DataWhale

Information Theory Basis

  • Entropy
    The entropy H ( x ) H(x) H(x) of a discrete random variable X X X with distribution p ( x ) p(x) p(x) is defined by H ( X ) = − ∑ x ∈ χ p ( x ) l o g p ( x ) H(X)=-\sum_{x\in \chi}p(x)logp(x) H(X)=xχp(x)logp(x)

  • Joint Entropy
    The joint entropy H ( X , Y ) H(X,Y) H(X,Y) of a pair of discrete random variables ( X , Y ) (X,Y) (X,Y) with a joint distribution p ( x , y ) p(x,y) p(x,y) is defined as H ( X , Y ) = − ∑ x ∈ X ∑ y ∈ Y p ( x , y ) l o g p ( x , y ) H(X,Y)=-\sum_{x\in \mathcal{X}}\sum_{y\in \mathcal{Y}}p(x,y)logp(x,y) H(X,Y)=xXyYp(x,y)logp(x,y)

  • Conditional Entropy
    The conditional entropy H ( Y ∣ X ) H(Y|X) H(YX) of a pair of discrete random variables ( X , Y ) (X,Y) (X,Y) with a joint distribution p ( x , y ) p(x,y) p(x,y) is defined as H ( Y ∣ X ) = − ∑ x ∈ X ∑ y ∈ Y p ( x , y ) l o g p ( y ∣ x ) H(Y|X)=-\sum_{x\in \mathcal{X}}\sum_{y\in \mathcal{Y}}p(x,y)logp(y|x) H(YX)=xXyYp(x,y)logp(yx)

  • Information Gain
    The information gain G ( X ) G(X) G(X) is defined as G ( X ) = H ( Y ) − H ( Y ∣ X ) G(X)=H(Y)-H(Y|X) G(X)=H(Y)H(YX) which represents the importance of condition X X X for the entropy H ( Y ) H(Y) H(Y).

  • Gini impurity
    The gini impurity a error rate to check whether a set of data belongs to the same category which is defined as I G ( f ) = ∑ i = 1 m f i ( 1 − f i ) I_G(f)=\sum_{i=1}^mf_i(1-f_i) IG(f)=i=1mfi(1fi)
    Smaller it is, more likely the set of data is the same category.

Decision Tree

Different classification algorithms

  • ID3 algorithm
    For the same output class D i D_i Di, computes the information gain A g A_g Ag of every feature and find the maximum one. Then for each value in A g A_g Ag, find the different output classes D i D_i Di and repeat again if the maximum A g A_g Ag is more than the threshold ϵ \epsilon ϵ.
    This algorithm is used for discrete features.

  • C4.5 algorithm
    It has the same principle as ID3, and overcomes the four disadvantages in ID3:

    1. no consideration for continuous features
    2. different number of value in each feature can affect the result
    3. no consideration for the situation when there is a missing value in a feature
    4. no consideration for overfitting

      solution for overfitting:
      regularization for preliminary pruning

  • CART algorithm(classification and regression tree)

    1. the gini value would be the basis of node splitting for a classification tree
    2. the minimum variance of sample would be the basis of node splitting for a regression tree

Model evaluation

There are classifier and regression evaluation index for the classification and regression tree, respectively.

  • auc and roc curve
  • RMSE & quantiles of errors

note: more index referring previous article

The sklearn parameters

sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2,min_samples_leaf=1, max_features=None, random_state=None, min_density=None, compute_importances=None,max_leaf_nodes=None)
  • criterion: feature splitting method(‘gini’ or ‘entropy’)
  • max_depth: the maximum depth of the decision tree, and overcome overfitting
  • min_samples_leaf: the minimum number of sample included in leaf node

Code to draw decision tree:

# coding=utf-8

# sklearn中为我们准备的数据-iris
# iris有三种鸢尾花,山鸢尾花,变色鸢尾和维吉尼亚鸢尾
# 数据中有4个特征(feature)
# sepal length (花萼长度)
# sepal width (花萼宽度)
# petal lenth (花瓣长度)
# petal width (花瓣宽度)
from sklearn.datasets import load_iris
from sklearn import tree
import numpy as np

# 获取鸢尾数据
iris = load_iris()

# 用来做测试的数据下标
test_idx = [0,50,100]

# 用以训练的数据
train_target = np.delete(iris.target,test_idx)
train_data = np.delete(iris.data,test_idx,axis=0)

# 用以测试的数据
test_target = iris.target[test_idx]
test_data = iris.data[test_idx]


# 决策树
clf = tree.DecisionTreeClassifier()
clf.fit(train_data,train_target)

# 打印出测试数据和决策树的预言数据
# 结果应该是一样的(即决策树能正确预测)
print ("test_target:")
print (test_target)
print ("predict:")
print (clf.predict(test_data))


# 将决策树可视化
# 需要pydot(我安装了兼容版本pydotplus)
# 同时需要Graphviz(请去官网www.graphviz.org下载)
from sklearn.externals.six import StringIO
import pydotplus

dot_data = StringIO()
tree.export_graphviz(clf,
                        out_file=dot_data,
                        feature_names=iris.feature_names,
                        class_names=iris.target_names,
                        filled=True,rounded=True,
                        impurity=False)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# 输出pdf,显示整个决策树的思维过程
graph.write_pdf("viz.pdf")

the decision tree is :

在这里插入图片描述

Reference

[1] http://blog.51cto.com/12482328/2105558

[2] https://blog.csdn.net/lanchunhui/article/details/51140053

[3] http://www.cnblogs.com/pinard/p/6050306.html

[4] https://www.jianshu.com/p/498ea0d8017d

[5] https://www.cnblogs.com/AlwaysT-Mac/p/6647192.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值