决策树简介
非参数学习算法
天然可解多分类问题
也可解决回归问题
非常好的解释性
信息熵
熵在信息论中代表,随机变量不确定度的度量。
熵越大,数据的不确定性越高;熵越小,数据的不确定性越低。
H
=
−
∑
i
=
1
k
p
i
l
o
g
(
p
i
)
H = -\sum\limits_{i=1}^kp_ilog(p_i)
H=−i=1∑kpilog(pi)
其中
p
i
p_i
pi代表
k
k
k类信息中所占的比例
举个栗子
1. { 1 3 , 1 3 , 1 3 } 1.\{\frac{1}{3},\frac{1}{3},\frac{1}{3}\} 1.{31,31,31}
H 1 = − 1 3 l o g 1 3 − 1 3 l o g 1 3 − 1 3 l o g 1 3 = 1.0986 H_1=-\frac{1}{3}log\frac{1}{3} - \frac{1}{3}log\frac{1}{3} -\frac{1}{3}log\frac{1}{3}=1.0986 H1=−31log31−31log31−31log31=1.0986
2. { 1 10 , 2 10 , 7 10 } 2.\{\frac{1}{10},\frac{2}{10},\frac{7}{10}\} 2.{101,102,107}
H 2 = − 1 10 l o g 1 10 − 2 10 l o g 2 10 − 7 10 l o g 7 10 = 0.8018 H_2=-\frac{1}{10}log\frac{1}{10} - \frac{2}{10}log\frac{2}{10} -\frac{7}{10}log\frac{7}{10}=0.8018 H2=−101log101−102log102−107log107=0.8018
3. { 1 , 0 , 0 } 3.\{1,0,0\} 3.{1,0,0}
H 3 = − l o g 1 = 0 H_3=-log1=0 H3=−log1=0
结论:势均力敌时,信息熵最大,而决策树的划分目标是,使得每个节点在某个维度和某个值划分之后,使信息熵最低,从而来确定该分类。
二分类时的信息熵公式: H = − x l o g ( x ) − ( 1 − x ) l o g ( 1 − x ) H=-xlog(x) - (1-x)log(1-x) H=−xlog(x)−(1−x)log(1−x)
$$
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()
X= iris.data[:,2:]
y = iris.target
plt.scatter(X[y==0,0],X[y==0,1])
plt.scatter(X[y==1,0],X[y==1,1])
plt.scatter(X[y==2,0],X[y==2,1])
plt.show()
调用sklearn中的决策树并使用信息熵训练模型
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth = 2, criterion = 'entropy')
clf.fit(X,y)
模拟使用信息熵进行划分
def entropy( p):
return -p*np.log(p) -(1-p)*np.log(1-p)
def split(X, y, d, value):
index_a = X[:,d] <= value
index_b = X[:,d] > value
return X[index_a],X[index_b],y[index_a],y[index_b]
from collections import Counter
import math
def entropy(y):
counter = Counter(y)
res = 0.0
for num in counter.values():
p = num/len(y)
res +=-p*math.log(p)
return res
def try_split(X,y):
best_entropy = float('inf')
best_d ,best_v = -1,-1
for d in range(X.shape[1]):
sorted_index=np.argsort(X[:,d])
for i in range(1, len(X)):
if X[sorted_index[i-1],d] != X[sorted_index[i],d]:
v = (X[sorted_index[i-1],d] + X[sorted_index[i],d]) / 2
x_l,x_r,y_l,y_r = split(X,y,d,v)
e = entropy(y_l) + entropy(y_r)
if e < best_entropy:
best_entropy ,best_d,best_v = e,d,v
return best_entropy,best_d,best_v
best_entropy,best_d,best_v =try_split(X,y)
print("best_entropy: ",best_entropy)
print("best_d: ",best_d)
print("best_v: ",best_v)
基尼系数
G = 1 − ∑ i = 1 k p i 2 G=1-\sum\limits_{i=1}^kp_i^2 G=1−i=1∑kpi2
举个栗子
1. { 1 3 , 1 3 , 1 3 } 1.\{\frac{1}{3},\frac{1}{3},\frac{1}{3}\} 1.{31,31,31}
G 1 = 1 − ( 1 3 ) 2 − ( 1 3 ) 2 − ( 1 3 ) 2 = 0.6666 G_1=1-(\frac{1}{3})^2 - (\frac{1}{3})^2 -(\frac{1}{3})^2=0.6666 G1=1−(31)2−(31)2−(31)2=0.6666
2. { 1 10 , 2 10 , 7 10 } 2.\{\frac{1}{10},\frac{2}{10},\frac{7}{10}\} 2.{101,102,107}
G 2 = 1 − ( 1 10 ) 2 − ( 2 10 ) 2 − ( 7 10 ) 2 = 0.46 G_2=1-(\frac{1}{10})^2 - (\frac{2}{10})^2 -(\frac{7}{10})^2=0.46 G2=1−(101)2−(102)2−(107)2=0.46
3. { 1 , 0 , 0 } 3.\{1,0,0\} 3.{1,0,0}
G 3 = 1 − 1 2 = 0 G_3=1-1^2=0 G3=1−12=0
二分类时的基尼系数:
G = 1 − x 2 − ( 1 − x ) 2 = − 2 x 2 + 2 x G=1-x^2 -(1-x)^2=-2x^2+2x G=1−x2−(1−x)2=−2x2+2x
调用sklearn中的决策树并使用基尼系数训练模型
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth = 2, criterion = 'gini')
clf.fit(X,y)
模拟使用基尼系数划分
def split(X, y, d, value):
index_a = X[:,d] <= value
index_b = X[:,d] > value
return X[index_a],X[index_b],y[index_a],y[index_b]
from collections import Counter
import math
def gini(y):
counter = Counter(y)
res = 1.0
for num in counter.values():
p = num/len(y)
res -= p**2
return res
def try_split(X,y):
best_g = float('inf')
best_d ,best_v = -1,-1
for d in range(X.shape[1]):
sorted_index=np.argsort(X[:,d])
for i in range(1, len(X)):
if X[sorted_index[i-1],d] != X[sorted_index[i],d]:
v = (X[sorted_index[i-1],d] + X[sorted_index[i],d]) / 2
x_l,x_r,y_l,y_r = split(X,y,d,v)
e = gini(y_l) + gini(y_r)
if e < best_g:
best_g ,best_d,best_v = e,d,v
return best_g,best_d,best_v
best_g,best_d,best_v =try_split(X,y)
print("best_entropy: ",best_g)
print("best_d: ",best_d)
print("best_v: ",best_v)
信息熵VS基尼系数
信息熵的计算比基尼系数稍慢
sklearn中默认为基尼系数
大多数人时候二者没有特别明显的优劣
决策树解决回归问题
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
boston = datasets.load_boston()
X=boston.data
y=boston.target
X_train,X_test,y_train,y_test = train_test_split(X,y ,random_state=666)
from sklearn.tree import DecisionTreeRegressor
dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)
分类回归树CART(Classification and Regression Tree)
根据某个维度d和某个阈值v进行二分
sklearn中实现的决策树为CART
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
复杂度
训练:
O
(
n
m
l
o
g
m
)
O(nmlogm)
O(nmlogm)
预测:
O
(
l
o
g
m
)
O(logm)
O(logm)
通常需要剪枝来降低复杂度,解决过拟合。
决策树的局限性
1.决策边界都是横平竖直的,不能斜着划分。
2.对个别数据很敏感。