特征重要性算法
项目链接:https://github.com/Wchenguang/gglearn/blob/master/DecisionTree/李航机器学习讲解/FeatureImportance.ipynb
信息增益法 公式
-
熵的定义:
- 属性
y
y
y 的熵,表示特征的不确定性:
P ( Y = y j ) = p j , i = 1 , 2 , ⋯   , n P\left(Y=y_{j}\right)=p_{j}, \quad i=1,2, \cdots, n P(Y=yj)=pj,i=1,2,⋯,n
H ( Y ) = − ∑ j = 1 n p j log p j H(Y)=-\sum_{j=1}^{n} p_{j} \log p_{j} H(Y)=−j=1∑npjlogpj
- 属性
y
y
y 的熵,表示特征的不确定性:
-
条件熵的定义:
- 在
x
x
x 已知的情况下,
y
y
y 的不确定性
P ( X = x i , Y = y j ) = p i j , i = 1 , 2 , ⋯   , n ; j = 1 , 2 , ⋯   , m P\left(X=x_{i}, Y=y_{j}\right)=p_{i j}, \quad i=1,2, \cdots, n ; \quad j=1,2, \cdots, m P(X=xi,Y=yj)=pij,i=1,2,⋯,n;j=1,2,⋯,m
H ( Y ∣ X ) = ∑ i = 1 n p i H ( Y ∣ X = x i ) H(Y | X)=\sum_{i=1}^{n} p_{i} H\left(Y | X=x_{i}\right) H(Y∣X)=i=1∑npiH(Y∣X=xi)
- 在
x
x
x 已知的情况下,
y
y
y 的不确定性
-
信息增益计算流程
- 计算特征A对数据集D的熵,即计算
y
y
y 的熵
H ( D ) = − ∑ k = 1 K ∣ C k ∣ ∣ D ∣ log 2 ∣ C k ∣ ∣ D ∣ H(D)=-\sum_{k=1}^{K} \frac{\left|C_{k}\right|}{|D|} \log _{2} \frac{\left|C_{k}\right|}{|D|} H(D)=−k=1∑K∣D∣∣Ck∣log2∣D∣∣Ck∣ - 计算$ x
不
同
取
值
的
情
况
下
,
不同取值的情况下,
不同取值的情况下, y $的熵
H ( D ∣ A ) = ∑ i = 1 n ∣ D i ∣ ∣ D ∣ H ( D i ) = − ∑ i = 1 n ∣ D i ∣ ∣ D ∣ ∑ k = 1 K ∣ D i k ∣ ∣ D i ∣ log 2 ∣ D i k ∣ ∣ D i ∣ H(D | A)=\sum_{i=1}^{n} \frac{\left|D_{i}\right|}{|D|} H\left(D_{i}\right)=-\sum_{i=1}^{n} \frac{\left|D_{i}\right|}{|D|} \sum_{k=1}^{K} \frac{\left|D_{i k}\right|}{\left|D_{i}\right|} \log _{2} \frac{\left|D_{i k}\right|}{\left|D_{i}\right|} H(D∣A)=i=1∑n∣D∣∣Di∣H(Di)=−i=1∑n∣D∣∣Di∣k=1∑K∣Di∣∣Dik∣log2∣Di∣∣Dik∣ - 做差计算增益
g ( D , A ) = H ( D ) − H ( D ∣ A ) g(D, A)=H(D)-H(D | A) g(D,A)=H(D)−H(D∣A)
- 计算特征A对数据集D的熵,即计算
y
y
y 的熵
import numpy as np
import math
'''
熵的计算
'''
def entropy(y_values):
e = 0
unique_vals = np.unique(y_values)
for val in unique_vals:
p = np.sum(y_values == val)/len(y_values)
e += (p * math.log(p, 2))
return -1 * e
'''
条件熵的计算
'''
def entropy_condition(x_values, y_values):
ey = entropy(y_values)
ey_condition = 0
xy = np.hstack((x_values, y_values))
unique_x = np.unique(x_values)
for x_val in unique_x:
px = np.sum(x_values == x_val) / len(x_values)
xy_condition_x = xy[np.where(xy[:, 0] == x_val)]
ey_condition_x = entropy(xy_condition_x[:, 1])
ey_condition += (px * ey_condition_x)
return ey - ey_condition
'''
信息增益比:摒弃了选择取值多的特征为重要特征的缺点
'''
def entropy_condition_ratio(x_values, y_values):
return entropy_condition(x_values, y_values) / entropy(x_values)
- 以书中P62页的例子作为测试,以下分别为A1, A2的信息增益
xy = np.array([[0,0,0,0,0,1,1,1,1,1,2,2,2,2,2], [0,0,1,1,0,0,0,1,0,0,0,0,1,1,0], [0,0,0,1,0,0,0,1,1,1,1,1,0,0,0],
[0,1,1,0,0,0,1,1,2,2,2,1,1,2,0], [0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]]).T
#A1
print(entropy_condition(xy[:, 0].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
#A2
print(entropy_condition(xy[:, 1].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
#A3
print(entropy_condition(xy[:, 2].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
#A4
print(entropy_condition(xy[:, 3].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
- 与书中结果相合
xy = np.array([[0,0,0,0,0,1,1,1,1,1,2,2,2,2,2], [0,0,1,1,0,0,0,1,0,0,0,0,1,1,0], [0,0,0,1,0,0,0,1,1,1,1,1,0,0,0],
[0,1,1,0,0,0,1,1,2,2,2,1,1,2,0], [0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]]).T
#A1
print(entropy_condition_ratio(xy[:, 0].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
#A2
print(entropy_condition_ratio(xy[:, 1].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
#A3
print(entropy_condition_ratio(xy[:, 2].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
#A4
print(entropy_condition_ratio(xy[:, 3].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
基尼指数 公式
Gini
(
p
)
=
∑
k
=
1
K
p
k
(
1
−
p
k
)
=
1
−
∑
k
=
1
K
p
k
2
\operatorname{Gini}(p)=\sum_{k=1}^{K} p_{k}\left(1-p_{k}\right)=1-\sum_{k=1}^{K} p_{k}^{2}
Gini(p)=k=1∑Kpk(1−pk)=1−k=1∑Kpk2
Gini
(
D
)
=
1
−
∑
k
=
1
K
(
∣
C
k
∣
∣
D
∣
)
2
\operatorname{Gini}(D)=1-\sum_{k=1}^{K}\left(\frac{\left|C_{k}\right|}{|D|}\right)^{2}
Gini(D)=1−k=1∑K(∣D∣∣Ck∣)2
Gini
(
D
,
A
)
=
∣
D
1
∣
∣
D
∣
Gini
(
D
1
)
+
∣
D
2
∣
∣
D
∣
Gini
(
D
2
)
\operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right)
Gini(D,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)
'''
基尼指数计算
'''
def gini(y_values):
g = 0
unique_vals = np.unique(y_values)
for val in unique_vals:
p = np.sum(y_values == val)/len(y_values)
g += (p * p)
return 1 - g
'''
按照x取值的基尼指数的计算
'''
def gini_condition(x_values, y_values):
g_condition = {}
xy = np.hstack((x_values, y_values))
unique_x = np.unique(x_values)
for x_val in unique_x:
xy_condition_x = xy[np.where(xy[:, 0] == x_val)]
xy_condition_notx = xy[np.where(xy[:, 0] != x_val)]
g_condition[x_val] = len(xy_condition_x)/len(x_values) * gini(xy_condition_x[:, 1]) + len(xy_condition_notx)/len(x_values) * gini(xy_condition_notx[:, 1])
return g_condition
xy = np.array([[0,0,0,0,0,1,1,1,1,1,2,2,2,2,2], [0,0,1,1,0,0,0,1,0,0,0,0,1,1,0], [0,0,0,1,0,0,0,1,1,1,1,1,0,0,0],
[0,1,1,0,0,0,1,1,2,2,2,1,1,2,0], [0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]]).T
#A1
print(gini_condition(xy[:, 0].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
#A2
print(gini_condition(xy[:, 1].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
#A3
print(gini_condition(xy[:, 2].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
#A4
print(gini_condition(xy[:, 3].reshape(-1, 1),
xy[:, -1].reshape(-1, 1)))
- 与书中p71相符,选择最小的特征及 x x x取值作为最优特征及分切点。
- 其实选取基尼指数最小,即选择在哪个特征下以及该特征取哪个值的情况下, y y y的不确定性最小
特征重要性的对比
以随机森林算法进行特征重要性计算,以书中数据为例
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42).fit(xy[:, :-1], xy[:, -1])
print(rf. feature_importances_)
- 总体上相符