# scikit-learn决策树算法中特征重要性的计算

sklearn.tree.DicisionTreeClassifier类中的feature_importances_属性返回的是特征的重要性，feature_importances_越高代表特征越重要，scikit-learn官方文档1中的解释如下：

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

stackover上的帖子4中答主Seljuk Gülcan指出如果每个特征只被使用一次，那么feature_importances_应当就是这个Gini importance

N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)


from StringIO import StringIO

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_graphviz
from sklearn.feature_selection import mutual_info_classif

X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]

y = [1,0,1,1]

clf = DecisionTreeClassifier()
clf.fit(X, y)

feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))

out = StringIO()
out = export_graphviz(clf, out_file='test/tree.dot')


feat importance = [0.25       0.08333333 0.04166667]


feature_importance = (4 / 4) * (0.375 - (3 / 4 * 0.444)) = 0.042


feature_importance = (3 / 4) * (0.444 - (2 / 3 * 0.5)) = 0.083


feature_importance = (2 / 4) * (0.5) = 0.25


1. $Entropy(p)=-\displaystyle\sum_{k=1}^Kp_k\log_2p_k$ ↩︎

2. $Gini(p)=\displaystyle\sum_{k=1}^Kp_k(1-p_k)=1-\displaystyle\sum_{k=1}^Kp_k^2$ ↩︎

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客