Machine Learning with Scikit-Learn and Tensorflow 7.7 特征重要程度

最新推荐文章于 2021-10-11 21:57:31 发布

qinhanmin

最新推荐文章于 2021-10-11 21:57:31 发布

阅读量630

点赞数

分类专栏：机器学习

机器学习专栏收录该内容

33 篇文章 0 订阅

订阅专栏

书籍信息
Hands-On Machine Learning with Scikit-Learn and Tensorflow
出版社: O’Reilly Media, Inc, USA
平装: 566页
语种: 英语
ISBN: 1491962291
条形码: 9781491962299
商品尺寸: 18 x 2.9 x 23.3 cm
ASIN: 1491962291

系列博文为书籍中文翻译
代码以及数据下载：https://github.com/ageron/handson-ml

决策树中，重要的特征往往靠近树的根部，不重要的特征往往靠近树的底部，或者不出现在决策树结点。所以，我们能够根据森林中特征的平均深度判断特征的重要程度。scikit-learn自动进行以上过程，我们可以通过feature_importance获得特征的重要程度。以iris数据为例，petal length（44%）和petal width（42%）是相对重要的特征，sepal length（11%）和sepal width（2%）是相对不重要的特征。

from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rnd_clf.fit(iris["data"], iris["target"])
for name, importance in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print name, "=", importance
# output
# sepal length (cm) = 0.112492250999
# sepal width (cm) = 0.0231192882825
# petal length (cm) = 0.441030464364
# petal width (cm) = 0.423357996355

下面的实例使用digit数据绘制不同像素的重要程度。

from sklearn.datasets import load_digits
mnist = load_digits()
rnd_clf = RandomForestClassifier(random_state=42)
rnd_clf.fit(mnist["data"], mnist["target"])

def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = matplotlib.cm.hot,
               interpolation="nearest")
    plt.axis("off")

plot_digit(rnd_clf.feature_importances_)
cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(), rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels(['Not important', 'Very important'])
plt.show()

这里写图片描述

下面的实例使用MNIST数据绘制不同像素的重要程度。

import matplotlib
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
rnd_clf = RandomForestClassifier(random_state=42)
rnd_clf.fit(mnist["data"], mnist["target"])

def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = matplotlib.cm.hot,
               interpolation="nearest")
    plt.axis("off")

plot_digit(rnd_clf.feature_importances_)
cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(), rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels(['Not important', 'Very important'])
plt.show()

这里写图片描述

由此可见，使用随机森林时，我们可以方便地得到特征的重要程度，对于特征选择是非常有意义的。

qinhanmin

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Machine Learning with Scikit-Learn and Tensorflow 7.7 特征重要程度

Hands-On Machine Learning with Scikit-Learn and Tensorflow
复制链接

扫一扫

专栏目录