sklearn之feature_importance_参数的学习

最新推荐文章于 2024-06-14 13:17:12 发布

一条咸鱼在网游

最新推荐文章于 2024-06-14 13:17:12 发布

阅读量4.6k

点赞数 1

文章标签：学习机器学习 python

本文链接：https://blog.csdn.net/weixin_63016274/article/details/127583887

版权

一、随机森林【原文链接：https://blog.csdn.net/zjuPeco/article/details/77371645】

随机森林的算法可以用如下几个步骤概括：

用有抽样放回的方法（bootstrap）从样本集中选取n个样本作为一个训练集
用抽样得到的样本集生成一棵决策树。在生成的每一个结点：
随机不重复地选择d个特征
利用这d个特征分别对样本集进行划分，找到最佳的划分特征（可用基尼系数、增益率或者信息增益判别）
重复步骤1到步骤2共k次，k即为随机森林中决策树的个数。
用训练得到的随机森林对测试样本进行预测，并用票选法决定预测的结果。
一个数据集中往往有成百上前个特征，如何在其中选择比结果影响最大的那几个特征，以此来缩减建立模型时的特征数是我们比较关心的问题。这样的方法其实很多，比如主成分分析，lasso等等。这里我们要介绍的是用随机森林来对进行特征筛选。

说白了就是看看每个特征在随机森林中的每颗树上做了多大的贡献，然后取个平均值，最后比一比特征之间的贡献大小。
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x, y = df.iloc[:, 1:].values, df.iloc[:, 0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feat_labels = df.columns[1:]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(x_train, y_train)

importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(x_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

如果要筛选出重要性比较高的变量的话：

threshold = 0.15
x_selected = x_train[:, importances > threshold]
x_selected.shape
(124, 3)
帮我们选好了3个重要性大于0.15的特征

原文链接：

(8条消息) 利用随机森林对特征重要性进行评估（含实例+代码讲解）_旅途中的宽~的博客-CSDN博客_利用随机森林对特征重要性进行评估

导入数据

import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
wine_data = pd.read_csv(url, header = None) #X

添加wine_data列名

wine_data.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

import numpy as np
np.unique(wine_data['Class label'])
【a = np.unique(A)

对于一维数组或者列表，unique函数去除其中重复的元素，并按元素由大到小返回一个新的无元素重复的元组或者列表】

检查一下数据是否有空数组：

wine_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Class label 178 non-null int64
1 Alcohol 178 non-null float64
2 Malic acid 178 non-null float64
3 Ash 178 non-null float64
4 Alcalinity of ash 178 non-null float64
5 Magnesium 178 non-null int64
6 Total phenols 178 non-null float64
7 Flavanoids 178 non-null float64
8 Nonflavanoid phenols 178 non-null float64
9 Proanthocyanins 178 non-null float64
10 Color intensity 178 non-null float64
11 Hue 178 non-null float64
12 OD280/OD315 of diluted wines 178 non-null float64
13 Proline 178 non-null int64
dtypes: float64(11), int64(3)
除去class label之外共有13个特征，数据集的大小为178

训练：

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x, y = wine_data.iloc[:, 1:].values, wine_data.iloc[:, 0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feat_labels = df.columns[1:]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(x_train, y_train)

结果：

1) 10 0.182483
2) 13 0.158610
3) 7 0.150948
4) 12 0.131987
5) 1 0.106589
6) 11 0.078243
7) 6 0.060718
8) 4 0.032033
9) 2 0.025400
10) 9 0.022351
11) 5 0.022078
12) 8 0.014645
13) 3 0.013916
筛选出重要性比较高的变量：

threshold = 0.15
x_selected = x_train[:, importances > threshold]
x_selected