本文翻译自原文: http://blog.datadive.net/selecting-good-features-part-iii-random-forests/
两类方法:
1.按impurity(基尼系数或者信息熵这类)来排序特征(Mean decrease impurity)
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
import numpy as np
#Load boston housing dataset as an example
boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
rf = RandomForestRegressor()
rf.fit(X, Y)
print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names),
reverse=True)
不过这种排序存在这样两种偏差:分类越多的变量排名越高;当数据集中有两个或更多的相关的特征时,那么这些特征对目标的解释程度应该是差不多的,但是按照随机森林处理后的结果与之矛盾。原文有做一个实验来说明这种情况。
除了基于随机森林的特征选择外,大多数基于模型的特征选择方法都存在对相关变量的排名的解释有问题的情况。所以在训练模型的时候不一定会因为我们丢弃了某个特征而导致模型效果大幅下降,因为还有其他的向边特征
2.mean decrease accuracy
通过排列组合特征看每次对准确度减少了多少,从而体现每个特征对模型准确度的重要性。
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict
X = boston["data"]
Y = boston["target"]
rf = RandomForestRegressor()
scores = defaultdict(list)
#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):
X_train, X_test = X[train_idx], X[test_idx]
Y_train, Y_test = Y[train_idx], Y[test_idx]
r = rf.fit(X_train, Y_train)
acc = r2_score(Y_test, rf.predict(X_test))
for i in range(X.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = r2_score(Y_test, rf.predict(X_t))
scores[names[i]].append((acc-shuff_acc)/acc)
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True)