特征工程与scikit-learn

最新推荐文章于 2021-07-22 11:11:07 发布

AI_Engine

最新推荐文章于 2021-07-22 11:11:07 发布

阅读量158

点赞数

本文链接：https://blog.csdn.net/weixin_36325602/article/details/104337969

版权

本文介绍了scikit-learn中三种特征选择方法：Filter过滤法、Embedded嵌入法和Wrapper包装法。Filter方法如方差过滤、卡方检验、互信息法等独立于算法；Embedded方法通过模型正则化选择特征，如SelectFromModel；Wrapper方法如RFE，特征选择与算法训练同时进行，具有较高复杂度。

摘要由CSDN通过智能技术生成

欢迎关注我的公众号：AI_Engine。知乎，简书同名呦~今日头条搜索：极意就是在下啦！欢迎转发，评论噢！

Filter过滤法

过滤法通常在预处理阶段使用，特征选择完全独立于任何算法之外。

方差过滤

    **本质：** 使用特征本身的方差来消除方差小于阈值的特征。

    **API：** VarianceThreshold

    **实例：**

import pandas as pd
data = pd.read_csv(r'../Data/digit recognizor.csv')
x = data.iloc[:,1:]
y = data.iloc[:,0]
x.shape

(42000, 784)

from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold() # 实例化，默认方差为0
x_var = selector.fit_transform(x) # 获取删除不合格特征之后的新特征矩阵
x_var.shape

(42000, 708)

import numpy as np
median = np.median(x.var().values)
median

1352.286703180131

x_fc_var = VarianceThreshold(threshold=median).fit_transform(x)
x_fc_var.shape

(42000, 392)

%%timeit # 魔法命令，计算运行时间
# 当特征是二分类是，特征的取值就是伯努利随机变量，这些变量的方差可以计算为var[x] = p(1-p)，其中x是特征矩阵，p是二分类特征中的一类在这个特征中所占的概率
# 若特征是伯努利随机变量，p=0.8，即二分类特征中某种分类占到80%以上的时候删除特征
x_bvar = VarianceThreshold(threshold=0.8 * (1-0.8)).fit_transform(x)
x_bvar.shape

(42000, 685)

卡方过滤

    **本质：** 计算特征与标签的相关性，相关性可以根据卡方值和p值进行判定。一般情况下，p值<0.05或=0.01时，特征与标签即为相关，该特征可以保留。

    API:chi2,SelectKBest

    **实例：**

from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
x_chi_fc_var = SelectKBest(chi2, k=300).fit_transform(x_fc_var, y) # 根据卡方选取得分最高的前300个特征

(42000, 300)

计算每个特征的卡方值与p值

chi_value, p_value = chi2(x_fc_var, y)
chi_value
p_value

计算k的合理值

k = chi_value.shape[0] - (p_value>0.05).sum()
k

392

F检验

    **本质：** 方差齐性检验，通过特征与标间的线性关系进行过滤。（它的本假设是特征与标签不存在显著的线性关系)

    **API：** f_classif

    **实例：**

from sklearn.feature_selection import f_classif
f_value, p_value = f_classif(x_fc_var, y)
f_value.shape
p_value.shape

(392,)
(392,)

k = f_value.shape[0] - (p_value>0.05).sum()
k

392

from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
x_f_fc_var = SelectKBest(f_classif, k=392).fit_transform(x_fc_var, y) # 根据F检验选取得分最高的前392个特征
x_f_fc_var.shape

(42000, 392)

cross_val_score(RandomForestClassifier(n_estimators=10,random_state=0), x_fc_var, y, cv=5).mean()

0.9388098166696807

互信息法

    **本质:** 通过特征与标签之间的线性或非线性关系进行过滤。互信息法比F检验强大，而且不返回p值和f值，而是返回特征与标签之间的互信息量估计（0到1），0表示两个变量独立，1表示变量完全相关。

    **API:** mutual_info_classif(互信息分类);mutual_info_regression(互信息回归)

    **实例：**

from sklearn.feature_selection import mutual_info_classif as mic
result = mic(x_fc_var, y)
k = result.shape[0] - (result<0).sum()
k

392

x_mic_fc_var = SelectKBest(mic, k=392).fit_transform(x_fc_var, y) # 根据互信息法选取得分最高的前300个特征
x_mic_fc_var.shape

(42000, 392)

cross_val_score(RandomForestClassifier(n_estimators=10,random_state=0), x_mic_fc_var, y, cv=5).mean()

0.9388098166696807

Embedded嵌入法

一句话总结： 一种算法自主选择特征的方法，是过滤法的强化版。

API： SelectFromModel

其他： 对于使用惩罚项的模型来说，正则化惩罚项越大，特征在模型中对应的系数就会越小。当正则化惩罚项大到一定的程度的时候，部分特征系数会变成0,当正则化惩罚项继续增大到一定程度时，所有的特征系数都会趋于0。但是我们会发现一部分特征系数会更容易先变成0,这部分系数就是可以筛掉的。也就是说，我们选择特征系数较大的特征。另外，支持向量机和逻辑回归使用参数C来控制返回的特征矩阵的稀疏性，参数C越小，返回的特征越少。Lasso回归用alpha参数来控制返回的特征矩阵，alpha的值越大，返回的特征越少。

实例：

import pandas as pd
data = pd.read_csv(r'../Data/digit recognizor.csv')
x = data.iloc[:,1:]
y = data.iloc[:,0]

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, random_state=0).fit(x, y)
clf.feature_importances_

clf = RandomForestClassifier(n_estimators=10, random_state=0)
x_embedded = SelectFromModel(estimator=clf, threshold=0.005).fit_transform(x, y)
x_embedded.shape

(42000, 47)

利用学习曲线寻找最佳threshold的值

%matplotlib inline  
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
thresholds = np.linspace(0,(clf.fit(x, y).feature_importances_).max(),10)
scores = []
for threshold in thresholds:
    x_embedded = SelectFromModel(clf, threshold=threshold).fit_transform(x, y)
    score = cross_val_score(clf, x_embedded, y, cv=5).mean()
    scores.append(score)
plt.plot(thresholds, scores)
plt.show()

evernotecid://199A0999-284B-465C-AF36-D243A9FA840A/appyinxiangcom/18720816/ENResource/p780 file

根据学习曲线在将threshold的粒度进行细化，观测新学习曲线。

scores = []
for threshold in np.linspace(0, 0.002, 10):
    x_embedded = SelectFromModel(clf, threshold=threshold).fit_transform(x, y)
    score = cross_val_score(clf, x_embedded, y, cv=5).mean()
    scores.append(score)
plt.figure(figsize=[20, 5])
plt.plot(np.linspace(0, 0.002, 10), scores)
plt.xticks(np.linspace(0, 0.002, 10))
plt.show()

evernotecid://199A0999-284B-465C-AF36-D243A9FA840A/appyinxiangcom/18720816/ENResource/p781 file

Wrapper包装法

本质： 特征选择与算法训练同时进行的方法。与嵌入法选择特征不同的是，包装法会选择一个目标函数帮助我们选择特征，而不是我们输入的某个评估指标和统计量阈值。

思想： 包装法在初始特征中根据特征重要性评估每个特征，然后从当前一组剔除某些特征并重复该过程，直至剩余特征的数量满足我们的要求。

复杂度： 过滤法 <包装法> <嵌入法>

API： RFE。其中重要的属性是.support(返回所有特征中的选择结果的bool矩阵）、以及.ranking(返回特征的迭代选择后的综合重要性排名）

实例：

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
clf = RandomForestClassifier(n_estimators=10, random_state=0)
rfe = RFE(estimator=clf, n_features_to_select=340, step=50).fit(x, y)
rfe.support_.sum()

340

rfe.ranking_

x_rfe = rfe.transform(x)
x_rfe

array([[  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       ...,
       [  0,   0,   0, ...,   0, 128, 255],
       [  0,   0, 146, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0]])

cross_val_score(estimator=clf, X=x_rfe, y=y ,cv=5).mean()

0.9389522459432109

%matplotlib inline  
import matplotlib.pyplot as plt
scores = []
for i in range(1, 751, 50):
    x_rfe = RFE(estimator=clf, n_features_to_select=i, step=50).fit_transform(x, y)
    score = cross_val_score(clf, x_rfe, y, cv=5).mean()
    scores.append(score)
plt.figure(figsize=[20, 5])
plt.plot(range(1, 751, 50), scores)
plt.xticks(range(1, 751, 50))
plt.show()

evernotecid://199A0999-284B-465C-AF36-D243A9FA840A/appyinxiangcom/18720816/ENResource/p782 file

附上小编的公众号二维码，欢迎您的到来，一定不会让您失望~

qrcodeforghc7b832457165430.jpg file

欢迎转载，评论，分享

AI_Engine

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
特征工程与scikit-learn

欢迎关注我的公众号：AI_Engine。知乎，简书同名呦~今日头条搜索：极意就是在下啦！欢迎转发，评论噢！Filter过滤法过滤法通常在预处理阶段使用，特征选择完全独立于任何算法之外。方差过滤 **本质：** 使用特征本身的方差来消除方差小于阈值的特征。 **API：** VarianceThreshold **实例：**import pandas a...
复制链接

扫一扫