在建立机器模型时,并不是所有属性对模型有同等贡献的,也不是属性越多,模型也好,因此要从众多属性中选择对模型输出及预测结果贡献最大的特征变量,这个过程就叫特征选择,特征选择的作用:消除无关变量,较少训练数据,节省了训练时间,同时提供了模型准确率,也可以缓解过拟合问题。
sklearn.feature_selection模块实现了特征选择算法,目前主要包括:单变量特征选择和递归特征消除。该模块的类主要用于特征选择或样本集上降维,提高算法的准确率或者提升在高位数据集上表现
①sklearn.feature_selection.VarianceThreshold(threshold=0.0):通过特征的方差来提取特征,默认方差为0的特征会自动删除
- 默认删除方差为0的特征
In [1]: from sklearn.feature_selection import VarianceThreshold
...: X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
...: selector = VarianceThreshold()#默认threshold=0.0
...: selector.fit_transform(X)
...:
Out[1]:
array([[2, 0],
[1, 4],
[1, 1]])
- 指定方差阈值(threshold)参数:表示只提取特征方差大于阈值的特征
In [6]: from sklearn.feature_selection import VarianceThreshold
...: X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 6]]
...: selector = VarianceThreshold(threshold=2)
...: selector.fit_transform(X)
...:
Out[6]:
array([[0],
[4],
[1]])
In [7]: selector.variances_
Out[7]: array([ 0. , 0.22222222, 2.88888889, 2. ])
- 通过variances_属性查看样本各个特征的方差
In [2]: selector.variances_
Out[2]: array([ 0. , 0.22222222, 2.88888889, 0. ])
- fit(X, y=None):从特征向量X中学习方差,返回sklearn.feature_selection.variance_threshold.VarianceThreshold对象,y参数只是为了兼容sklearn.pipeline.Pipeline
In [8]: from sklearn.feature_selection import VarianceThreshold
...: X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 6]]
...: st = VarianceThreshold(threshold=2)
...: st.fit(X)
...: print(type(st.fit(X)))
...: st.variances_
...:
<class 'sklearn.feature_selection.variance_threshold.VarianceThreshold'>
Out[8]: array([ 0. , 0.22222222, 2.88888889, 2. ])
- fit_transform(X, y=None, **fit_params):返回提取特征转换后的数组
In [9]: selector.fit_transform(X)
Out[9]:
array([[0],
[4],
[1]])
- get_params(deep=True):获取估计器参数,以字典形式返回
In [10]: selector.get_params(deep=True)
Out[10]: {'threshold': 2}
- get_support(indices=False):若indices参数为False,返回所有特征的布尔数组,满足条件的特征列为True,不满足的特征列为False;若indices为True,则满足条件的特征列对应的整数组成的数组
In [11]: selector.get_support(False)
Out[11]: array([False, False, True, False], dtype=bool)
In [12]: selector.get_support(True)
Out[12]: array([2], dtype=int64)
- inverse_transform(X):返回X剔除的特征列值用0替换的数组
In [13]: selector.inverse_transform(selector.fit_transform(X))
Out[13]:
array([[0, 0, 0, 0],
[0, 0, 4, 0],
[0, 0, 1, 0]])
- set_params(**params):设置估计器参数
In [14]: selector.set_params(threshold=1)
Out[14]: VarianceThreshold(threshold=1)
- transform(X):数据转换
In [15]: X2= [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
...: selector.transform(X2)
...:
Out[15]:
array([[0, 3],
[4, 3],
[1, 3]])
②sklearn.feature_selection.GenericUnivariateSelect(score_func=<function f_classif>, mode=’percentile’,param=1e-05):可以设置不同的策略来进行单变量特征选择,也可以同时超参数调优选择最佳单变量选择策略
参数说明:
score_func:回调函数,函数接受X和y两个数组,返回(scores, pvalues)元组数组
mode:特征选择模式,可选项:‘percentile’, ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’
In [1]: from sklearn.datasets import load_iris
...: from sklearn.feature_selection import chi2
...: from sklearn.feature_selection import GenericUnivariateSelect
...: iris = load_iris()
...: X,y = iris.data , iris.target
...: s = GenericUnivariateSelect(score_func =chi2, mode='k_best',param=2)
...: s.fit_transform(X,y)
...:
Out[1]:
array([[ 1.4, 0.2],
[ 1.4, 0.2],
[ 1.3, 0.2],
[ 1.5, 0.2],
[ 1.4, 0.2],
[ 1.7, 0.4],
.....
[ 5.4, 2.3],
[ 5.1, 1.8]])
scores_属性:查看各个特征的score
In [2]: s.scores_
Out[2]: array([ 10.81782088, 3.59449902, 116.16984746, 67.24482759])
pvalues_属性:查看各特征P值
In [3]: s.pvalues_
Out[3]:
array([ 4.47651499e-03, 1.65754167e-01, 5.94344354e-26,
2.50017968e-15])
③sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10):scores按升序排序,选择排前k名所对应的特征
参数说明:k取整数或all
属性值:
In [1]: from sklearn.datasets import load_iris
...: from sklearn.feature_selection import SelectKBest,chi2
...: iris = load_iris()
...: X,y = iris.data,iris.target
...: s = SelectKBest(chi2, k='all').fit(X,y)
...:
In [2]: s.scores_
Out[2]: array([ 10.81782088, 3.59449902, 116.16984746, 67.24482759])
In [3]: s.pvalues_
Out[3]:
array([ 4.47651499e-03, 1.65754167e-01, 5.94344354e-26,
2.50017968e-15])
In [4]: sk =SelectKBest(chi2, k=2).fit(X,y)
In [5]: sk.scores_
Out[5]: array([ 10.81782088, 3.59449902, 116.16984746, 67.24482759])
In [6]: sk.pvalues_
Out[6]:
array([ 4.47651499e-03, 1.65754167e-01, 5.94344354e-26,
2.50017968e-15])
方法:
- fit_transform(X, y=None, **fit_params):训练和转换数据
In [7]: sk.fit_transform(X,y)
Out[7]:
array([[ 1.4, 0.2],
[ 1.4, 0.2],
[ 1.3, 0.2],
...
[ 5.2, 2. ],
[ 5.4, 2.3],
[ 5.1, 1.8]])
- get_support(indices=False):indices=False返回布尔数组,选择的特征列为True,未选择的特征列为False;indices=True返回选择特征列的整数索引
In [8]: sk.get_support(indices=False)
Out[8]: array([False, False, True, True], dtype=bool)
In [9]: sk.get_support(indices=True)
Out[9]: array([2, 3], dtype=int64)
- inverse_transform(X):返回和原X大小相同的数组,未选择的特征用0替换
In [11]: sk.inverse_transform(sk.fit_transform(X,y))
Out[11]:
array([[ 0. , 0. , 1.4, 0.2],
[ 0. , 0. , 1.4, 0.2],
[ 0. , 0. , 1.3, 0.2],
[ 0. , 0. , 1.5, 0.2],
....
[ 0. , 0. , 5.4, 2.3],
[ 0. , 0. , 5.1, 1.8]])
- transform(X):返回X降维到所选的特征上数组
In [3]: sk.transform(X)
Out[3]:
array([[ 1.4, 0.2],
[ 1.4, 0.2],
[ 1.3, 0.2],
[ 1.5, 0.2],
...
[ 5.4, 2.3],
[ 5.1, 1.8]])
④sklearn.feature_selection.SelectPercentile(score_func=<function f_classif>, percentile=10):scores按升序排序,选择排前百分percentile所对应的特征
In [1]: from sklearn.datasets import load_iris
...: from sklearn.feature_selection import SelectPercentile,chi2
...: iris = load_iris()
...: X, y = iris.data, iris.target
...: sp=SelectPercentile(chi2, percentile=33).fit(X,y)
...: print(sp.scores_)
...: X_new = sp.fit_transform(X,y)
...: X_new[:10]
...:
[ 10.81782088 3.59449902 116.16984746 67.24482759]
Out[1]:
array([[ 1.4],
[ 1.4],
[ 1.3],
[ 1.5],
[ 1.4],
[ 1.7],
[ 1.4],
[ 1.5],
[ 1.4],
[ 1.5]])
In [2]: from sklearn.datasets import load_iris
...: from sklearn.feature_selection import SelectPercentile,chi2
...: iris = load_iris()
...: X, y = iris.data, iris.target
...: sp=SelectPercentile(chi2, percentile=34).fit(X,y)
...: print(sp.scores_)
...: X_new = sp.fit_transform(X,y)
...: X_new[:10]
...:
[ 10.81782088 3.59449902 116.16984746 67.24482759]
Out[2]:
array([[ 1.4, 0.2],
[ 1.4, 0.2],
[ 1.3, 0.2],
[ 1.5, 0.2],
[ 1.4, 0.2],
[ 1.7, 0.4],
[ 1.4, 0.3],
[ 1.5, 0.2],
[ 1.4, 0.2],
[ 1.5, 0.1]])
没明白上述percentile=33时,只选择一个特征,而percentile=34时,选择了两个特征,该比例值时如何计算的? -----遗留问题
⑤sklearn.feature_selection.SelectFpr(score_func=<function f_classif>, alpha=0.05):alpha默认值为0.05,过滤掉特征的pvalues_高于指定alpha的特征
In [3]: from sklearn.datasets import load_iris
...: from sklearn.feature_selection import SelectFpr,chi2
...: iris = load_iris()
...: X, y = iris.data, iris.target
...: sp=SelectFpr(chi2, alpha=0.05).fit(X,y)
...:
In [4]: sp.pvalues_
Out[4]:
array([ 4.47651499e-03, 1.65754167e-01, 5.94344354e-26,
2.50017968e-15])
In [5]: sp.get_support(indices=True)
Out[5]: array([0, 2, 3], dtype=int64)
⑥sklearn.feature_selection.SelectFdr(score_func=<function f_classif>, alpha=0.05),同SelectFpr类似
In [6]: from sklearn.datasets import load_iris
...: from sklearn.feature_selection import SelectFdr,chi2
...: iris = load_iris()
...: X, y = iris.data, iris.target
...: sp=SelectFdr(chi2, alpha=0.004).fit(X,y)
...:
In [7]: sp.pvalues_
Out[7]:
array([ 4.47651499e-03, 1.65754167e-01, 5.94344354e-26,
2.50017968e-15])
In [8]: sp.get_support(indices=True)
Out[8]: array([2, 3], dtype=int64)