自动化特征选择

最新推荐文章于 2024-09-22 19:01:53 发布

Taohongfei_huster

最新推荐文章于 2024-09-22 19:01:53 发布

阅读量2.2k

点赞数

分类专栏：机器学习特征工程 sklearn

本文链接：https://blog.csdn.net/qq_41951186/article/details/83066451

版权

本文介绍了自动化特征选择的三种策略：单变量统计、基于模型的选择和迭代选择，并通过cancer数据集进行实例分析。单变量统计通过p值筛选特征，基于模型的选择利用模型的重要性度量，迭代特征选择如RFE逐步剔除不重要特征。这些方法帮助减少噪声特征，提升模型性能。

摘要由CSDN通过智能技术生成

自动化特征选择用于判断每个特征的作用，从原始数据特征中选择那些最有用的特征，通常有三种策略：单变量统计、基于模型的选择和迭代选择。下面结合cancer数据集对它们进行分析。

一、单变量统计

在单变量统计中，我们计算每个特征和目标值之间的关系是否存在统计显著性，然后选择具有最高置信度的特征。（这些测试的一个关键性质就是它们是单变量的，即只单独考虑每个特征，如果一个特征只有在与另一个特征合并时才具有信息量，那么这个特征将被舍弃）。

在scikit-learn中使用单变量特征选择，对于分类问题，选择f_classif，对于回归问题，选择f_regression，然后基于测试中的p值来确定一种舍弃特征的方法（所有舍弃参数的方法都使用阈值来舍弃p值过大的特征，意味着它们不可能与目标值相关）。计算阈值的方法各有不同，最简单的是SelectKBest和SelectPercentile，前者选择固定数量的k个特征，后者选择固定百分比的特征。

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectPercentile
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()

# get deterministic random numbers
rng = np.random.RandomState(42)
noise = rng.normal(size=(len(cancer.data), 50))
# add noise features to the data
# the first 30 features are from the dataset, the next 50 are noise
X_w_noise = np.hstack([cancer.data, noise])

X_train, X_test, y_train, y_test = train_test_split(
    X_w_noise, cancer.target, random_state=0, test_size=.5)
# use f_classif (the default) and SelectPercentile to select 50% of features
select = SelectPercentile(percentile=50)
select.fit(X_