DC学院学习笔记（二十）：用特征选择方法优化模型

最新推荐文章于 2024-06-15 19:10:54 发布

weixin_33970449

最新推荐文章于 2024-06-15 19:10:54 发布

阅读量786

点赞数

文章标签：人工智能数据结构与算法 python

原文链接：https://yq.aliyun.com/articles/478997

版权

特征选择的定义：

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程，是提高学习算法性能的一个重要手段，也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

特征选择的方法：

数据驱动：分析手上已有的训练数据，得出哪些x里面的特征对预测y最重要的。主要的三大种类方法如下：

相关性：考察在我们已有的数据里面的特征x与预测值y的相关度
迭代删除（增加）：确定要使用哪个算法后，选择最合适的训练子集，从而使得模型的效果最好
基于模型：通过随机森林等可以直接得出每个训练特征的重要性的模型；或者是在进行预测时加入的一些正则化调整，引起的对特征的筛选，从而挑选出最重要的特征

领域专家：通过相关领域的专家知识、经验来挑选特征

迭代特征选择

解决的问题：假设我们已经确定了要使用哪个算法后，我们怎么知道哪个X的子集合作为特征训练模型效果最好。

解决方案：

迭代特征选择python实现：

import pandas
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder

iris =pandas.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None)
iris.columns=['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm','Species']
le = LabelEncoder()
le.fit(iris['Species'])
lm = linear_model.LogisticRegression()
features = ['PetalLengthCm','PetalWidthCm','SepalLengthCm','SepalWidthCm']
y = le.transform(iris['Species'])

selected_features = []
rest_features = features[:]
best_acc = 0
while len(rest_features)>0:
    temp_best_i = ''
    temp_best_acc = 0
    for feature_i in rest_features:
        temp_features = selected_features + [feature_i,]
        X = iris[temp_features]
        scores = cross_val_score(lm,X,y,cv=5 , scoring='accuracy')
        acc = np.mean(scores)
        if acc > temp_best_acc:
            temp_best_acc = acc
            temp_best_i = feature_i
    print("select",temp_best_i,"acc:",temp_best_acc)
    if temp_best_acc > best_acc:
        best_acc = temp_best_acc
        selected_features += [temp_best_i,]
        rest_features.remove(temp_best_i)
    else: 
        break
print("best feature set: ",selected_features,"acc: ",best_acc)

select PetalWidthCm acc: 0.853333333333
select SepalWidthCm acc: 0.94
select PetalLengthCm acc: 0.953333333333
select SepalLengthCm acc: 0.96
best feature set:  ['PetalWidthCm', 'SepalWidthCm', 'PetalLengthCm', 'SepalLengthCm'] acc:  0.96

ok,跟之前得到的一样，就是三个特征都选择了，效果最好。

最后，来认识一下什么叫特征工程

特征工程

有这么一句话在业界广泛流传：数据和特征决定了机器学习的上限，而模型和算法只是逼近这个上限而已。那特征工程到底是什么呢？顾名思义，其本质是一项工程活动，目的是最大限度地从原始数据中提取特征以供算法和模型使用。

引用知乎的一张图：https://www.zhihu.com/question/29316149

weixin_33970449

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
DC学院学习笔记（二十）：用特征选择方法优化模型

特征选择的定义：特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程，是提高学...
复制链接

扫一扫