特征提升之特征筛选

最新推荐文章于 2024-08-10 22:58:41 发布

cicilover

最新推荐文章于 2024-08-10 22:58:41 发布

阅读量1.5w

点赞数 1

分类专栏： machine learning 文章标签：特征筛选 DT 分类特征提升

本文链接：https://blog.csdn.net/cicilover/article/details/77854621

版权

machine learning 专栏收录该内容

27 篇文章 0 订阅

订阅专栏

良好的数据特征组合不需太多，就可以使得模型的性能表现突出。冗余的特征虽然不会影响到模型的性能，但使得CPU的计算做了无用功。比如，PCA主要用于去除多余的线性相关的特征组合，因为这些冗余的特征组合不会对模型训练有更多贡献。不良的特征自然会降低模型的精度。

特征筛选与PCA这类通过主成分对特征进行重建的方法略有区别：对于PCA，经常无法解释重建之后的特征；然而特征筛选不存在对特征值的修改，从而更加侧重于寻找那些对模型的性能提升较大的少量特征。

下面沿用Titanic数据集，试图通过特征筛选来寻找最佳的特征组合，并且达到提高预测准确性的目标。

Python源码：

#coding=utf-8
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn import feature_selection
from sklearn.cross_validation import cross_val_score
import numpy as np
import pylab as pl

#-------------download data
titanic=pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
#-------------sperate data and target
y=titanic['survived']
X=titanic.drop(['row.names','name','survived'],axis=1)
#-------------fulfill lost data with mean value
X['age'].fillna(X['age'].mean(),inplace=True)
X.fillna('UNKNOWN',inplace=True)
#-------------split data，25% for test
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=33)
#-------------feature vectorization
vec=DictVectorizer()
X_train=vec.fit_transform(X_train.to_dict(orient='record'))
X_test=vec.transform(X_test.to_dict(orient='record'))
#-------------
print 'Dimensions of handled vector',len(vec.feature_names_)
#-------------use DTClassifier to predict and measure performance
dt=DecisionTreeClassifier(criterion='entropy')
dt.fit(X_train,y_train)
print dt.score(X_test,y_test)
#-------------selection features ranked in the front 20%,use DTClassifier with the same config to predict and measure performance
fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=20)
X_train_fs=fs.fit_transform(X_train,y_train)
dt.fit(X_train_fs,y_train)
X_test_fs=fs.transform(X_test)
print dt.score(X_test_fs,y_test)

percentiles=range(1,100,2)
results=[]

for i in percentiles:
    fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=i)
    X_train_fs=fs.fit_transform(X_train,y_train)
    scores=cross_val_score(dt,X_train_fs,y_train,cv=5)
    results=np.append(results,scores.mean())
print results
#-------------find feature selection percent with the best performance
opt=int(np.where(results==results.max())[0])
print 'Optimal number of features',percentiles[opt]
#TypeError: only integer scalar arrays can be converted to a scalar index
#transfer list to array
#print 'Optimal number of features',np.array(percentiles)[opt]

#-------------use the selected features and the same config to measure performance on test datas
fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=7)
X_train_fs=fs.fit_transform(X_train,y_train)
dt.fit(X_train_fs,y_train)
X_test_fs=fs.transform(X_test)
print dt.score(X_test_fs,y_test)

pl.plot(percentiles,results)
pl.xlabel('percentiles of features')
pl.ylabel('accuracy')
pl.show()

Result：