一、数据特征选择目的
1、降低数据的拟合度
2、提高算法精度
3、减少训练时间
二、数据特征方法
1、单变量特征(卡方检验)
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = datasets.load_iris()
names = ['separ-length','separ-width','petal-length','petal-width','class']
data = pd.read_csv(r'iris.csv',names = names)
array = data.values
X = array[:,0:4]
Y = array[:,4]
test = SelectKBest(score_func = chi2, k = 4)
fit = test.fit(X,Y)
np.set_printoptions(precision = 3)
print(fit.scores_)
features = fit.transform(X)
print(features)
运行结果:
2、递归特征消除
下面代码以支持向量机算法模型,通过 递归特征消除来选定对预测结果影响最大的三个数据特征
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.svm import SVC
iris = datasets.load_iris()
names = ['separ-length','separ-width','petal-length','petal-width','class']
data = pd.read_csv(r'iris.csv',names = names)
array = data.values
X = array[:,0:4]
Y = array[:,4]
model = SVC(kernel="linear")
rfe = RFE(model,3)
fit = rfe.fit(X,Y)
print("特征个数:")
print(fit.n_features_)
print("被选定的特征:")
print(fit.support_)
print("特征排名:")
print(fit.ranking_)
运行结果: