1、前言
特征选择是从原始数据中选择对于模型预测效果最好的特征的过程。特征选择方法分为两大类:基于统计的特征选择,以及基于模型的特征选择。
2、基础知识讲解
分类任务:
- 真阳性率和假阳性率
- 灵敏度(真阳性率)和特异性
- 假阴性率和假阳性率
回归任务:
- 平均绝对误差
- R2
元指标是指不直接与模型预测性能相关的指标,它们试图衡量周遭的性能
- 模型拟合/训练所需的时间
- 拟合后的模型预测新实例的时间
- 需要持久化(永久保存)的数据大小
2.1get_best_model_and_accuracy
定义get_best_model_and_accuracy函数:
- 搜索所有给定的参数,优化机器学习流水线
- 输出有助于评估流水线质量的指标。
# 导入网格搜索模块
from sklearn.model_selection import GridSearchCV
def get_best_model_and_accuracy(model, params, X, y):
grid = GridSearchCV(model, # 要搜索的模型
params, # 要尝试的参数
error_score=0.) # 如果报错,结果是0
grid.fit(X, y) # 拟合模型和参数
# 经典的性能指标
print("Best Accuracy: {}".format(grid.best_score_))
# 得到最佳准确率的最佳参数
print("Best Parameters: {}".format(grid.best_params_))
# 拟合的平均时间(秒)
print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
# 预测的平均时间(秒)
# 从该指标可以看出模型在真实世界的性能
print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
对信用卡逾期数据集进行探索性数据分析
import pandas as pd
import numpy as np
# 用随机数种子保证随机数永远一致
np.random.seed(123)
path = '/home/kesci/input/credit_card4700/credit_card_default.csv'
# 导入数据集
credit_card_default = pd.read_csv(path)
# 描述性统计
# 调用.T方法进行转置,以便更好地观察
credit_card_default.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
LIMIT_BAL | 30000.0 | 167484.322667 | 129747.661567 | 10000.0 | 50000.00 | 140000.0 | 240000.00 | 1000000.0 |
SEX | 30000.0 | 1.603733 | 0.489129 | 1.0 | 1.00 | 2.0 | 2.00 | 2.0 |
EDUCATION | 30000.0 | 1.853133 | 0.790349 | 0.0 | 1.00 | 2.0 | 2.00 | 6.0 |
MARRIAGE | 30000.0 | 1.551867 | 0.521970 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 |
AGE | 30000.0 | 35.485500 | 9.217904 | 21.0 | 28.00 | 34.0 | 41.00 | 79.0 |
PAY_0 | 30000.0 | -0.016700 | 1.123802 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_2 | 30000.0 | -0.133767 | 1.197186 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_3 | 30000.0 | -0.166200 | 1.196868 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_4 | 30000.0 | -0.220667 | 1.169139 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_5 | 30000.0 | -0.266200 | 1.133187 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
PAY_6 | 30000.0 | -0.291100 | 1.149988 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 |
BILL_AMT1 | 30000.0 | 51223.330900 | 73635.860576 | -165580.0 | 3558.75 | 22381.5 | 67091.00 | 964511.0 |
BILL_AMT2 | 30000.0 | 49179.075167 | 71173.768783 | -69777.0 | 2984.75 | 21200.0 | 64006.25 | 983931.0 |
BILL_AMT3 | 30000.0 | 47013.154800 | 69349.387427 | -157264.0 | 2666.25 | 20088.5 | 60164.75 | 1664089.0 |
BILL_AMT4 | 30000.0 | 43262.948967 | 64332.856134 | -170000.0 | 2326.75 | 19052.0 | 54506.00 | 891586.0 |
BILL_AMT5 | 30000.0 | 40311.400967 | 60797.155770 | -81334.0 | 1763.00 | 18104.5 | 50190.50 | 927171.0 |
BILL_AMT6 | 30000.0 | 38871.760400 | 59554.107537 | -339603.0 | 1256.00 | 17071.0 | 49198.25 | 961664.0 |
PAY_AMT1 | 30000.0 | 5663.580500 | 16563.280354 | 0.0 | 1000.00 | 2100.0 | 5006.00 | 873552.0 |
PAY_AMT2 | 30000.0 | 5921.163500 | 23040.870402 | 0.0 | 833.00 | 2009.0 | 5000.00 | 1684259.0 |
PAY_AMT3 | 30000.0 | 5225.681500 | 17606.961470 | 0.0 | 390.00 | 1800.0 | 4505.00 | 896040.0 |
PAY_AMT4 | 30000.0 | 4826.076867 | 15666.159744 | 0.0 | 296.00 | 1500.0 | 4013.25 | 621000.0 |
PAY_AMT5 | 30000.0 | 4799.387633 | 15278.305679 | 0.0 | 252.50 | 1500.0 | 4031.50 | 426529.0 |
PAY_AMT6 | 30000.0 | 5215.502567 | 17777.465775 | 0.0 | 117.75 | 1500.0 | 4000.00 | 528666.0 |
default payment next month | 30000.0 | 0.221200 | 0.415062 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
# 检查缺失值,发现无缺失
credit_card_default.isnull().sum()
LIMIT_BAL 0 SEX 0 EDUCATION 0 MARRIAGE 0 AGE 0 PAY_0 0 PAY_2 0 PAY_3 0 PAY_4 0 PAY_5 0 PAY_6 0 BILL_AMT1 0 BILL_AMT2 0 BILL_AMT3 0 BILL_AMT4 0 BILL_AMT5 0 BILL_AMT6 0 PAY_AMT1 0 PAY_AMT2 0 PAY_AMT3 0 PAY_AMT4 0 PAY_AMT5 0 PAY_AMT6 0 default payment next month 0 dtype: int64
# 30 000行,24列
credit_card_default.shape
(30000, 24)
credit_card_default.head()
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default payment next month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | -2 | ... | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
1 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | 0 | ... | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
2 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | ... | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
3 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
4 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | 0 | ... | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 | 0 |
# 特征
X = credit_card_default.drop('default payment next month', axis=1)
# label
y = credit_card_default['default payment next month']
# 取空准确率
y.value_counts(normalize=True)
0 0.7788 1 0.2212 Name: default payment next month, dtype: float64
2.2创建基准机器学习流水线
# 导入4种模型
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings("ignore")
# 为网格搜索设置变量
# 先设置机器学习模型的参数
# 逻辑回归
lr_params = {'C':[1e-1, 1e0, 1e1, 1e2], 'penalty':['l1', 'l2']}
# KNN
knn_params = {'n_neighbors': [1, 3, 5, 7]}
# 决策树
tree_params = {'max_depth':[None, 1, 3, 5, 7]}
# 随机森林
forest_params = {'n_estimators': [10, 50, 100],
'max_depth': [None, 1, 3, 5, 7]}
# 实例化机器学习模型
lr = LogisticRegression()
knn = KNeighborsClassifier()
d_tree = DecisionTreeClassifier()
forest = RandomForestClassifier()
get_best_model_and_accuracy(lr, lr_params, X, y)
Best Accuracy: 0.8095333333333333 Best Parameters: {'C': 0.1, 'penalty': 'l1'} Average Time to Fit (s): 0.524 Average Time to Score (s): 0.004
get_best_model_and_accuracy(knn, knn_params, X, y)
Best Accuracy: 0.7602333333333333 Best Parameters: {'n_neighbors': 7} Average Time to Fit (s): 0.02 Average Time to Score (s): 0.704
发现KNN的准确率不如空准确率0.7788。
KNN是基于距离的模型,使用空间的紧密度衡量,假定所有的特征尺度相同,但是数据可能并不是这样,因此对于KNN,我们需要更复杂的流水线,以更准确地评估基准性能。
# 导入所需的包
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# 为流水线设置KNN参数
knn_pipe_params = {'classifier__{}'.format(k): v for k, v in knn_params.items()}
# KNN 需要标准化的参数
knn_pipe = Pipeline([('scale', StandardScaler()), ('classifier', knn)])
# 拟合快,预测慢
get_best_model_and_accuracy(knn_pipe, knn_pipe_params, X, y)
print(knn_pipe_params) # {'classifier__n_neighbors': [1, 3, 5, 7]}
Best Accuracy: 0.8008 Best Parameters: {'classifier__n_neighbors': 7} Average Time to Fit (s): 0.031 Average Time to Score (s): 5.075 {'classifier__n_neighbors': [1, 3, 5, 7]}
用StandardScalar进行z分数标准化处理后,这个流水线的准确率至少比空准确率要高,但是这也严重影响了预测时间,因为多了一个预处理步骤。目前,逻辑回归依然领先:准确率更高,速度更快
# 决策树的准确率是第一,拟合速度比逻辑回归快,预测速度比KNN快
get_best_model_and_accuracy(d_tree, tree_params, X, y)
Best Accuracy: 0.8202666666666667 Best Parameters: {'max_depth': 3} Average Time to Fit (s): 0.136 Average Time to Score (s): 0.002
get_best_model_and_accuracy(forest, forest_params, X, y)
Best Accuracy: 0.8195666666666667 Best Parameters: {'max_depth': 7, 'n_estimators': 50} Average Time to Fit (s): 0.916 Average Time to Score (s): 0.037
2.3特征选择的类型
基于统计的特征选择很大程度上依赖于机器学习模型之外的统计测试,以便在流水线的训练阶段选择特征。
基于模型的特征选择则依赖于一个预处理步骤,需要训练一个辅助的机器学习模型,并利用其预测能力来选择特征。
基于统计的特征选择
- 皮尔逊相关系数(Pearson correlations)
- 假设检验。这两个方法都是单变量方法
意思是,如果为了提高机器学习流水线性能而每次选择单一特征以创建更好的数据集,这种方法最简便。
# 相关系数
credit_card_default.corr()
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default payment next month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LIMIT_BAL | 1.000000 | 0.024755 | -0.219161 | -0.108139 | 0.144713 | -0.271214 | -0.296382 | -0.286123 | -0.267460 | -0.249411 | ... | 0.293988 | 0.295562 | 0.290389 | 0.195236 | 0.178408 | 0.210167 | 0.203242 | 0.217202 | 0.219595 | -0.153520 |
SEX | 0.024755 | 1.000000 | 0.014232 | -0.031389 | -0.090874 | -0.057643 | -0.070771 | -0.066096 | -0.060173 | -0.055064 | ... | -0.021880 | -0.017005 | -0.016733 | -0.000242 | -0.001391 | -0.008597 | -0.002229 | -0.001667 | -0.002766 | -0.039961 |
EDUCATION | -0.219161 | 0.014232 | 1.000000 | -0.143464 | 0.175061 | 0.105364 | 0.121566 | 0.114025 | 0.108793 | 0.097520 | ... | -0.000451 | -0.007567 | -0.009099 | -0.037456 | -0.030038 | -0.039943 | -0.038218 | -0.040358 | -0.037200 | 0.028006 |
MARRIAGE | -0.108139 | -0.031389 | -0.143464 | 1.000000 | -0.414170 | 0.019917 | 0.024199 | 0.032688 | 0.033122 | 0.035629 | ... | -0.023344 | -0.025393 | -0.021207 | -0.005979 | -0.008093 | -0.003541 | -0.012659 | -0.001205 | -0.006641 | -0.024339 |
AGE | 0.144713 | -0.090874 | 0.175061 | -0.414170 | 1.000000 | -0.039447 | -0.050148 | -0.053048 | -0.049722 | -0.053826 | ... | 0.051353 | 0.049345 | 0.047613 | 0.026147 | 0.021785 | 0.029247 | 0.021379 | 0.022850 | 0.019478 | 0.013890 |
PAY_0 | -0.271214 | -0.057643 | 0.105364 | 0.019917 | -0.039447 | 1.000000 | 0.672164 | 0.574245 | 0.538841 | 0.509426 | ... | 0.179125 | 0.180635 | 0.176980 | -0.079269 | -0.070101 | -0.070561 | -0.064005 | -0.058190 | -0.058673 | 0.324794 |
PAY_2 | -0.296382 | -0.070771 | 0.121566 | 0.024199 | -0.050148 | 0.672164 | 1.000000 | 0.766552 | 0.662067 | 0.622780 | ... | 0.222237 | 0.221348 | 0.219403 | -0.080701 | -0.058990 | -0.055901 | -0.046858 | -0.037093 | -0.036500 | 0.263551 |
PAY_3 | -0.286123 | -0.066096 | 0.114025 | 0.032688 | -0.053048 | 0.574245 | 0.766552 | 1.000000 | 0.777359 | 0.686775 | ... | 0.227202 | 0.225145 | 0.222327 | 0.001295 | -0.066793 | -0.053311 | -0.046067 | -0.035863 | -0.035861 | 0.235253 |
PAY_4 | -0.267460 | -0.060173 | 0.108793 | 0.033122 | -0.049722 | 0.538841 | 0.662067 | 0.777359 | 1.000000 | 0.819835 | ... | 0.245917 | 0.242902 | 0.239154 | -0.009362 | -0.001944 | -0.069235 | -0.043461 | -0.033590 | -0.026565 | 0.216614 |
PAY_5 | -0.249411 | -0.055064 | 0.097520 | 0.035629 | -0.053826 | 0.509426 | 0.622780 | 0.686775 | 0.819835 | 1.000000 | ... | 0.271915 | 0.269783 | 0.262509 | -0.006089 | -0.003191 | 0.009062 | -0.058299 | -0.033337 | -0.023027 | 0.204149 |
PAY_6 | -0.235195 | -0.044008 | 0.082316 | 0.034345 | -0.048773 | 0.474553 | 0.575501 | 0.632684 | 0.716449 | 0.816900 | ... | 0.266356 | 0.290894 | 0.285091 | -0.001496 | -0.005223 | 0.005834 | 0.019018 | -0.046434 | -0.025299 | 0.186866 |
BILL_AMT1 | 0.285430 | -0.033642 | 0.023581 | -0.023472 | 0.056239 | 0.187068 | 0.234887 | 0.208473 | 0.202812 | 0.206684 | ... | 0.860272 | 0.829779 | 0.802650 | 0.140277 | 0.099355 | 0.156887 | 0.158303 | 0.167026 | 0.179341 | -0.019644 |
BILL_AMT2 | 0.278314 | -0.031183 | 0.018749 | -0.021602 | 0.054283 | 0.189859 | 0.235257 | 0.237295 | 0.225816 | 0.226913 | ... | 0.892482 | 0.859778 | 0.831594 | 0.280365 | 0.100851 | 0.150718 | 0.147398 | 0.157957 | 0.174256 | -0.014193 |
BILL_AMT3 | 0.283236 | -0.024563 | 0.013002 | -0.024909 | 0.053710 | 0.179785 | 0.224146 | 0.227494 | 0.244983 | 0.243335 | ... | 0.923969 | 0.883910 | 0.853320 | 0.244335 | 0.316936 | 0.130011 | 0.143405 | 0.179712 | 0.182326 | -0.014076 |
BILL_AMT4 | 0.293988 | -0.021880 | -0.000451 | -0.023344 | 0.051353 | 0.179125 | 0.222237 | 0.227202 | 0.245917 | 0.271915 | ... | 1.000000 | 0.940134 | 0.900941 | 0.233012 | 0.207564 | 0.300023 | 0.130191 | 0.160433 | 0.177637 | -0.010156 |
BILL_AMT5 | 0.295562 | -0.017005 | -0.007567 | -0.025393 | 0.049345 | 0.180635 | 0.221348 | 0.225145 | 0.242902 | 0.269783 | ... | 0.940134 | 1.000000 | 0.946197 | 0.217031 | 0.181246 | 0.252305 | 0.293118 | 0.141574 | 0.164184 | -0.006760 |
BILL_AMT6 | 0.290389 | -0.016733 | -0.009099 | -0.021207 | 0.047613 | 0.176980 | 0.219403 | 0.222327 | 0.239154 | 0.262509 | ... | 0.900941 | 0.946197 | 1.000000 | 0.199965 | 0.172663 | 0.233770 | 0.250237 | 0.307729 | 0.115494 | -0.005372 |
PAY_AMT1 | 0.195236 | -0.000242 | -0.037456 | -0.005979 | 0.026147 | -0.079269 | -0.080701 | 0.001295 | -0.009362 | -0.006089 | ... | 0.233012 | 0.217031 | 0.199965 | 1.000000 | 0.285576 | 0.252191 | 0.199558 | 0.148459 | 0.185735 | -0.072929 |
PAY_AMT2 | 0.178408 | -0.001391 | -0.030038 | -0.008093 | 0.021785 | -0.070101 | -0.058990 | -0.066793 | -0.001944 | -0.003191 | ... | 0.207564 | 0.181246 | 0.172663 | 0.285576 | 1.000000 | 0.244770 | 0.180107 | 0.180908 | 0.157634 | -0.058579 |
PAY_AMT3 | 0.210167 | -0.008597 | -0.039943 | -0.003541 | 0.029247 | -0.070561 | -0.055901 | -0.053311 | -0.069235 | 0.009062 | ... | 0.300023 | 0.252305 | 0.233770 | 0.252191 | 0.244770 | 1.000000 | 0.216325 | 0.159214 | 0.162740 | -0.056250 |
PAY_AMT4 | 0.203242 | -0.002229 | -0.038218 | -0.012659 | 0.021379 | -0.064005 | -0.046858 | -0.046067 | -0.043461 | -0.058299 | ... | 0.130191 | 0.293118 | 0.250237 | 0.199558 | 0.180107 | 0.216325 | 1.000000 | 0.151830 | 0.157834 | -0.056827 |
PAY_AMT5 | 0.217202 | -0.001667 | -0.040358 | -0.001205 | 0.022850 | -0.058190 | -0.037093 | -0.035863 | -0.033590 | -0.033337 | ... | 0.160433 | 0.141574 | 0.307729 | 0.148459 | 0.180908 | 0.159214 | 0.151830 | 1.000000 | 0.154896 | -0.055124 |
PAY_AMT6 | 0.219595 | -0.002766 | -0.037200 | -0.006641 | 0.019478 | -0.058673 | -0.036500 | -0.035861 | -0.026565 | -0.023027 | ... | 0.177637 | 0.164184 | 0.115494 | 0.185735 | 0.157634 | 0.162740 | 0.157834 | 0.154896 | 1.000000 | -0.053183 |
default payment next month | -0.153520 | -0.039961 | 0.028006 | -0.024339 | 0.013890 | 0.324794 | 0.263551 | 0.235253 | 0.216614 | 0.204149 | ... | -0.010156 | -0.006760 | -0.005372 | -0.072929 | -0.058579 | -0.056250 | -0.056827 | -0.055124 | -0.053183 | 1.000000 |
# 用Seaborn生成热图
import seaborn as sns
import matplotlib.style as style
%matplotlib inline
# 选用一个干净的主题
style.use('fivethirtyeight')
sns.heatmap(credit_card_default.corr())
用相关系数确定特征交互和冗余变量,发现并删除这些冗余变量是减少机器学习过拟合问题的一个关键方法
# 只有特征和label的相关性
credit_card_default.corr()['default payment next month']
LIMIT_BAL -0.153520 SEX -0.039961 EDUCATION 0.028006 MARRIAGE -0.024339 AGE 0.013890 PAY_0 0.324794 PAY_2 0.263551 PAY_3 0.235253 PAY_4 0.216614 PAY_5 0.204149 PAY_6 0.186866 BILL_AMT1 -0.019644 BILL_AMT2 -0.014193 BILL_AMT3 -0.014076 BILL_AMT4 -0.010156 BILL_AMT5 -0.006760 BILL_AMT6 -0.005372 PAY_AMT1 -0.072929 PAY_AMT2 -0.058579 PAY_AMT3 -0.056250 PAY_AMT4 -0.056827 PAY_AMT5 -0.055124 PAY_AMT6 -0.053183 default payment next month 1.000000 Name: default payment next month, dtype: float64
# 只留下相关系数超过正负0.2的特征
credit_card_default.corr()['default payment next month'].abs() > .2
LIMIT_BAL False SEX False EDUCATION False MARRIAGE False AGE False PAY_0 True PAY_2 True PAY_3 True PAY_4 True PAY_5 True PAY_6 False BILL_AMT1 False BILL_AMT2 False BILL_AMT3 False BILL_AMT4 False BILL_AMT5 False BILL_AMT6 False PAY_AMT1 False PAY_AMT2 False PAY_AMT3 False PAY_AMT4 False PAY_AMT5 False PAY_AMT6 False default payment next month True Name: default payment next month, dtype: bool
# 存储特征
mask = credit_card_default.corr()['default payment next month'].abs() > .2
highly_correlated_features = credit_card_default.columns[mask]
highly_correlated_features
Index(['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'default payment next month'], dtype='object')
# 删掉label
highly_correlated_features = highly_correlated_features.drop('default payment next month')
highly_correlated_features
Index(['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'], dtype='object')
# 只有5个高度关联的变量
X_subsetted = X[highly_correlated_features]
get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y)
Best Accuracy: 0.8196666666666667 Best Parameters: {'max_depth': 3} Average Time to Fit (s): 0.007 Average Time to Score (s): 0.002
准确率比要击败的准确率0.82026略差,但是拟合时间快了大概20倍。我们的模型只需要5个特征就可以学习整个数据集,而且速度快得多。
将相关性选择作为预处理阶段的一部分,封装成CustomCorrelationChooser,实现一个拟合逻辑和一个转换逻辑
- 拟合逻辑:从特征矩阵中选择相关性高于阈值的列
- 转换逻辑:对数据集取子集,只包含重要的列
因为 Scikit-Learn 是依赖鸭子类型的(而不是继承),转换器类需要有三个方法:fit()(返回self),transform(),和fit_transform()。通过添加TransformerMixin作为基类,可以很容易地得到最后一个。添加BaseEstimator作为基类(且构造器中避免使用args和kargs),可以得到两个额外的方法(get_params()和set_params()),二者可以方便地进行超参数自动微调
from sklearn.base import TransformerMixin, BaseEstimator
class CustomCorrelationChooser(TransformerMixin, BaseEstimator):
def __init__(self, response, cols_to_keep=[], threshold=None):
# 保存响应变量
self.response = response
# 保存阈值
self.threshold = threshold
# 初始化一个变量,存放要保留的特征名
self.cols_to_keep = cols_to_keep
def transform(self, X):
# 转换会选择合适的列
return X[self.cols_to_keep]
def fit(self, X, *_):
# 创建新的DataFrame,存放特征和响应
df = pd.concat([X, self.response], axis=1)
# 保存高于阈值的列的名称
mask = df.corr()[df.columns[-1]].abs() > self.threshold
self.cols_to_keep = df.columns[mask]
# 只保留X的列,去掉响应变量
self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns]
return self
# 实例化特征选择器
ccc = CustomCorrelationChooser(threshold=.2, response=y)
ccc.fit(X)
ccc.cols_to_keep
['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5']
ccc.transform(X).head()
PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | |
---|---|---|---|---|---|
0 | 2 | 2 | -1 | -1 | -2 |
1 | -1 | 2 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 |
4 | -1 | 0 | -1 | 0 | 0 |
# 流水线
from copy import deepcopy
# 使用响应变量初始化特征选择器
ccc = CustomCorrelationChooser(response=y)
# 创建流水线,包括选择器
ccc_pipe = Pipeline([('correlation_select', ccc),
('classifier', d_tree)])
tree_pipe_params = {'classifier__max_depth':
[None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}
# 复制决策树的参数
ccc_pipe_params = deepcopy(tree_pipe_params)
# 更新决策树的参数选择
ccc_pipe_params.update({'correlation_select__threshold':[0, .1, .2, .3]})
print(ccc_pipe_params)
{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'correlation_select__threshold': [0, 0.1, 0.2, 0.3]}
# 比原来好一点,而且很快
get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y)
Best Accuracy: 0.8206 Best Parameters: {'classifier__max_depth': 5, 'correlation_select__threshold': 0.1} Average Time to Fit (s): 0.088 Average Time to Score (s): 0.003
# 阈值是0.1
ccc = CustomCorrelationChooser(threshold=0.1, response=y)
ccc.fit(X)
# 选择器保留了我们找到的5列,以及LIMIT_BAL和PAY_6
ccc.cols_to_keep
['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
使用假设检验
假设检验是一种统计学方法,可以对单个特征进行复杂的统计检验。作为一种统计检验,假设检验用于在给定数据样本时确定可否在整个数据集上应用某种条件。假设检验的结果会告诉我们是否应该相信或拒绝假设(并选择另一个假设)。基于样本数据,假设检验会确定是否应拒绝零假设。我们通常会用p值(一个上限为1的非负小数,由显著性水平决定)得出结论
# SelectKBest在给定目标函数后选择k个最高分
from sklearn.feature_selection import SelectKBest
# ANOVA测试
from sklearn.feature_selection import f_classif
# f_classif 可以使用负数 但不是所有类都支持
# chi2(卡方)也很常用,但只支持正数
# 回归分析有自己的假设检验
P值
P值,也就是常见到的 P-value。P 值是一种概率,指的是在 H0 假设为真的前提下,样本结果出现的概率。如果 P-value 很小,则说明在原假设为真的前提下,样本结果出现的概率很小,甚至很极端,这就反过来说明了原假设很大概率是错误的
p值的一个常见阈值是0.05,意思是可以认为p值小于0.05的特征是显著的
# 只保留最佳的5个特征
k_best = SelectKBest(f_classif, k=5)
# 选择最佳特征后的矩阵
k_best.fit_transform(X, y)
array([[ 2, 2, -1, -1, -2], [-1, 2, 0, 0, 0], [ 0, 0, 0, 0, 0], ..., [ 4, 3, 2, -1, 0], [ 1, -1, 0, 0, 0], [ 0, 0, 0, 0, 0]])
# 取列的p值
k_best.pvalues_
array([1.30224395e-157, 4.39524880e-012, 1.22503803e-006, 2.48536389e-005, 1.61368459e-002, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 1.89929659e-315, 1.12660795e-279, 7.29674048e-234, 6.67329549e-004, 1.39573624e-002, 1.47699827e-002, 7.85556416e-002, 2.41634443e-001, 3.52122521e-001, 1.14648761e-036, 3.16665676e-024, 1.84177029e-022, 6.83094160e-023, 1.24134477e-021, 3.03358907e-020])
# 特征和p值组成DataFrame
# 按p值排列
p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value')
# 前5个特征
p_values.head()
column | p_value | |
---|---|---|
5 | PAY_0 | 0.000000e+00 |
6 | PAY_2 | 0.000000e+00 |
7 | PAY_3 | 0.000000e+00 |
8 | PAY_4 | 1.899297e-315 |
9 | PAY_5 | 1.126608e-279 |
# 低p值的特征
p_values[p_values['p_value'] < .05]
column | p_value | |
---|---|---|
5 | PAY_0 | 0.000000e+00 |
6 | PAY_2 | 0.000000e+00 |
7 | PAY_3 | 0.000000e+00 |
8 | PAY_4 | 1.899297e-315 |
9 | PAY_5 | 1.126608e-279 |
10 | PAY_6 | 7.296740e-234 |
0 | LIMIT_BAL | 1.302244e-157 |
17 | PAY_AMT1 | 1.146488e-36 |
18 | PAY_AMT2 | 3.166657e-24 |
20 | PAY_AMT4 | 6.830942e-23 |
19 | PAY_AMT3 | 1.841770e-22 |
21 | PAY_AMT5 | 1.241345e-21 |
22 | PAY_AMT6 | 3.033589e-20 |
1 | SEX | 4.395249e-12 |
2 | EDUCATION | 1.225038e-06 |
3 | MARRIAGE | 2.485364e-05 |
11 | BILL_AMT1 | 6.673295e-04 |
12 | BILL_AMT2 | 1.395736e-02 |
13 | BILL_AMT3 | 1.476998e-02 |
4 | AGE | 1.613685e-02 |
# 高p值的特征
p_values[p_values['p_value'] >= .05]
column | p_value | |
---|---|---|
14 | BILL_AMT4 | 0.078556 |
15 | BILL_AMT5 | 0.241634 |
16 | BILL_AMT6 | 0.352123 |
# 试试SelectKBest
from copy import deepcopy
k_best = SelectKBest(f_classif)
# 用SelectKBest建立流水线
select_k_pipe = Pipeline([('k_best', k_best),
('classifier', d_tree)])
select_k_best_pipe_params = deepcopy(tree_pipe_params)
# all没有作用
select_k_best_pipe_params.update({'k_best__k':list(range(1,23)) + ['all']})
print(select_k_best_pipe_params) # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}
# 与相关特征选择器比较
get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y)
{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all']} Best Accuracy: 0.8206 Best Parameters: {'classifier__max_depth': 5, 'k_best__k': 7} Average Time to Fit (s): 0.087 Average Time to Score (s): 0.003
k_best = SelectKBest(f_classif, k=7)
p_values.head(7)
column | p_value | |
---|---|---|
5 | PAY_0 | 0.000000e+00 |
6 | PAY_2 | 0.000000e+00 |
7 | PAY_3 | 0.000000e+00 |
8 | PAY_4 | 1.899297e-315 |
9 | PAY_5 | 1.126608e-279 |
10 | PAY_6 | 7.296740e-234 |
0 | LIMIT_BAL | 1.302244e-157 |
特征选择的两种统计方法,每次选择的7个特征都一样。我们选择这7个特征之外的所有特征,看看效果
# 完整性测试
# 用最差的特征
the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])]
# 如果选择的特征特别差
# 性能也会受影响
get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y)
Best Accuracy: 0.7839 Best Parameters: {'max_depth': 5} Average Time to Fit (s): 0.12 Average Time to Score (s): 0.002
2.4基于模型的特征选择
自然语言处理
# 推文数据集
tweets_path = '/home/kesci/input/Twitter8140/twitter_sentiment.csv'
tweets = pd.read_csv(tweets_path, encoding='latin1')
tweets.head()
ItemID | Sentiment | SentimentText | |
---|---|---|---|
0 | 1 | 0 | is so sad for my APL frie... |
1 | 2 | 0 | I missed the New Moon trail... |
2 | 3 | 1 | omg its already 7:30 :O |
3 | 4 | 0 | .. Omgaga. Im sooo im gunna CRy. I'... |
4 | 5 | 0 | i think mi bf is cheating on me!!! ... |
tweets_X, tweets_y = tweets['SentimentText'], tweets['Sentiment']
# 流水线
from sklearn.feature_extraction.text import CountVectorizer
# 导入朴素贝叶斯,加快处理
from sklearn.naive_bayes import MultinomialNB
featurizer = CountVectorizer()
text_pipe = Pipeline([('featurizer', featurizer),
('classify', MultinomialNB())])
text_pipe_params = {'featurizer__ngram_range':[(1, 2)],
'featurizer__max_features': [5000, 10000],
'featurizer__min_df': [0., .1, .2, .3],
'featurizer__max_df': [.7, .8, .9, 1.]}
get_best_model_and_accuracy(text_pipe, text_pipe_params, tweets_X, tweets_y)
Best Accuracy: 0.7557531328446129 Best Parameters: {'featurizer__max_df': 0.7, 'featurizer__max_features': 10000, 'featurizer__min_df': 0.0, 'featurizer__ngram_range': (1, 2)} Average Time to Fit (s): 2.726 Average Time to Score (s): 0.419
# 更基础,用了SelectKBest的流水线
featurizer = CountVectorizer(ngram_range=(1, 2))
select_k_text_pipe = Pipeline([('featurizer', featurizer),
('select_k', SelectKBest()),
('classify', MultinomialNB())])
select_k_text_pipe_params = {'select_k__k': [1000, 5000]}
get_best_model_and_accuracy(select_k_text_pipe,
select_k_text_pipe_params,
tweets_X, tweets_y)
Best Accuracy: 0.7529728270109712 Best Parameters: {'select_k__k': 5000} Average Time to Fit (s): 3.36 Average Time to Score (s): 0.73
看起来SelectKBest对于文本数据效果不好。如果没有FeatureUnion,我们不能达到之前的准确率。值得注意的是,无论使用何种方式,拟合和预测的时间都很长:这是因为统计单变量方法在大量特征(例如从文本向量化中获取的特征)上表现不佳。
特征选择指标——基于树的模型
在拟合决策树时,决策树从根节点开始,在每个节点处选择最优分割,优化节点纯净度指标。默认情况下,scikit-learn每步都会优化基尼指数(gini metric)。每次分割时,模型会记录每个分割对整体优化目标的帮助
# 创建新的决策树分类器
tree = DecisionTreeClassifier()
tree.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
# 注意:还有其他特征
importances = pd.DataFrame({'importance': tree.feature_importances_, 'feature':X.columns}).sort_values('importance', ascending=False)
importances.head()
importance | feature | |
---|---|---|
5 | 0.161872 | PAY_0 |
4 | 0.072024 | AGE |
11 | 0.071858 | BILL_AMT1 |
0 | 0.056563 | LIMIT_BAL |
19 | 0.055726 | PAY_AMT3 |
拟合中最重要的特征是PAY_0,和之前统计模型的结果相匹配。第2、第3和第5个特征,这3个特征在进行统计测试前没有显示出重要性。这意味着,这种特征选择方法有可能带来一些新的结果。
SelectFromModel和SelectKBest相比最大的不同之处在于不使用k(需要保留的特征数):SelectFromModel使用阈值,代表重要性的最低限度
# 和SelectKBest相似,但使用机器学习模型的内部指标来评估特征的重要性,不使用统计测试的p值
from sklearn.feature_selection import SelectFromModel
# 实例化一个类,按照决策树分类器的内部指标排序重要性,选择特征
select_from_model = SelectFromModel(DecisionTreeClassifier(),
threshold=.05)
selected_X = select_from_model.fit_transform(X, y)
selected_X.shape
(30000, 8)
# 为后面加速
tree_pipe_params = {'classifier__max_depth': [1, 3, 5, 7]}
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# 创建基于DecisionTreeClassifier的SelectFromModel
select = SelectFromModel(DecisionTreeClassifier())
select_from_pipe = Pipeline([('select', select),
('classifier', d_tree)])
select_from_pipe_params = deepcopy(tree_pipe_params)
select_from_pipe_params.update({
'select__threshold': [.01, .05, .1, .2, .25, .3, .4, .5, .6, "mean", "median", "2.*mean"],
'select__estimator__max_depth': [None, 1, 3, 5, 7]
})
print(select_from_pipe_params) # {'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__max_depth': [None, 1, 3, 5, 7], 'classifier__max_depth': [1, 3, 5, 7]}
get_best_model_and_accuracy(select_from_pipe,
select_from_pipe_params,
X, y)
{'classifier__max_depth': [1, 3, 5, 7], 'select__threshold': [0.01, 0.05, 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 'mean', 'median', '2.*mean'], 'select__estimator__max_depth': [None, 1, 3, 5, 7]}
# 设置流水线最佳参数
select_from_pipe.set_params(**{'select__threshold':0.01,
'select__estimator__max_depth':None,
'classifier__max_depth':3})
# 拟合数据
select_from_pipe.steps[0][1].fit(X,y)
# 列出选择的列
X.columns[select_from_pipe.steps[0][1].get_support()]
Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'], dtype='object')
select_from_pipe.steps[0][1]
SelectFromModel(estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'), max_features=None, norm_order=1, prefit=False, threshold=0.01)
线性模型和正则化
SelectFromModel可以处理任何包括feature_importances_或coef_属性的机器学习模型。基于树的模型会暴露前者,线性模型则会暴露后者。
在线性模型中,正则化是一种对模型施加额外约束的方法,目的是防止过拟合,并改进数据泛化能力。正则化通过对需要优化的损失函数添加额外的条件来完成,意味着在拟合时,正则化的线性模型有可能严重减少甚至损坏特征。
# 用正则化后的逻辑回归进行选择
logistic_selector = SelectFromModel(LogisticRegression())
# 新流水线,用LogistisRegression的参数进行排列
regularization_pipe = Pipeline([('select', logistic_selector),
('classifier', tree)])
regularization_pipe_params = deepcopy(tree_pipe_params)
# L1和L2正则化
regularization_pipe_params.update({
'select__threshold': [.01, .05, .1, "mean", "median", "2.*mean"],
'select__estimator__penalty': ['l1', 'l2'],
})
print(regularization_pipe_params) # {'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'classifier__max_depth': [1, 3, 5, 7], 'select__estimator__penalty': ['l1', 'l2']}
get_best_model_and_accuracy(regularization_pipe,
regularization_pipe_params,
X, y)
{'classifier__max_depth': [1, 3, 5, 7], 'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__penalty': ['l1', 'l2']} Best Accuracy: 0.8211666666666667 Best Parameters: {'classifier__max_depth': 5, 'select__estimator__penalty': 'l1', 'select__threshold': 0.01} Average Time to Fit (s): 0.389 Average Time to Score (s): 0.002
# 设置流水线最佳参数
regularization_pipe.set_params(**{'select__threshold': 0.01,
'classifier__max_depth': 5,
'select__estimator__penalty': 'l1'})
# 拟合数据
regularization_pipe.steps[0][1].fit(X, y)
# 列出选择的列
X.columns[regularization_pipe.steps[0][1].get_support()]
Index(['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'], dtype='object')
目前看来,逻辑回归分类器和支持向量分类器(SVC)的最大区别在于,后者会最大优化二分类项目的准确性,而前者对属性的建模更好
# SVC是线性模型,用线性支持在欧几里得空间内分割数据
# 只能分割二分数据
from sklearn.svm import LinearSVC
# 用SVC取参数
svc_selector = SelectFromModel(LinearSVC())
svc_pipe = Pipeline([('select', svc_selector),
('classifier', tree)])
svc_pipe_params = deepcopy(tree_pipe_params)
svc_pipe_params.update({
'select__threshold': [.01, .05, .1, "mean", "median", "2.*mean"],
'select__estimator__penalty': ['l1', 'l2'],
'select__estimator__loss': ['squared_hinge', 'hinge'],
'select__estimator__dual': [True, False]
})
print(svc_pipe_params) # 'select__estimator__loss': ['squared_hinge', 'hinge'], 'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__penalty': ['l1', 'l2'], 'classifier__max_depth': [1, 3, 5, 7], 'select__estimator__dual': [True, False]}
get_best_model_and_accuracy(svc_pipe,
svc_pipe_params,
X, y)
{'classifier__max_depth': [1, 3, 5, 7], 'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__penalty': ['l1', 'l2'], 'select__estimator__loss': ['squared_hinge', 'hinge'], 'select__estimator__dual': [True, False]} Best Accuracy: 0.8218333333333333 Best Parameters: {'classifier__max_depth': 5, 'select__estimator__dual': True, 'select__estimator__loss': 'squared_hinge', 'select__estimator__penalty': 'l2', 'select__threshold': 'median'} Average Time to Fit (s): 0.662 Average Time to Score (s): 0.001
SVC达到了最高的准确率。可以看见拟合时间受到了影响,但是如果能把最快的预测和最好的准确率结合,那么机器学习流水线就会很出色了:基于SVC,利用正则化为决策树分类器找到最佳特征。下面看看选择器选择了哪些特征来达到目前的最佳准确率:
# 设置流水线最佳参数
svc_pipe.set_params(**{'classifier__max_depth': 5,
'select__estimator__dual': False,
'select__estimator__loss': 'squared_hinge',
'select__estimator__penalty': 'l1',
'select__threshold': 0.01})
# 拟合数据
svc_pipe.steps[0][1].fit(X, y)
# 列出选择的列
X.columns[svc_pipe.steps[0][1].get_support()]
Index(['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_5'], dtype='object')
与逻辑回归比,唯一的区别是PAY_4特征,可以看到,移除单个特征不会影响流水线的性能。