特征选择(4)

1、前言

特征选择是从原始数据中选择对于模型预测效果最好的特征的过程。特征选择方法分为两大类:基于统计的特征选择,以及基于模型的特征选择。

2、基础知识讲解

分类任务:

  • 真阳性率和假阳性率
  • 灵敏度(真阳性率)和特异性
  • 假阴性率和假阳性率

回归任务:

  • 平均绝对误差
  • R2

元指标是指不直接与模型预测性能相关的指标,它们试图衡量周遭的性能

  • 模型拟合/训练所需的时间
  • 拟合后的模型预测新实例的时间
  • 需要持久化(永久保存)的数据大小

2.1get_best_model_and_accuracy

定义get_best_model_and_accuracy函数:

  • 搜索所有给定的参数,优化机器学习流水线
  • 输出有助于评估流水线质量的指标。
# 导入网格搜索模块
from sklearn.model_selection import GridSearchCV


def get_best_model_and_accuracy(model, params, X, y):
    grid = GridSearchCV(model, # 要搜索的模型
                        params, # 要尝试的参数
                        error_score=0.) # 如果报错,结果是0
    grid.fit(X, y) # 拟合模型和参数
    # 经典的性能指标
    print("Best Accuracy: {}".format(grid.best_score_))
    # 得到最佳准确率的最佳参数
    print("Best Parameters: {}".format(grid.best_params_))
    # 拟合的平均时间(秒)
    print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    # 预测的平均时间(秒)
    # 从该指标可以看出模型在真实世界的性能
    print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))

对信用卡逾期数据集进行探索性数据分析

import pandas as pd
import numpy as np

# 用随机数种子保证随机数永远一致
np.random.seed(123)

path = '/home/kesci/input/credit_card4700/credit_card_default.csv'
# 导入数据集
credit_card_default = pd.read_csv(path)
# 描述性统计
# 调用.T方法进行转置,以便更好地观察
credit_card_default.describe().T
countmeanstdmin25%50%75%max
LIMIT_BAL30000.0167484.322667129747.66156710000.050000.00140000.0240000.001000000.0
SEX30000.01.6037330.4891291.01.002.02.002.0
EDUCATION30000.01.8531330.7903490.01.002.02.006.0
MARRIAGE30000.01.5518670.5219700.01.002.02.003.0
AGE30000.035.4855009.21790421.028.0034.041.0079.0
PAY_030000.0-0.0167001.123802-2.0-1.000.00.008.0
PAY_230000.0-0.1337671.197186-2.0-1.000.00.008.0
PAY_330000.0-0.1662001.196868-2.0-1.000.00.008.0
PAY_430000.0-0.2206671.169139-2.0-1.000.00.008.0
PAY_530000.0-0.2662001.133187-2.0-1.000.00.008.0
PAY_630000.0-0.2911001.149988-2.0-1.000.00.008.0
BILL_AMT130000.051223.33090073635.860576-165580.03558.7522381.567091.00964511.0
BILL_AMT230000.049179.07516771173.768783-69777.02984.7521200.064006.25983931.0
BILL_AMT330000.047013.15480069349.387427-157264.02666.2520088.560164.751664089.0
BILL_AMT430000.043262.94896764332.856134-170000.02326.7519052.054506.00891586.0
BILL_AMT530000.040311.40096760797.155770-81334.01763.0018104.550190.50927171.0
BILL_AMT630000.038871.76040059554.107537-339603.01256.0017071.049198.25961664.0
PAY_AMT130000.05663.58050016563.2803540.01000.002100.05006.00873552.0
PAY_AMT230000.05921.16350023040.8704020.0833.002009.05000.001684259.0
PAY_AMT330000.05225.68150017606.9614700.0390.001800.04505.00896040.0
PAY_AMT430000.04826.07686715666.1597440.0296.001500.04013.25621000.0
PAY_AMT530000.04799.38763315278.3056790.0252.501500.04031.50426529.0
PAY_AMT630000.05215.50256717777.4657750.0117.751500.04000.00528666.0
default payment next month30000.00.2212000.4150620.00.000.00.001.0
# 检查缺失值,发现无缺失
credit_card_default.isnull().sum()
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default payment next month    0
dtype: int64
# 30 000行,24列
credit_card_default.shape

(30000, 24)

credit_card_default.head()
LIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_0PAY_2PAY_3PAY_4PAY_5...BILL_AMT4BILL_AMT5BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6default payment next month
0200002212422-1-1-2...000068900001
112000022226-12000...3272345532610100010001000020001
2900002223400000...1433114948155491518150010001000100050000
3500002213700000...2831428959295472000201912001100106910000
45000012157-10-100...2094019146191312000366811000090006896790
# 特征
X = credit_card_default.drop('default payment next month', axis=1)

# label
y = credit_card_default['default payment next month']



# 取空准确率
y.value_counts(normalize=True)
0    0.7788
1    0.2212
Name: default payment next month, dtype: float64

2.2创建基准机器学习流水线

# 导入4种模型
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings("ignore")

# 为网格搜索设置变量
# 先设置机器学习模型的参数

# 逻辑回归
lr_params = {'C':[1e-1, 1e0, 1e1, 1e2], 'penalty':['l1', 'l2']}

# KNN
knn_params = {'n_neighbors': [1, 3, 5, 7]}

# 决策树
tree_params = {'max_depth':[None, 1, 3, 5, 7]}

# 随机森林
forest_params = {'n_estimators': [10, 50, 100], 
                 'max_depth': [None, 1, 3, 5, 7]}
                 
                 
# 实例化机器学习模型
lr = LogisticRegression()
knn = KNeighborsClassifier()
d_tree = DecisionTreeClassifier()
forest = RandomForestClassifier()          





get_best_model_and_accuracy(lr, lr_params, X, y)
Best Accuracy: 0.8095333333333333
Best Parameters: {'C': 0.1, 'penalty': 'l1'}
Average Time to Fit (s): 0.524
Average Time to Score (s): 0.004
get_best_model_and_accuracy(knn, knn_params, X, y)
Best Accuracy: 0.7602333333333333
Best Parameters: {'n_neighbors': 7}
Average Time to Fit (s): 0.02
Average Time to Score (s): 0.704

发现KNN的准确率不如空准确率0.7788。
KNN是基于距离的模型,使用空间的紧密度衡量,假定所有的特征尺度相同,但是数据可能并不是这样,因此对于KNN,我们需要更复杂的流水线,以更准确地评估基准性能。

# 导入所需的包
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# 为流水线设置KNN参数
knn_pipe_params = {'classifier__{}'.format(k): v for k, v in knn_params.items()}

# KNN 需要标准化的参数
knn_pipe = Pipeline([('scale', StandardScaler()), ('classifier', knn)])

# 拟合快,预测慢
get_best_model_and_accuracy(knn_pipe, knn_pipe_params, X, y)

print(knn_pipe_params)  # {'classifier__n_neighbors': [1, 3, 5, 7]} 
Best Accuracy: 0.8008
Best Parameters: {'classifier__n_neighbors': 7}
Average Time to Fit (s): 0.031
Average Time to Score (s): 5.075
{'classifier__n_neighbors': [1, 3, 5, 7]}

用StandardScalar进行z分数标准化处理后,这个流水线的准确率至少比空准确率要高,但是这也严重影响了预测时间,因为多了一个预处理步骤。目前,逻辑回归依然领先:准确率更高,速度更快

# 决策树的准确率是第一,拟合速度比逻辑回归快,预测速度比KNN快
get_best_model_and_accuracy(d_tree, tree_params, X, y)
Best Accuracy: 0.8202666666666667
Best Parameters: {'max_depth': 3}
Average Time to Fit (s): 0.136
Average Time to Score (s): 0.002
get_best_model_and_accuracy(forest, forest_params, X, y)
Best Accuracy: 0.8195666666666667
Best Parameters: {'max_depth': 7, 'n_estimators': 50}
Average Time to Fit (s): 0.916
Average Time to Score (s): 0.037

2.3特征选择的类型

基于统计的特征选择很大程度上依赖于机器学习模型之外的统计测试,以便在流水线的训练阶段选择特征。
基于模型的特征选择则依赖于一个预处理步骤,需要训练一个辅助的机器学习模型,并利用其预测能力来选择特征。

基于统计的特征选择

  • 皮尔逊相关系数(Pearson correlations)
  • 假设检验。这两个方法都是单变量方法
    意思是,如果为了提高机器学习流水线性能而每次选择单一特征以创建更好的数据集,这种方法最简便。
# 相关系数
credit_card_default.corr()
LIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_0PAY_2PAY_3PAY_4PAY_5...BILL_AMT4BILL_AMT5BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6default payment next month
LIMIT_BAL1.0000000.024755-0.219161-0.1081390.144713-0.271214-0.296382-0.286123-0.267460-0.249411...0.2939880.2955620.2903890.1952360.1784080.2101670.2032420.2172020.219595-0.153520
SEX0.0247551.0000000.014232-0.031389-0.090874-0.057643-0.070771-0.066096-0.060173-0.055064...-0.021880-0.017005-0.016733-0.000242-0.001391-0.008597-0.002229-0.001667-0.002766-0.039961
EDUCATION-0.2191610.0142321.000000-0.1434640.1750610.1053640.1215660.1140250.1087930.097520...-0.000451-0.007567-0.009099-0.037456-0.030038-0.039943-0.038218-0.040358-0.0372000.028006
MARRIAGE-0.108139-0.031389-0.1434641.000000-0.4141700.0199170.0241990.0326880.0331220.035629...-0.023344-0.025393-0.021207-0.005979-0.008093-0.003541-0.012659-0.001205-0.006641-0.024339
AGE0.144713-0.0908740.175061-0.4141701.000000-0.039447-0.050148-0.053048-0.049722-0.053826...0.0513530.0493450.0476130.0261470.0217850.0292470.0213790.0228500.0194780.013890
PAY_0-0.271214-0.0576430.1053640.019917-0.0394471.0000000.6721640.5742450.5388410.509426...0.1791250.1806350.176980-0.079269-0.070101-0.070561-0.064005-0.058190-0.0586730.324794
PAY_2-0.296382-0.0707710.1215660.024199-0.0501480.6721641.0000000.7665520.6620670.622780...0.2222370.2213480.219403-0.080701-0.058990-0.055901-0.046858-0.037093-0.0365000.263551
PAY_3-0.286123-0.0660960.1140250.032688-0.0530480.5742450.7665521.0000000.7773590.686775...0.2272020.2251450.2223270.001295-0.066793-0.053311-0.046067-0.035863-0.0358610.235253
PAY_4-0.267460-0.0601730.1087930.033122-0.0497220.5388410.6620670.7773591.0000000.819835...0.2459170.2429020.239154-0.009362-0.001944-0.069235-0.043461-0.033590-0.0265650.216614
PAY_5-0.249411-0.0550640.0975200.035629-0.0538260.5094260.6227800.6867750.8198351.000000...0.2719150.2697830.262509-0.006089-0.0031910.009062-0.058299-0.033337-0.0230270.204149
PAY_6-0.235195-0.0440080.0823160.034345-0.0487730.4745530.5755010.6326840.7164490.816900...0.2663560.2908940.285091-0.001496-0.0052230.0058340.019018-0.046434-0.0252990.186866
BILL_AMT10.285430-0.0336420.023581-0.0234720.0562390.1870680.2348870.2084730.2028120.206684...0.8602720.8297790.8026500.1402770.0993550.1568870.1583030.1670260.179341-0.019644
BILL_AMT20.278314-0.0311830.018749-0.0216020.0542830.1898590.2352570.2372950.2258160.226913...0.8924820.8597780.8315940.2803650.1008510.1507180.1473980.1579570.174256-0.014193
BILL_AMT30.283236-0.0245630.013002-0.0249090.0537100.1797850.2241460.2274940.2449830.243335...0.9239690.8839100.8533200.2443350.3169360.1300110.1434050.1797120.182326-0.014076
BILL_AMT40.293988-0.021880-0.000451-0.0233440.0513530.1791250.2222370.2272020.2459170.271915...1.0000000.9401340.9009410.2330120.2075640.3000230.1301910.1604330.177637-0.010156
BILL_AMT50.295562-0.017005-0.007567-0.0253930.0493450.1806350.2213480.2251450.2429020.269783...0.9401341.0000000.9461970.2170310.1812460.2523050.2931180.1415740.164184-0.006760
BILL_AMT60.290389-0.016733-0.009099-0.0212070.0476130.1769800.2194030.2223270.2391540.262509...0.9009410.9461971.0000000.1999650.1726630.2337700.2502370.3077290.115494-0.005372
PAY_AMT10.195236-0.000242-0.037456-0.0059790.026147-0.079269-0.0807010.001295-0.009362-0.006089...0.2330120.2170310.1999651.0000000.2855760.2521910.1995580.1484590.185735-0.072929
PAY_AMT20.178408-0.001391-0.030038-0.0080930.021785-0.070101-0.058990-0.066793-0.001944-0.003191...0.2075640.1812460.1726630.2855761.0000000.2447700.1801070.1809080.157634-0.058579
PAY_AMT30.210167-0.008597-0.039943-0.0035410.029247-0.070561-0.055901-0.053311-0.0692350.009062...0.3000230.2523050.2337700.2521910.2447701.0000000.2163250.1592140.162740-0.056250
PAY_AMT40.203242-0.002229-0.038218-0.0126590.021379-0.064005-0.046858-0.046067-0.043461-0.058299...0.1301910.2931180.2502370.1995580.1801070.2163251.0000000.1518300.157834-0.056827
PAY_AMT50.217202-0.001667-0.040358-0.0012050.022850-0.058190-0.037093-0.035863-0.033590-0.033337...0.1604330.1415740.3077290.1484590.1809080.1592140.1518301.0000000.154896-0.055124
PAY_AMT60.219595-0.002766-0.037200-0.0066410.019478-0.058673-0.036500-0.035861-0.026565-0.023027...0.1776370.1641840.1154940.1857350.1576340.1627400.1578340.1548961.000000-0.053183
default payment next month-0.153520-0.0399610.028006-0.0243390.0138900.3247940.2635510.2352530.2166140.204149...-0.010156-0.006760-0.005372-0.072929-0.058579-0.056250-0.056827-0.055124-0.0531831.000000
# 用Seaborn生成热图
import seaborn as sns
import matplotlib.style as style
%matplotlib inline

# 选用一个干净的主题
style.use('fivethirtyeight')

sns.heatmap(credit_card_default.corr())

用相关系数确定特征交互和冗余变量,发现并删除这些冗余变量是减少机器学习过拟合问题的一个关键方法

# 只有特征和label的相关性
credit_card_default.corr()['default payment next month'] 
LIMIT_BAL                    -0.153520
SEX                          -0.039961
EDUCATION                     0.028006
MARRIAGE                     -0.024339
AGE                           0.013890
PAY_0                         0.324794
PAY_2                         0.263551
PAY_3                         0.235253
PAY_4                         0.216614
PAY_5                         0.204149
PAY_6                         0.186866
BILL_AMT1                    -0.019644
BILL_AMT2                    -0.014193
BILL_AMT3                    -0.014076
BILL_AMT4                    -0.010156
BILL_AMT5                    -0.006760
BILL_AMT6                    -0.005372
PAY_AMT1                     -0.072929
PAY_AMT2                     -0.058579
PAY_AMT3                     -0.056250
PAY_AMT4                     -0.056827
PAY_AMT5                     -0.055124
PAY_AMT6                     -0.053183
default payment next month    1.000000
Name: default payment next month, dtype: float64
# 只留下相关系数超过正负0.2的特征
credit_card_default.corr()['default payment next month'].abs() > .2
LIMIT_BAL                     False
SEX                           False
EDUCATION                     False
MARRIAGE                      False
AGE                           False
PAY_0                          True
PAY_2                          True
PAY_3                          True
PAY_4                          True
PAY_5                          True
PAY_6                         False
BILL_AMT1                     False
BILL_AMT2                     False
BILL_AMT3                     False
BILL_AMT4                     False
BILL_AMT5                     False
BILL_AMT6                     False
PAY_AMT1                      False
PAY_AMT2                      False
PAY_AMT3                      False
PAY_AMT4                      False
PAY_AMT5                      False
PAY_AMT6                      False
default payment next month     True
Name: default payment next month, dtype: bool
# 存储特征
mask = credit_card_default.corr()['default payment next month'].abs() > .2
highly_correlated_features = credit_card_default.columns[mask]

highly_correlated_features
Index(['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5',
       'default payment next month'],
      dtype='object')
# 删掉label
highly_correlated_features = highly_correlated_features.drop('default payment next month')

highly_correlated_features

Index(['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'], dtype='object')

# 只有5个高度关联的变量
X_subsetted = X[highly_correlated_features]

get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y) 
Best Accuracy: 0.8196666666666667
Best Parameters: {'max_depth': 3}
Average Time to Fit (s): 0.007
Average Time to Score (s): 0.002

准确率比要击败的准确率0.82026略差,但是拟合时间快了大概20倍。我们的模型只需要5个特征就可以学习整个数据集,而且速度快得多。

将相关性选择作为预处理阶段的一部分,封装成CustomCorrelationChooser,实现一个拟合逻辑和一个转换逻辑

  • 拟合逻辑:从特征矩阵中选择相关性高于阈值的列
  • 转换逻辑:对数据集取子集,只包含重要的列

因为 Scikit-Learn 是依赖鸭子类型的(而不是继承),转换器类需要有三个方法:fit()(返回self),transform(),和fit_transform()。通过添加TransformerMixin作为基类,可以很容易地得到最后一个。添加BaseEstimator作为基类(且构造器中避免使用args和kargs),可以得到两个额外的方法(get_params()和set_params()),二者可以方便地进行超参数自动微调

from sklearn.base import TransformerMixin, BaseEstimator

class CustomCorrelationChooser(TransformerMixin, BaseEstimator):
    def __init__(self, response, cols_to_keep=[], threshold=None):
        # 保存响应变量
        self.response = response
        # 保存阈值
        self.threshold = threshold
        # 初始化一个变量,存放要保留的特征名
        self.cols_to_keep = cols_to_keep
        
    def transform(self, X):
        # 转换会选择合适的列
        return X[self.cols_to_keep]
        
    def fit(self, X, *_):
        # 创建新的DataFrame,存放特征和响应
        df = pd.concat([X, self.response], axis=1)
        # 保存高于阈值的列的名称
        mask = df.corr()[df.columns[-1]].abs() > self.threshold
        self.cols_to_keep = df.columns[mask]
        # 只保留X的列,去掉响应变量
        self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns]
        return self





# 实例化特征选择器
ccc = CustomCorrelationChooser(threshold=.2, response=y)
ccc.fit(X)

ccc.cols_to_keep

['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5']

ccc.transform(X).head()
PAY_0PAY_2PAY_3PAY_4PAY_5
022-1-1-2
1-12000
200000
300000
4-10-100
# 流水线
from copy import deepcopy

# 使用响应变量初始化特征选择器
ccc = CustomCorrelationChooser(response=y)

# 创建流水线,包括选择器
ccc_pipe = Pipeline([('correlation_select', ccc), 
                     ('classifier', d_tree)])

tree_pipe_params = {'classifier__max_depth': 
                    [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}

# 复制决策树的参数
ccc_pipe_params = deepcopy(tree_pipe_params)

# 更新决策树的参数选择
ccc_pipe_params.update({'correlation_select__threshold':[0, .1, .2, .3]})

print(ccc_pipe_params)

{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'correlation_select__threshold': [0, 0.1, 0.2, 0.3]}

# 比原来好一点,而且很快
get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y) 
Best Accuracy: 0.8206
Best Parameters: {'classifier__max_depth': 5, 'correlation_select__threshold': 0.1}
Average Time to Fit (s): 0.088
Average Time to Score (s): 0.003
# 阈值是0.1
ccc = CustomCorrelationChooser(threshold=0.1, response=y)
ccc.fit(X)

# 选择器保留了我们找到的5列,以及LIMIT_BAL和PAY_6
ccc.cols_to_keep

['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

使用假设检验

假设检验是一种统计学方法,可以对单个特征进行复杂的统计检验。作为一种统计检验,假设检验用于在给定数据样本时确定可否在整个数据集上应用某种条件。假设检验的结果会告诉我们是否应该相信或拒绝假设(并选择另一个假设)。基于样本数据,假设检验会确定是否应拒绝零假设。我们通常会用p值(一个上限为1的非负小数,由显著性水平决定)得出结论

# SelectKBest在给定目标函数后选择k个最高分
from sklearn.feature_selection import SelectKBest

# ANOVA测试
from sklearn.feature_selection import f_classif

# f_classif 可以使用负数 但不是所有类都支持
# chi2(卡方)也很常用,但只支持正数
# 回归分析有自己的假设检验

P值

P值,也就是常见到的 P-value。P 值是一种概率,指的是在 H0 假设为真的前提下,样本结果出现的概率。如果 P-value 很小,则说明在原假设为真的前提下,样本结果出现的概率很小,甚至很极端,这就反过来说明了原假设很大概率是错误的

p值的一个常见阈值是0.05,意思是可以认为p值小于0.05的特征是显著的

# 只保留最佳的5个特征
k_best = SelectKBest(f_classif, k=5)




# 选择最佳特征后的矩阵
k_best.fit_transform(X, y)
array([[ 2,  2, -1, -1, -2],
       [-1,  2,  0,  0,  0],
       [ 0,  0,  0,  0,  0],
       ...,
       [ 4,  3,  2, -1,  0],
       [ 1, -1,  0,  0,  0],
       [ 0,  0,  0,  0,  0]])
# 取列的p值
k_best.pvalues_
array([1.30224395e-157, 4.39524880e-012, 1.22503803e-006, 2.48536389e-005,
       1.61368459e-002, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
       1.89929659e-315, 1.12660795e-279, 7.29674048e-234, 6.67329549e-004,
       1.39573624e-002, 1.47699827e-002, 7.85556416e-002, 2.41634443e-001,
       3.52122521e-001, 1.14648761e-036, 3.16665676e-024, 1.84177029e-022,
       6.83094160e-023, 1.24134477e-021, 3.03358907e-020])
# 特征和p值组成DataFrame
# 按p值排列
p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value')

# 前5个特征
p_values.head()
columnp_value
5PAY_00.000000e+00
6PAY_20.000000e+00
7PAY_30.000000e+00
8PAY_41.899297e-315
9PAY_51.126608e-279
# 低p值的特征
p_values[p_values['p_value'] < .05]
columnp_value
5PAY_00.000000e+00
6PAY_20.000000e+00
7PAY_30.000000e+00
8PAY_41.899297e-315
9PAY_51.126608e-279
10PAY_67.296740e-234
0LIMIT_BAL1.302244e-157
17PAY_AMT11.146488e-36
18PAY_AMT23.166657e-24
20PAY_AMT46.830942e-23
19PAY_AMT31.841770e-22
21PAY_AMT51.241345e-21
22PAY_AMT63.033589e-20
1SEX4.395249e-12
2EDUCATION1.225038e-06
3MARRIAGE2.485364e-05
11BILL_AMT16.673295e-04
12BILL_AMT21.395736e-02
13BILL_AMT31.476998e-02
4AGE1.613685e-02
# 高p值的特征
p_values[p_values['p_value'] >= .05]
columnp_value
14BILL_AMT40.078556
15BILL_AMT50.241634
16BILL_AMT60.352123
# 试试SelectKBest
from copy import deepcopy

k_best = SelectKBest(f_classif)

# 用SelectKBest建立流水线
select_k_pipe = Pipeline([('k_best', k_best), 
                         ('classifier', d_tree)])

select_k_best_pipe_params = deepcopy(tree_pipe_params)
# all没有作用
select_k_best_pipe_params.update({'k_best__k':list(range(1,23)) + ['all']})

print(select_k_best_pipe_params) # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}

# 与相关特征选择器比较
get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y)
{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all']}
Best Accuracy: 0.8206
Best Parameters: {'classifier__max_depth': 5, 'k_best__k': 7}
Average Time to Fit (s): 0.087
Average Time to Score (s): 0.003
k_best = SelectKBest(f_classif, k=7)





p_values.head(7)
columnp_value
5PAY_00.000000e+00
6PAY_20.000000e+00
7PAY_30.000000e+00
8PAY_41.899297e-315
9PAY_51.126608e-279
10PAY_67.296740e-234
0LIMIT_BAL1.302244e-157

特征选择的两种统计方法,每次选择的7个特征都一样。我们选择这7个特征之外的所有特征,看看效果

# 完整性测试
# 用最差的特征
the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])]

# 如果选择的特征特别差
# 性能也会受影响
get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y)
Best Accuracy: 0.7839
Best Parameters: {'max_depth': 5}
Average Time to Fit (s): 0.12
Average Time to Score (s): 0.002

2.4基于模型的特征选择

自然语言处理

# 推文数据集
tweets_path = '/home/kesci/input/Twitter8140/twitter_sentiment.csv'
tweets = pd.read_csv(tweets_path, encoding='latin1')

tweets.head()
ItemIDSentimentSentimentText
010is so sad for my APL frie...
120I missed the New Moon trail...
231omg its already 7:30 :O
340.. Omgaga. Im sooo im gunna CRy. I'...
450i think mi bf is cheating on me!!! ...
tweets_X, tweets_y = tweets['SentimentText'], tweets['Sentiment']



# 流水线
from sklearn.feature_extraction.text import CountVectorizer
# 导入朴素贝叶斯,加快处理 
from sklearn.naive_bayes import MultinomialNB

featurizer = CountVectorizer()

text_pipe = Pipeline([('featurizer', featurizer), 
                 ('classify', MultinomialNB())])

text_pipe_params = {'featurizer__ngram_range':[(1, 2)], 
               'featurizer__max_features': [5000, 10000],
               'featurizer__min_df': [0., .1, .2, .3], 
               'featurizer__max_df': [.7, .8, .9, 1.]}


get_best_model_and_accuracy(text_pipe, text_pipe_params, tweets_X, tweets_y)
Best Accuracy: 0.7557531328446129
Best Parameters: {'featurizer__max_df': 0.7, 'featurizer__max_features': 10000, 'featurizer__min_df': 0.0, 'featurizer__ngram_range': (1, 2)}
Average Time to Fit (s): 2.726
Average Time to Score (s): 0.419
# 更基础,用了SelectKBest的流水线
featurizer = CountVectorizer(ngram_range=(1, 2))

select_k_text_pipe = Pipeline([('featurizer', featurizer), 
                      ('select_k', SelectKBest()),
                      ('classify', MultinomialNB())])

select_k_text_pipe_params = {'select_k__k': [1000, 5000]}

get_best_model_and_accuracy(select_k_text_pipe, 
                            select_k_text_pipe_params, 
                            tweets_X, tweets_y)
Best Accuracy: 0.7529728270109712
Best Parameters: {'select_k__k': 5000}
Average Time to Fit (s): 3.36
Average Time to Score (s): 0.73

看起来SelectKBest对于文本数据效果不好。如果没有FeatureUnion,我们不能达到之前的准确率。值得注意的是,无论使用何种方式,拟合和预测的时间都很长:这是因为统计单变量方法在大量特征(例如从文本向量化中获取的特征)上表现不佳。

特征选择指标——基于树的模型

在拟合决策树时,决策树从根节点开始,在每个节点处选择最优分割,优化节点纯净度指标。默认情况下,scikit-learn每步都会优化基尼指数(gini metric)。每次分割时,模型会记录每个分割对整体优化目标的帮助

# 创建新的决策树分类器
tree = DecisionTreeClassifier()

tree.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
# 注意:还有其他特征
importances = pd.DataFrame({'importance': tree.feature_importances_, 'feature':X.columns}).sort_values('importance', ascending=False)

importances.head()
importancefeature
50.161872PAY_0
40.072024AGE
110.071858BILL_AMT1
00.056563LIMIT_BAL
190.055726PAY_AMT3

拟合中最重要的特征是PAY_0,和之前统计模型的结果相匹配。第2、第3和第5个特征,这3个特征在进行统计测试前没有显示出重要性。这意味着,这种特征选择方法有可能带来一些新的结果。

SelectFromModel和SelectKBest相比最大的不同之处在于不使用k(需要保留的特征数):SelectFromModel使用阈值,代表重要性的最低限度

# 和SelectKBest相似,但使用机器学习模型的内部指标来评估特征的重要性,不使用统计测试的p值
from sklearn.feature_selection import SelectFromModel

# 实例化一个类,按照决策树分类器的内部指标排序重要性,选择特征
select_from_model = SelectFromModel(DecisionTreeClassifier(), 
                                    threshold=.05)

selected_X = select_from_model.fit_transform(X, y)
selected_X.shape         
(30000, 8)
# 为后面加速
tree_pipe_params = {'classifier__max_depth': [1, 3, 5, 7]}

from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# 创建基于DecisionTreeClassifier的SelectFromModel
select = SelectFromModel(DecisionTreeClassifier())

select_from_pipe = Pipeline([('select', select),
                             ('classifier', d_tree)])

select_from_pipe_params = deepcopy(tree_pipe_params)

select_from_pipe_params.update({
 'select__threshold': [.01, .05, .1, .2, .25, .3, .4, .5, .6, "mean", "median", "2.*mean"],
 'select__estimator__max_depth': [None, 1, 3, 5, 7]
 })

print(select_from_pipe_params)  # {'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__max_depth': [None, 1, 3, 5, 7], 'classifier__max_depth': [1, 3, 5, 7]}


get_best_model_and_accuracy(select_from_pipe, 
                            select_from_pipe_params, 
                            X, y)
{'classifier__max_depth': [1, 3, 5, 7], 'select__threshold': [0.01, 0.05, 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 'mean', 'median', '2.*mean'], 'select__estimator__max_depth': [None, 1, 3, 5, 7]}
# 设置流水线最佳参数
select_from_pipe.set_params(**{'select__threshold':0.01,
                           'select__estimator__max_depth':None,
                           'classifier__max_depth':3})

# 拟合数据
select_from_pipe.steps[0][1].fit(X,y)

# 列出选择的列
X.columns[select_from_pipe.steps[0][1].get_support()]
Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4',
       'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3',
       'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'],
      dtype='object')
select_from_pipe.steps[0][1]
SelectFromModel(estimator=DecisionTreeClassifier(class_weight=None,
                                                 criterion='gini',
                                                 max_depth=None,
                                                 max_features=None,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 presort=False,
                                                 random_state=None,
                                                 splitter='best'),
                max_features=None, norm_order=1, prefit=False, threshold=0.01)

线性模型和正则化

SelectFromModel可以处理任何包括feature_importances_或coef_属性的机器学习模型。基于树的模型会暴露前者,线性模型则会暴露后者。

在线性模型中,正则化是一种对模型施加额外约束的方法,目的是防止过拟合,并改进数据泛化能力。正则化通过对需要优化的损失函数添加额外的条件来完成,意味着在拟合时,正则化的线性模型有可能严重减少甚至损坏特征。

# 用正则化后的逻辑回归进行选择
logistic_selector = SelectFromModel(LogisticRegression())

# 新流水线,用LogistisRegression的参数进行排列
regularization_pipe = Pipeline([('select', logistic_selector), 
 ('classifier', tree)])

regularization_pipe_params = deepcopy(tree_pipe_params)

# L1和L2正则化
regularization_pipe_params.update({
 'select__threshold': [.01, .05, .1, "mean", "median", "2.*mean"],
 'select__estimator__penalty': ['l1', 'l2'],
 })

print(regularization_pipe_params)  # {'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'classifier__max_depth': [1, 3, 5, 7], 'select__estimator__penalty': ['l1', 'l2']}


get_best_model_and_accuracy(regularization_pipe, 
                            regularization_pipe_params, 
                            X, y)
{'classifier__max_depth': [1, 3, 5, 7], 'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__penalty': ['l1', 'l2']}
Best Accuracy: 0.8211666666666667
Best Parameters: {'classifier__max_depth': 5, 'select__estimator__penalty': 'l1', 'select__threshold': 0.01}
Average Time to Fit (s): 0.389
Average Time to Score (s): 0.002
# 设置流水线最佳参数
regularization_pipe.set_params(**{'select__threshold': 0.01, 
 'classifier__max_depth': 5, 
 'select__estimator__penalty': 'l1'})

# 拟合数据
regularization_pipe.steps[0][1].fit(X, y)

# 列出选择的列
X.columns[regularization_pipe.steps[0][1].get_support()]
Index(['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4',
       'PAY_5'],
      dtype='object')

目前看来,逻辑回归分类器和支持向量分类器(SVC)的最大区别在于,后者会最大优化二分类项目的准确性,而前者对属性的建模更好

# SVC是线性模型,用线性支持在欧几里得空间内分割数据
# 只能分割二分数据
from sklearn.svm import LinearSVC

# 用SVC取参数
svc_selector = SelectFromModel(LinearSVC())

svc_pipe = Pipeline([('select', svc_selector), 
 ('classifier', tree)])

svc_pipe_params = deepcopy(tree_pipe_params)

svc_pipe_params.update({
 'select__threshold': [.01, .05, .1, "mean", "median", "2.*mean"],
 'select__estimator__penalty': ['l1', 'l2'],
 'select__estimator__loss': ['squared_hinge', 'hinge'],
 'select__estimator__dual': [True, False]
 })

print(svc_pipe_params)  # 'select__estimator__loss': ['squared_hinge', 'hinge'], 'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__penalty': ['l1', 'l2'], 'classifier__max_depth': [1, 3, 5, 7], 'select__estimator__dual': [True, False]}

get_best_model_and_accuracy(svc_pipe, 
                            svc_pipe_params, 
                            X, y) 
{'classifier__max_depth': [1, 3, 5, 7], 'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__penalty': ['l1', 'l2'], 'select__estimator__loss': ['squared_hinge', 'hinge'], 'select__estimator__dual': [True, False]}
Best Accuracy: 0.8218333333333333
Best Parameters: {'classifier__max_depth': 5, 'select__estimator__dual': True, 'select__estimator__loss': 'squared_hinge', 'select__estimator__penalty': 'l2', 'select__threshold': 'median'}
Average Time to Fit (s): 0.662
Average Time to Score (s): 0.001

SVC达到了最高的准确率。可以看见拟合时间受到了影响,但是如果能把最快的预测和最好的准确率结合,那么机器学习流水线就会很出色了:基于SVC,利用正则化为决策树分类器找到最佳特征。下面看看选择器选择了哪些特征来达到目前的最佳准确率:

# 设置流水线最佳参数
svc_pipe.set_params(**{'classifier__max_depth': 5, 
                                  'select__estimator__dual': False, 
                                  'select__estimator__loss': 'squared_hinge', 
                                  'select__estimator__penalty': 'l1', 
                                  'select__threshold': 0.01})

# 拟合数据
svc_pipe.steps[0][1].fit(X, y)

# 列出选择的列
X.columns[svc_pipe.steps[0][1].get_support()]
Index(['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_5'], dtype='object')

与逻辑回归比,唯一的区别是PAY_4特征,可以看到,移除单个特征不会影响流水线的性能。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值