基于SHAP的特征筛选----说明

原创已于 2024-01-02 13:40:58 修改 · 置顶 · 2.5k 阅读

15 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #人工智能 #深度学习

于 2024-01-02 13:01:44 首次发布

本文介绍了SHAP值在机器学习模型解释和特征选择中的应用，通过计算特征对模型预测的贡献度，实现特征消除。详细步骤包括从空集合开始，逐步计算和消除特征，直到达到预设数量。以CatBoost为例，展示了如何使用SHAP算法进行特征筛选以优化模型性能。

SHAP（Shapley Additive exPlanations）值是一种用于解释机器学习模型预测的方法。

在特征消除中，使用SHAP值来衡量每个特征对模型输出的贡献，从而帮助确定哪些特征对模型预测的影响较小，可以被移除。

下面先给出两个定义，再给出具体步骤

定义1： $L_{i} = l(y_{i},a_{i})$ ------- 模型对样本xi预测结果的损失

定义2： $L_{i,{-j}} = l(y_{i},a_{i}-v_{i,j})$ ------- 去掉特征j后，模型对样本xi预测结果的损失

其中ai是模型对样本xi的预测结果，vi,j是j特征对样本i预测结果的贡献度，二者有如下关系

$a_{_{i}} = \sum_{j=1}^{M} v_{i,j}$

也就是说，对样本xi的预测结果ai,由各个特征的贡献度求和而得

那么基于SHAP的特征消除（特征筛选）包含以下几个步骤：

1.从从空的已消除特征集合 E={}开始

2.计算当前损失值（从原始特征中消除了E中的特征后的）

$L_{-E} = \sum_{i=1}^{N} L_{i,-E} = \sum_{i=1}^{N} l (y_{i},a_{i}-\sum _{k\in E} v_{i,k})$

3.对于每一个未被消除的特征，使用 Shap 值计算单独消除其中一个特征前后损失函数变化的得分

$score_{j} = L_{-E,-j} - L_{-E} = \sum_{i=1}^{N} L_{i,(-E,-j)} - L_{-E} = \sum_{i=1}^{N} l(y_{i},a_{i}-\sum _{k\in E} v_{i,k}-v_{i,j}) - L_{-E}$

4. 消除得分最低的一个特征并将其添加到集合 E

5. 如果仍然需要消除特征，则返回到步骤 2

使用例子

'''
可以基于三种算法PredictionValuesChange, LossFunctionChange or ShapValues
使用方法如下
model = CatBoost(params)
summary = model.select_features(
    train_pool,              # pool used for training
    eval_set,                # pool used for early stopping and features scores
    features_for_select,     # which features are allowed to eliminate?
    num_features_to_select,  # how many features do you want to select?
    algorithm,               # based on PredictionValuesChange, LossFunctionChange or             
                               ShapValues
    steps,                   # how many times it is allowed to train model?
                             # more steps - more accurate selection, especially for 
                               PredictionValuesChange
    shap_calc_type,          # one of Approximate, Regular or Exact
                             # used for LossFunctionChange and ShapValues algorithms
    train_final_model,       # is it required to fit model with selected features?
    plot                     # build a beatiful plot with metric values?
)

'''
from catboost import CatBoostRegressor, EShapCalcType, EFeaturesSelectionAlgorithm
#给定catboost参数
ctb_params = dict(iterations=600,
                      learning_rate=0.1,#注意，当task_type为GPU时，学习率需要调大0.1->1.0
                      depth=8,
                      l2_leaf_reg=30,
                      bootstrap_type='Bernoulli',
                      subsample=0.66,
                      loss_function='MAE',
                      eval_metric = 'MAE',
                      metric_period=100,
                      od_type='Iter',
                      od_wait=30,
                      task_type='CPU',
                      allow_writing_files=False,
                      )

print("Feature Elimination Performing.")
ctb_model = CatBoostRegressor(**ctb_params)
#设置参数筛选的参数，训练集、验证集，原始全部的特征名，要剩下几个特征，几步（多点效果好），筛选算法，是否用筛选出的特征训练一个新模型，是否画图
summary = ctb_model.select_features(
        df_offline_train[feature_name], df_offline_train_target,
        eval_set=[(df_offline_valid[feature_name], df_offline_valid_target)],
        features_for_select=feature_name,
        num_features_to_select=len(feature_name)-24,    # Dropping from 124 to 100
        steps=3,
        algorithm=EFeaturesSelectionAlgorithm.RecursiveByShapValues,
        shap_calc_type=EShapCalcType.Regular,
        train_final_model=False,
        plot=True,
    )