使用Amazon SageMaker Clarify和SHAP解释模型预测

戴玫芹

于 2025-06-11 09:17:07 发布

阅读量343

点赞数 4

本文链接：https://blog.csdn.net/gitblog_00825/article/details/148578398

版权

使用Amazon SageMaker Clarify和SHAP解释模型预测

data-science-on-aws AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker 项目地址: https://gitcode.com/gh_mirrors/da/data-science-on-aws

引言

在当今数据驱动的商业环境中，机器学习模型的可解释性变得越来越重要。随着业务需求的扩展和法规要求的增加，我们需要理解模型为何做出特定决策。Amazon SageMaker Clarify是一个强大的工具，它利用SHAP（SHapley Additive exPlanations）方法来解释每个输入特征对最终决策的贡献程度。

准备工作

在开始模型解释性分析之前，我们需要完成以下准备工作：

环境设置：导入必要的Python库，包括boto3、sagemaker、pandas和numpy等
会话初始化：创建SageMaker会话，获取默认存储桶和执行角色
数据准备：准备用于解释性分析的测试数据，通常采用JSONLines格式

import boto3
import sagemaker
import pandas as pd
import numpy as np

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name="sagemaker", region_name=region)

测试数据准备

为了进行模型解释性分析，我们需要准备与模型输入格式匹配的测试数据。这些数据通常包含模型的特征输入和预期的输出格式。

test_data_explainability_path = "./data-clarify/test_data_explainability.jsonl"

数据上传

将准备好的测试数据上传到Amazon S3存储桶中，以便SageMaker Clarify可以访问：

test_data_explainablity_s3_uri = sess.upload_data(
    bucket=bucket, key_prefix="bias/test_data_explainability", path=test_data_explainability_path
)

模型配置

在进行解释性分析之前，我们需要从训练作业中创建模型：

inference_image_uri = sagemaker.image_uris.retrieve(
    framework="tensorflow",
    region=region,
    version="2.3.1",
    py_version="py37",
    instance_type="ml.m5.4xlarge",
    image_scope="inference",
)

model_name = sess.create_model_from_job(training_job_name=training_job_name, image_uri=inference_image_uri)

SageMaker Clarify处理器

创建SageMakerClarifyProcessor实例，用于执行解释性分析：

from sagemaker import clarify

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.c5.2xlarge", sagemaker_session=sess
)

配置解释性分析

数据配置(DataConfig)

DataConfig对象指定了数据输入输出的基本信息：

explainability_report_prefix = "bias/explainability-report-{}".format(training_job_name)
explainability_output_path = "s3://{}/{}".format(bucket, explainability_report_prefix)

explainability_data_config = clarify.DataConfig(
    s3_data_input_path=test_data_explainablity_s3_uri,
    s3_output_path=explainability_output_path,
    headers=["review_body", "product_category"],
    features="features",
    dataset_type="application/jsonlines",
)

模型配置(ModelConfig)

ModelConfig对象包含了训练模型的信息：

model_config = clarify.ModelConfig(
    model_name=model_name,
    instance_type="ml.m5.4xlarge",
    instance_count=1,
    content_type="application/jsonlines",
    accept_type="application/jsonlines",
    content_template='{"features":$features}',
)

SHAP配置(SHAPConfig)

SHAPConfig定义了SHAP分析的具体参数：

shap_config = clarify.SHAPConfig(
    baseline=[{"features": ["ok", "Digital_Software"]}],
    num_samples=5,
    agg_method="mean_abs",
)

运行解释性分析作业

配置完成后，我们可以启动解释性分析作业：

clarify_processor.run_explainability(
    model_config=model_config,
    model_scores="predicted_label",
    data_config=explainability_data_config,
    explainability_config=shap_config,
    wait=False,
    logs=False,
)

查看分析结果

分析作业完成后，我们可以从S3下载报告并查看：

!aws s3 cp --recursive $explainability_output_path ./explainability_report/

报告将包含以下内容：

特征重要性分析
每个特征对预测结果的贡献
SHAP值的可视化展示

技术原理深入

SHAP方法简介

SHAP（SHapley Additive exPlanations）是一种基于协作理论的方法，用于解释机器学习模型的预测。它的核心思想是将模型的预测值分配给各个输入特征，计算每个特征的贡献值。

SHAP值具有以下优点：

一致性：如果模型改变使得某个特征更重要，SHAP值会相应增加
局部准确性：特征贡献的总和等于模型的预测输出
缺失性：缺失特征的贡献为零

SageMaker Clarify的工作原理

SageMaker Clarify在后台执行以下步骤：

启动临时处理实例
部署模型的临时端点
使用SHAP方法计算特征重要性
生成可视化报告
清理临时资源

最佳实践

样本数量选择：根据数据规模和计算资源合理设置num_samples参数
基线选择：选择有代表性的基线样本，通常可以使用数据集的平均值
报告解读：重点关注对业务决策有实际意义的特征贡献
资源管理：分析完成后及时清理临时资源，避免不必要的费用

结论

通过Amazon SageMaker Clarify和SHAP方法，我们可以深入了解机器学习模型的决策过程，提高模型的可解释性和透明度。这对于满足合规要求、建立用户信任以及改进模型性能都具有重要意义。

在实际应用中，建议将模型解释性分析作为模型开发流程的标准环节，特别是在涉及关键业务决策或受监管的场景中。

data-science-on-aws AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker 项目地址: https://gitcode.com/gh_mirrors/da/data-science-on-aws

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考