隐私计算实践4|逻辑回归LR与广义线性模型GLM开发实践

leeyayai_xixihah

已于 2024-07-31 15:36:20 修改

阅读量727

点赞数 10

分类专栏：隐私计算文章标签：逻辑回归算法机器学习

于 2024-07-31 10:25:36 首次发布

本文链接：https://blog.csdn.net/leeyayai_xixihah/article/details/140798622

版权

隐私计算专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一：背景知识--LR/GLM应用场景及原理

1.基于sklearn调用TweedieRegressor对fetch_openml数据集建模

2. 调用 SS GLM接口建模--基于多方安全计算的建模

2.1 给节点创建SPU实例，用于后续的安全计算

2.2 处理数据-列处理，bob与alice分配不同的特征

2.3 定义多方安全计算中预测与评估的方法

2.4 使用SSGLM进行拟合

2.5 Tweedie 幂参数的影响

2.6 比较Tweedie 回归和安全单隐层广义线性模型 SSGLM

2.7 结果可视化

一：背景知识--LR/GLM应用场景及原理

误差组件为伯努利分布---逻辑回归

连接函数：逻辑回归sigmod函数，将函数值映射到0-1之间

隐语中采用的：一阶优化器加二阶优化器

任何想要计算的函数：都可以化为加法和乘法的运算来近似

用优化器找目标函数的局部最优值

多方安全计算的下的函数计算----> 多方安全计算的加法和乘法

--->同态加密算法与秘密分享算法

举例：求三个秘密方A（15），B（25），C（50），三个数的值。

先将这三个值拆分----> A，B，C，将三个分片进行交换---

A与C交换了3 和 4， A与B交换了5和9 --- >

A ,B，C分别在本地进行交换后的分片计算 -- > 最后将三个值汇总

分片--> 交换分片--> 相乘求和

二：隐语模型--密态SSLR/SSGLM

P值在0-2之间--由数据分布决定--但是没有这个先验知识-->多次实验p值，看最终的结果

优化器：SGD梯度下降法（一阶优化器），IRLS二阶优化器

选择SGD优化器，可进一步选择是否由IRLS进行初始化

offset_col:偏置

dist_scale:数据方差，属于先验知识（不好获取），用默认值1

weight_col:数据权重，不同样本的权重不同--不平衡数据集

三：应用实现-从理论到隐语应用

四：实践

1.基于sklearn调用TweedieRegressor对fetch_openml数据集建模

便于数据集下载

import ssl
ssl._create_default_https_context=ssl._create_unverified_context
#下载数据集用

数据集下载，使用，结果评估函数

from functools import partial

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    mean_tweedie_deviance,
)


def load_mtpl2(n_samples=None):
    """Fetch the French Motor Third-Party Liability Claims dataset.

    Parameters
    ----------
    n_samples: int, default=None
      number of samples to select (for faster run time). Full dataset has
      678013 samples.
    """
    # freMTPL2freq dataset from https://www.openml.org/d/41214
    df_freq = fetch_openml(data_id=41214, as_frame=True).data
    df_freq["IDpol"] = df_freq["IDpol"].astype(int)
    df_freq.set_index("IDpol", inplace=True)

    # freMTPL2sev dataset from https://www.openml.org/d/41215
    df_sev = fetch_openml(data_id=41215, as_frame=True).data

    # sum ClaimAmount over identical IDs
    df_sev = df_sev.groupby("IDpol").sum()

    df = df_freq.join(df_sev, how="left")
    df["ClaimAmount"].fillna(0, inplace=True)

    # unquote string fields
    for column_name in df.columns[df.dtypes.values == object]:
        df[column_name] = df[column_name].str.strip("'")
    return df.iloc[:n_samples]


def plot_obs_pred(
    df,
    feature,
    weight,
    observed,
    predicted,
    y_label=None,
    title=None,
    ax=None,
    fill_legend=False,
):
    """Plot observed and predicted - aggregated per feature level.

    Parameters
    ----------
    df : DataFrame
        input data
    feature: str
        a column name of df for the feature to be plotted
    weight : str
        column name of df with the values of weights or exposure
    observed : str
        a column name of df with the observed target
    predicted : DataFrame
        a dataframe, with the same index as df, with the predicted target
    fill_legend : bool, default=False
        whether to show fill_between legend
    """
    # aggregate observed and predicted variables by feature level
    df_ = df.loc[:, [feature, weight]].copy()
    df_["observed"] = df[observed] * df[weight]
    df_["predicted"] = predicted * df[weight]
    df_ = (
        df_.groupby([feature])[[weight, "observed", "predicted"]]
        .sum()
        .assign(observed=lambda x: x["observed"] / x[weight])
        .assign(predicted=lambda x: x["predicted"] / x[weight])
    )

    ax = df_.loc[:, ["observed", "predicted"]].plot(style=".", ax=ax)
    y_max = df_.loc[:, ["observed", "predicted"]].values.max() * 0.8
    p2 = ax.fill_between(
        df_.index,
        0,
        y_max * df_[weight] / df_[weight].values.max(),
        color="g",
        alpha=0.1,
    )
    if fill_legend:
        ax.legend([p2], ["{} distribution".format(feature)])
    ax.set(
        ylabel=y_label if y_label is not None else None,
        title=title if title is not None else "Train: Observed vs Predicted",
    )


def score_estimator(
    estimator,
    X_train,
    X_test,
    df_train,
    df_test,
    target,
    weights,
    tweedie_powers=None,
):
    """Evaluate an estimator on train and test sets with different metrics"""

    metrics = [
        ("D² explained", None),  # Use default scorer if it exists
        ("mean abs. error", mean_absolute_error),
        ("mean squared error", mean_squared_error),
    ]
    if tweedie_powers:
        metrics += [
            (
                "mean Tweedie dev p={:.4f}".format(power),
                partial(mean_tweedie_deviance, power=power),
            )
            for power in tweedie_powers
        ]

    res = []
    for subset_label, X, df in [
        ("train", X_train, df_train),
        ("test", X_test, df_test),
    ]:
        y, _weights = df[target], df[weights]
        for score_label, metric in metrics:
            if isinstance(estimator, tuple) and len(estimator) == 2:
                # Score the model consisting of the product of frequency and
                # severity models.
                est_freq, est_sev = estimator
                y_pred = est_freq.predict(X) * est_sev.predict(X)
            else:
                y_pred = estimator.predict(X)

            if metric is None:
                if not hasattr(estimator, "score"):
                    continue
                score = estimator.score(X, y, sample_weight=_weights)
            else:
                score = metric(y, y_pred, sample_weight=_weights)

            res.append({"subset": subset_label, "metric": score_label, "score": score})

    res = (
        pd.DataFrame(res)
        .set_index(["metric", "subset"])
        .score.unstack(-1)
        .round(4)
        .loc[:, ["train", "test"]]
    )
    return res

数据处理与特征工程

这段代码展示了使用Pandas和Scikit-learn库对保险数据集进行预处理和特征转换的完整流程。以下是代码的详细解释：

导入库: 导入了ColumnTransformer、make_pipeline、FunctionTransformer、KBinsDiscretizer、OneHotEncoder和StandardScaler等，这些都是Scikit-learn中用于特征工程的组件。
加载数据: 调用load_mtpl2函数（未在代码中定义）加载了一个包含1000个样本的保险数据集。
数据清洗: 对数据进行了清洗，将索赔金额为0但索赔次数大于等于1的记录的索赔次数设置为0，以避免对模型的负面影响。
修正不合理观测值: 对索赔次数、风险敞口和索赔金额进行了限制，以消除可能的数据错误或异常值。
创建转换器: 定义了一个名为log_scale_transformer的管道，用于先对数据进行自然对数转换，然后标准化。
构建列转换器: 使用ColumnTransformer定义了多个转换步骤：
- 使用KBinsDiscretizer对VehAge和DrivAge进行分箱离散化。
- 使用OneHotEncoder对分类变量进行独热编码。
- 直接传递BonusMalus列。
- 使用之前定义的log_scale_transformer对Density列进行转换。
拟合并转换数据: 对数据集df应用column_trans转换器，得到转换后的特征矩阵X。
计算保险精算指标: 计算了“Pure Premium”（纯保费），这是每个保险持有人每单位风险敞口的预期总索赔金额。此外，还计算了“Frequency”（频率）和“AvgClaimAmount”（平均每起索赔金额）。
打印数据: 使用Pandas的option_context设置，限制打印时显示的最大列数为15，并打印了索赔金额大于0的前几行数据。

请注意，这段代码中的load_mtpl2函数未给出定义，您可能需要根据实际情况替换为实际的数据加载函数。此外，代码中的np.fmax用于避免除以0的情况，确保了即使在索赔次数为0的情况下也能正常计算平均每起索赔金额。

最后，print语句中的条件df.ClaimAmount > 0确保了只打印那些有有效索赔金额的记录。这对于建模来说是有意义的，因为模型需要正的索赔金额来进行训练。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    KBinsDiscretizer,
    OneHotEncoder,
    StandardScaler,
)

df = load_mtpl2(n_samples=1000)

# Note: filter out claims with zero amount, as the severity model
# requires strictly positive target values.
df.loc[(df["ClaimAmount"] == 0) & (df["ClaimNb"] >= 1), "ClaimNb"] = 0

# Correct for unreasonable observations (that might be data error)
# and a few exceptionally large claim amounts
df["ClaimNb"] = df["ClaimNb"].clip(upper=4)
df["Exposure"] = df["Exposure"].clip(upper=1)
df["ClaimAmount"] = df["ClaimAmount"].clip(upper=200000)

log_scale_transformer = make_pipeline(
    FunctionTransformer(func=np.log), StandardScaler()
)

column_trans = ColumnTransformer(
    [
        (
            "binned_numeric",
            KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile'),
            ["VehAge", "DrivAge"],
        ),
        (
            "onehot_categorical",
            OneHotEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
        ),
        ("passthrough_numeric", "passthrough", ["BonusMalus"]),
        ("log_scaled_numeric", log_scale_transformer, ["Density"]),
    ],
    remainder="drop",
)
X = column_trans.fit_transform(df)

# Insurances companies are interested in modeling the Pure Premium, that is
# the expected total claim amount per unit of exposure for each policyholder
# in their portfolio:
df["PurePremium"] = df["ClaimAmount"] / df["Exposure"]

# This can be indirectly approximated by a 2-step modeling: the product of the
# Frequency times the average claim amount per claim:
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df["AvgClaimAmount"] = df["ClaimAmount"] / np.fmax(df["ClaimNb"], 1)

with pd.option_context("display.max_columns", 15):
    print(df[df.ClaimAmount > 0].head())

划分训练集与测试集

from sklearn.model_selection import train_test_split

df_train, df_test, X_train, X_test = train_test_split(df, X, random_state=0)

模型训练

from sklearn.linear_model import TweedieRegressor

glm_pure_premium = TweedieRegressor(power=1.9, alpha=0.1)
glm_pure_premium.fit(
    X_train, df_train["PurePremium"], sample_weight=df_train["Exposure"]
)

tweedie_powers = [1.5, 1.7, 1.8, 1.9, 1.99, 1.999, 1.9999]

scores_glm_pure_premium = score_estimator(
    glm_pure_premium,
    X_train,
    X_test,
    df_train,
    df_test,
    target="PurePremium",
    weights="Exposure",
    tweedie_powers=tweedie_powers,
)

scores = pd.concat(
    [scores_glm_pure_premium],
    axis=1,
    sort=True,
    keys=("TweedieRegressor"),
)
print("Evaluation of the Product Model and the Tweedie Regressor on target PurePremium")
with pd.option_context("display.expand_frame_repr", False):
    print(scores)

2. 调用 SS GLM接口建模--基于多方安全计算的建模

2.1 给节点创建SPU实例，用于后续的安全计算
import secretflow as sf

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

# In case you have a running secretflow runtime already.
sf.shutdown()

sf.init(['alice', 'bob'], address='local')

alice, bob = sf.PYU('alice'), sf.PYU('bob')
spu = sf.SPU(
    sf.utils.testing.cluster_def(
        ['alice', 'bob'],
        {"protocol": "REF2K", "field": "FM128", "fxp_fraction_bits": 40},
    ),
)
代码的详细步骤和解释：
检查 SecretFlow 版本:
print('The version of SecretFlow: {}'.format(sf.__version__))
这行代码打印出当前安装的 SecretFlow 库的版本。
关闭 SecretFlow 运行时:
sf.shutdown()
如果之前已经启动了 SecretFlow 运行时，这行代码会将其关闭。
初始化 SecretFlow 运行环境:
sf.init(['alice', 'bob'], address='local')
这行代码初始化了 SecretFlow 的运行环境，指定了两个参与方 alice 和 bob，并且设置通信地址为本地。
创建参与方的 Python Processing Unit (PYU):
alice, bob = sf.PYU('alice'), sf.PYU('bob')
为 alice 和 bob 创建了 PYU 实例，这些实例将用于执行 Python 函数。
创建 Secure Processing Unit (SPU):
spu = sf.SPU(
    sf.utils.testing.cluster_def(
        ['alice', 'bob'],
        {"protocol": "REF2K", "field": "FM128", "fxp_fraction_bits": 40},
    ),
)
这行代码创建了一个 SPU 实例，它将用于执行安全计算。cluster_def 函数定义了一个包含两个参与方的集群，并且设置了一些安全计算的参数，如协议类型 REF2K，有限域 FM128，以及定点数的小数位数 fxp_fraction_bits。

2.2 处理数据-列处理，bob与alice分配不同的特征

这段代码的关键在于将本地数据集转换为联邦化的数据结构
from secretflow.data import FedNdarray, PartitionWay

x, y = X_train, df_train["PurePremium"]
w = df_train["Exposure"]


def x_to_vdata(x):
    x = x.todense()
    v_data = FedNdarray(
        partitions={
            alice: alice(lambda: x[:, :15])(),
            bob: bob(lambda: x[:, 15:])(),
        },
        partition_way=PartitionWay.VERTICAL,
    )
    return v_data


v_data = x_to_vdata(x)

label_data = FedNdarray(
    partitions={alice: alice(lambda: y.values)()},
    partition_way=PartitionWay.VERTICAL,
)

sample_weight = FedNdarray(
    partitions={alice: alice(lambda: w.values)()},
    partition_way=PartitionWay.VERTICAL,
)
这段代码展示了如何在 SecretFlow 框架中使用联邦学习的方式处理数据。代码的主要目的是将本地数据集 X_train, y_train, 和 w（分别代表特征、标签和样本权重）转换成 SecretFlow 可以操作的联邦数据格式。以下是代码的详细解释：
导入依赖:
from secretflow.data import FedNdarray, PartitionWay
导入了 FedNdarray 类和 PartitionWay 枚举类型，这些用于创建联邦化的 N 维数组。
定义数据:
x, y = X_train, df_train["PurePremium"]
w = df_train["Exposure"]
这里定义了三个变量：x 代表训练特征集，y 代表与 df_train["PurePremium"] 相关的标签，w 代表样本权重。
定义转换函数:
def x_to_vdata(x):
    x = x.todense()
    v_data = FedNdarray(
        partitions={
            alice: alice(lambda: x[:, :15])(),
            bob: bob(lambda: x[:, 15:])(),
        },
        partition_way=PartitionWay.VERTICAL,
    )
    return v_data
定义了一个函数 x_to_vdata，它接收一个稀疏矩阵 x，将其转换为密集格式，然后使用 FedNdarray 创建一个垂直分割的联邦数组。特征矩阵 x 被分成两部分，前15列分配给 alice，其余部分分配给 bob。
创建特征的联邦数组:
v_data = x_to_vdata(x)
调用 x_to_vdata 函数并传入特征集 x，得到联邦化的特征数组 v_data。
创建标签的联邦数组:--标签都给alice
label_data = FedNdarray(
    partitions={alice: alice(lambda: y.values)()},
    partition_way=PartitionWay.VERTICAL,
)
创建了一个联邦数组 label_data，它包含标签 y，并且是垂直分割的。
创建样本权重的联邦数组:
sample_weight = FedNdarray(
    partitions={alice: alice(lambda: w.values)()},
    partition_way=PartitionWay.VERTICAL,
)
创建了一个联邦数组 sample_weight，它包含样本权重 w，并且同样是垂直分割的。

2.3 定义多方安全计算中预测与评估的方法
from secretflow.device.driver import reveal
from secretflow.ml.linear.ss_glm.core import get_dist

dist = 'Tweedie'
ss_glm_power = 1.9


class DirectRevealModel:
    def __init__(self, model) -> None:
        self.model = model

    def predict(self, X):
        vdata = x_to_vdata(X)
        y = self.model.predict(vdata)
        return reveal(y).reshape((-1,))

    def score(self, X, y, sample_weight=None):
        y = y.values
        y_pred = self.predict(X)

        constant = np.mean(y)
        if sample_weight is not None:
            constant *= sample_weight.shape[0] / np.sum(sample_weight)

        # Missing factor of 2 in deviance cancels out.
        deviance = get_dist(dist, 1, ss_glm_power).deviance(y_pred, y, None)
        deviance_null = get_dist(dist, 1, ss_glm_power).deviance(
            np.average(y, weights=sample_weight) + np.zeros(y.shape), y, None
        )
        return 1 - (deviance + constant) / (deviance_null + constant)
这段代码定义了一个名为 DirectRevealModel 的类，它封装了一个 SecretFlow 模型，并提供了 predict 和 score 方法。这个类主要用于在安全多方计算环境中直接获取模型预测结果和评估模型性能。以下是代码的详细解释：
导入依赖:
from secretflow.device.driver import reveal
from secretflow.ml.linear.ss_glm.core import get_dist
导入了 SecretFlow 的 reveal 函数，用于获取加密数据的明文值，以及 get_dist 函数，用于获取特定分布的类。
定义分布和幂参数:
dist = 'Tweedie'
ss_glm_power = 1.9
设置了使用的分布（Tweedie分布）和Tweedie分布的幂参数。
定义 DirectRevealModel 类:
class DirectRevealModel:
    def __init__(self, model) -> None:
        self.model = model
定义了 DirectRevealModel 类，它接受一个 SecretFlow 模型作为参数，并将其存储为实例变量。
预测方法:
def predict(self, X):
    vdata = x_to_vdata(X)
    y = self.model.predict(vdata)
    return reveal(y).reshape((-1,))
predict 方法接收特征数据 X，使用 x_to_vdata 函数将其转换为 SecretFlow 可以理解的联邦数据格式，然后使用封装的模型进行预测。通过 reveal 函数获取明文预测结果，并将其重塑为列向量形式。
评分方法:
def score(self, X, y, sample_weight=None):
    ...
score 方法用于评估模型性能。它接收测试数据 X、真实标签 y 以及可选的样本权重 sample_weight。计算模型在测试数据集上的 deviance（偏差）以及 null deviance（零偏差），然后根据这两个值计算模型的评分。
计算偏差:
deviance = get_dist(dist, 1, ss_glm_power).deviance(y_pred, y, None)
deviance_null = get_dist(dist, 1, ss_glm_power).deviance(...)
deviance_null = get_dist(dist, 1, ss_glm_power).deviance(...)

使用 get_dist 获取 Tweedie 分布的实例，并计算偏差。deviance 是模型预测与实际标签之间的偏差，而 deviance_null 是一个基线偏差，通常使用常数预测计算。
计算评分:
return 1 - (deviance + constant) / (deviance_null + constant)
根据偏差计算评分，评分越高表示模型性能越好。
reveal 函数的使用应该谨慎，因为它会泄露计算结果的明文，这在安全多方计算环境中可能不是期望的行为。通常，我们希望在不泄露明文的情况下评估模型性能。

2.4 使用SSGLM进行拟合

在 SecretFlow 框架中使用安全单隐层广义线性模型（SSGLM）进行拟合（训练），并封装了训练后的模型以进行预测和评分。以下是代码的详细步骤和解释：
import time
from secretflow.ml.linear.ss_glm import SSGLM

model = SSGLM(spu)

ss_glm_power = 1.9
start = time.time()
model.fit_irls(
    v_data,
    label_data,
    None,
    sample_weight,
    2,
    'Log',
    'Tweedie',
    ss_glm_power,
    l2_lambda=0.1,
    infeed_batch_size_limit=8000000,
    fraction_of_validation_set=0.2,
    stopping_rounds=2,
    stopping_metric='deviance',
    stopping_tolerance=0.001,
)

wrapped_model = DirectRevealModel(model)
导入时间模块:
import time
导入 Python 的 time 模块，用于测量模型训练所需的时间。
从 SecretFlow 导入 SSGLM:
from secretflow.ml.linear.ss_glm import SSGLM
从 SecretFlow 库中导入 SSGLM 类，用于创建安全多方计算环境中的广义线性模型。
初始化 SSGLM 模型:
model = SSGLM(spu)
使用之前创建的 SPU 实例 spu 初始化 SSGLM 模型。
设置 Tweedie 模型的幂参数:
ss_glm_power = 1.9
设置用于 Tweedie 模型的幂参数。
记录训练开始时间:
start = time.time()
记录模型训练开始的时间。
拟合 SSGLM 模型:
model.fit_irls(
    v_data,
    label_data,
    None,
    sample_weight,
    2,
    'Log',
    'Tweedie',
    ss_glm_power,
    l2_lambda=0.1,
    infeed_batch_size_limit=8000000,
    fraction_of_validation_set=0.2,
    stopping_rounds=2,
    stopping_metric='deviance',
    stopping_tolerance=0.001,
)
使用 fit_irls 方法拟合 SSGLM 模型。这个方法接受多个参数，包括：

v_data: 特征数据的联邦数组。
label_data: 标签数据的联邦数组。
None: 这里似乎是一个未使用的参数。
sample_weight: 样本权重的联邦数组。
2: 表示模型的多项式特征的度数。
'Log': 表示使用的链接函数。
'Tweedie': 指定分布族。
ss_glm_power: Tweedie 分布的幂参数。
l2_lambda: L2 正则化参数。
infeed_batch_size_limit: 输入数据的批处理大小限制。
fraction_of_validation_set: 验证集的比例。
stopping_rounds: 停止训练的迭代次数。
stopping_metric: 用于停止条件的评估指标。
stopping_tolerance: 停止条件的容差。
封装模型:
wrapped_model = DirectRevealModel(model)
创建 DirectRevealModel 类的实例，将训练后的 SSGLM 模型封装起来，以便进行直接的预测和评分。
这段代码展示了在安全多方计算环境中使用 SecretFlow 进行模型训练的完整流程。通过封装模型，可以在不泄露各方数据的情况下，使用模型进行预测和评估。

2.5 Tweedie 幂参数的影响

不同的 Tweedie 幂参数评估 Tweedie 回归模型和安全单隐层广义线性模型（SSGLM）的性能。

tweedie_powers = [1.5, 1.7, 1.8, 1.9, 1.99, 1.999, 1.9999]

scores_ss_glm_pure_premium = score_estimator(
    wrapped_model,
    X_train,
    X_test,
    df_train,
    df_test,
    target="PurePremium",
    weights="Exposure",
    tweedie_powers=tweedie_powers,
)

scores = pd.concat(
    [scores_glm_pure_premium, scores_ss_glm_pure_premium],
    axis=1,
    sort=True,
    keys=("TweedieRegressor", "SSGLMRegressor"),
)
print("Evaluation of the Tweedie Regressor and SS GLM on target PurePremium")
with pd.option_context("display.expand_frame_repr", False):
    print(scores)

2.6 比较Tweedie 回归和安全单隐层广义线性模型 SSGLM

比较两种不同模型（Tweedie 回归和安全单隐层广义线性模型 SSGLM）在训练集和测试集上的预测结果。

res = []
for subset_label, x, df in [
    ("train", X_train, df_train),
    ("test", X_test, df_test),
]:
    exposure = df["Exposure"].values
    res.append(
        {
            "subset": subset_label,
            "observed": df["ClaimAmount"].values.sum(),
            "predicted, tweedie, power=%.2f"
            % glm_pure_premium.power: np.sum(exposure * glm_pure_premium.predict(x)),
            "predicted, ss glm, power=%.2f"
            % ss_glm_power: np.sum(exposure * wrapped_model.predict(x)),
        }
    )

print(pd.DataFrame(res).set_index("subset").T)

2.7 结果可视化

使用 Python 和相关库来绘制不同模型的洛伦兹曲线（Lorenz Curve），并计算它们的基尼系数（Gini Index）。洛伦兹曲线是衡量模型预测风险分布公平性的一个工具，而基尼系数则是衡量模型预测好坏的一个指标，值越高表示模型的预测性能越好，完美模型的基尼系数为1

from sklearn.metrics import auc


def lorenz_curve(y_true, y_pred, exposure):
    y_true, y_pred = np.asarray(y_true), np.asarray(y_pred)
    exposure = np.asarray(exposure)

    # order samples by increasing predicted risk:
    ranking = np.argsort(y_pred)
    ranked_exposure = exposure[ranking]
    ranked_pure_premium = y_true[ranking]
    cumulated_claim_amount = np.cumsum(ranked_pure_premium * ranked_exposure)
    cumulated_claim_amount /= cumulated_claim_amount[-1]
    cumulated_samples = np.linspace(0, 1, len(cumulated_claim_amount))
    return cumulated_samples, cumulated_claim_amount


fig, ax = plt.subplots(figsize=(8, 8))

y_pred_total_ss_glm = wrapped_model.predict(X_test).reshape((-1,))
y_pred_total = glm_pure_premium.predict(X_test)

for label, y_pred in [
    ("Compound Poisson Gamma", y_pred_total),
    ("Compound Poisson Gamma SS GLM", y_pred_total_ss_glm),
]:
    ordered_samples, cum_claims = lorenz_curve(
        df_test["PurePremium"], y_pred, df_test["Exposure"]
    )
    gini = 1 - 2 * auc(ordered_samples, cum_claims)
    label += " (Gini index: {:.3f})".format(gini)
    ax.plot(ordered_samples, cum_claims, linestyle="-", label=label)

# Oracle model: y_pred == y_test
ordered_samples, cum_claims = lorenz_curve(
    df_test["PurePremium"], df_test["PurePremium"], df_test["Exposure"]
)
gini = 1 - 2 * auc(ordered_samples, cum_claims)
label = "Oracle (Gini index: {:.3f})".format(gini)
ax.plot(ordered_samples, cum_claims, linestyle="-.", color="gray", label=label)

# Random baseline
ax.plot([0, 1], [0, 1], linestyle="--", color="black", label="Random baseline")
ax.set(
    title="Lorenz Curves",
    xlabel="Fraction of policyholders\n(ordered by model from safest to riskiest)",
    ylabel="Fraction of total claim amount",
)
ax.legend(loc="upper left")
plt.plot()