隐语Secretflow 第6讲：逻辑回归LR与广义线性模型GLM开发实践

cmdxy0201

已于 2024-09-07 21:58:13 修改

阅读量725

点赞数 15

文章标签：逻辑回归算法机器学习

于 2024-09-07 21:49:51 首次发布

本文链接：https://blog.csdn.net/weixin_52114126/article/details/141994987

版权

一、背景知识-LR/GLM应用场景及原理

1.应用场景

逻辑回归（LR）：二分类问题
广义线性模型（GLM）：风险保费预测，根据要提供的保障责任，计算预期总索赔额。

法一：直接对纯保费建模，服从tweedie分布（1，2）

法二：通过两步建模间接近似：纯保费 = 索赔次数（泊松分布、负二项分布） * 平均索赔金额（伽马分布、逆高斯分布）

2.原理

线性回归（GLM的一个基本形式）

其假设响应变量Y的真实值由系统组件（system component）和误差组件（error component）组成。

系统组件：线性预测器 $\eta = x^{T}\beta$ （数值项，可拟合）

误差组件：白噪声 $\epsilon \sim N(0,1)$ （高斯随机变量）

线性回归的响应变量Y的条件分布为高斯分布 $Y\sim N(x^{T}\beta ,1)$

广义线性回归模型（GLM）

GLM将误差项的概率分布扩展为指数分布族：伯努利分布（逻辑回归）、泊松分布、Gamma分布、复合泊松Gamma分布、Tweedie分布等。

一个广义线性模型有三个关键组件：

系统组件：一个线性预测器 $\eta =x^{T}\beta$ ， $x$ 为自变量， $\beta$ 是定义的未知参数。
随机组件：一个指数族分布作为响应变量Y的概率分布 $p(Y;\theta )$ ， $\theta$ 是分布的自然参数， $\theta$ 和 $\mu$ 存在一一映射关系，用函数 $\Psi$ 表示关系。
连接函数：使得 $\eta =g(\mu )$ ，描述系统组件和随机组件之间的关系（“值域映射”）

常用连接函数如下：

二、隐语模型-密态SSLR/SSGLM

广义线性模型参数估计

一阶优化器：SGD参数估计方法 （计算\通信量小；初始化随机）

二阶优化器：迭代重加权最小二乘法（IRLS）（计算\通信复杂度高；初始化准确、收敛速度快）

二阶优化器（训练初期可选择之以初始化+简单迭代）+一阶优化器（然后使用之快速收敛）

秘密分享加法

各个参与方不知道对方的秘密是多少，也不需要中立的第三方参与

秘密分享乘法

三、应用实现-从理论到隐语应用

1.SSGLM参数解析

选择建模标签分布类型和连接函数 （desc）

若选择tweedie分布，需对p进行调参（一般在1~2，可以使用循环确定参数）

优化器

其他参数（偏执、方差、数据加权）

2.SSLR使用

1.准备SPU和归一化数据

2.模型训练

3.模型评估

3.SSGLM使用

SSGLM中可以使用IRLS或者SGD优化器进行训练

SS-LR、SSGLM优势：可证安全；不依赖可信第三方；支持多种模型（伯努利分布、泊松分布、gamma分布、tweedie分布）；计算高效

四、实践——针对Tweedie分布的GLM建模

基于sklearn调用TweedieRegressor对openssl_mtpl2进行建模

下载数据集，并设置一些辅助函数

import ssl
# in case certificate verify failed due to http connection
# 防止下面下载数据集时出现验证错误
ssl._create_default_https_context = ssl._create_unverified_context 

from functools import partial

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    mean_tweedie_deviance,
)

# open_mtpl2: how often/ how much a driver will file an insurance claim in a year
def load_mtpl2(n_samples=None):
    """Fetch the French Motor Third-Party Liability Claims dataset.

    Parameters
    ----------
    n_samples: int, default=None
      number of samples to select (for faster run time). Full dataset has
      678013 samples.
    """
    # freMTPL2freq dataset from https://www.openml.org/d/41214
    df_freq = fetch_openml(data_id=41214, as_frame=True).data
    df_freq["IDpol"] = df_freq["IDpol"].astype(int)
    df_freq.set_index("IDpol", inplace=True)

    # freMTPL2sev dataset from https://www.openml.org/d/41215
    df_sev = fetch_openml(data_id=41215, as_frame=True).data

    # 按照相同的 ID 累计索赔金额
    df_sev = df_sev.groupby("IDpol").sum()

    df = df_freq.join(df_sev, how="left")
    df["ClaimAmount"].fillna(0, inplace=True)

    # 去掉字符串字段的引号
    for column_name in df.columns[df.dtypes.values == object]:
        df[column_name] = df[column_name].str.strip("'")
    return df.iloc[:n_samples]


def plot_obs_pred(
    df,
    feature,
    weight,
    observed,
    predicted,
    y_label=None,
    title=None,
    ax=None,
    fill_legend=False,
):
    """Plot observed and predicted - 按特征级别聚合.

    Parameters
    ----------
    df : DataFrame
        input data
    feature: str
        a column name of df for the feature to be plotted
    weight : str
        column name of df with the values of weights or exposure
    observed : str
        a column name of df with the observed target
    predicted : DataFrame
        a dataframe, with the same index as df, with the predicted target
    fill_legend : bool, default=False
        whether to show fill_between legend
    """
    # aggregate observed and predicted variables by feature level
    df_ = df.loc[:, [feature, weight]].copy()
    df_["observed"] = df[observed] * df[weight]
    df_["predicted"] = predicted * df[weight]
    df_ = (
        df_.groupby([feature])[[weight, "observed", "predicted"]]
        .sum()
        .assign(observed=lambda x: x["observed"] / x[weight])
        .assign(predicted=lambda x: x["predicted"] / x[weight])
    )

    ax = df_.loc[:, ["observed", "predicted"]].plot(style=".", ax=ax)
    y_max = df_.loc[:, ["observed", "predicted"]].values.max() * 0.8
    p2 = ax.fill_between(
        df_.index,
        0,
        y_max * df_[weight] / df_[weight].values.max(),
        color="g",
        alpha=0.1,
    )
    if fill_legend:
        ax.legend([p2], ["{} distribution".format(feature)])
    ax.set(
        ylabel=y_label if y_label is not None else None,
        title=title if title is not None else "Train: Observed vs Predicted",
    )


def score_estimator(
    estimator,
    X_train,
    X_test,
    df_train,
    df_test,
    target,
    weights,
    tweedie_powers=None,
):
    """Evaluate an estimator on train and test sets with different metrics"""

    metrics = [
        ("D² explained", None),  # Use default scorer if it exists
        ("mean abs. error", mean_absolute_error),
        ("mean squared error", mean_squared_error),
    ]
    if tweedie_powers:
        metrics += [
            (
                "mean Tweedie dev p={:.4f}".format(power),
                partial(mean_tweedie_deviance, power=power),
            )
            for power in tweedie_powers
        ]

    res = []
    for subset_label, X, df in [
        ("train", X_train, df_train),
        ("test", X_test, df_test),
    ]:
        y, _weights = df[target], df[weights]
        for score_label, metric in metrics:
            if isinstance(estimator, tuple) and len(estimator) == 2:
                # Score the model consisting of the product of frequency and
                # severity models.
                est_freq, est_sev = estimator
                y_pred = est_freq.predict(X) * est_sev.predict(X)
            else:
                y_pred = estimator.predict(X)

            if metric is None:
                if not hasattr(estimator, "score"):
                    continue
                score = estimator.score(X, y, sample_weight=_weights)
            else:
                score = metric(y, y_pred, sample_weight=_weights)

            res.append({"subset": subset_label, "metric": score_label, "score": score})

    res = (
        pd.DataFrame(res)
        .set_index(["metric", "subset"])
        .score.unstack(-1)
        .round(4)
        .loc[:, ["train", "test"]]
    )
    return res

数据预处理

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    KBinsDiscretizer,
    OneHotEncoder,
    StandardScaler,
)

df = load_mtpl2(n_samples=50000)

# 过滤 Note: filter out claims with zero amount, as the severity model
# requires strictly positive target values.
df.loc[(df["ClaimAmount"] == 0) & (df["ClaimNb"] >= 1), "ClaimNb"] = 0

# 修正 Correct for unreasonable observations (that might be data error)
# and a few exceptionally large claim amounts
df["ClaimNb"] = df["ClaimNb"].clip(upper=4)
df["Exposure"] = df["Exposure"].clip(upper=1)
df["ClaimAmount"] = df["ClaimAmount"].clip(upper=200000)

log_scale_transformer = make_pipeline(
    FunctionTransformer(func=np.log), StandardScaler()
)

column_trans = ColumnTransformer(
    [
        (
            "binned_numeric",
            KBinsDiscretizer(n_bins=10, subsample=int(2e5), random_state=0),
            ["VehAge", "DrivAge"],
        ),
        (
            "onehot_categorical",
            OneHotEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
        ),
        ("passthrough_numeric", "passthrough", ["BonusMalus"]),
        ("log_scaled_numeric", log_scale_transformer, ["Density"]),
    ],
    remainder="drop",
)
X = column_trans.fit_transform(df)

# 纯保费 Insurances companies are interested in modeling the Pure Premium, that is
# the expected total claim amount per unit of exposure for each policyholder
# in their portfolio:
df["PurePremium"] = df["ClaimAmount"] / df["Exposure"]

# 两步建模近似 This can be indirectly approximated by a 2-step modeling: the product of the
# Frequency times the average claim amount per claim:
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df["AvgClaimAmount"] = df["ClaimAmount"] / np.fmax(df["ClaimNb"], 1)

with pd.option_context("display.max_columns", 15):
    print(df[df.ClaimAmount > 0].head())

训练集和测试集划分

from sklearn.model_selection import train_test_split

df_train, df_test, X_train, X_test = train_test_split(df, X, random_state=0)

使用TweedieRegressor建模并训练

from sklearn.linear_model import TweedieRegressor

# Tweedie 回归模型，power 参数指定 Tweedie 分布的指数（1.9 介于泊松分布和伽马分布之间）
# alpha 是正则化强度，solver 是求解器类型
glm_pure_premium = TweedieRegressor(power=1.9, alpha=0.1, solver='newton-cholesky')

# X_train 是特征矩阵，df_train["PurePremium"] 是目标变量（纯保费）
# sample_weight 参数指定每个样本的权重
glm_pure_premium.fit(
    X_train, df_train["PurePremium"], sample_weight=df_train["Exposure"]
) # df_train["PurePremium"]即label

tweedie_powers = [1.5, 1.7, 1.8, 1.9, 1.99, 1.999, 1.9999]

scores_glm_pure_premium = score_estimator(
    glm_pure_premium,
    X_train,
    X_test,
    df_train,
    df_test,
    target="PurePremium",
    weights="Exposure",
    tweedie_powers=tweedie_powers,
)

#将评估结果合并为一个 DataFram
scores = pd.concat(
    [scores_glm_pure_premium],
    axis=1,
    sort=True,
    keys=("TweedieRegressor"),
)
print("Evaluation of the Product Model and the Tweedie Regressor on target PurePremium")
with pd.option_context("display.expand_frame_repr", False):
    print(scores)

调用SF的SS_GLM接口，对openssl_mtpl2进行建模

获取明文模型的迭代参数

n_iter = glm_pure_premium.n_iter_ #迭代轮数

初始化SPU。REF2K是隐语提供的明文MPC协议，做展示用

import secretflow as sf

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

# In case you have a running secretflow runtime already.
sf.shutdown()

# 初始化 SecretFlow 环境，指定参与方和地址
sf.init(['alice', 'bob'], address='local')

# 定义参与方
alice, bob = sf.PYU('alice'), sf.PYU('bob')

spu = sf.SPU(
    sf.utils.testing.cluster_def(
        ['alice', 'bob'],
        {"protocol": "REF2K", "field": "FM128", "fxp_fraction_bits": 40},
    ),
)

对数据（特征）进行垂直切分

from secretflow.data import FedNdarray, PartitionWay
#数据预处理
x, y = X_train, df_train["PurePremium"]
w = df_train["Exposure"]


def x_to_vdata(x):
    x = x.todense()
    v_data = FedNdarray(
        partitions={
            alice: alice(lambda: x[:, :15])(),
            bob: bob(lambda: x[:, 15:])(),
        },
        partition_way=PartitionWay.VERTICAL,
    )
    return v_data


v_data = x_to_vdata(x)

# 将标签 y 转换为分布式数据对象 label_data，并由 alice 持有
label_data = FedNdarray(
    partitions={alice: alice(lambda: y.values)()},
    partition_way=PartitionWay.VERTICAL,
)

# 将样本权重 w 转换为分布式数据对象 sample_weight，并由 alice 持有

sample_weight = FedNdarray(
    partitions={alice: alice(lambda: w.values)()},
    partition_way=PartitionWay.VERTICAL,
)

定义模型的框架

from secretflow.device.driver import reveal
from secretflow.ml.linear.ss_glm.core import get_dist

dist = 'Tweedie'
ss_glm_power = 1.9


class DirectRevealModel:
    def __init__(self, model) -> None:
        self.model = model

    def predict(self, X):
        vdata = x_to_vdata(X)
        y = self.model.predict(vdata)
        # 对预测结果进行解密和重塑
        return reveal(y).reshape((-1,))

    def score(self, X, y, sample_weight=None):
        y = y.values
        y_pred = self.predict(X)

        constant = np.mean(y)
        if sample_weight is not None:
            constant *= sample_weight.shape[0] / np.sum(sample_weight)

        # Missing factor of 2 in deviance cancels out.
        deviance = get_dist(dist, 1, ss_glm_power).deviance(y_pred, y, None)
        deviance_null = get_dist(dist, 1, ss_glm_power).deviance(
            np.average(y, weights=sample_weight) + np.zeros(y.shape), y, None
        )
        # 评分结果公式
        return 1 - (deviance + constant) / (deviance_null + constant)

模型训练并评估

import time
from secretflow.ml.linear.ss_glm import SSGLM

model = SSGLM(spu)

ss_glm_power = 1.9
start = time.time()
model.fit_irls(
    v_data,
    label_data,
    None,
    sample_weight,
    2,
    'Log',
    'Tweedie',
    ss_glm_power,
    l2_lambda=0.1,
    infeed_batch_size_limit=8000000,
    fraction_of_validation_set=0.2,
    stopping_rounds=2,
    stopping_metric='deviance',
    stopping_tolerance=0.001,
)

wrapped_model = DirectRevealModel(model)