一、 背景知识-LR/GLM应用场景及原理
1.应用场景
- 逻辑回归(LR):二分类问题
- 广义线性模型(GLM):风险保费预测,根据要提供的保障责任,计算预期总索赔额。
法一:直接对纯保费建模,服从tweedie分布(1,2)
法二:通过两步建模间接近似:纯保费 = 索赔次数(泊松分布、负二项分布) * 平均索赔金额 (伽马分布、逆高斯分布)
2.原理
- 线性回归(GLM的一个基本形式)
其假设响应变量Y的真实值由系统组件(system component)和误差组件(error component)组成。
系统组件:线性预测器 (数值项,可拟合)
误差组件:白噪声 (高斯随机变量)
线性回归的响应变量Y的条件分布为高斯分布
- 广义线性回归模型(GLM)
GLM将误差项的概率分布扩展为指数分布族:伯努利分布(逻辑回归)、泊松分布、Gamma分布、复合泊松Gamma分布、Tweedie分布等。
一个广义线性模型有三个关键组件:
- 系统组件:一个线性预测器, 为自变量,是定义的未知参数。
- 随机组件:一个指数族分布作为响应变量Y的概率分布,是分布的自然参数,和存在一一映射关系,用函数表示关系。
- 连接函数:使得,描述系统组件和随机组件之间的关系(“值域映射”)
常用连接函数如下:
二、隐语模型-密态SSLR/SSGLM
-
广义线性模型参数估计
一阶优化器:SGD参数估计方法 (计算\通信量小;初始化随机)
二阶优化器:迭代重加权最小二乘法(IRLS)(计算\通信复杂度高;初始化准确、收敛速度快)
二阶优化器(训练初期可选择之以初始化+简单迭代)+一阶优化器(然后使用之快速收敛)
-
秘密分享加法
各个参与方不知道对方的秘密是多少,也不需要中立的第三方参与
- 秘密分享乘法
三、应用实现-从理论到隐语应用
1.SSGLM参数解析
- 选择建模标签分布类型和连接函数 (desc)
-
若选择tweedie分布,需对p进行调参(一般在1~2,可以使用循环确定参数)
-
优化器
- 其他参数(偏执、方差、数据加权)
2.SSLR使用
1.准备SPU和归一化数据
2.模型训练
3.模型评估
3.SSGLM使用
SSGLM中可以使用IRLS或者SGD优化器进行训练
SS-LR、SSGLM优势:可证安全;不依赖可信第三方;支持多种模型(伯努利分布、泊松分布、gamma分布、tweedie分布);计算高效
四、实践——针对Tweedie分布的GLM建模
- 基于sklearn调用TweedieRegressor对openssl_mtpl2进行建模
下载数据集,并设置一些辅助函数
import ssl
# in case certificate verify failed due to http connection
# 防止下面下载数据集时出现验证错误
ssl._create_default_https_context = ssl._create_unverified_context
from functools import partial
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.metrics import (
mean_absolute_error,
mean_squared_error,
mean_tweedie_deviance,
)
# open_mtpl2: how often/ how much a driver will file an insurance claim in a year
def load_mtpl2(n_samples=None):
"""Fetch the French Motor Third-Party Liability Claims dataset.
Parameters
----------
n_samples: int, default=None
number of samples to select (for faster run time). Full dataset has
678013 samples.
"""
# freMTPL2freq dataset from https://www.openml.org/d/41214
df_freq = fetch_openml(data_id=41214, as_frame=True).data
df_freq["IDpol"] = df_freq["IDpol"].astype(int)
df_freq.set_index("IDpol", inplace=True)
# freMTPL2sev dataset from https://www.openml.org/d/41215
df_sev = fetch_openml(data_id=41215, as_frame=True).data
# 按照相同的 ID 累计索赔金额
df_sev = df_sev.groupby("IDpol").sum()
df = df_freq.join(df_sev, how="left")
df["ClaimAmount"].fillna(0, inplace=True)
# 去掉字符串字段的引号
for column_name in df.columns[df.dtypes.values == object]:
df[column_name] = df[column_name].str.strip("'")
return df.iloc[:n_samples]
def plot_obs_pred(
df,
feature,
weight,
observed,
predicted,
y_label=None,
title=None,
ax=None,
fill_legend=False,
):
"""Plot observed and predicted - 按特征级别聚合.
Parameters
----------
df : DataFrame
input data
feature: str
a column name of df for the feature to be plotted
weight : str
column name of df with the values of weights or exposure
observed : str
a column name of df with the observed target
predicted : DataFrame
a dataframe, with the same index as df, with the predicted target
fill_legend : bool, default=False
whether to show fill_between legend
"""
# aggregate observed and predicted variables by feature level
df_ = df.loc[:, [feature, weight]].copy()
df_["observed"] = df[observed] * df[weight]
df_["predicted"] = predicted * df[weight]
df_ = (
df_.groupby([feature])[[weight, "observed", "predicted"]]
.sum()
.assign(observed=lambda x: x["observed"] / x[weight])
.assign(predicted=lambda x: x["predicted"] / x[weight])
)
ax = df_.loc[:, ["observed", "predicted"]].plot(style=".", ax=ax)
y_max = df_.loc[:, ["observed", "predicted"]].values.max() * 0.8
p2 = ax.fill_between(
df_.index,
0,
y_max * df_[weight] / df_[weight].values.max(),
color="g",
alpha=0.1,
)
if fill_legend:
ax.legend([p2], ["{} distribution".format(feature)])
ax.set(
ylabel=y_label if y_label is not None else None,
title=title if title is not None else "Train: Observed vs Predicted",
)
def score_estimator(
estimator,
X_train,
X_test,
df_train,
df_test,
target,
weights,
tweedie_powers=None,
):
"""Evaluate an estimator on train and test sets with different metrics"""
metrics = [
("D² explained", None), # Use default scorer if it exists
("mean abs. error", mean_absolute_error),
("mean squared error", mean_squared_error),
]
if tweedie_powers:
metrics += [
(
"mean Tweedie dev p={:.4f}".format(power),
partial(mean_tweedie_deviance, power=power),
)
for power in tweedie_powers
]
res = []
for subset_label, X, df in [
("train", X_train, df_train),
("test", X_test, df_test),
]:
y, _weights = df[target], df[weights]
for score_label, metric in metrics:
if isinstance(estimator, tuple) and len(estimator) == 2:
# Score the model consisting of the product of frequency and
# severity models.
est_freq, est_sev = estimator
y_pred = est_freq.predict(X) * est_sev.predict(X)
else:
y_pred = estimator.predict(X)
if metric is None:
if not hasattr(estimator, "score"):
continue
score = estimator.score(X, y, sample_weight=_weights)
else:
score = metric(y, y_pred, sample_weight=_weights)
res.append({"subset": subset_label, "metric": score_label, "score": score})
res = (
pd.DataFrame(res)
.set_index(["metric", "subset"])
.score.unstack(-1)
.round(4)
.loc[:, ["train", "test"]]
)
return res
数据预处理
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import (
FunctionTransformer,
KBinsDiscretizer,
OneHotEncoder,
StandardScaler,
)
df = load_mtpl2(n_samples=50000)
# 过滤 Note: filter out claims with zero amount, as the severity model
# requires strictly positive target values.
df.loc[(df["ClaimAmount"] == 0) & (df["ClaimNb"] >= 1), "ClaimNb"] = 0
# 修正 Correct for unreasonable observations (that might be data error)
# and a few exceptionally large claim amounts
df["ClaimNb"] = df["ClaimNb"].clip(upper=4)
df["Exposure"] = df["Exposure"].clip(upper=1)
df["ClaimAmount"] = df["ClaimAmount"].clip(upper=200000)
log_scale_transformer = make_pipeline(
FunctionTransformer(func=np.log), StandardScaler()
)
column_trans = ColumnTransformer(
[
(
"binned_numeric",
KBinsDiscretizer(n_bins=10, subsample=int(2e5), random_state=0),
["VehAge", "DrivAge"],
),
(
"onehot_categorical",
OneHotEncoder(),
["VehBrand", "VehPower", "VehGas", "Region", "Area"],
),
("passthrough_numeric", "passthrough", ["BonusMalus"]),
("log_scaled_numeric", log_scale_transformer, ["Density"]),
],
remainder="drop",
)
X = column_trans.fit_transform(df)
# 纯保费 Insurances companies are interested in modeling the Pure Premium, that is
# the expected total claim amount per unit of exposure for each policyholder
# in their portfolio:
df["PurePremium"] = df["ClaimAmount"] / df["Exposure"]
# 两步建模近似 This can be indirectly approximated by a 2-step modeling: the product of the
# Frequency times the average claim amount per claim:
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df["AvgClaimAmount"] = df["ClaimAmount"] / np.fmax(df["ClaimNb"], 1)
with pd.option_context("display.max_columns", 15):
print(df[df.ClaimAmount > 0].head())
训练集和测试集划分
from sklearn.model_selection import train_test_split
df_train, df_test, X_train, X_test = train_test_split(df, X, random_state=0)
使用TweedieRegressor建模并训练
from sklearn.linear_model import TweedieRegressor
# Tweedie 回归模型,power 参数指定 Tweedie 分布的指数(1.9 介于泊松分布和伽马分布之间)
# alpha 是正则化强度,solver 是求解器类型
glm_pure_premium = TweedieRegressor(power=1.9, alpha=0.1, solver='newton-cholesky')
# X_train 是特征矩阵,df_train["PurePremium"] 是目标变量(纯保费)
# sample_weight 参数指定每个样本的权重
glm_pure_premium.fit(
X_train, df_train["PurePremium"], sample_weight=df_train["Exposure"]
) # df_train["PurePremium"]即label
tweedie_powers = [1.5, 1.7, 1.8, 1.9, 1.99, 1.999, 1.9999]
scores_glm_pure_premium = score_estimator(
glm_pure_premium,
X_train,
X_test,
df_train,
df_test,
target="PurePremium",
weights="Exposure",
tweedie_powers=tweedie_powers,
)
#将评估结果合并为一个 DataFram
scores = pd.concat(
[scores_glm_pure_premium],
axis=1,
sort=True,
keys=("TweedieRegressor"),
)
print("Evaluation of the Product Model and the Tweedie Regressor on target PurePremium")
with pd.option_context("display.expand_frame_repr", False):
print(scores)
- 调用SF的SS_GLM接口,对openssl_mtpl2进行建模
获取明文模型的迭代参数
n_iter = glm_pure_premium.n_iter_ #迭代轮数
初始化SPU。REF2K是隐语提供的明文MPC协议,做展示用
import secretflow as sf
# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))
# In case you have a running secretflow runtime already.
sf.shutdown()
# 初始化 SecretFlow 环境,指定参与方和地址
sf.init(['alice', 'bob'], address='local')
# 定义参与方
alice, bob = sf.PYU('alice'), sf.PYU('bob')
spu = sf.SPU(
sf.utils.testing.cluster_def(
['alice', 'bob'],
{"protocol": "REF2K", "field": "FM128", "fxp_fraction_bits": 40},
),
)
对数据(特征)进行垂直切分
from secretflow.data import FedNdarray, PartitionWay
#数据预处理
x, y = X_train, df_train["PurePremium"]
w = df_train["Exposure"]
def x_to_vdata(x):
x = x.todense()
v_data = FedNdarray(
partitions={
alice: alice(lambda: x[:, :15])(),
bob: bob(lambda: x[:, 15:])(),
},
partition_way=PartitionWay.VERTICAL,
)
return v_data
v_data = x_to_vdata(x)
# 将标签 y 转换为分布式数据对象 label_data,并由 alice 持有
label_data = FedNdarray(
partitions={alice: alice(lambda: y.values)()},
partition_way=PartitionWay.VERTICAL,
)
# 将样本权重 w 转换为分布式数据对象 sample_weight,并由 alice 持有
sample_weight = FedNdarray(
partitions={alice: alice(lambda: w.values)()},
partition_way=PartitionWay.VERTICAL,
)
定义模型的框架
from secretflow.device.driver import reveal
from secretflow.ml.linear.ss_glm.core import get_dist
dist = 'Tweedie'
ss_glm_power = 1.9
class DirectRevealModel:
def __init__(self, model) -> None:
self.model = model
def predict(self, X):
vdata = x_to_vdata(X)
y = self.model.predict(vdata)
# 对预测结果进行解密和重塑
return reveal(y).reshape((-1,))
def score(self, X, y, sample_weight=None):
y = y.values
y_pred = self.predict(X)
constant = np.mean(y)
if sample_weight is not None:
constant *= sample_weight.shape[0] / np.sum(sample_weight)
# Missing factor of 2 in deviance cancels out.
deviance = get_dist(dist, 1, ss_glm_power).deviance(y_pred, y, None)
deviance_null = get_dist(dist, 1, ss_glm_power).deviance(
np.average(y, weights=sample_weight) + np.zeros(y.shape), y, None
)
# 评分结果公式
return 1 - (deviance + constant) / (deviance_null + constant)
模型训练并评估
import time
from secretflow.ml.linear.ss_glm import SSGLM
model = SSGLM(spu)
ss_glm_power = 1.9
start = time.time()
model.fit_irls(
v_data,
label_data,
None,
sample_weight,
2,
'Log',
'Tweedie',
ss_glm_power,
l2_lambda=0.1,
infeed_batch_size_limit=8000000,
fraction_of_validation_set=0.2,
stopping_rounds=2,
stopping_metric='deviance',
stopping_tolerance=0.001,
)
wrapped_model = DirectRevealModel(model)
-
明/密文训练和预测结果对比
明/密文模型差距还可以
使用lorenz_curve分析模型预测结果的数值分布与真实数据的分布的一致性 =》还行
lorenz_curve主要用于展示数据分布的不平等程度或者集中程度