[隐私计算学习笔记]7——XGB算法与SGB算法

最新推荐文章于 2024-06-21 11:53:38 发布

Garyff

最新推荐文章于 2024-06-21 11:53:38 发布

阅读量482

点赞数 11

分类专栏：隐私计算文章标签：学习笔记

本文链接：https://blog.csdn.net/weixin_42551843/article/details/139801348

版权

隐私计算专栏收录该内容

11 篇文章 0 订阅

订阅专栏

1. 决策树模型

1.1 数据设定

纵向分割数据集：所有数据方的样本一致、样本特征不同、只有一方持有标签。如下示意：

1.2 XGBoost训练算法

单棵树分裂过程：

预计算：根据损失函数定义、样本标签、当前预测值，计算每个样本可以求得其一阶导 $g_{i}$ 和二阶导 $h_{i}$ ；
结点分裂：通过枚举所有分裂方案，选出带来最优增益值的方式执行分裂。分裂方案包含分裂特征和分裂阈值，可以将当前结点样本集合 $I$ 分裂为左子树样本集合 $I_{L}$ 和右子树样本集合 $I_{R}$ ，并由如下公式计算出此分裂方案的增益值：

其中： $\lambda$ 和 $\gamma$ 分别为叶节点数和叶节点权重的惩罚因子。

权重计算：由落入该结点的样本计算得到，公式如下：

1.3 SS-XGB训练算法

使用秘密分享计算分裂增益值和叶权重，使用秘密分享协议提供的加/乘等操作来实现安全的多方联合计算。特别需要关注的问题是：如何在计算分桶加和时，不泄漏任何样本分布相关的信息。通过引入一个密态下的向量𝑆就可以解决这个问题。

向量𝑆中标记为1的样本是被选中的样本需要加和，0相反。为了保证样本分布不泄漏，这个向量也是通过秘密分享协议保护的。在秘密分享协议的保护下，计算向量𝑆和梯度向量的内积，即可得到梯度在分桶内的累加和。

2. SS-XGB算法应用

import sys
import time
import logging

import secretflow as sf
from secretflow.ml.boost.ss_xgb_v import Xgb
from secretflow.device.driver import wait, reveal
from secretflow.data import FedNdarray, PartitionWay
from secretflow.data.split import train_test_split
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score, classification_report


# init log
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

# init all nodes in local Standalone Mode.
sf.init(['alice', 'bob', 'carol'], address='local')

# init PYU, the Python Processing Unit, process plaintext in each node.
alice = sf.PYU('alice')
bob = sf.PYU('bob')
carol = sf.PYU('carol')

# init SPU, the Secure Processing Unit,
#           process ciphertext under the protection of a multi-party secure computing protocol
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob', 'carol']))

# read data in each party
def read_x(start, end):
    from sklearn.datasets import load_breast_cancer
    x = load_breast_cancer()['data']
    return x[:, start:end]

def read_y():
    from sklearn.datasets import load_breast_cancer
    return load_breast_cancer()['target']

# alice / bob / carol each hold one third of the features of the data
v_data = FedNdarray(
    partitions={
        alice: alice(read_x)(0, 10),
        bob: bob(read_x)(10, 20),
        carol: carol(read_x)(20, 30),
    },
    partition_way=PartitionWay.VERTICAL,
)
# Y label belongs to alice
label_data = FedNdarray(
    partitions={alice: alice(read_y)()},
    partition_way=PartitionWay.VERTICAL,
)
# wait IO finished
wait([p.data for p in v_data.partitions.values()])
wait([p.data for p in label_data.partitions.values()])
# split train data and test date
random_state = 1234
split_factor = 0.8
v_train_data, v_test_data = train_test_split(v_data, train_size=split_factor, random_state=random_state)
v_train_label, v_test_label= train_test_split(label_data, train_size=split_factor, random_state=random_state)
# run SS-XGB
xgb = Xgb(spu)
start = time.time()
params = {
    # for more detail, see Xgb API doc
    'num_boost_round': 5,
    'max_depth': 5,
    'learning_rate': 0.1,
    'sketch_eps': 0.08,
    'objective': 'logistic',
    'reg_lambda': 0.1,
    'subsample': 1,
    'colsample_by_tree': 1,
    'base_score': 0.5,
}
model = xgb.train(params, v_train_data,v_train_label)
logging.info(f"train time: {time.time() - start}")

# Do predict
start = time.time()
# Now the result is saved in the spu by ciphertext
spu_yhat = model.predict(v_test_data)
# reveal for auc, acc and classification report test.
yhat = reveal(spu_yhat)
logging.info(f"predict time: {time.time() - start}")
y = reveal(v_test_label.partitions[alice])
# get the area under curve(auc) score of classification
logging.info(f"auc: {roc_auc_score(y, yhat)}")
binary_class_results = np.where(yhat>0.5, 1, 0)
# get the accuracy score of classification
logging.info(f"acc: {accuracy_score(y, binary_class_results)}")
# get the report of classification
print("classification report:")
print(classification_report(y, binary_class_results))

3. SGB算法应用

import spu
from sklearn.metrics import roc_auc_score

import secretflow as sf
from secretflow.data import FedNdarray, PartitionWay
from secretflow.device.driver import reveal, wait
from secretflow.ml.boost.sgb_v import (
    Sgb,
    get_classic_XGB_params,
    get_classic_lightGBM_params,
)
from secretflow.ml.boost.sgb_v.model import load_model
import pprint

pp = pprint.PrettyPrinter(depth=4)

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

alice_ip = '127.0.0.1'
bob_ip = '127.0.0.1'
ip_party_map = {bob_ip: 'bob', alice_ip: 'alice'}

_system_config = {'lineage_pinning_enabled': False}
sf.shutdown()
# init cluster
sf.init(
    ['alice', 'bob'],
    address='local',
    _system_config=_system_config,
    object_store_memory=5 * 1024 * 1024 * 1024,
)

# SPU settings
cluster_def = {
    'nodes': [
        {'party': 'alice', 'id': 'local:0', 'address': alice_ip + ':12945'},
        {'party': 'bob', 'id': 'local:1', 'address': bob_ip + ':12946'},
        # {'party': 'carol', 'id': 'local:2', 'address': '127.0.0.1:12347'},
    ],
    'runtime_config': {
        # SEMI2K support 2/3 PC, ABY3 only support 3PC, CHEETAH only support 2PC.
        # pls pay attention to size of nodes above. nodes size need match to PC setting.
        'protocol': spu.spu_pb2.SEMI2K,
        'field': spu.spu_pb2.FM128,
    },
}

# HEU settings
heu_config = {
    'sk_keeper': {'party': 'alice'},
    'evaluators': [{'party': 'bob'}],
    'mode': 'PHEU',
    'he_parameters': {
        # ou is a fast encryption schema that is as secure as paillier.
        'schema': 'ou',
        'key_pair': {
            'generate': {
                # bit size should be 2048 to provide sufficient security.
                'bit_size': 2048,
            },
        },
    },
    'encoding': {
        'cleartext_type': 'DT_I32',
        'encoder': "IntegerEncoder",
        'encoder_args': {"scale": 1},
    },
}

# Data Settings
from sklearn.datasets import load_breast_cancer

ds = load_breast_cancer()
x, y = ds['data'], ds['target']

v_data = FedNdarray(
    {
        alice: (alice(lambda: x[:, :15])()),
        bob: (bob(lambda: x[:, 15:])()),
    },
    partition_way=PartitionWay.VERTICAL,
)
label_data = FedNdarray(
    {alice: (alice(lambda: y)())},
    partition_way=PartitionWay.VERTICAL,
)

#Parameters settings
params = get_classic_XGB_params()
params['num_boost_round'] = 3
params['max_depth'] = 3
pp.pprint(params)

#Use SGB
sgb = Sgb(heu)
model = sgb.train(params, v_data, label_data)

# Evaluation models
yhat = model.predict(v_data)
yhat = reveal(yhat)
print(f"auc: {roc_auc_score(y, yhat)}")

# each participant party needs a location to store
saving_path_dict = {
    # in production we may use remote oss, for example.
    device: "./" + device.party
    for device in v_data.partitions.keys()
}
r = model.save_model(saving_path_dict) # Save model
wait(r)

#Loading models
# alice is our label holder
model_loaded = load_model(saving_path_dict, alice)
fed_yhat_loaded = model_loaded.predict(v_data, alice)
yhat_loaded = reveal(fed_yhat_loaded.partitions[alice])

assert (
    yhat == yhat_loaded
).all(), "loaded model predictions should match original, yhat {} vs yhat_loaded {}".format(
    yhat, yhat_loaded
)

4. 算法对比

	SS-XGB	SGB
准备设备和数据	定义SPU设备多方数据放入FedNdarray	定义HEU设备多方数据放入FedNdarray
设置训练参数	参数为XGB的子集支持XGB经典功能	参数较多吸收XGB\lightGBM等更多功能
模型产出：评估和保存	预测结果为MPC密文全密态评估	预测结果为标签持有方明文标签持有方本地单方面评估

5. 小结

本篇讲解了决策树模型，例举了SS-XGB和SGB的应用实例，以及对这两种算法在设备准备和数据、训练参数设置以及模型评估和保存角度进行了对比。

（PS：感谢您看到这里！您的支持是我更新的最大动力）

Garyff

关注

11
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
[隐私计算学习笔记]7——XGB算法与SGB算法

本篇讲解了决策树模型，例举了SS-XGB和SGB的应用实例，以及对这两种算法在设备准备和数据、训练参数设置以及模型评估和保存角度进行了对比。
复制链接

扫一扫

专栏目录

	SS-XGB	SGB
准备设备和数据	定义SPU设备多方数据放入FedNdarray	定义HEU设备多方数据放入FedNdarray
设置训练参数	参数为XGB的子集支持XGB经典功能	参数较多吸收XGB\lightGBM等更多功能
模型产出：评估和保存	预测结果为MPC密文全密态评估	预测结果为标签持有方明文标签持有方本地单方面评估

[隐私计算学习笔记]7——XGB算法与SGB算法

1. 决策树模型

1.1 数据设定

1.2 XGBoost训练算法

1.3 SS-XGB训练算法

2. SS-XGB算法应用

3. SGB算法应用

4. 算法对比

5. 小结

“相关推荐”对你有帮助么？