指南参考链接:https://www.secretflow.org.cn/zh-CN/docs/secretflow/v1.6.1b0/user_guide
1、预处理和隐私求交
1.1 预处理
隐语支持两种数据表结构:DataFrame和FedNdarray。
- DataFrame是联邦表格数据的封装,由多个参与方的数据块构成,支持数据水平、垂直切分和混合切分,
- 水平:特征一致,但参与方各自有各自的样本,对应API:HDataFrame
- 垂直:参与方的样本是对齐的,但样本特征不同,对应API:VDataFrame
- 混合切分:既有水平又有垂直的切分模式。对应API:MixDataFrame。
- - FedNdarray是联邦ndarray的封装,同样由多个参与方的数据块构成,每一方的数据使用numpy darray来表示,支持水平和垂直切分,对应统一的API:FedNdarray。
DataFrame 和 FedNdarray 各自提供了一些读写 api 可供直接使用。
隐语也提供多种预处理工具来处理这些数据,可以直接使用 DataFrame API 处理数据,或者使用
sf.preprocessing 包内的各类预处理组件处理。
下面跑个WOE编码的示例,DataFrame暂时没运行结束,待运行结束后再补充。
WOE编码
WOE分箱实现了针对二元目标变量的数值变量的分箱。
代码链接参考:https://www.secretflow.org.cn/zh-CN/docs/secretflow/v1.6.1b0/user_guide/preprocessing/WeightOfEvidenceEncoding
结果和官方一致
1.2 隐私求交PSI
隐私求交(Private Set Intersection)是一种使用密码学方法,获取两份数据内容的交集的算法。 PSI过程中不泄露任务交集以外的信息 。
在垂直拆分场景中,隐私求交常用于第一步的数据对齐,然后可以进一步做数据分析或机器学习建模。
在隐语中 PSI 有两种使用方式:
-
使用spu.psi_csv等接口
_,alice_psi path =tempfile.mkstemp() _,bob_psi_path =tempfile.mkstemp() spu.psi_csv( key="uid", input_path={alice:alice_ path, bob: bob _path}, output path={alice:alice_psi path,bob: bob_psi_path}, receiver="alice", protocol="ECDH_PSI_2PC", sort=True, )
-
使用data.vertical.read_csv接口
from secretflow.data.vertical import read csv as v read csv vdf =vread csv( {alice:alice path,bob: bob path}, spu=spu, keys="uid", drop _keys="uid", psi_protocl="ECDH PSI 2PC", ) vdf.columns
同时隐语支持多种 PSI 算法,可根据参与方数量、带宽、算力、数据不平衡度等不同场景合理选择。
2、决策树模型和线性回归模型
2.1 决策树
隐语目前支持多种决策树算法(XGB),同时支持回归和二分类训练,不同算法的安全性也不同,可根据自己实际场景和安全性需求来选用,大致分为三类:
算法 | SS-XGB | SecureBoost | 水平XGBoost |
---|---|---|---|
API | secretflow.ml.boost.ss_xgb_v.Xgb | secretflow.ml.boost.sgb_v.Sgb | secretflow.ml.boost.homo_boost.SFXgboost |
场景 | 垂直切分 | 垂直切分 | 水平切分 |
安全性 | 可证安全,安全性依赖使用的秘密分享协议安全性 | 非可证安全,存在可能导致数据泄露的已知攻击 | 非可证安全性, |
性能 | 通信成本更高 | 计算量更大,但通信量更小 | — |
下面跑一个决策树简单的模型,使用secretflow进行垂直联邦学习。在此垂直模式下,垂直划分设定为:
-
所有数据方的样本一致
-
但是拥有样本的不同特征
-
只有一方持有标签
这里alice和bob分别拿到前10个特征和后10个特征,分配的特征数量越多,训练效果越好,我这里只分配了10个。并且数据标签只分配给Alice。8:2划分数据集,其他的就和机器学习模型训练评估类似,设置不同的超参。最后计算评估指标并打印。
import sys
import time
import logging
import secretflow as sf
from secretflow.ml.boost.ss_xgb_v import Xgb
from secretflow.device.driver import wait, reveal
from secretflow.data import FedNdarray, PartitionWay
from secretflow.data.split import train_test_split
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score, classification_report
# init log
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# init all nodes in local Standalone Mode.
sf.init(['alice', 'bob'], address='local')
# init PYU, the Python Processing Unit, process plaintext in each node.
alice = sf.PYU('alice')
bob = sf.PYU('bob')
# init SPU, the Secure Processing Unit,
# process ciphertext under the protection of a multi-party secure computing protocol
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
# read data in each party
def read_x(start, end):
from sklearn.datasets import load_breast_cancer
x = load_breast_cancer()['data']
return x[:, start:end]
def read_y():
from sklearn.datasets import load_breast_cancer
return load_breast_cancer()['target']
# alice / bob / carol each hold one third of the features of the data
v_data = FedNdarray(
partitions={
alice: alice(read_x)(0, 10),
bob: bob(read_x)(10, 20),
},
partition_way=PartitionWay.VERTICAL,
)
# Y label belongs to alice
label_data = FedNdarray(
partitions={alice: alice(read_y)()},
partition_way=PartitionWay.VERTICAL,
)
# wait IO finished
wait([p.data for p in v_data.partitions.values()])
wait([p.data for p in label_data.partitions.values()])
# split train data and test date
random_state = 1234
split_factor = 0.8
v_train_data, v_test_data = train_test_split(v_data, train_size=split_factor, random_state=random_state)
v_train_label, v_test_label= train_test_split(label_data, train_size=split_factor, random_state=random_state)
# run SS-XGB
xgb = Xgb(spu)
start = time.time()
params = {
# for more detail, see Xgb API doc
'num_boost_round': 5,
'max_depth': 5,
'learning_rate': 0.1,
'sketch_eps': 0.08,
'objective': 'logistic',
'reg_lambda': 0.1,
'subsample': 1,
'colsample_by_tree': 1,
'base_score': 0.5,
}
model = xgb.train(params, v_train_data,v_train_label)
logging.info(f"train time: {time.time() - start}")
# Do predict
start = time.time()
# Now the result is saved in the spu by ciphertext
spu_yhat = model.predict(v_test_data)
# reveal for auc, acc and classification report test.
yhat = reveal(spu_yhat)
logging.info(f"predict time: {time.time() - start}")
y = reveal(v_test_label.partitions[alice])
# get the area under curve(auc) score of classification
logging.info(f"auc: {roc_auc_score(y, yhat)}")
binary_class_results = np.where(yhat>0.5, 1, 0)
# get the accuracy score of classification
logging.info(f"acc: {accuracy_score(y, binary_class_results)}")
# get the report of classification
print("classification report:")
print(classification_report(y, binary_class_results))
#### 2.2 线性回归模型
隐语支持多种线性回归模型
算法 | SS-SGD | HESS-SGD | SS-GLM | 混合联邦LR |
---|---|---|---|---|
API | secretflow.ml.linear.SSRegression | secretflow.ml.linear.HESSLogisticRegression | secretflow.ml.linear.SSGLM | secretflow.ml.linear.FlLogisticRegressionMix |
场景 | 垂直 | 垂直 | 垂直 | 混合切分(2+n) |
安全性 | 可证安全,安全性依赖于使用的秘密分享协议 | 可证安全,安全性依赖于使用的秘密分享协议和同态加密算法 | 可证安全,安全性依赖于使用的秘密分享协议 | 非可证安全,泄露了部分中间信息 |
算法 | 线性回归、逻辑回归 | 逻辑回归 | 广义线性回归 | 逻辑回归 |
性能 | 通行量更大,大带宽(万兆/局域网)下速度更快 | 计算量更大,网络受限(带宽延迟)的情况下速度更快 | —— | —— |
同样,再跑一个SSRegression的。
import sys
import time
import logging
import numpy as np
import spu
import secretflow as sf
from secretflow.data.split import train_test_split
from secretflow.device.driver import wait, reveal
from secretflow.data import FedNdarray, PartitionWay
from secretflow.ml.linear.ss_sgd import SSRegression
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
# init log
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# init all nodes in local Standalone Mode.
sf.init(['alice', 'bob'], address='local')
# init PYU, the Python Processing Unit, process plaintext in each node.
alice = sf.PYU('alice')
bob = sf.PYU('bob')
# init SPU, the Secure Processing Unit,
# process ciphertext under the protection of a multi-party secure computing protocol
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
# read data in each party
def read_x(start, end):
# use breast_cancer as example
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
x = load_breast_cancer()['data']
# LR's train dataset must be standardized or normalized
scaler = StandardScaler()
x = scaler.fit_transform(x)
return x[:, start:end]
def read_y():
from sklearn.datasets import load_breast_cancer
return load_breast_cancer()['target']
# alice / bob / carol each hold one third of the features of the data
# read_x is execute locally on each node.
v_data = FedNdarray(
partitions={
alice: alice(read_x)(0, 15),
bob: bob(read_x)(15, 30),
},
partition_way=PartitionWay.VERTICAL,
)
# Y label belongs to alice
label_data = FedNdarray(
partitions={alice: alice(read_y)()},
partition_way=PartitionWay.VERTICAL,
)
# wait IO finished
wait([p.data for p in v_data.partitions.values()])
wait([p.data for p in label_data.partitions.values()])
# split train data and test date
random_state = 1234
split_factor = 0.8
v_train_data, v_test_data = train_test_split(v_data, train_size=split_factor, random_state=random_state)
v_train_label, v_test_label = train_test_split(label_data, train_size=split_factor, random_state=random_state)
# run SS-SGD
# SSRegression use spu to fit model.
model = SSRegression(spu)
start = time.time()
model.fit(
v_train_data, # x
v_train_label, # y
5, # epochs
0.3, # learning_rate
32, # batch_size
't1', # sig_type
'logistic', # reg_type
'l2', # penalty
0.1, # l2_norm
)
logging.info(f"train time: {time.time() - start}")
# Do predict
start = time.time()
# Now the result is saved in the spu by ciphertext
spu_yhat = model.predict(v_test_data)
# reveal for auc, acc and classification report test.
yhat = reveal(spu_yhat)
logging.info(f"predict time: {time.time() - start}")
y = reveal(v_test_label.partitions[alice])
# get the area under curve(auc) score of classification
logging.info(f"auc: {roc_auc_score(y, yhat)}")
binary_class_results = np.where(yhat > 0.5, 1, 0)
# get the accuracy score of classification
logging.info(f"acc: {accuracy_score(y, binary_class_results)}")
# get the report of classification
print("classification report:")
print(classification_report(y, binary_class_results))
### 3、神经网络算法
神经网络这里主要是联邦学习,分为水平联邦学习和垂直拆分学习,分别提供了两套API(FLModel、SLModel),需要注意的是,这里的两种联邦学习都属于非可证安全算法,安全性要根据其实际场景具体分析。
这里跑一个水平联邦图像分类的示例,其实就是参与者不断从服务器下载新模型在本地利用本地数据训练后,将超参加密传给服务器,服务器将加密的参数进行安全聚合,然后更新服务器的模型,再将新模型返回给参与者不断去训练修正反馈。
参照官方使用minist数据集,链接为https://www.secretflow.org.cn/zh-CN/docs/secretflow/v1.3.0b0/tutorial/Federate_Learning_for_Image_Classification
多方参与比单方参与准确率高了近2个百分点