【隐语实战】组件介绍与自定义开发

最新推荐文章于 2024-08-14 11:22:43 发布

Rabbit_QL

最新推荐文章于 2024-08-14 11:22:43 发布

阅读量714

点赞数 13

文章标签：算法架构

本文链接：https://blog.csdn.net/Rabbit_QL/article/details/140035942

版权

Day 11 组件介绍与自定义开发

讲师：冯骏（蚂蚁集团隐私计算专家）

学习链接：https://www.bilibili.com/video/BV1Hw4m1e7CA

隐语组件spec介绍，如何使用隐语组件（Kuscia + SecretPad）

基于Kuscia/SecretPad进行二次开发
将隐语集成到任意调度系统

一、隐语开放标准

SecretFlow Open Specification

隐语提出的适用于隐私计算应用的一系列协议的集合。
目前包括数据，组件，节点执行，运行报告等协议。
隐语生态各模块均遵守本标准。
Github Repo: https://github.com/secretflow/spec
Doc: https://www.secretflow.org.cn/docs/spec/latest/zh-Hans

1. 数据Data

https://github.com/secretflow/spec/blob/main/secretflow/spec/v1/data.proto

Public Data
- 数据都是分布式存储的（distributed data）
- Name：数据名
- Type
- Meta信息
- system_info：python版本、pytorch版本等
DataRef

每个DataRef指向一个私有数据
- Ownership：DataRef属于谁
- uri

message DistData {
  // The name of this distributed data.
  string name = 1;

  // Type.
  string type = 2;

  // Describe the system information that used to generate this distributed
  // data.
  SystemInfo system_info = 3;

  // Public information, known to all parties.
  // i.e. VerticalTable.
  google.protobuf.Any meta = 4;

  // A reference to a data that is stored in the remote path.
  message DataRef {
    // The path information relative to StorageConfig of the party.
    string uri = 1;

    // The owner party.
    string party = 2;

    // The storage format, i.e. csv.
    string format = 3;
  }

  // Remote data references.
  repeated DataRef data_refs = 5;
}

注意：DistData是完全公开的

举例说明
- MPC中的每个数据有若干个分片，每个分片的底层可以看作一个DataRef指向Remote Object

在这里插入图片描述

FL中的联合数据表

每个Partition可以看作是一个DataRef指向Remote Object
完整的调用流程

StorageConfig: https://github.com/secretflow/spec/blob/main/secretflow/spec/v1/data.proto#L38-L49
DataRef: https://github.com/secretflow/spec/blob/main/secretflow/spec/v1/data.proto#L77-L86

在这里插入图片描述

message StorageConfig {
  // enum[local_fs, s3]
  string type = 1;

  // For local_fs.
  message LocalFSConfig {
    // Working directory.
    string wd = 1;
  }

  // For S3 compatible object storage
  message S3Config {
    // endpoint https://play.min.io or http://127.0.0.1:9000 with scheme
    string endpoint = 1;
    // the bucket name of the oss datasource
    string bucket = 2;
    // the prefix of the oss datasource. e.g.  data/traindata/
    string prefix = 3;
    // access key
    string access_key_id = 4;
    // access secret
    string access_key_secret = 5;
    // virtual_host is the same as AliyunOSS/AWS S3's virtualhost , default true
    bool virtual_host = 6;
    // optional enum[s3v2,s3v4]
    string version = 7;
  }
  // local_fs config.
  LocalFSConfig local_fs = 2;
  // s3 config
  S3Config s3 = 3;
}

2. 组件Component

https://github.com/secretflow/spec/blob/main/secretflow/spec/v1/component.proto#L174C1-L192C2

// The definition of a comp.
message ComponentDef {
  // Namespace of the comp.
  string domain = 1;

  // Should be unique among all comps of the same domain.
  string name = 2;

  string desc = 3;

  // Version of the comp.
  string version = 4;

  repeated AttributeDef attrs = 5;

  repeated IoDef inputs = 6;

  repeated IoDef outputs = 7;
}

ComponentDef

通过domain、name和version定位组件
- domain: 组件的命名空间
- name: 在命名空间中必须是唯一的
- version: 组件的版本
- attributes：组件的属性
- inputs：组件的输入要求
- outputs：组件的输出要求
展开后，可以看作树的结构
- Struct Attribute Group：模块
- Union Attirbute Group：单选
- Atomic Attribute：配置
用户需要自己选择协议，然后在对应模块下进行计算。
输入和输出IO

https://www.secretflow.org.cn/zh-CN/docs/spec/v1.0.dev240328/spec#iodef
- 指定输入输出的类型，可以是多种类型。
- 如果是表，可以进一步指定使用的列。
- 还可以进一步指定每一选中列需要填入的参数

3. 节点执行

Node Evalution

在这里插入图片描述

节点是组件的实例
StorageConfig参与方的数据路径配置
NodeEvalParam定义组件的相关参数Attribute和IO

https://github.com/secretflow/spec/blob/main/secretflow/spec/v1/evaluation.proto#L31
NodeEvalResult是组件运行结果outputs

4. 运行报告Report

用于前端展示（我对前端不是很熟，这里就看看吧👀）

https://github.com/secretflow/spec/blob/main/secretflow/spec/v1/report.proto

Descriptions：以组的形式显示多个只读字段。
Table：显示数据的行。
Div：页面的一个部分或节，由Descriptions、Tables或Divs组成。
Tab：报告的一个页面，由Divs组成。
Report：报告的顶级，由Tabs组成。

运行报告也是一种DistData，通常作为部分组件的输出，用户界面可以根据定义来渲染运行报告

二、隐语组件列表

组建列表：https://www.secretflow.org.cn/zh-CN/docs/secretflow/v1.7.0b0/component/comp_list

可以在https://github.com/secretflow/secretflow/blob/main/docker/comp_list.json中查看组件信息：参数、输入、输出

comps {
  domain: "stats"
  name: "ss_vif"
  desc: "Calculate Variance Inflation Factor(VIF) for vertical partitioning dataset\nby using secret sharing.\n- For large dataset(large than 10w samples & 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device."
  version: "0.0.1"
  inputs {
    name: "input_data"
    desc: "Input vertical table."
    types: "sf.table.vertical_table"
    types: "sf.table.individual"
    attrs {
      name: "feature_selects"
      desc: "Specify which features to calculate VIF with. If empty, all features will be used."
    }
  }
  outputs {
    name: "report"
    desc: "Output Variance Inflation Factor(VIF) report."
    types: "sf.report"
  }
}

Secretpad中也可以查看组件

三、调用隐语组件

https://www.secretflow.org.cn/zh-CN/docs/secretflow/v1.7.0b0/component/comp_guide

有三种方式调用隐语组件：例如PSI隐私求交组件。

1. SecretFlow CLI/Lib

无需任何其他依赖

CLI需要用base64对命令进行编码，SF内部用来作测试用

产生mock数据：将以下 bash 脚本保存为 generate_csv.sh 。

#!/bin/bash

set -e
show_help() {
    echo "Usage: bash generate_csv.sh -c {col_name} -p {file_name}"
    echo "  -c"
    echo "          the column name of id."
    echo "  -p"
    echo "          the path of output csv."
}
if [[ "$#" -lt 1 ]]; then
    show_help
    exit
fi

while getopts ":c:p:" OPTION; do
    case $OPTION in
    c)
        COL_NAME=$OPTARG
        ;;
    p)
        FILE_PATH=$OPTARG
        ;;
    *)
        echo "Incorrect options provided"
        exit 1
        ;;
    esac
done


# header
echo $COL_NAME > $FILE_PATH

# generate 800 random int
for ((i=0; i<800; i++))
do
# from 0 to 1000
id=$(shuf -i 0-1000 -n 1)

# check duplicates
while grep -q "^$id$" $FILE_PATH
do
    id=$(shuf -i 0-1000 -n 1)
done

# write
echo "$id" >> $FILE_PATH
done

echo "Generated csv file is $FILE_PATH."

执行两次，分别产生Alice和Bob的数据（每方生成800个随机整数表示的数据，作为id）

Alice

mkdir -p /tmp/alice
bash generate_csv.sh -c id1 -p /tmp/alice/input.csv

mkdir -p /tmp/bob
bash generate_csv.sh -c id2 -p /tmp/bob/input.csv

将以下 Python 代码保存为 psi_demo.py

原教程中的版本和我的不一致，我使用的是v1.6.1b0

import json

from secretflow.component.entry import comp_eval
from secretflow.spec.extend.cluster_pb2 import (
    SFClusterConfig,
    SFClusterDesc,
)
from secretflow.spec.v1.component_pb2 import Attribute
from secretflow.spec.v1.data_pb2 import (
    DistData,
    TableSchema,
    IndividualTable,
    StorageConfig,
)
from secretflow.spec.v1.evaluation_pb2 import NodeEvalParam
import click


@click.command()
@click.argument("party", type=str)
def run(party: str):
    desc = SFClusterDesc(
        parties=["alice", "bob"],
        devices=[
            SFClusterDesc.DeviceDesc(
                name="spu",
                type="spu",
                parties=["alice", "bob"],
                config=json.dumps(
                    {
                        "runtime_config": {"protocol": "REF2K", "field": "FM64"},
                        "link_desc": {
                            "connect_retry_times": 60,
                            "connect_retry_interval_ms": 1000,
                            "brpc_channel_protocol": "http",
                            "brpc_channel_connection_type": "pooled",
                            "recv_timeout_ms": 1200 * 1000,
                            "http_timeout_ms": 1200 * 1000,
                        },
                    }
                ),
            ),
            SFClusterDesc.DeviceDesc(
                name="heu",
                type="heu",
                parties=[],
                config=json.dumps(
                    {
                        "mode": "PHEU",
                        "schema": "paillier",
                        "key_size": 2048,
                    }
                ),
            ),
        ],
    )

    sf_cluster_config = SFClusterConfig(
        desc=desc,
        public_config=SFClusterConfig.PublicConfig(
            ray_fed_config=SFClusterConfig.RayFedConfig(
                parties=["alice", "bob"],
                addresses=[
                    "127.0.0.1:61041",
                    "127.0.0.1:61042",
                ],
            ),
            spu_configs=[
                SFClusterConfig.SPUConfig(
                    name="spu",
                    parties=["alice", "bob"],
                    addresses=[
                        "127.0.0.1:61045",
                        "127.0.0.1:61046",
                    ],
                )
            ],
        ),
        private_config=SFClusterConfig.PrivateConfig(
            self_party=party,
            ray_head_addr="local",  # local means setup a Ray cluster instead connecting to an existed one.
        ),
    )

    # check https://www.secretflow.org.cn/docs/spec/latest/zh-Hans/intro#nodeevalparam for details.
    sf_node_eval_param = NodeEvalParam(
        domain="data_prep",
        name="psi",
        version="0.0.5",
        attr_paths=[
            "protocol",
            "sort",
            "bucket_size",
            "ecdh_curve_type",
            "input/receiver_input/key",
            "input/sender_input/key",
        ],
        attrs=[
            Attribute(s="PROTOCOL_ECDH"),
            Attribute(b=True),
            Attribute(i64=1048576),
            Attribute(s="CURVE_FOURQ"),
            Attribute(ss=["id1"]),
            Attribute(ss=["id2"]),
        ],
        inputs=[
            DistData(
                name="receiver_input",
                type="sf.table.individual",
                data_refs=[
                    DistData.DataRef(uri="input.csv", party="alice", format="csv"),
                ],
            ),
            DistData(
                name="sender_input",
                type="sf.table.individual",
                data_refs=[
                    DistData.DataRef(uri="input.csv", party="bob", format="csv"),
                ],
            ),
        ],
        output_uris=[
            "output.csv",
        ],
    )

    sf_node_eval_param.inputs[0].meta.Pack(
        IndividualTable(
            schema=TableSchema(
                id_types=["str"],
                ids=["id1"],
            ),
            line_count=-1,
        ),
    )

    sf_node_eval_param.inputs[1].meta.Pack(
        IndividualTable(
            schema=TableSchema(
                id_types=["str"],
                ids=["id2"],
            ),
            line_count=-1,
        ),
    )

    storage_config = StorageConfig(
        type="local_fs",
        local_fs=StorageConfig.LocalFSConfig(wd=f"/tmp/{party}"),
    )

    res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)

    print(f'Node eval res is \n{res}')


if __name__ == "__main__":
    run()

执行

alice
```
python psi_demo.py alice
```
bob
```
python psi_demo.py bob
```

在两个终端中看到以下输出：

Node eval res is 
outputs {
  name: "output.csv"
  type: "sf.table.vertical_table"
  system_info {
  }
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable"
    value: "\n\n\n\003id1\"\003str\n\n\n\003id2\"\003str\020\200\005"
  }
  data_refs {
    uri: "output.csv"
    party: "alice"
    format: "csv"
  }
  data_refs {
    uri: "output.csv"
    party: "bob"
    format: "csv"
  }
}

需要关注的是输出的data_refs可以看到有alice的也有bob的结果文件（也就是他们的交集）。

2. Kuscia

简化数据同步和调度操作

这部分俺就省略啦，未来有需要再补上。

https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.9.0b0/deployment/kuscia_deployment_instructions

3. SecretPad

使用用户界面

在这里插入图片描述

在下方还能看到日志：

2024-06-28 10:03:05 INFO the jobId=rezb, taskId=rezb-gwsttndr-node-3 start ...
2024-06-28 10:03:17 INFO the jobId=rezb, taskId=rezb-gwsttndr-node-3 succeed

在这里插入图片描述

点击查看结果，可以跳转到的结果节点页面，来下载数据。

四、新增隐语组件

https://www.secretflow.org.cn/zh-CN/docs/secretpad-all-in-one/v1.6.1b0/more_tutorials/new_components

例如，创建一个基于MPC的秘密比较组件。这部分简单听了一下，没有进行实际操作。

1. 隐语组件

组件：隐语提供的最小粒度的计算任务。
组件列表：组件的集合。
组件实例：组件的一次调用。
训练流：组件实例的DAG流。

2. 隐语背景

(1) 隐语技术栈各模块间的关系

SecretPad是隐语平台的用户界面，用户在这里可以看到所有组件列表；用户利用组件来构建训练流。
Kuscia节点部署在每一个计算方，负责拉起隐语组件实例。
SecretFlow镜像包含了隐语的binary，负责实际执行组件实例。

在这里插入图片描述

如果你现在需要修改/新增一个组件，你需要：

修改隐语代码
打包隐语镜像
更新隐语SecretPad平台组件列表
在调度框架Kusica中注册新的组件镜像

(2) 隐语镜像中的层级关系

在这里插入图片描述

Kuscia Adapter：将kuscia的数据结构转化为SecretFlow组件数据结构。代码位于：https://github.com/secretflow/secretflow/blob/main/secretflow/kuscia/entry.py。你不需要修改这里。
SecretFlow Comp Entry：读取SecretFlow组件数据结构，调用对应的组件。代码位于：https://github.com/secretflow/secretflow/blob/main/secretflow/component/entry.py。你需要在这里声明组件。
SecretFlow Comps：所有隐语组件。代码位于：https://github.com/secretflow/secretflow/tree/main/secretflow/component。你需要在这个文件夹下创建你的新组件。
SecretFlow Libraries：隐语API。你可以利用所有隐语现有的各类算法来构造组件。你可以在这个链接了解隐语的第一方库。你可能需要调整这部分代码。
SecretFlow Devices: 隐语设备，隐语将本地明文计算抽象为PYU运算，密态计算抽象为密态设备的运算：SPU（MPC，多方安全计算），HEU（HE，同态加密），TEEU（TEE，可信执行环境），如果你不了解，请阅读这个文档。你一般不需要修改这部分代码。
Ray/RayFed。Ray 是隐语的底座，负责在一个kuscia拉起的隐语节点中调度资源，每一个计算参与方都是一个Ray集群。RayFed 负责Ray集群之间的通信和协调。