使用KubeFATE部署联邦学习开发环境
使用KubeFATE部署联邦学习开发环境
最近需要测试真实场景下FATE的一些功能,在使用Docker Compose进行FATE部署时遇到了一些问题,所以总结一下步骤与解决方法。参考官方部署文档
在两台服务器(1,2)上以root用户进行安装部署。其中服务器1既作为部署机也作为目标机,服务器2只作为目标机。
一、安装docker与docker-compose
部署机与目标机均需安装
curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun # 安装 docker
curl -L "https://github.com/docker/compose/releases/download/1.29.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose # 安装 docker-compose
chmod +x /usr/local/bin/docker-compose # 将可执行权限应用于二进制文件
systemctl start docker # 启动 docker
二、下载并解压KubeFATE1.6
从Github上下载KubeFATE1.6中的Source code(zip),上传到服务器1并解压。
unzip KubeFATE-1.6.0.zip
三、定义需要部署的实例数目
在部署机上执行以下步骤
# 进入docker-deploy目录
cd docker-deploy/
# 编辑parties.conf如下
vi parties.conf
user=root # 默认为fate,需修改为root
dir=/data/projects/fate
partylist=(10000)
partyiplist=(192.168.1.1, 192.168.1.2) #此处替换为目标机的IP,有几个写几个
servingiplist=(192.168.1.1, 192.168.1.2) #此处替换为目标机的IP
exchangeip=
四、生成集群部署文件
在部署机上执行以下步骤
cd docker-deploy/
bash generate_config.sh
# 若报错 docker_deploy.sh: 权限不够,执行以下代码
chmod +x ./docker_deploy.sh
现在,已经为包括交换节点在内的每个参与方生成了tar文件,存储在outputs文件夹。
五、将FATE部署到目标主机
在部署机上执行以下步骤
bash docker_deploy.sh all
在部署的过程中报错拉取镜像超时,若没有报错则忽略以下步骤。
ERROR: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/b7/b7b236ee4db0780ea765f0467d4cd858e5b2129bdce5fc8f1029466bc1e1d79e/data?verify=1620955831-7bGG01dQ%2BQJSt%2BxOnkb10XVaAr0%3D: dial tcp 104.18.123.25:443: i/o timeout
首先尝试了链接1和链接2中的方法,但在后面还有可能继续出现这个问题,最终通过修改镜像源解决。
cd docker-deploy/
vim .env
# 配置RegistryURI, 使用hub.c.163.com的镜像
RegistryURI=hub.c.163.com
# wq保存退出
# 修改了配置文件parties.conf或者.env后需重新生成部署文件
bash docker_deploy.sh all
# 此时部署到目标机则不会报错了
bash docker_deploy.sh all
六、检验是否成功部署
此时在各节点上应该已经创建了一个名为confs- <party_id> _python_1的容器并运行fate-flow服务。例如,在Party 10000的节点上,运行以下命令来验证部署:
docker exec -it confs-10000_python_1 bash
cd /data/projects/fate/examples/toy_example/
python run_toy_example.py 10000 9999 1
运行后报错:
Traceback (most recent call last):
File "run_toy_example.py", line 210, in <module>
exec_toy_example(runtime_config)
File "run_toy_example.py", line 171, in exec_toy_example
jobid = exec_task(dsl_path, runtime_config)
File "run_toy_example.py", line 93, in exec_task
"failed to exec task, status:{}, stderr is {} stdout:{}".format(status, stderr, stdout))
ValueError: failed to exec task, status:100, stderr is None stdout:{'data': {'guest': {'10001': {'retcode': 0, 'retmsg': 'success'}}, 'host': {'9999': {'retcode': 104, 'retmsg': 'Federated schedule error, Please check rollSite and fateflow network connectivityrpc request error: <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "\n[Roll Site Error TransInfo] \n location
经过多次排查原因为未开放8080和20000端口
# 开启防火墙
systemctl start firewalld
# 添加指定需要开放的端口:
firewall-cmd --add-port=20000/tcp --permanent
# 重载入添加的端口:
firewall-cmd --reload
# 查询指定端口是否开启成功:
firewall-cmd --query-port=20000/tcp
# 重新运行即可
python run_toy_example.py 10000 9999 1
如果测试通过,则输出可能如下所示:
stdout:{
"data": {
"board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202105200547052456781&role=guest&party_id=9999",
"job_dsl_path": "/data/projects/fate/jobs/202105200547052456781/job_dsl.json",
"job_id": "202105200547052456781",
"job_runtime_conf_on_party_path": "/data/projects/fate/jobs/202105200547052456781/guest/job_runtime_on_party_conf.json",
"job_runtime_conf_path": "/data/projects/fate/jobs/202105200547052456781/job_runtime_conf.json",
"logs_directory": "/data/projects/fate/logs/202105200547052456781",
"model_info": {
"model_id": "guest-9999#host-10001#model",
"model_version": "202105200547052456781"
},
"pipeline_dsl_path": "/data/projects/fate/jobs/202105200547052456781/pipeline_dsl.json",
"train_runtime_conf_path": "/data/projects/fate/jobs/202105200547052456781/train_runtime_conf.json"
},
"jobId": "202105200547052456781",
"retcode": 0,
"retmsg": "success"
}
job status is running
job status is running
[INFO] [2021-05-20 05:47:33,785] [81:140386598397760] - secure_add_guest.py[line:99]: begin to init parameters of secure add example guest
[INFO] [2021-05-20 05:47:33,785] [81:140386598397760] - secure_add_guest.py[line:102]: begin to make guest data
[INFO] [2021-05-20 05:47:36,814] [81:140386598397760] - secure_add_guest.py[line:105]: split data into two random parts
[INFO] [2021-05-20 05:47:48,193] [81:140386598397760] - secure_add_guest.py[line:108]: share one random part data to host
[INFO] [2021-05-20 05:47:48,204] [81:140386598397760] - secure_add_guest.py[line:111]: get share of one random part data from host
[INFO] [2021-05-20 05:47:55,361] [81:140386598397760] - secure_add_guest.py[line:114]: begin to get sum of guest and host
[INFO] [2021-05-20 05:47:58,700] [81:140386598397760] - secure_add_guest.py[line:117]: receive host sum from guest
[INFO] [2021-05-20 05:47:59,918] [81:140386598397760] - secure_add_guest.py[line:124]: success to calculate secure_sum, it is 1999.9999999999995
显示job status is running代表任务已经开始运行,并且出现"success to calculate secure_sum, it is 2000.0"代表任务已经成功。此时在本地打开http://服务器ip:8080/
即可查看fateboard。