小白一枚,简单记录一下FATE on Spark 部署遇到的坑~~~
本文记录的是rabbitMQ集群部分遇到的问题,其他模块之后再慢慢更新
搭建环境
操作系统OS:CentOS 7 (多台Host)
FATE 部署模式:FATE on Spark (Spark 单机模式+无hdfs+RabbitMQ集群)
版本: v1.8.0
账户: root 和 app
链接:FATE/fate_on_spark_deployment_guide.md at master · FederatedAI/FATE · GitHub
RabbitMQ with Erlang
RabbitMQ模块部署对应FATE_on_Spark部署文档 5.8 章节
FATE/rabbitmq_deployment_guide.zh.md at master · FederatedAI/FATE · GitHub
安装包链接
情况描述: RabbitMQ (with Erlang) 依赖包安装 - 官方文档没有给出下载链接
解决方法: RabbitMQ Erlang 以及 ncurses 下载地址如下
https://github.com/rabbitmq/rabbitmq-server/releases/download/rabbitmq_v3_6_15/rabbitmq-server-generic-unix-3.6.15.tar.xz
http://www.erlang.org/download/otp_src_19.3.tar.gz
https://mirrors.aliyun.com/gnu/ncurses/ncurses-6.0.tar.gz
https://github.com/mirror/ncurses/archive/refs/tags/v6.0.tar.gz
PS: 坐标国外,不清楚国内是否能获取
相关依赖安装
情况描述: 依赖包安装 ncurses make install 安装报错
********************** APPLICATIONS DISABLED **********************
jinterface : No Java compiler found
odbc : ODBC library - link check failed
********************** APPLICATIONS INFORMATION *******************
wx : wxWidgets not found, wx will NOT be usable
********************** DOCUMENTATION INFORMATION ******************
documentation :
fop is missing.
Using fakefop to generate placeholder PDF files.
解决方法:root用户下执行```yum install gcc-c++ automake cmake ncurses-devel openssl-devel wxGTK-devel fop java-1.8.0-openjdk-devel unixODBC-devel libssh2-devel ```
参考链接:
How compile erlang without modules 'jinterface, odbc, wx' asdf-vm/asdf-erlang · GitHub
yum安装出现No package ****** available_每天进步一点_点的博客-CSDN博客
Erlang 命令 not found
情况描述: 安装后尝试命令```erl ```返回 erl: command not found...
解决方法: 检查确认安装过程完成无报错 并且登出ssh 重新进入 执行 `erl`
RabbitMQ 节点加入集群
情况描述: 在节点host2停止服务 ```sbin/rabbitmqctl stop``` 时发生Authentication failed
attempted to contact: [rabbit@host2]
rabbit@host2:
* connected to epmd (port 4369) on host2
* epmd reports node 'rabbit' running on port 25672
* TCP connection succeeded but Erlang distribution failed
* Authentication failed (rejected by the remote node), please check the Erlang cookie
解决方法:需要使用初始.erlang.cookie 停止
所以先停止 再拷贝主节点host1的.erlang.cookie 到节点host2, host3
(.erlang.cookie文件位于/home/app/目录下)
(更新)重启服务
先启动主节点,后启动其他节点加入集群,完成后分别在每一方检查是否正常运行:
ps -ef | grep -i rabbit
netstat -tlnp | grep -i 5672
两方集群部署的正常情况下有
tcp 0 0 0.0.0.0:15672 0.0.0.0:* LISTEN 4797/beam.smp
tcp 0 0 0.0.0.0:25672 0.0.0.0:* LISTEN 4797/beam.smp
tcp6 0 0 :::5672 :::* LISTEN 4797/beam.smp
执行fate任务时 Connection refused
情况描述:
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='xxx.xxx.xxx.xxx', port=15672): Max retries exceeded with url: /api/queues/202208101027104732950_secure_add_example_0_0-guest-9999-host-9999 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f028771c1d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
解决方法:
确认两方可以打开UI http://192.168.XXX.XXX:15672/ 并用 fate 用户密码登录(默认fate - fate)
如 login failed, 重新配置rabbitmq 用户
rabbitmqctl add_user fate fate rabbitmqctl set_user_tags fate administrator rabbitmqctl set_permissions -p / fate ".*" ".*" ".*"
如 主节点正常 仅client 节点 15672 不存在, 重新启动应用 rabbitmqctl start_app
sbin/rabbitmqctl start_app
参考链接: