周期性spark程序运行一段时间后,发现新的任务无法继续执行,新的任务无法继续提交到 yarn 集群(hdp 2.6.4) ,不停地 刷如下报错
[WARN] Utils Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
参照其他人经验 检查主机名以及扩大随机端口范围问题仍未解决。猜测问题应该应用未能正常释放TCP连接引起的后续应用无法正常执行。
于是 netstat -an 之后发现有不少 连接处于CLOSE_WAIT 状态,
通过 ss -s 发现 TCP 连接基本快被IPV6 占满,占了90%以上,或者如下命令也可以进行统计
lsof | grep CLOSE_WAIT | awk '{print $5}' | sort | uniq -c
并且发现大部分均是在 50010 端口上的 CLOSE_WAIT ,由于我们的hadoop应用并没有通过ipv6进行通信 ,推测问题可能是hadoop 的bug ,于是在hdp 、cdh 的已知问题里查找相似问题,果然找到类似问题
On CDH3 there were known issues that could lead to this situation. If you could show us a few lines from the netstat output that represents a majority of your CLOSE_WAIT connections, I can narrow the cause down more, but I'll list potential causes here:
1) IPv6 is enabled on your machines and the RS's are trying to utilize the IPv6 protocol to communicate to DNs, but since hadoop does not support IPv6, the RS's do not properly close down those tcp connections. The output of the "lsof | grep CLOSE_WAIT" command will show IPv6 in the type column if these connections are due to errantly utilizing IPv6. To alleviate that, you can add the following java arguments to your java runtime:
-Djava.net.preferIPv4Stack=true -Djava.net.preferIPv6Addresses=false
大体的意思是说 "由于服务器开启了IPV6 ,RS尝试通过IPV6 方式与DNs通信,但是由于hadoop 不支持ipv6 ,RS 无法正常的关闭这些TCP连接,造成了一堆CLOSE_WAIT 状态的连接,可以在启动相关java 进程时候,指定关闭IPV6 来避免这种问题"
在相关spark 启动程序里添加了 -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv6Addresses=false 重启应用程序,观察两小时再看看50010端口 CLOSE_WAIT 状态的连接已经恢复到十位数正常水平,观察了一周问题未再复现。