INFO java.net.BindException: Address already in use: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
spark集群大量端口占用-BindException: Address already in use
1.异常信息
之前提交spark任务都很正常,但是最近老是执行spark任务失败:BindException: Address already in use
spark ui 显示 异常信息
-
HTTP ERROR 500
-
Problem accessing /proxy/application_1588486936385_2884/. Reason:
-
Address already in use
-
Caused by:
-
java.net.BindException: Address already in use
-
at java.net.PlainSocketImpl.socketBind(Native Method)
-
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
-
at java.net.Socket.bind(Socket.java:644)
-
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:120)
-
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
-
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
-
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
-
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
-
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
-
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
-
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
-
at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:200)
-
at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:387)
-
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
-
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
-
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
-
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
yarn application 查看的异常信息
-
20/12/21 12:55:18 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on a random free port. You may check whether configuring an appropriate binding address.
-
20/12/21 12:55:18 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Address already in use: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
-
java.net.BindException: Address already in use: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
-
at sun.nio.ch.Net.bind0(Native Method)
-
at sun.nio.ch.Net.bind(Net.java:433)
-
at sun.nio.ch.Net.bind(Net.java:425)
-
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
-
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
-
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
-
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1283)
-
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
-
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
-
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:989)
-
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
-
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:364)
-
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
-
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
-
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
-
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
-
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
-
at java.lang.Thread.run(Thread.java:745)
-
End of LogType:stderr
2.异常分析
重试 这么多次都找不到随机端口,说明端口都被占用了
查看端口总数ss -s
-
Total: 105626 (kernel 109563)
-
TCP: 105277 (estab 196, closed 79, orphaned 0, synrecv 0, timewait 77/0), ports 0
-
Transport Total IP IPv6
-
* 109563 - -
-
RAW 1 0 1
-
UDP 15 8 7
-
TCP 105198 104809 389
-
INET 105214 104817 397
-
FRAG 0 0 0
可以看到一万多个 基本 被占用完了
使用ss命令 大量的 CLOSE-WAIT 端口
-
STAB 0 0 [::ffff:192.168.827]:44693 [::ffff:192.168.860]:37528
-
CLOSE-WAIT 1 0 [::ffff:192.168.827]:58473 [::ffff:192.168.827]:50010
-
CLOSE-WAIT 1 0 [::ffff:192.168.827]:55800 [::ffff:192.168.827]:50010
-
CLOSE-WAIT 1 0 [::ffff:192.168.827]:37749 [::ffff:192.168.860]:50010
-
CLOSE-WAIT 1 0 [::ffff:192.168.827]:54642 [::ffff:192.168.827]:50010
-
CLOSE-WAIT 1 0 [::ffff:192.168.827]:39578 [::ffff:192.168.827]:50010
-
CLOSE-WAIT 1 0
-
。。。。
随便找个端口分析一下
查看端口状态
-
[root@master ~]# netstat -anp | grep 54889
-
tcp 1 0 192.168.1.827:54889 192.168.1.803:50010 CLOSE_WAIT 19870/java
-
tcp6 1 0 192.168.1.827:54889 192.168.1.827:50010 CLOSE_WAIT 44212/java
查看进程
-
[root@master ~]# ps -ef|grep 19870
-
root 17678 45042 0 16:34 pts/0 00:00:00 grep --color=auto 19870
-
root 19870 1 0 May04 ? 1-10:31:14 /opt/hadoop/jdk1.8.0_77/bin/java -Xmx16384m -Djava.library.path=/opt/hadoop/hadoop-2.7.7/lib -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/hadoop-2.7.7/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop/hadoop-2.7.7 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dproc_hiveserver2 -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/opt/hadoop/apache-hive-3.0.0-bin/conf/parquet-logging.properties -Djline.terminal=jline.UnsupportedTerminal -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/hadoop/apache-hive-3.0.0-bin/lib/hive-service-3.0.0.jar org.apache.hive.service.server.HiveServer2
查看 java 进程
-
[root@master ~]# jps
-
28966 HQuorumPeer
-
11113 SecondaryNameNode
-
28457 HRegionServer
-
10858 DataNode
-
15722 NodeManager
-
11403 ResourceManager
-
10707 NameNode
-
44212 ApplicationMaster
-
2839 Jps
-
19672 RunJar
-
28217 HMaster
-
19870 RunJar
-
42814 RunJar
这个 RunJar应该就是HiveServer2
先查看对向端口
-
[root@slave3 ~]# netstat -anp|grep 50010
-
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 11430/java
-
tcp 0 0 192.168.1.859:50010 192.168.1.860:34366 ESTABLISHED 11430/java
-
tcp 0 0 192.168.1.859:38024 192.168.1.803:50010 ESTABLISHED 11430/java
-
tcp 0 0 192.168.1.859:50010 192.168.1.859:47796 ESTABLISHED 11430/java
-
tcp 0 0 192.168.1.859:38022 192.168.1.803:50010 ESTABLISHED 11430/java
-
tcp6 0 0 192.168.1.859:47796 192.168.1.859:50010 ESTABLISHED 29418/java
-
[root@slave3 ~]# ps -ef|grep 11430
-
root 11430 1 0 May03 ? 2-03:40:55 /opt/hadoop/jdk1.8.0_77/bin/java -Dproc_datanode -Xmx16384m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/hadoop-2.7.7/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop/hadoop-2.7.7 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop/hadoop-2.7.7/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/hadoop-2.7.7/logs -Dhadoop.log.file=hadoop-root-datanode-slave3.log -Dhadoop.home.dir=/opt/hadoop/hadoop-2.7.7 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop/hadoop-2.7.7/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
-
root 11571 1179 0 16:46 pts/0 00:00:00 grep --color=auto 11430
-
[root@slave3 ~]#
-
[root@slave3 ~]#
-
[root@slave3 ~]# jps
-
15488 HQuorumPeer
-
11665 Jps
-
5749 NodeManager
-
11430 DataNode
-
29418 HRegionServer
-
[root@slave3 ~]#
嗯 这是个 dataNode
结论:HiveServer2和dataNode有大量连接没有关闭
3.解决异常
先kill 掉 HiveServer2这个进程,发现ss -s命令下端口占用减少,netstat命令 都快了 好多。(端口信息过多netstat就会卡,ss命令不会卡)
-
[root@master ~]# ss -s
-
Total: 983 (kernel 14276)
-
TCP: 640 (estab 199, closed 84, orphaned 0, synrecv 0, timewait 82/0), ports 0
-
Transport Total IP IPv6
-
* 14276 - -
-
RAW 1 0 1
-
UDP 15 8 7
-
TCP 556 155 401
-
INET 572 163 409
-
FRAG 0 0 0
假设1:我通过客户端连接过HiveServer2,但查询慢的时候我就直接关闭客户端,但是HiveServer2和datanode就没有关闭连接,但我可能一年就查几次,不会导致这么多连接没关闭吧,这个假设可能性太小
假设2:application 程序 有隐形的代码自动连接HiveServer2没有关闭连接,这个可能性也不大,我关闭HiveServer2后我的应用还是照常跑。