【问题解决】本地提交任务到Spark集群报错:Initial job has not accepted any resources

本地提交任务到Spark集群报错:Initial job has not accepted any resources

错误信息如下:

18/04/17 18:18:14 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/04/17 18:18:29 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources


将该python文件放到集群机器上提交到spark就没有问题。后来尝试在本机执行Spark自带的example,问题依旧存在。

虽然是WARN,但是任务并未成功执行,在Spark的webui里也一直是运行状态。我在本机和集群上执行的命令分别如下:

bin\spark-submit --master spark://192.168.3.207:7077 examples\src\main\python\pi.py
./spark-submit --master spark://192.168.3.207:7077 ../examples/src/main/python/pi.py
执行的都是spark自带的例子。
从网上找的解决办法大概有2个,都不好使,先在此记录一下:

1)加大执行内存:

bin\spark-submit --driver-memory 2000M --executor-memory 2000M --master spark://192.168.3.207:7077 examples\src\main\python\pi.py

2)修改防火墙或放开对spark的限制,或者暂时先关闭。


继续查看master和slave各自的log,也没有错误,后来到master的webui界面:http://192.168.3.207:8080/,点击刚才的任务进去:


点击某个workder的stderr,内容如下:

18/04/17 18:55:54 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 23412@he-200
18/04/17 18:55:54 INFO SignalUtils: Registered signal handler for TERM
18/04/17 18:55:54 INFO SignalUtils: Registered signal handler for HUP
18/04/17 18:55:54 INFO SignalUtils: Registered signal handler for INT
18/04/17 18:55:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/17 18:55:55 INFO SecurityManager: Changing view acls to: he,shaowei.liu
18/04/17 18:55:55 INFO SecurityManager: Changing modify acls to: he,shaowei.liu
18/04/17 18:55:55 INFO SecurityManager: Changing view acls groups to: 
18/04/17 18:55:55 INFO SecurityManager: Changing modify acls groups to: 
18/04/17 18:55:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(he, shaowei.liu); groups with view permissions: Set(); users  with modify permissions: Set(he, shaowei.liu); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from 192.168.56.1:51378 in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
...
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from 192.168.56.1:51378 in 120 seconds
... 8 more

18/04/17 18:57:55 ERROR RpcOutboxMessage: Ask timeout before connecting successfully


发现日志报连接192.168.56.1:51378超时。问题是这个ip是哪里来的呢?查看下自己机器ip,命令行执行ipconfig,问题找到了:192.168.56.1是我本机Docker创建的VirtualBox虚拟网络IP。应该是本地在提交任务到集群时,没有正确获取到本机的ip地址,导致集群节点接受任务一直超时。解决办法很简单:把该网络禁用。
再试一次,很快就执行完毕了。
bin\spark-submit --master spark://192.168.3.207:7077 examples\src\main\python\pi.py

再看下webui里的日志,发现集群节点要连接我本机,然后将我的任务pi.py,传到节点临时目录/tmp/spark-xxx/,并拷贝到$SPARM_HOME/work/下才真正执行。以后有时间再学习下具体流程。顺便把日志贴出来:

18/04/17 19:13:11 INFO TransportClientFactory: Successfully created connection to /192.168.0.138:51843 after 3 ms (0 ms spent in bootstraps)
18/04/17 19:13:11 INFO DiskBlockManager: Created local directory at /tmp/spark-67d75b11-65e7-4bc7-89b5-c07fb159470f/executor-b8ce41a3-7c6e-49f6-95ef-7ed6cdef8e53/blockmgr-030eb78d-e46b-4feb-b7b7-108f9e61ec85
18/04/17 19:13:11 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
18/04/17 19:13:12 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.0.138:51843
18/04/17 19:13:12 INFO WorkerWatcher: Connecting to worker spark://Worker@192.168.3.102:34041
18/04/17 19:13:12 INFO TransportClientFactory: Successfully created connection to /192.168.3.102:34041 after 0 ms (0 ms spent in bootstraps)
18/04/17 19:13:12 INFO WorkerWatcher: Successfully connected to spark://Worker@192.168.3.102:34041
18/04/17 19:13:12 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, 192.168.3.102, 44683, None)
18/04/17 19:13:12 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, 192.168.3.102, 44683, None)
18/04/17 19:13:12 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, 192.168.3.102, 44683, None)
18/04/17 19:13:14 INFO CoarseGrainedExecutorBackend: Got assigned task 0
18/04/17 19:13:14 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/04/17 19:13:14 INFO Executor: Fetching spark://192.168.0.138:51843/files/pi.py with timestamp 1523963609005
18/04/17 19:13:14 INFO TransportClientFactory: Successfully created connection to /192.168.0.138:51843 after 1 ms (0 ms spent in bootstraps)
18/04/17 19:13:14 INFO Utils: Fetching spark://192.168.0.138:51843/files/pi.py to /tmp/spark-67d75b11-65e7-4bc7-89b5-c07fb159470f/executor-b8ce41a3-7c6e-49f6-95ef-7ed6cdef8e53/spark-98745f3b-2f70-47b2-8c56-c5b9f6eac496/fetchFileTemp2255624304256249008.tmp
18/04/17 19:13:14 INFO Utils: Copying /tmp/spark-67d75b11-65e7-4bc7-89b5-c07fb159470f/executor-b8ce41a3-7c6e-49f6-95ef-7ed6cdef8e53/spark-98745f3b-2f70-47b2-8c56-c5b9f6eac496/-11088979641523963609005_cache to /home/ubutnu/spark_2_2_1/work/app-20180417191311-0005/1/./pi.py
……
18/04/17 19:13:14 INFO TransportClientFactory: Successfully created connection to /192.168.0.138:51866 after 5 ms (0 ms spent in bootstraps)
18/04/17 19:13:14 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1803 bytes result sent to driver
……
18/04/17 19:13:16 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
18/04/17 19:13:16 INFO MemoryStore: MemoryStore cleared
18/04/17 19:13:16 INFO ShutdownHookManager: Shutdown hook called
18/04/17 19:13:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-67d75b11-65e7-4bc7-89b5-c07fb159470f/executor-b8ce41a3-7c6e-49f6-95ef-7ed6cdef8e53/spark-98745f3b-2f70-47b2-8c56-c5b9f6eac496

阅读更多
文章标签: Spark 大数据
个人分类: 大数据 问题解决
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

关闭
关闭
关闭