【解决】centos6.2 spark cluster问题(持续追加)

38 篇文章 0 订阅

系统:centos6.2

节点数目:1个master,16个worker

spark版本:0.8.0

内核版本:2.6.32



以下是遇到的问题及解决办法:

1. 执行完某一个任务后,某个节点无法再次连接,在其上执行jps出现:StandaloneExecutorBackend进程,无法结束。

原因:不明

解决办法:重启,重新连接。

2. worker节点tasktracker无法启动

原因:关闭集群后的worker节点上tasktracker节点没关掉。

解决办法:关闭集群后,手动找到worker节点上的tasktracker进程并杀死

3. 执行start-master.sh报错:failed to launch org.apache.spark.deploy.master.Master

[root@hw024 spark-0.8.0-incubating]# ./bin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home/zhangqianlong/spark-0.8.0-incubating/bin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-hw024.out
failed to launch org.apache.spark.deploy.master.Master:
Error: Could not find or load main class org.apache.spark.deploy.master.Master
full log in /home/zhangqianlong/spark-0.8.0-incubating/bin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-hw024.out

解决办法:执行sbt/sbt clean assembly

4. master执行结束后结果显示正常,但是worker节点的/work/XX/stderr中报错误。

cd $SPARK_HOME 
./run-example org.apache.spark.examples.SparkPi spark://hw024:7077

标准输出结果:

……

14/03/20 11:13:02 INFO scheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:39) finished in 1.642 s

14/03/20 11:13:02 INFO cluster.ClusterScheduler: Remove TaskSet 0.0 from pool

14/03/20 11:13:02 INFO spark.SparkContext: Job finished: reduce at SparkPi.scala:39, took 1.708775428 s

Pi is roughly 3.13434

但是查看workernode的/home/zhangqianlong/spark-0.8.0-incubating-bin-hadoop1/work/app-20140320111300-0008/8/stderr内容如下:

Spark Executor Command: "java" "-cp" ":/home/zhangqianlong/spark-0.8.0-incubating-bin-hadoop1/conf:/home/zhangqianlong/spark-0.8.0-incubating-bin-hadoop1/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.StandaloneExecutorBackend" "akka://spark@hw024:60929/user/StandaloneScheduler" "8" "hw018" "24"

====================================

14/03/20 11:05:15 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started

14/03/20 11:05:15 INFO executor.StandaloneExecutorBackend: Connecting to driver: akka://spark@hw024:60929/user/StandaloneScheduler

14/03/20 11:05:15 INFO executor.StandaloneExecutorBackend: Successfully registered with driver

14/03/20 11:05:15 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started

14/03/20 11:05:15 INFO spark.SparkEnv: Connecting to BlockManagerMaster: akka://spark@hw024:60929/user/BlockManagerMaster

14/03/20 11:05:15 INFO storage.MemoryStore: MemoryStore started with capacity 323.9 MB.

14/03/20 11:05:15 INFO storage.DiskStore: Created local directory at /tmp/spark-local-20140320110515-9151

14/03/20 11:05:15 INFO network.ConnectionManager: Bound socket to port 59511 with id = ConnectionManagerId(hw018,59511)

14/03/20 11:05:15 INFO storage.BlockManagerMaster: Trying to register BlockManager

14/03/20 11:05:15 INFO storage.BlockManagerMaster: Registered BlockManager

14/03/20 11:05:15 INFO spark.SparkEnv: Connecting to MapOutputTracker: akka://spark@hw024:60929/user/MapOutputTracker

14/03/20 11:05:15 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-81a80beb-fd56-4573-9afe-ca9310d3ea8d

14/03/20 11:05:15 INFO server.Server: jetty-7.x.y-SNAPSHOT

14/03/20 11:05:15 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:56230

14/03/20 11:05:16 ERROR executor.StandaloneExecutorBackend: Driver terminated or disconnected! Shutting down.

这个问题困扰我一周了,F**K!

经过多次与其他攻城狮讨论,该问题不用理会,只要正常运行结束,并且hadoop fs -cat /XX/part-XXX有输出结果,结果是想要的就行。我猜是因为延时配置问题。

5.运行时报错:filenotfoundexception: too many open files

原因:迭代计算时打开太多临时文件

解决办法:修改所有节点的系统打开文件上限设置:/etc/security/limits.conf(注意,不能用ssh远程登陆后先删除再拷贝,会导致系统无法登陆)

重启spark后生效就可以解决


6. 由于在spark上运行的程序,输入数据太大(当输入小的时候可以成功运行)造成程序挂掉。

报错信息如下:

14/04/15 16:14:33 INFO cluster.ClusterTaskSetManager: Starting task 2.0:92 as TID 594 on executor 24: hw028 (ANY)
14/04/15 16:14:33 INFO cluster.ClusterTaskSetManager: Serialized task 2.0:92 as 2119 bytes in 0 ms
14/04/15 16:14:33 INFO client.Client$ClientActor: Executor updated: app-20140415151451-0000/23 is now FAILED (Command exited with code 137)
14/04/15 16:14:33 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140415151451-0000/23 removed: Command exited with code 137
14/04/15 16:14:33 ERROR client.Client$ClientActor: Master removed our application: FAILED; stopping client
14/04/15 16:14:33 ERROR cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster!
14/04/15 16:14:33 INFO cluster.ClusterScheduler: Remove TaskSet 2.0 from pool 
14/04/15 16:14:33 INFO cluster.ClusterScheduler: Ignoring update from TID 590 because its task set is gone
14/04/15 16:14:33 INFO cluster.ClusterScheduler: Ignoring update from TID 593 because its task set is gone
14/04/15 16:14:33 INFO scheduler.DAGScheduler: Failed to run count at PageRank.scala:43
Exception in thread "main" org.apache.spark.SparkException: Job failed: Error: Disconnected from Spark cluster
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:379)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)

原因:RDD太大了,每个节点好多RDD,导致内存不够。

解决办法:修改运行命令或者sprak-env.sh,添加参数 -Dspark.akka.frameSize=10000(以M为单位的)。

7. 由于输入数据太大或者网络太差导致的woker节点“no recent heart beats”。

现象: WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, hw032, 39782, 0) with no recent heart beats: 46910ms exceeds 45000ms

原因:由于网络差或者数据量太大,worker节点在一定的时间内(默认45s)没有给master信号,master以为它挂了。

解决办法:修改运行命令或者sprak-env.sh,添加参数 -Dspark.storage.blockManagerHeartBeatMs=60000(以ms为单位,即6分钟)。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值