Spark Debug

https://blog.csdn.net/weixin_40901056/article/details/90546701

 

一、 报错:大数据量没问题,小数据量报错
java.io.EOFException: Premature EOF: no length prefix available
    at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2326)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:734)
org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:212)
    ... 222 more
19/05/24 10:43:57 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted.
org.apache.spark.SparkException: Job aborted.
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

原因定位:
(1)–conf spark.sql.autoBroadcastJoinThreshold=200000000 导致小人群包(<200M)时候触发broadcast,导致耗时过久等错误。
(2)即便加上–conf spark.sql.broadcastTimeout=3600 配置,也会报错。

二、 报错:orc 文件太多,增大 --driver-memory 解决
java.io.EOFException: Premature EOF: no length prefix available
 at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2326)
 at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:734)
19/07/12 17:04:30 WARN DFSClient: Error Recovery for block BP-1670631107-10.198.28.233-1518587510448:blk_1080281905_6670799 in pipeline DatanodeInfoWithStorage[11.7.133.135:50010,DS-0b250dac-7bc7-49ca-991a-13a4bd603f3a,DISK], DatanodeInfoWithStorage[10.198.117.40:50010,DS-a4adda03-69b9-49e8-9b2b-8b1fb2d8a668,DISK], DatanodeInfoWithStorage[10.198.28.42:50010,DS-d0e0b34c-85d1-4d6f-b6e9-ecdf1aa10a67,DISK]: bad datanode DatanodeInfoWithStorage[11.7.133.135:50010,DS-0b250dac-7bc7-49ca-991a-13a4bd603f3a,DISK]
19/07/12 17:08:35 WARN DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1670631107-10.198.28.233-1518587510448:blk_1080281905_6671099
java.io.EOFException: Premature EOF: no length prefix available
 at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2326)
 at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:734)
19/07/12 17:08:35 WARN DFSClient: Error Recovery for block BP-1670631107-10.198.28.233-1518587510448:blk_1080281905_6671099 in pipeline DatanodeInfoWithStorage[10.198.117.40:50010,DS-a4adda03-69b9-49e8-9b2b-8b1fb2d8a668,DISK], DatanodeInfoWithStorage[10.198.28.42:50010,DS-d0e0b34c-85d1-4d6f-b6e9-ecdf1aa10a67,DISK], DatanodeInfoWithStorage[10.198.127.131:50010,DS-ecb4fd12-503a-429b-bade-eda4c47d9160,DISK]: bad datanode DatanodeInfoWithStorage[10.198.117.40:50010,DS-a4adda03-69b9-49e8-9b2b-8b1fb2d8a668,DISK]

原因定位:
(1)第一次执行失败,失败自动重新执行可以成功,不知道原因
(2)推测:由于orc文件太多导致,即便加上.config(“hive.exec.orc.split.strategy”, “BI”)也不急于事。
(3)提高driver-memory解决,推测原因是orc文件信息耗费大量driver内存导致的。
(4)参考https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html

解决方案:
(1)提高driver-memory, --driver-memory 8G 。
(2)提高driver-memory,提高读取orc table的速度(不设置BI strategy的情况下)

二、orc table读取耗时严重
.config(“hive.exec.orc.split.strategy”, “BI”)
.config(“hive.exec.orc.default.stripe.size”, 268435456L)

Orc的分split有3种策略(ETL、BI、HYBIRD),默认是HYBIRD: The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.
(见https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.orc.split.strategy)
ETL:扫描每个ORC文件上的footer。扫描之后很具block的大小,进行orc的split
BI:每个orc文件作为一个split,不用扫描元信息
HYBIRD:扫面每个ORC footers信息,之后根据orc平均文件大小决定使用ETL还是BI

当我们执行SQL处理ORC格式的HIVE表时,会发现很简单的一个处理会花很长时间去生成task 。原因是用Spark读取ORC文件时,Spark会使用ETL模式读取ORC文件,这时会扫每个ORC文件上的index。扫完之后才会开始进行真正的 Spark Orc 切片缓慢优化。 当将策略修改成BI之后,将直接以ORC文件作为split。加快切分task的速度。但如果单个文件都很大,可能影响后续运行速度。

二、Future time out
19/07/16 16:11:26 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(58,[Lscala.Tuple2;@1037afd9,BlockManagerId(58, host-11-17-152-29.svc.ht1.n.jd.local, 43890, None))]
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:119)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:688)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:717)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:717)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:717)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1962)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:717)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
    ... 13 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:190)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)

(1)原因1: 未知。。。
提高–conf spark.executor.heartbeatInterval=20s 未能解决问题。。。
参考https://issues.apache.org/jira/browse/SPARK-14140
(2)原因2: driver端内存不足
提高driver-memory,或者设置永久代spark.driver.extraJavaOptions="-XX:PermSize=2g -XX:MaxPermSize=2g",未能解决问题。。。
spark 源码中 private val HEARTBEAT_MAX_FAILURES = conf.getInt(“spark.executor.heartbeat.maxFailures”, 60)可以看出,最长60次尝试连接失败,executor才会自杀,因此导致一直卡着

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值