Spark报错：ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

冷漠；

已于 2023-09-06 17:06:14 修改

阅读量2.3k

点赞数 1

分类专栏： spark 文章标签： spark 大数据 scala

于 2022-09-20 17:47:03 首次发布

本文链接：https://blog.csdn.net/qq_45124566/article/details/126739686

版权

spark 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

错误

今天运行 Spark 任务时报了一个错误，如下所示：

ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks 
java.io.IOException: Failed to connect to hostname/192.168.xx.xxx:50002
 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
 at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:114)
 at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
 at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121)
 at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:124)
 at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:98)
 at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:757)
 at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
 at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
 at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
 at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 22/09/07 09:15:09 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
 ...

主要错误就是：
shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to hostname/192.168.xx.xxx:50002

原因

导致出现上述错误的原因是某个 executor 挂了，某个 exetutor 想要 fetch 数据（应该是shuffle read），但是那个有数据的 executor 挂了，导致 fetch 失败。

shuffle 分为 shuffle write 和 shuffle read 两部分。
shuffle write 的分区数由上一阶段的 RDD 分区数控制，shuffle read 的分区数则是由 Spark 提供的一些参数控制。

shuffle write 可以简单理解为类似于 saveAsLocalDiskFile 的操作，将计算的中间结果按某种规则临时放到各个 executor 所在的本地磁盘上。

shuffle read 的时候数据的分区数则是由 spark 提供的一些参数控制。可以想到的是，如果这个参数值设置的很小，同时 shuffle read 的量很大，那么将会导致一个 task 需要处理的数据非常大。结果导致 JVM crash，从而导致取 shuffle 数据失败，同时 executor 也丢失了，看到 Failed to connect to host 的错误，也就是 executor lost 的意思。有时候即使不会导致 JVM crash 也会造成长时间的 gc。

解决方法

知道原因后问题就好解决了，主要从 shuffle 的数据量和处理 shuffle 数据的分区数两个角度入手。

减少 shuffle 数据

考虑是否可以使用 map side join 或是 broadcast join 来规避 shuffle 的产生。将不必要的数据在 shuffle 前进行过滤，比如原始数据有 20 个字段，只要选取需要的字段进行处理即可，将会减少一定的 shuffle 数据。
SparkSQL 和 DataFrame 的 join，group by 等操作

通过 spark.sql.shuffle.partitions 控制 SparkSQL 和 DataFrame 的分区数，默认为 200，根据 shuffle 的量以及计算的复杂度提高这个值。
Rdd 的 join，groupBy，reduceByKey 等操作

通过 spark.default.parallelism 控制 RDD 中 shuffle read 与 reduce 处理的分区数，默认为运行任务的 core 的总数（mesos 细粒度模式为 8 个，local 模式为本地的 core 总数），官方建议为设置成运行任务的 core 的 2-3 倍。
提高 executor 的内存

通过 spark.executor.memory 适当提高executor 的 memory 值。
增加并行 task 的数目，提高 spark 任务的并行度

通过增加并行 task 的数目，从而减小每个 task 的数据量。
是否存在数据倾斜的问题

空值是否已经过滤？异常数据（某个 key 数据特别大）是否可以单独处理？考虑改变数据分区规则。

其它实际操作中使用有效的方法

如果使用了广播变量，减小广播变量的广播值或者禁止广播。

broadcast fetch data

参考文章：
https://www.cnblogs.com/double-kill/p/9012383.html
https://www.cnblogs.com/java-meng/p/15189266.html
https://www.jianshu.com/p/edd3ccc46980

冷漠；

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
2
评论
Spark报错：ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks java.io.IOException: Failed to connect to hostname/192.168.xx.xxx:50002 at
复制链接

扫一扫