采坑记 Memory is not enough for task serialization: java.lang.OutOfMemoryError
背景:一个每天跑的Spark程序突然出现Memory is not enough for task serialization: java.lang.OutOfMemoryError 这样的OOM错误,具体错误如下。怀疑是输入数据变大导致结果变大,加大executor内存,每个core的内存直接翻一倍重跑,还是失败;怀疑数据中有极端长的字符串导致,每个core的内存变为原来的4倍,已经相当大了,还是同样的错误。暴力加内存无法解决这个问题,需要从错误提示入手review代码。
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Memory is not enough for task serialization: java.lang.OutOfMemoryError
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1895)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1883)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1882)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1882)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:884)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:883)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:883)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2113)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2065)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2054)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2122)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2143)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2175)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:79)
... 53 more
Caused by: java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1157)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1071)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:884)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:883)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:883)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2113)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2065)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2054)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
不过这个错误比较古怪,没有明确指向业务代码具体的哪一行,只定位到保存hdfs这一行,没有太多价值。由于融合的数据比较多,逐一看了每路数据数据的大小,最近都有小幅增长。怀疑是每路数据小幅增长导致全量数据变大导致。于是输入端抽样了1%的user数据来运行,竟然还是失败,报错还是OOM。这就非常诡异了,1%的量和100%的user数据报错相同。
但是小批量的数据报错会多一些信息,申请数据的大小超过了机器的最大值,搜索了下发现JVM在申请数组的时候实际有个最大值,即 Int.MAX -1 ,即2^31 -1,也就是程序申请的内存超过了这个值
Memory is not enough for task serialization:
java.lang.OutOfMemoryError Requested array size exceeds VM limit
于是,将输入数据进行分组,分成A、B、C3组,先全部注释,发现能跑通,再保留A,丢掉BC,发现跑的桶,继续保留A、B,丢掉C,发现也跑的桶,那确定就是C组特征的问题。C组在逻辑上面看起来不太可能出问题,而且C组的数据来源都review了一遍,小幅增长,变化不大。百思不得其解,然后注意到一个细节,C组中的数据在DataFrame中是直接用的,没有做Broadcast。假设C组数据大小为2G,不做broadcast的话,每个task都会从driver那里取这个数据,先不考虑driver发送这些数据的网络堵塞,每个task都会申请2G大小的内存在装这个数据,如果一个executor上面有4个task,那就要申请8G的内存,只为装这一个数据,申请这么大的内存,很容易出现OOM的错误。但是对C组数据做了broadcast,那么同一个executor上的task是共享这个数据的,内存使用就得到保障了。Spark中对于小于10M的数据是自动做了broadcast,大于10M的情况需要根据自己判断在处理。
对C组数据做了broadcast之后,问题迎刃而解。太隐蔽了,这个坑。