Spark程序栈溢出错误

转载 2016年08月31日 13:37:54

Analysis of Spark's StackOverflowError in Iterative Algorithm

發表於 2015-10-14

文章目錄
  1. 1. Why the iterative algorithm can cause stack overflow error
  2. 2. How to fix stack overflow error in iterative algorithm
  3. 3. Demo of Checkpointing
  4. 4. References

Recently, I partly finished GBDT algorithm on spark. When testing its performance with a large iteration times (>300), I meeted the java.lang.StackOverflowError.

Why the iterative algorithm can cause stack overflow error

The iterative algorithm with a large iteration times often has a long lineage and it causes a long/deep java object tree (DAG of RDD objects), which needs to be serialized as part of the task creation. When serializing, the whole object DAG needs to be traversed leading to the stackoverflow error.

The lineage of the GBDT algorithm I wrote was shown below (at 4th iteration),

How to fix stack overflow error in iterative algorithm

Some people may deal with this problem by caching RDD, but it doesn’t work. There is a difference between caching to memory and checkpointing, when considered from the lineage point of view.

  • When an RDD in checkpointed, the data of the RDD is saved to HDFS (or any Hadoop API compatible fault-tolerant storage) and the lineage of the RDD is truncated. This is okay because in case of the worker failure, the RDD data can be read back from the fault-tolerant storage.
  • When an RDD is cached, the data of the RDD is cached in memory, but the lineage is not truncated. This is because if the in-memory data is lost, the lineage is required to recompute the data.

So to deal with stackoverflow errors due to long lineage, just caching is not going to be useful and you have to checkpoint the RDD for every 20~30 iterations. The correct way to do this is to do the following:

  1. Mark RDD of every Nth iteration for caching and checkpointing (both).
  2. Before generating (N+1)th iteration RDD, force the materialization of this RDD by doing a rdd.count() or any other actions. This will persist the RDD in memory as well as save to HDFS and truncate the lineage. If you just mark all Nth iteration RDD for checkpointing, but only force the materialization after ALL the iterations (not after every (N+1)th iteration as suggested) that will still lead to stackoverflow errors.

After added checkpointing, the lineage of the GBDT algorithm I wrote changed to (at 4th iteration):

Demo of Checkpointing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* set checkpointing directory */
val conf = new SparkConf().setAppName(s"GBoost Example with $params")
val sc = new SparkContext(conf)
sc.setCheckpointDir(params.cp_dir) // params.cp_dir = hdfs://bda00:8020/user/houjp/gboost/data/checkpoint/

/* do checkpointing */
var output: RDD[Double] = init()
var iter = 1
while (iter <= num_iter) {
    val pre_output = output
    // iterative computation of output
    output = getNext(pre_output).persist()
    // checkpoint every 20th iteration
    if (iter % 20 == 0) {
        output.checkpoint()
    }
    // force the materialization of this RDD
    output.count()
    pre_output.unpersist()
    iter += 1
}

References

  1. http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-td5649.html
举报

相关文章推荐

Spark Executor 报错 java.lang.StackOverflowError

Spark Executor 报错 java.lang.StackOverflowError

spark中updateStateByKey引发StackOverflowError的解决

spark中updateStateByKey引发StackOverflowError的解决 问题描述 写的spark程序, 运行几个小时候总是会出现 StackOverflowEr...

精选:深入理解 Docker 内部原理及网络配置

网络绝对是任何系统的核心,对于容器而言也是如此。Docker 作为目前最火的轻量级容器技术,有很多令人称道的功能,如 Docker 的镜像管理。然而,Docker的网络一直以来都比较薄弱,所以我们有必要深入了解Docker的网络知识,以满足更高的网络需求。

Spark面对OOM问题的解决方法及优化总结

分布式计算系统最常见的问题就是OOM问题,本文主要讲述Spark中OOM问题的原因和解决办法,并结合笔者实践讲述了一些优化技巧。涉及shuffle内存溢出,map内存溢出。spark代码优化技巧和sp...

spark使用总结

弹性分布式数据集(RDD)是分布式处理的一个数据集的抽象, RDD是只读的,在RDD之上的操作都是并行的 。实际上,RDD只是一个逻辑实体,其中存储了分布式数据集的一些信息,并没有包含所谓的“物理数据...

ALS推荐算法在Spark上的优化--从50分钟到3分钟

Spark上的ALS推荐算法优化, 从无法满足业务需求的耗时50分钟到3分钟.

栈溢出错误:message java.lang.StackOverflowError

再做一个基于ssh的电商网站时出现错误,错误原因是因为在页面product.jsp中又包含了一下这个页面 如下: 这个页面的业务逻辑被循环调用,导师栈溢出了 HTTP Status 500 -...

段错误与栈溢出

“段”(Segment)是指二进制文件内的区域,所有某种特定类型信息被保存在这里。可以用size程序得到可执行文件中各个段的大小。 C程序布局中分为代码段、初始化数据段、非初始化数据段、栈段和堆段。...

stm32 栈溢出 错误

转自:http://wenku.baidu.com/link?url=D58gZf0eyj2APBOxOlVmDiljVQdsfJAZqpa6XIntbUU4gAnOyJn8bWFLomR9qIoJk...

V8中的堆栈溢出错误

在运行V8代码的时候或者Nodejs等依赖于V8的时候,可能遇到如下错误: Maximum call stack size exceeded 这个错误是由于V8的堆栈溢出了,V8默认的堆栈大小...

函数栈溢出引起的段错误segmentation fault

有一个回调函数中发生了段错误,但经检查也没有什么明显的错误,然后用排除法一点一点屏蔽,最后定位在一个函数里出错,但这个函数没什么明显错误。最后把入口参数改为引用传递就不报错误。 但隔了一段时间这个函...
返回顶部
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)