Spark程序栈溢出错误

转载 2016年08月31日 13:37:54

Analysis of Spark's StackOverflowError in Iterative Algorithm

發表於 2015-10-14

文章目錄
  1. 1. Why the iterative algorithm can cause stack overflow error
  2. 2. How to fix stack overflow error in iterative algorithm
  3. 3. Demo of Checkpointing
  4. 4. References

Recently, I partly finished GBDT algorithm on spark. When testing its performance with a large iteration times (>300), I meeted the java.lang.StackOverflowError.

Why the iterative algorithm can cause stack overflow error

The iterative algorithm with a large iteration times often has a long lineage and it causes a long/deep java object tree (DAG of RDD objects), which needs to be serialized as part of the task creation. When serializing, the whole object DAG needs to be traversed leading to the stackoverflow error.

The lineage of the GBDT algorithm I wrote was shown below (at 4th iteration),

How to fix stack overflow error in iterative algorithm

Some people may deal with this problem by caching RDD, but it doesn’t work. There is a difference between caching to memory and checkpointing, when considered from the lineage point of view.

  • When an RDD in checkpointed, the data of the RDD is saved to HDFS (or any Hadoop API compatible fault-tolerant storage) and the lineage of the RDD is truncated. This is okay because in case of the worker failure, the RDD data can be read back from the fault-tolerant storage.
  • When an RDD is cached, the data of the RDD is cached in memory, but the lineage is not truncated. This is because if the in-memory data is lost, the lineage is required to recompute the data.

So to deal with stackoverflow errors due to long lineage, just caching is not going to be useful and you have to checkpoint the RDD for every 20~30 iterations. The correct way to do this is to do the following:

  1. Mark RDD of every Nth iteration for caching and checkpointing (both).
  2. Before generating (N+1)th iteration RDD, force the materialization of this RDD by doing a rdd.count() or any other actions. This will persist the RDD in memory as well as save to HDFS and truncate the lineage. If you just mark all Nth iteration RDD for checkpointing, but only force the materialization after ALL the iterations (not after every (N+1)th iteration as suggested) that will still lead to stackoverflow errors.

After added checkpointing, the lineage of the GBDT algorithm I wrote changed to (at 4th iteration):

Demo of Checkpointing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* set checkpointing directory */
val conf = new SparkConf().setAppName(s"GBoost Example with $params")
val sc = new SparkContext(conf)
sc.setCheckpointDir(params.cp_dir) // params.cp_dir = hdfs://bda00:8020/user/houjp/gboost/data/checkpoint/

/* do checkpointing */
var output: RDD[Double] = init()
var iter = 1
while (iter <= num_iter) {
    val pre_output = output
    // iterative computation of output
    output = getNext(pre_output).persist()
    // checkpoint every 20th iteration
    if (iter % 20 == 0) {
        output.checkpoint()
    }
    // force the materialization of this RDD
    output.count()
    pre_output.unpersist()
    iter += 1
}

References

  1. http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-td5649.html

相关文章推荐

Spark Executor 报错 java.lang.StackOverflowError

Spark Executor 报错 java.lang.StackOverflowError

异常:java.lang.StackOverflowError

java.lang.StackOverflowError at java.lang.reflect.Proxy$Key1.equals(Proxy.java:455) at java.util.c...

java.lang.StackOverflowError

做毕设出现一个java.lang.StackOverflowError异常。弄了半天,又是问高手,又是查资料的。最后发现,解决问题啦!特记录下来! StackOverflowError是由于当前...

java.lang.StackOverflowError

今天测试在用android自带的monkey进行测试的时候出现了一个错误java.lang.StackOverflowError.看到出错的代码只是Integer i=new Integer(xxx)...

spark中updateStateByKey引发StackOverflowError的解决

spark中updateStateByKey引发StackOverflowError的解决 问题描述 写的spark程序, 运行几个小时候总是会出现 StackOverflowEr...

spark常见问题解决

(1)执行spark-shell进入交互界面时INFO信息过多          解决方法:          cd /usr/local/spark/conf          cp log4j...
  • SCGH_Fx
  • SCGH_Fx
  • 2017年05月19日 14:54
  • 425

实战:OutOfMemoryError和StackOverflowError异常

Java堆溢出   Java堆用于存储对象实例,只要不断地创建对象,并且保证GC Roots到对象之间有可达路径来避免垃圾回收机制清除这些对象,那么在对象数量到达最大堆的容量限制后就会产生内存溢出异常...

Spark面对OOM问题的解决方法及优化总结

分布式计算系统最常见的问题就是OOM问题,本文主要讲述Spark中OOM问题的原因和解决办法,并结合笔者实践讲述了一些优化技巧。涉及shuffle内存溢出,map内存溢出。spark代码优化技巧和sp...

spark 使用中会遇到的一些问题及解决思路

7 内存溢出问题     在Spark中使用hql方法执行hive语句时,由于其在查询过程中调用的是Hive的获取元数据信息、SQL解析,并且使用Cglib等进行序列化反序列化,中间可能产生较多...

给定A, B两个整数,不使用除法和取模运算,求A/B的商和余数

给定A, B两个整数,不使用除法和取模运算,求A/B的商和余数。 1.   最基本的算法是,从小到大遍历: for (i = 2 to A -1)          if (i * B > A)...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Spark程序栈溢出错误
举报原因:
原因补充:

(最多只允许输入30个字)