Spark程序栈溢出错误

转载 2016年08月31日 13:37:54

Analysis of Spark's StackOverflowError in Iterative Algorithm

發表於 2015-10-14

文章目錄
  1. 1. Why the iterative algorithm can cause stack overflow error
  2. 2. How to fix stack overflow error in iterative algorithm
  3. 3. Demo of Checkpointing
  4. 4. References

Recently, I partly finished GBDT algorithm on spark. When testing its performance with a large iteration times (>300), I meeted the java.lang.StackOverflowError.

Why the iterative algorithm can cause stack overflow error

The iterative algorithm with a large iteration times often has a long lineage and it causes a long/deep java object tree (DAG of RDD objects), which needs to be serialized as part of the task creation. When serializing, the whole object DAG needs to be traversed leading to the stackoverflow error.

The lineage of the GBDT algorithm I wrote was shown below (at 4th iteration),

How to fix stack overflow error in iterative algorithm

Some people may deal with this problem by caching RDD, but it doesn’t work. There is a difference between caching to memory and checkpointing, when considered from the lineage point of view.

  • When an RDD in checkpointed, the data of the RDD is saved to HDFS (or any Hadoop API compatible fault-tolerant storage) and the lineage of the RDD is truncated. This is okay because in case of the worker failure, the RDD data can be read back from the fault-tolerant storage.
  • When an RDD is cached, the data of the RDD is cached in memory, but the lineage is not truncated. This is because if the in-memory data is lost, the lineage is required to recompute the data.

So to deal with stackoverflow errors due to long lineage, just caching is not going to be useful and you have to checkpoint the RDD for every 20~30 iterations. The correct way to do this is to do the following:

  1. Mark RDD of every Nth iteration for caching and checkpointing (both).
  2. Before generating (N+1)th iteration RDD, force the materialization of this RDD by doing a rdd.count() or any other actions. This will persist the RDD in memory as well as save to HDFS and truncate the lineage. If you just mark all Nth iteration RDD for checkpointing, but only force the materialization after ALL the iterations (not after every (N+1)th iteration as suggested) that will still lead to stackoverflow errors.

After added checkpointing, the lineage of the GBDT algorithm I wrote changed to (at 4th iteration):

Demo of Checkpointing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* set checkpointing directory */
val conf = new SparkConf().setAppName(s"GBoost Example with $params")
val sc = new SparkContext(conf)
sc.setCheckpointDir(params.cp_dir) // params.cp_dir = hdfs://bda00:8020/user/houjp/gboost/data/checkpoint/

/* do checkpointing */
var output: RDD[Double] = init()
var iter = 1
while (iter <= num_iter) {
    val pre_output = output
    // iterative computation of output
    output = getNext(pre_output).persist()
    // checkpoint every 20th iteration
    if (iter % 20 == 0) {
        output.checkpoint()
    }
    // force the materialization of this RDD
    output.count()
    pre_output.unpersist()
    iter += 1
}

References

  1. http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-td5649.html

Spark Executor 报错 java.lang.StackOverflowError

Spark Executor 报错 java.lang.StackOverflowError
  • dai451954706
  • dai451954706
  • 2017年06月22日 17:51
  • 1250

实战:OutOfMemoryError和StackOverflowError异常

Java堆溢出   Java堆用于存储对象实例,只要不断地创建对象,并且保证GC Roots到对象之间有可达路径来避免垃圾回收机制清除这些对象,那么在对象数量到达最大堆的容量限制后就会产生内存溢出异常...
  • tanga842428
  • tanga842428
  • 2016年08月25日 11:15
  • 2443

java.lang.StackOverflowError递归的栈溢出错误

递归在JAVA中是指方法本身调用自己,以此来解决问题普通循环不太容易解决的问题。 递归能解决一些特定的问题,但相对的也有其缺点。递归运行速度较慢,在递归调用过 程中系统为每一层返回点,局部量等提供...
  • tr0824
  • tr0824
  • 2018年01月28日 18:52
  • 9

栈溢出错误:message java.lang.StackOverflowError

再做一个基于ssh的电商网站时出现错误,错误原因是因为在页面product.jsp中又包含了一下这个页面 如下: 这个页面的业务逻辑被循环调用,导师栈溢出了 HTTP Status 500 -...
  • u013610133
  • u013610133
  • 2017年04月11日 15:45
  • 583

spark使用总结

弹性分布式数据集(RDD)是分布式处理的一个数据集的抽象, RDD是只读的,在RDD之上的操作都是并行的 。实际上,RDD只是一个逻辑实体,其中存储了分布式数据集的一些信息,并没有包含所谓的“物理数据...
  • pzw_0612
  • pzw_0612
  • 2016年10月15日 23:40
  • 6269

大数据盘点之Spark篇

一直没有找到此篇文章的真正出处,不知道到底是否允许全篇转载,暂时引用http://h2ex.com/634的文章 谭政,Hulu 网大数据基础平台研发。曾在新浪微博平台工作过。专注于大数据存...
  • wb81074
  • wb81074
  • 2016年01月11日 15:10
  • 1448

spark中updateStateByKey引发StackOverflowError的解决

spark中updateStateByKey引发StackOverflowError的解决 问题描述 写的spark程序, 运行几个小时候总是会出现 StackOverflowEr...
  • caoli98033
  • caoli98033
  • 2015年03月16日 20:40
  • 2090

spark 使用中会遇到的一些问题及解决思路

7 内存溢出问题     在Spark中使用hql方法执行hive语句时,由于其在查询过程中调用的是Hive的获取元数据信息、SQL解析,并且使用Cglib等进行序列化反序列化,中间可能产生较多...
  • xiao_jun_0820
  • xiao_jun_0820
  • 2015年04月14日 10:13
  • 48460

java.lang.StackOverflowError

做毕设出现一个java.lang.StackOverflowError异常。弄了半天,又是问高手,又是查资料的。最后发现,解决问题啦!特记录下来! StackOverflowError是由于当前...
  • g19920917
  • g19920917
  • 2013年04月06日 23:46
  • 101107

异常:java.lang.StackOverflowError

java.lang.StackOverflowError at java.lang.reflect.Proxy$Key1.equals(Proxy.java:455) at java.util.c...
  • dingchenxixi
  • dingchenxixi
  • 2015年07月03日 14:05
  • 4253
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Spark程序栈溢出错误
举报原因:
原因补充:

(最多只允许输入30个字)