Spark程序栈溢出错误

Analysis of Spark's StackOverflowError in Iterative Algorithm

發表於 2015-10-14

文章目錄
  1. 1. Why the iterative algorithm can cause stack overflow error
  2. 2. How to fix stack overflow error in iterative algorithm
  3. 3. Demo of Checkpointing
  4. 4. References

Recently, I partly finished GBDT algorithm on spark. When testing its performance with a large iteration times (>300), I meeted the java.lang.StackOverflowError.

Why the iterative algorithm can cause stack overflow error

The iterative algorithm with a large iteration times often has a long lineage and it causes a long/deep java object tree (DAG of RDD objects), which needs to be serialized as part of the task creation. When serializing, the whole object DAG needs to be traversed leading to the stackoverflow error.

The lineage of the GBDT algorithm I wrote was shown below (at 4th iteration),

How to fix stack overflow error in iterative algorithm

Some people may deal with this problem by caching RDD, but it doesn’t work. There is a difference between caching to memory and checkpointing, when considered from the lineage point of view.

  • When an RDD in checkpointed, the data of the RDD is saved to HDFS (or any Hadoop API compatible fault-tolerant storage) and the lineage of the RDD is truncated. This is okay because in case of the worker failure, the RDD data can be read back from the fault-tolerant storage.
  • When an RDD is cached, the data of the RDD is cached in memory, but the lineage is not truncated. This is because if the in-memory data is lost, the lineage is required to recompute the data.

So to deal with stackoverflow errors due to long lineage, just caching is not going to be useful and you have to checkpoint the RDD for every 20~30 iterations. The correct way to do this is to do the following:

  1. Mark RDD of every Nth iteration for caching and checkpointing (both).
  2. Before generating (N+1)th iteration RDD, force the materialization of this RDD by doing a rdd.count() or any other actions. This will persist the RDD in memory as well as save to HDFS and truncate the lineage. If you just mark all Nth iteration RDD for checkpointing, but only force the materialization after ALL the iterations (not after every (N+1)th iteration as suggested) that will still lead to stackoverflow errors.

After added checkpointing, the lineage of the GBDT algorithm I wrote changed to (at 4th iteration):

Demo of Checkpointing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* set checkpointing directory */
val conf = new SparkConf().setAppName(s"GBoost Example with $params")
val sc = new SparkContext(conf)
sc.setCheckpointDir(params.cp_dir) // params.cp_dir = hdfs://bda00:8020/user/houjp/gboost/data/checkpoint/

/* do checkpointing */
var output: RDD[Double] = init()
var iter = 1
while (iter <= num_iter) {
    val pre_output = output
    // iterative computation of output
    output = getNext(pre_output).persist()
    // checkpoint every 20th iteration
    if (iter % 20 == 0) {
        output.checkpoint()
    }
    // force the materialization of this RDD
    output.count()
    pre_output.unpersist()
    iter += 1
}

References

  1. http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-td5649.html
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值