SparkCore（17）：RDD的容错机制

最新推荐文章于 2023-06-06 11:21:03 发布

RayBreslin

最新推荐文章于 2023-06-06 11:21:03 发布

阅读量1.3k

点赞数

本文链接：https://blog.csdn.net/u010886217/article/details/103289687

版权

Spark 同时被 2 个专栏收录

68 篇文章 0 订阅

订阅专栏

SparkCore

18 篇文章 0 订阅

订阅专栏

一、概念

RDD 任务运行过程中，如果出错，spark会有相应的机制去进行错误修复，从而保证任务持续执行，即RDD容错机制。

二、具体容错分类

1.driver宕机

（1）如果job运行在client：程序直接挂了

（2）如果job运行在cluster：

-》spark on standalone/mesos：通过spark-submit的参数--supervise可以指定当driver宕机的时候，在其他的节点上重新恢复.

[root@hadoop spark-2.1.0-bin-2.6.0-cdh5.7.0]# bin/spark-submit 
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:

......
 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.
......

-》spark on yarn：自动恢复四次

2.executor宕机

比如executor进程所在机器（worker）宕机、Executor和Driver之间通信超时。则Driver直接把坏掉的executor从Driver列表中移除，然后重新向Resourcemanager/master申请资源，自动在work或者NodeManager上重新启动一个executor重新执行任务