Spark RDD 2012 论文笔记

最新推荐文章于 2024-08-09 17:04:25 发布

taochaoqiang

最新推荐文章于 2024-08-09 17:04:25 发布

阅读量247

点赞数

分类专栏： spark 文章标签： spark RDD in-memory

本文链接：https://blog.csdn.net/superqiang34/article/details/78883600

版权

spark 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Spark RDD 2012 论文笔记

对 2012 年的 spark 文章「Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing」进行简要地笔记，同时记录下自己的一点理解。

Introduction

Resilient Distributed Datasets(RDDs)

2.1 RDD Abstraction

2.2 Spark Programming Interface

2.3 Advantages of the RDD Model

2.4 Applications Not Suitable for RDDs

Spark Programming Interface

3.1 RDD Operations in Spark

3.2 Example Applications

Representing RDDs

表达 RDD 的难点是怎样记录多个 transformation 生成的 lineage，而 Spark 是使用 graph-based representation 的方法。具体来讲，所有 RDD 公用一个 common interface，这个 interface 由五个部分组成：
1. a set of partitions
2. a set of dependencies on parent RDDs
3. a function for computing the dataset based on its parents
4. metadata about its partitioning scheme
5. data placement
对应着接口方法为：
RDD 接口方法
这里就有个问题，里面的 dependencies 是怎么表达的？Spark 里将 dependencies 分为两类：
- narrow dependencies
- wide dependencies
第一种是说，parent RDD 的每一个 partition 至多被 child RDD 的一个 partition 使用；第二种是说，parent RDD 的每一个 partition 可以被 child RDD 的多个 partition 使用。比如，map 就是一个 narrow dependency，而 join 就有可能是 wide dependencies。有下面这个图解释：
RDD dependencies
为什么这样分呢？原因有二。
- 使用 narrow dependencies 可以保证在一个 cluster node 上能够执行 pipelined execution，而不需要进行跨节点传输数据，而 wide dependencies 因为 child RDD 会依赖于多个 parent RDD 的 partition，因此需要进行跨节点的 shuffling，就像 map-reduce 操作一样。
- 使用 narrow dependencies 能够保证 recovery after a node failure 十分的高效，因为只需要恢复 the lost parent partitions，而这可以通过在不同的 node 上并行恢复，而在 wide dependencies 情况下，a failed node 可能会导致一个 RDD 的所有的 ancestors 的一些 partition 都丢失掉，因此需要完全地 re-execution.

Implementation

Spark 内部使用 Mesos 进行资源管理和任务调度。而 RDD 的管理和调度由 Spark 自己进行处理。

5.1 Job Scheduling

Spark 的 scheduler 充分利用了 RDD 的实现。先上个图：
Spark Scheduler
这里有个 stage 概念，each stage contains as many pipelined transformations with narrow dependencies as possible. 每个 stage 的边界是 shuffle operations required for wide dependencies or any already computed partitions that can short-circuit the computation of a parent RDD.
Spark scheduler 就是通过检测 RDD’s lineage graph 来生成 stages 组成的 DAG，然后启动 task 计算每个 stage 缺失的 partitions，直到计算出 target RDD.
值得一提的是，Spark scheduler 是基于 data locality using delay scheduling 机制来分配 task 的。也就是说，当一个 task 需要处理的 partition 在集群的某个节点的内存里时，scheduler 直接将 task 发送至对应的节点上执行操作。再比如，如果 RDD 在 HDFS 文件上，则 task 会被发送至对应的 data node 上。而对于 wide dependencies 的情况，Spark materialize intermediate records on the nodes holding parent partitions to simplify fault recovery.
如果一个 task 挂了，只要对应的 stage’s parents 还在，那就可以在另一个 node 上 re-run 这个 task。而如果 stages 挂了，那就 resubmit tasks to compute the missing partitions in parallel.

5.2 Interpreter Integration

5.3 Memory Management

Spark 提供三种 storage of persistent RDDs 选项：
1. in-memory storage as deserialized Java objects
2. in-memory storage as serialized data
3. on-disk storage
同时，当内存有限时，采用 LRU 策略对缓存数据进行失效。