Saprk基础：什么是RDD

最新推荐文章于 2022-07-02 18:31:45 发布

王阿臭的学习笔记

最新推荐文章于 2022-07-02 18:31:45 发布

阅读量191

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/weixin_44641024/article/details/102628240

版权

spark 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

RDD是：Resilient Distributed Datasets(RDDs) 的简写，中文含义弹性的分布式数据集

Spark 源码中RDD.scala的源码注释对RDD进行了详细讲解（github：https://github.com/apache/spark/blob/v2.4.4/core/src/main/scala/org/apache/spark/rdd/RDD.scala）

以spark 2.4.4中的注释为例

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel. This class contains the
basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
[[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
pairs, such as `groupByKey` and `join`;
[[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
Doubles; and
[[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
can be saved as SequenceFiles.
All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
through implicit.

Internally, each RDD is characterized by five main properties:

- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)

All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
reading data from a new storage system) by overriding these functions. Please refer to the
<a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
for more details on RDD internals.

关键点：

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark

弹性的分布式数据集是 Spark 基础的抽象。

解释：弹性的（可复原的），说明数据集具有容错性、可修复性。分布式，说明数据集可以分布在不同的机器上

Represents an immutable, partitioned collection of elements that can be operated on in parallel.

翻译：RDD 是不可变的分区的可并行处理的元素集合

解释：不可变的，这和 Scala 的设计理念相同，数据集一旦构建完成，就不能再修改，这样能轻松解决多个线程读数据的一致性问题。分区的=可并行处理的=分布式

This class contains the
basic operations available on all RDDs, such as `map`, `filter`, and `persist`.

翻译：这个抽象类包含了所有 RDD 都应该有的基本操作，比如 map 、filter 、persist等

In addition,
[[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
pairs, such as `groupByKey` and `join`;

翻译：另外 PairRDDFunctions 对象中包含了键值对型（KV型） RDD 的操作，例如 groupByKey和 join；
解释：KV 型可以支持按 Key 分组、关联等操作

[[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
Doubles;

翻译：DoubleRDDFunctions提供可 double 数据集的操作；
解释：数值型数据集有求和、平均、分布图等统计性操作

and
[[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
can be saved as SequenceFiles.

翻译：SequenceFileRDDFunctions 提供了顺序存储操作

All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
through implicit.

翻译：所有的的类通过隐式转换自动地用于RDD实例中
解释：RDD 伴生对象里包含了隐式转换函数，用implicit 修饰。隐式转换是 Scala 的语法特性。

Internally, each RDD is characterized by five main properties:

- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)

翻译：在 RDD 中，包含这样的5个属性（也就说要实现抽象方法或给空对象赋值）：

一个分区的列表（getPartitions）
一个用于计算分区中数据的函数（compute）针对RDD的操作，其实就是对底层的partition进行操作
一个对其他 RDD 的依赖列表（getDependencies）
可选：KV 型 RDD 应该有一个分区器，例如 hash-分区器（partitioner）
可选：分区数据计算完后优先存储的位置，例如 HDFS 的某个块（getPreferredLocations）

All of the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions.

翻译： Spark 中所有的任务调度、任务执行都依赖于这些方法。RDD 可以覆盖这些方法，实现有自己的计算方法。例如从一个新的存储系统中读取数据。

Please refer to the <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a> for more details on RDD internals.

翻译：更多细节，可以 Spark 的论文

RDD计算：

RDD一系列分区

protected def getPartitions: Array[Partition]
partitoin是什么

trait Partition extends Serializable {
  /**
   * Get the partition's index within its parent RDD
   */
  def index: Int

  // A better default implementation of HashCode
  override def hashCode(): Int = index

  override def equals(other: Any): Boolean = super.equals(other)
}

partition必然有index，即编号。

compute方法，比如对RDD进行map计算

计算的时候依赖底层的分区，所以compute两个参数：分区、任务的上下文

def compute(split: Partition, context: TaskContext): Iterator[T]

getDependencies方法，RDD之间的依赖

protected def getDependencies: Seq[Dependency[_]] = deps

RDD 的依赖列表有宽依赖和窄依赖

partitioner

@transient val partitioner: Option[Partitioner] = None

getPreferredLocations

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

王阿臭的学习笔记

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录