Saprk基础:什么是RDD

RDD是:Resilient Distributed Datasets(RDDs) 的简写,中文含义弹性的分布式数据集

Spark 源码中RDD.scala的源码注释对RDD进行了详细讲解(github:https://github.com/apache/spark/blob/v2.4.4/core/src/main/scala/org/apache/spark/rdd/RDD.scala

以spark 2.4.4中的注释为例

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel. This class contains the
basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
[[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
pairs, such as `groupByKey` and `join`;
[[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
Doubles; and
[[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
can be saved as SequenceFiles.
All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
through implicit.

Internally, each RDD is characterized by five main properties:

- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)

All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
reading data from a new storage system) by overriding these functions. Please refer to the
<a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
for more details on RDD internals.

关键点:

  • A Resilient Distributed Dataset (RDD), the basic abstraction in Spark

弹性的 分布式 数据集是 Spark 基础的抽象。

解释:弹性的(可复原的),说明数据集具有容错性、可修复性。 分布式,说明数据集可以分布在不同的机器上

  • Represents an immutable, partitioned collection of elements that can be operated on in parallel.

翻译:RDD 是不可变的 分区的 可并行处理的 元素集合

解释:不可变的,这和 Scala 的设计理念相同,数据集一旦构建完成,就不能再修改,这样能轻松解决多个线程读数据的一致性问题。 分区的=可并行处理的=分布式

  • This class contains the
    basic operations available on all RDDs, such as `map`, `filter`, and `persist`.

翻译:这个抽象类包含了所有 RDD 都应该有的基本操作,比如 map 、filter 、persist等

  • In addition,
    [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
    pairs, such as `groupByKey` and `join`;

翻译:另外 PairRDDFunctions 对象中包含了 键值对型(KV型) RDD 的操作,例如 groupByKey和 join;
解释:KV 型可以支持按 Key 分组、关联等操作

  • [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
    Doubles;

翻译:DoubleRDDFunctions提供可 double 数据集的操作;
解释:数值型数据集有求和、平均、分布图等统计性操作

  • and
    [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
    can be saved as SequenceFiles.

翻译:SequenceFileRDDFunctions 提供了顺序存储操作

  • All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
    through implicit.

翻译:所有的的类通过隐式转换自动地用于RDD实例中
解释:RDD 伴生对象里包含了隐式转换函数,用implicit 修饰。隐式转换是 Scala 的语法特性。

  • Internally, each RDD is characterized by five main properties:

- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)

翻译:在 RDD 中,包含这样的5个属性(也就说要实现抽象方法或给空对象赋值):

  1. 一个分区的列表(getPartitions)
  2. 一个用于计算分区中数据的函数(compute)针对RDD的操作,其实就是对底层的partition进行操作
  3. 一个对其他 RDD 的依赖列表(getDependencies)
  4. 可选:KV 型 RDD 应该有一个分区器,例如 hash-分区器(partitioner)
  5. 可选:分区数据计算完后优先存储的位置,例如 HDFS 的某个块(getPreferredLocations)
  • All of the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions.

翻译: Spark 中所有的任务调度、任务执行都依赖于这些方法。RDD 可以覆盖这些方法,实现有自己的计算方法。例如从一个新的存储系统中读取数据。

  • Please refer to the <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a> for more details on RDD internals.

翻译:更多细节,可以 Spark 的论文

 

RDD计算:

  • RDD一系列分区
protected def getPartitions: Array[Partition]
partitoin是什么
trait Partition extends Serializable {
  /**
   * Get the partition's index within its parent RDD
   */
  def index: Int

  // A better default implementation of HashCode
  override def hashCode(): Int = index

  override def equals(other: Any): Boolean = super.equals(other)
}

partition必然有index,即编号。

 

  • compute方法,比如对RDD进行map计算

计算的时候依赖底层的分区,所以compute两个参数:分区、任务的上下文

def compute(split: Partition, context: TaskContext): Iterator[T]
  • getDependencies方法,RDD之间的依赖

protected def getDependencies: Seq[Dependency[_]] = deps

RDD 的依赖列表有宽依赖和窄依赖

  • partitioner

      @transient val partitioner: Option[Partitioner] = None

  • getPreferredLocations
protected def getPreferredLocations(split: Partition): Seq[String] = Nil 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值