Spark-rdd
@(spark)[rdd]
首先介绍一下rdd,然后按字母字母顺序逐个描述各个rdd
RDD
基类就叫RDD,这个文件非常长,有非常多的函数:
1. 省略比较直观的函数的说明
2. 有大量的功能函数distinct之类
3. 再次重申sc.runJob是所有实质性函数的入口
4. 在object RDD中含有大量的隐式转化
5. 这中间最重要的一个函数就是override def compute(split: Partition, context: TaskContext): Iterator[T]
,注意返回一个Iterator,通过这种方式计算每个RDD。
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
*
* All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
* to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
* reading data from a new storage system) by overriding these functions. Please refer to the
* [[http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Spark paper]] for more details
* on RDD internals.
*/
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
注意这两个参数:
1. 一个是SparkContext
2. 一个是一个Seq[Dependency[_]]
EmptyRDD
空RDD,compute直接抛异常。应该什么都不做用于特殊用途的RDD吧。