spark core源码分析17 RDD相关API

最新推荐文章于 2022-07-27 13:46:04 发布

shark.zyq

最新推荐文章于 2022-07-27 13:46:04 发布

阅读量2.4k

点赞数

本文链接：https://blog.csdn.net/yueqian_zhu/article/details/48546063

版权

spark源码解析同时被 2 个专栏收录

21 篇文章 9 订阅

订阅专栏

spark 源码分析

17 篇文章 0 订阅

订阅专栏

博客地址: http://blog.csdn.net/yueqian_zhu/

一、RDD创建的操作（SparkContext.scala）

1、从内存集合中创建RDD，RDD中包含的是类型为T的集合

def parallelize[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T]

def makeRDD[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T]

2、从文件中读取(path/to/file;hdfs://ip:port/path/to/file;file://path/to/file),没有前缀的默认从hdfs读取。hdfs的配置从sparkConf中hadoop相关的配置项创建

path：

(1) 一个文件路径，这时候只装载指定的文件

(2) 一个目录路径，这时候只装载指定目录下面的所有文件（不包括子目录下面的文件）

(3) 通过通配符的形式加载多个文件或者加载多个目录下面的所有文件

def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String]
例子：
val sc = new SparkContext(new SparkConf().setAppName("Spark Test"))

    //textFile,返回行数
    val textFileRDD = sc.textFile("file:///Users/zhengze/Downloads/spark-1.4.1-bin-hadoop2.3/README.md")
    println(textFileRDD.count)  //输出行数
3、老的hadoopRdd接口
def hadoopRDD[K, V](
    conf: JobConf,
    inputFormatClass: Class[_ <: InputFormat[K, V]],
    keyClass: Class[K],
    valueClass: Class[V],
    minPartitions: Int = defaultMinPartitions): RDD[(K, V)]
4、新的hadoopFile接口   path：待读取的文件；conf：hadoop 配置文件，同textFile
（1）没有参数conf的接口，默认取sparkContext中读取hadoop相关的配置
（2）也可以自己指定conf，见下面例子
def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
    path: String,
    fClass: Class[F],
    kClass: Class[K],
    vClass: Class[V],
    conf: Configuration = hadoopConfiguration): RDD[(K, V)]
未指定path，需要额外通过setName接口设置
def newAPIHadoopRDD[K, V, F <: NewInputFormat[K, V]](
    conf: Configuration = hadoopConfiguration,
    fClass: Class[F],
    kClass: Class[K],
    vClass: Class[V]): RDD[(K, V)]
用到隐式转换，如V为Text，vm就隐式转换为ClassTag[Text],vm.runtimeClass.asInstanceOf[Class[V]],就等同调用了上面的接口，非常方便哦
//eg. val file = sparkContext.hadoopFile[LongWritable, Text, TextInputFormat](path)
def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]]
    (path: String)
    (implicit km: ClassTag[K], vm: ClassTag[V], fm: ClassTag[F]): RDD[(K, V)]
例子：
val sc = new SparkContext(new SparkConf().setAppName("Spark Test"))
    //newAPIHadoopFile
    import org.apache.hadoop.conf.Configuration
    val conf = new Configuration(false)
    conf.addResource("xxx-core-site.xml")
    val hadoopFileRDD = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat]("hdfs://ip:port/path/to/file")
    println(hadoopFileRDD.count)  //输出行数
5、RDD中元素合并(不去重)
def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T]
def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T]
例子：
val sc = new SparkContext(new SparkConf().setAppName("Spark Test"))
    //union
    val rdd1 = sc.parallelize(Seq(1,2,3))
    val rdd2 = sc.parallelize(Seq(3,4,5))
    sc.union(Seq(rdd1,rdd2)).foreach(println) //输出123345
二、RDD基本转换操作(RDD.scala)
1、存储相关，不解释
def persist(newLevel: StorageLevel): this.type
def cache(): this.type 
def unpersist(blocking: Boolean = true): this.type
2、map  
参数f从原始RDD中的类型T一对一的转化为类型U的MapPartitionsRDD
/**
 * Return a new RDD by applying a function to all elements of this RDD.
 */
def map[U: ClassTag](f: T => U): RDD[U]
例子：
val sc = new SparkContext(new SparkConf().setAppName("Spark Test"))
    //map
    val m = sc.parallelize(Seq(1,2,3)).map(x => x+1).foreach(println)  //输出234
3、flatMap
参数f从原始RDD中的类型T一对一的转换为一个集合[U]，之后再将所有小的集合合并成一个大集合
flatMap操作之后生成的RDD类型是f函数参数最终生成的集合[U]的元素类型，即RDD[U]
/**
 *  Return a new RDD by first applying a function to all elements of this
 *  RDD, and then flattening the results.
 */
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
例子：
val sc = new SparkContext(new SparkConf().setAppName("Spark Test"))
    //flatmap
    val fm = sc.parallelize(Seq(1,2,3)).flatMap(x => Seq(x+2)).foreach(println)  //输出345
4、filter
参数f从原始RDD中类型T转换为一个Boolean进行元素过滤操作
def filter(f: T => Boolean): RDD[T]
5、distinct
将原始RDD中的元素去重,这里的ord没用到,即无法排序
/**
 * Return a new RDD containing the distinct elements in this RDD.
 */
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}

/**
 * Return a new RDD containing the distinct elements in this RDD.
 */
def distinct(): RDD[T] = withScope {
  distinct(partitions.length)
}
例子：
val sc = new SparkContext(new SparkConf().setAppName("Spark Test"))
    //distinct  不管设置多少分区，最终会将所有数据去重
    val dis = sc.parallelize(Seq(1,2,3,4,5,4,3)).distinct().foreach(println) //输出41352
6、重新分区，repartition是coalesce接口的shuffle为true的简易实现
假设N个分区重新划分为M个分区：
N<M:shuffle需要设置为true
N>M且相差不多：可以shuffle设置为false，此时为窄依赖
N>>>M:shuffle如果为false，会影响性能
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
    : RDD[T]
7、union / ++
合并RDD，可能存在重复元素
/**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def union(other: RDD[T]): RDD[T]
def ++(other: RDD[T]): RDD[T]
8、sortBy
首先调用keyBy，作用是通过参数f，将原始RDD中的T元素转换为(K,T)类型，即K由T经f方法产生
然后调用sortByKey，对key进行按partition分组排序，最后只取排序后的所有T元素
/**
 * Return this RDD sorted by the given key function.
 */
def sortBy[K](
    f: (T) => K,
    ascending: Boolean = true,
    numPartitions: Int = this.partitions.length)
    (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
  this.keyBy[K](f)
      .sortByKey(ascending, numPartitions)
      .values
}
例子：
val sc = new SparkContext(new SparkConf().setAppName("Spark Test"))
    val rdd1 = sc.parallelize(Seq(1,2,3,4,5,4,3,6,7))
    .sortBy(x => x,true).foreach(println)  //输出123344567
9、取交集且交集中不含相同元素
def intersection(other: RDD[T]): RDD[T]
def intersection(
    other: RDD[T],
    partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
例子：
val sc = new SparkContext(new SparkConf().setAppName("Spark Test"))
    val rdd1 = sc.parallelize(Seq(1,2,3))
    val rdd2 = sc.parallelize(Seq(3,4,5))
    rdd1.intersection(rdd2).foreach(println) //输出3
10、首先通过参数f将原始RDD中的T元素转换为(k，T)类型，即K由T经f方法产生
然后调用groupByKey对RDD进行按key聚合，最终产生(k,Iterable[T])类型的RDD
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]
def groupBy[K](
    f: T => K,
    numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]
def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
    : RDD[(K, Iterable[T])]
11、mapPartitions与map类似，只不过f参数由RDD中的每个元素变成了RDD中的每一个分区的迭代器。
这样可以各个分区共享同一个外部对象，而不是每个元素一个对象。比如创建jdbc连接之类的，不需要每个元素都建立一个连接吧。
参数preservesPartitioning表示是否保留父RDD的partitioner分区信息。
mapPartitionsWithIndex方法参数f中多了一个分区号为参数
def mapPartitions[U: ClassTag](
    f: Iterator[T] => Iterator[U],
    preservesPartitioning: Boolean = false): RDD[U]
def mapPartitionsWithIndex[U: ClassTag](
    f: (Int, Iterator[T]) => Iterator[U],
    preservesPartitioning: Boolean = false): RDD[U]
例子：
val a = sc.parallelize(1 to 9, 3)
    def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
      var res = List[(T, T)]()
      var pre = iter.next
      while (iter.hasNext)
      {
        val cur = iter.next;
        res .::= (pre, cur)
        pre = cur;
      }
      res.iterator
    }
    a.mapPartitions(myfunc).foreach(println)
12、zip将两个RDD组合成key/value形式的RDD，两个RDD的partition数量和元素数量都需要相同
zipPartitions是将多个RDD按照partition组合成新的RDD，RDD需要相同的分区数，但是每个分区中的元素数量是没有要求的
def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]
def zipPartitions[B: ClassTag, V: ClassTag]
    (rdd2: RDD[B], preservesPartitioning: Boolean)
    (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, V: ClassTag]
    (rdd2: RDD[B])
    (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V] = withScope {
  zipPartitions(rdd2, preservesPartitioning = false)(f)
}
def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag]
    (rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning: Boolean)
    (f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag]
    (rdd2: RDD[B], rdd3: RDD[C])
    (f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V] = withScope {
  zipPartitions(rdd2, rdd3, preservesPartitioning = false)(f)
}
def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V: ClassTag]
    (rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning: Boolean)
    (f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]
def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V: ClassTag]
    (rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])
    (f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]): RDD[V] = withScope {
  zipPartitions(rdd2, rdd3, rdd4, preservesPartitioning = false)(f)
}
例子1：
val a = sc.parallelize(1 to 100, 3)
    val b = sc.parallelize(101 to 200, 3)
    a.zip(b).foreach(println)
例子2:
val a = sc.parallelize(0 to 9, 3)
    val b = sc.parallelize(10 to 19, 3)
    val c = sc.parallelize(100 to 109, 3)
    def myfunc(aiter: Iterator[Int], biter: Iterator[Int], citer: Iterator[Int]): Iterator[String] =
    {
      var res = List[String]()
      while (aiter.hasNext && biter.hasNext && citer.hasNext)
      {
        val x = aiter.next + " " + biter.next + " " + citer.next
        res ::= x
      }
      res.iterator
    }
    a.zipPartitions(b, c)(myfunc).foreach(println)

三、K/V类型RDD转换操作
1、partitionBy
类似基本转换中的repartition中的功能，根据partition参数将原RDD重新分区转化成ShuffledRDD
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
2、mapValues/flatMapValues
针对RDD[k,v]中的V值进行map操作和flatMap操作
def mapValues[U](f: V => U): RDD[(K, U)]
def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]
例子1：
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
    val b = a.map(x => (x.length, x))
    b.mapValues("x" + _ + "x").foreach(println)
例子2:
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.flatMapValues("x" + _ + "x").collect
res6: Array[(Int, Char)] = Array((3,x), (3,d), (3,o), (3,g), (3,x), (5,x), (5,t), (5,i), (5,g), (5,e), (5,r), (5,x), (4,x), (4,l), (4,i), (4,o), (4,n), (4,x), (3,x), (3,c), (3,a), (3,t), (3,x), (7,x), (7,p), (7,a), (7,n), (7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a), (5,g), (5,l), (5,e), (5,x))
3、组合操作
createCombiner:创建组合器函数，将V类型转换为C类型
mergeValue:合并器函数，将一个V类型和一个C类型合并成一个C类型
mergeCombiners:将两个C类型合并成一个C类型
partitioner：分区函数
mapSideCombine：是否需要在map端进行combine操作
稍微深入一点理解，既然按key分组，首先会在每个partition中将key,value存储起来，存储的时候会判断是否已经存在该key，第一次肯定是没有的嘛，所以就要依靠createComBiner方法来创建初始值；之后再次出现同样的key，就只会调用mergeValue方法将之前存储的值与现在的值作合并
def combineByKey[C](createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C,
    partitioner: Partitioner,
    mapSideCombine: Boolean = true,
    serializer: Serializer = null): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C,
    numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C)
  : RDD[(K, C)]
例子：
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
    val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
    val c = b.zip(a)
    val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
    d.foreach(println)
下面的操作最终都会归结为对combineByKey的调用
def foldByKey(
    zeroValue: V,
    partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
    combOp: (U, U) => U): RDD[(K, U)]
def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
    combOp: (U, U) => U): RDD[(K, U)]
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
    combOp: (U, U) => U): RDD[(K, U)]
例子1：foldByKey （相对于combineByKey而言，C类型与V类型一致，createCombiner为func(zeroValue,v)，mergeValue与mergeCombiners方法一致）
可以看到，针对每一个key，在每个分区开始进行计算的时候，首先将第一个元素和"x"合并（注意：并不是对分区内的每个元素都需要和"x"合并）
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.foldByKey("x")(_ + _).collect
res85: Array[(Int, String)] = Array((4,xlion), (3,xdogcat), (7,xpanther), (5,xtigereagle))
例子2:reduceByKey （相对于foldByKey而言，createCombiner不需要）
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))
例子3:groupByKey  (相对于combineByKey而言，不需要func方法，它只是将所有key相同的value简单组合)
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length)
b.groupByKey.collect
res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))
例子4:aggregateByKey  (可以简单认为是combineByKey和foldByKey的组合形态，具有zeroValue，也有mergeValue和mergeCombiners方法)
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 1)
    pairRDD.aggregateByKey(100)( _ + _, _ + _).foreach(println)
    //输出
    //(mouse,106)
    //(dog,112)
    //(cat,119)
4、连接操作
对于两个RDD的key相同，value进行组合
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
例子：
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect

res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))

四、RDD行动操作
1、简单方法
def foreach(f: T => Unit): Unit
def foreachPartition(f: Iterator[T] => Unit): Unit
def collect(): Array[T]
private[spark] def collectPartitions(): Array[Array[T]]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
def count(): Long
def take(num: Int): Array[T]
def first(): T 
def top(num: Int)(implicit ord: Ordering[T]): Array[T]    //默认取最大的num个元素
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]    //默认取最小的num个元素
def max()(implicit ord: Ordering[T]): T
def min()(implicit ord: Ordering[T]): T 
def isEmpty(): Boolean
2、toLocalIterator
返回这个RDD所有元素的迭代器，计算将消耗RDD最大分区的内存量。调用的RDD有可能是shuffledRDD，为避免重复计算，最好之前能cache住
def toLocalIterator: Iterator[T]
3、reduce
对RDD中的元素进行二元计算，返回计算结果
def reduce(f: (T, T) => T): T
4、aggregate操作：
zeroValue:初始值
seqOp:将RDD中每个分区的数据聚合成类型为U的类型的值
combOp:将各个分区聚合起来的值与初始值合并得到最终的U类型的值
fold操作：
fold方法是aggregate的便利接口，op既是seqOp操作，也是combOp操作，且最终类型的值与原始RDD中的类型T一致
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
def fold(zeroValue: T)(op: (T, T) => T): T
例子：// See here how the initial value "x" is applied three times.
//  - once for each partition
//  - once when combining all the partitions in the second reduce function.
zeroValue首先在每个分区中参与计算，然后在合并每个分区时参与计算。所以以下计算时，会出现3个"x"
val z = sc.parallelize(List("a","b","c","d","e","f"),2)
    z.aggregate("x")(_ + _, _+_)
    res116: String = xxdefxabc
5、countByValue
将RDD中所有的值分别计数，返回一个本地的map<value, count>对
def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long]
6、存储相关行动操作
def saveAsTextFile(path: String): Unit
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit
def saveAsNewAPIHadoopFile[F <: NewOutputFormat[K, V]](
    path: String)(implicit fm: ClassTag[F]): Unit
def saveAsNewAPIHadoopFile(
    path: String,
    keyClass: Class[_],
    valueClass: Class[_],
    outputFormatClass: Class[_ <: NewOutputFormat[_, _]],
    conf: Configuration = self.context.hadoopConfiguration): Unit

更多详细API介绍，见http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
讲的很赞