目录
一.RDD概述
/**
* A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
* partitioned collection of elements that can be operated on in parallel. This class contains the
* basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
* [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
* pairs, such as `groupByKey` and `join`;
*
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions 一堆分区构成
* - A function for computing each split 一个方法作用在一个分区上的
* - A list of dependencies on other RDDs 新的RDD依赖前者 R1----> R2 ----> R3
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* kv 可以作用上分区器
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
* 作业调度到数据的最佳位置
*/
1 一堆分区构成
protected def getPartitions: Array[Partition]
2 一个方法作用在一个分区上的
def compute(split: Partition, context: TaskContext): Iterator[T]
3 新的RDD依赖前者 R1----> R2 ----> R3
protected def getDependencies: Seq[Dependency[_]] = deps
4 kv 可以作用上分区器
val partitioner: Option[Partitioner] = None
5 作业调度到数据的最佳位置
protected def getPreferredLocations(split: Partition): Seq[String] = Nil(空List)
case object Nil extends List[Nothing]
二. RDD 创建
Spark中创建RDD的方式分为四种:
2.1.1 从集合(内存)创建RDD
从集合中创建RDD 有parallelize 和 makeRDD 两个方法
- parallelize
val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1.collect().foreach(println)
- makeRDD
val rdd2 = sc.makeRDD(List(1,2,3,4))
rdd2.collect().foreach(println)
从底层实现来看 ,makeRDD方法其实也是调用的parallelize方法
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
2.1.2 从外部存储(文件)创建RDD
本地文件系统、HSFS
sc.textFile("hdfs://ifeng:9000/hdfsapi/wc.txt")
.flatMap(_.split(","))
.map((_,1))
.reduceByKey(_+_).collect()
三. RDD操作
RDD操作分为两类
-
transformations
-
actions
RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
1 map
行映射转换, 可以是 类型的转换 、 值的转换
val RDD1 = sc.parallelize(List(1, 2, 3, 4, 5))
//RDD1.map(_*2).foreach(println) 2 4 6 8 10
2 mapPartitions
map : fun —> elem
mappartition —> partition
以partition为单位进行操作
val RDD2 = sc.parallelize(List(1, 2, 3, 4, 5),2)
RDD2.mapPartitions(_.map(_*2)).foreach(println)
3 mapPartitionWithIndex
一个partition对应一个task
RDD2.mapPartitionsWithIndex((index,partition)=>{
println("这是一个分区")
partition.map(x => s"分区:$index,元素:$x")
}).foreach(println)
4 filer
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
RDD1.filter(_ > 2).foreach(println)
RDD1.filter(_ % 2 == 0).filter(_ > 2).foreach(println)
RDD1.filter(x => x % 2 == 0 && x > 2).foreach(println)
5 glom
每个分区都放到一个Array中
/**
* Return an RDD created by coalescing all elements within each partition into an array.
*/
def glom(): RDD[Array[T]] = withScope {
new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))
}
6 sample
取样
/**
* Return a sampled subset of this RDD.
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
* without replacement: probability that each element is chosen; fraction must be [0, 1]
* with replacement: expected number of times each element is chosen; fraction must be greater
* than or equal to 0
* @param seed seed for the random number generator
*
* @note This is NOT guaranteed to provide exactly the fraction of the count
* of the given [[RDD]].
*/
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T] = {
require(fraction >= 0,
s"Fraction must be nonnegative, but got ${fraction}")
sc.parallelize( 1 to 20 ).sample(true,0.5).foreach(println)
7 zip
拉链对应,一个对一个
/**
* Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
* second element in each RDD, etc. Assumes that the two RDDs have the *same number of
* partitions* and the *same number of elements in each partition* (e.g. one was made through
* a map on the other).
*/
def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
zipPartitions(other, preservesPartitioning = false) { (thisIter, otherIter) =>
new Iterator[(T, U)] {
def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
case (true, true) => true
case (false, false) => false
case _ => throw new SparkException("Can only zip RDDs with " +
"same number of elements in each partition")
}
def next(): (T, U) = (thisIter.next(), otherIter.next())
}
}
zipWithIndex
8 flatMap
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
Map & flatMap之间的区别
上图可以看出,flatMap其实比map多的就是flatten操作。
在使用时map会将一个长度为N的RDD转换为另一个长度为N的RDD;而flatMap会将一个长度为N的RDD转换成一个N个元素的集合,然后再把这N个元素合成到一个单个RDD的结果集。
比如一个包含三行内容的数据文件“README.md”。
a b c
d
经过以下转换过程
val textFile = sc.textFile("README.md")
textFile.flatMap(_.split(" "))
其实就是经历了以下转换
["a b c", "", "d"] => [["a","b","c"],[],["d"]] => ["a","b","c","d"]
在这个示例中,flatMap就把包含多行数据的RDD,即[“a b c”, “”, “d”] ,转换为了一个包含多个单词的集合。实际上,flatMap相对于map多了的是[[“a”,”b”,”c”],[],[“d”]] => [“a”,”b”,”c”,”d”]这一步。
9 flatmapValues
/**
* Pass each value in the key-value pair RDD through a flatMap function without changing the
* keys; this also retains the original RDD's partitioning.
*/
def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)] = self.withScope {
val cleanF = self.context.clean(f)
new MapPartitionsRDD[(K, U), (K, V)](self,
(context, pid, iter) => iter.flatMap { case (k, v) =>
cleanF(v).map(x => (k, x))
},
preservesPartitioning = true)
}
10 mapValues
/**
* Pass each value in the key-value pair RDD through a map function without changing the keys;
* this also retains the original RDD's partitioning.
*/
def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
val cleanF = self.context.clean(f)
new MapPartitionsRDD[(K, U), (K, V)](self,
(context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
preservesPartitioning = true)
}
flatMap
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
join
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
cogroup
join也是调用的cogroup
cogroup返回的是CompactBuffer
union
分区数量 = 之和
groupby
/**
* Return an RDD of grouped items. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
: RDD[(K, Iterable[T])] = withScope {
val cleanF = sc.clean(f)
this.map(t => (cleanF(t), t)).groupByKey(p)
}
groupByKey
groupBy 的特殊形式,只能根据Key来区分
wordCount