WordCount 详解
spark-shell 示例代码:
scala> val result = sc.textFile("file:///home/hadoop/data/wc.data").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
scala> result.toDebugString
res: String =
(2) ShuffledRDD[10] at reduceByKey at <console>:25 [Memory Deserialized 1x Replicated]
| CachedPartitions: 2; MemorySize: 416.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
+-(2) MapPartitionsRDD[9] at map at <console>:25 [Memory Deserialized 1x Replicated]
| MapPartitionsRDD[8] at flatMap at <console>:25 [Memory Deserialized 1x Replicated]
| file:///home/hadoop/data/wc.data MapPartitionsRDD[1] at textFile at <console>:24 [Memory Deserialized 1x Replicated]
| file:///home/hadoop/data/wc.data HadoopRDD[0] at textFile at <console>:24 [Memory Deserialized 1x Replicated]
从以上 toDebugString 中可以看出,wc一共产生了5个RDD,分别是:
textFile:HadoopRDD 和 MapPartitionsRDD
flatMap:MapPartitionsRDD
map:MapPartitionsRDD
reduceByKey:ShuffledRDD
其中,textFile 源码如下:
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
可以看出,textFile 底层调用的 hadoopFile ,和mapreduce的读取数据方式一样,也是只取了 pair._2 部分,而 pair._1 部分表示数据在文本中的偏移量,一般没用。其中,hadoopFile产生HadoopRDD,map产生MapPartitionsRDD。
dependency 依赖
- 窄依赖
一个父RDD的partition至多被子RDD的partition使用一次。
OneToOneDependency,窄依赖都在一个stage中完成。 - 宽依赖 <= 会产生shuffle,会有新的stage
一个父RDD的partition会被子RDD的partition使用多次。
思考:为什么说shuffle是一个expensive operation?
答:it involves disk I/O, data serialization, and network I/O. (spark的shuffle数据也会落到磁盘上。)
缓存:Cache 和 Persist
cache底层调用了persist,且存储等级是 MEMORY_ONLY,可以认为 cache 是 persist 的一种特定情况。源码如下:
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()
…………
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
存储级别有:
class StorageLevel private(
private var _useDisk: Boolean,
private var _useMemory: Boolean,
private var _useOffHeap: Boolean,
private var _deserialized: Boolean,
private var _replication: Int = 1)
…………
object StorageLevel {
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
注意:
cache 和 persist都是lazy级别的操作,需要第一次遇到action操作,才会触发;而 unpersist 是 eager 操作,运行后立马释放存储。
一般存储要结合 spark 的 kryo 序列化使用,否则,数据存储的大小会比输入数据大许多。
groupByKey 和 reduceByKey
- reduceByKey底层源码
@Experimental
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]
可以看出,reduceByKey 底层调用了 combineByKeyWithClassTag 方法,其中有个 mapSideCombine 参数默认是true,即默认实现了 Combine 的预聚合操作。
- groupByKey 底层源码
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
可以看出,groupByKey 底层也调用了 combineByKeyWithClassTag 方法,但是其中的 mapSideCombine 参数默认是false,即没有实现 Combine 的预聚合操作;另外,从注释中看出,groupByKey 是做的全部数据的shuffle,很容易导致 OutOfMemoryError。
repartition 和 coalesce
- repartition 源码
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*
* TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
可以看出,repartition 底层调用了 coalesce算子,并且 shuffle 默认是 true,即 repartition 算子是一个shuffle 算子。
- coalesce源码
/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions. If a larger number
* of partitions is requested, it will stay at the current number of partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* @note With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner. The optional partition coalescer
* passed in must be serializable.
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
可以看出,coalesce 算子的 shuffle 默认为 false,即默认不会 shuffle。当使用 coalesce 减少分区数时,是没有shuffle的,当使用 coalesce 增加分区数时,必须指定 shuffle 为 true,否则重新分区不生效。
cogroup 和 join
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
join 算子底层调用的是 cogroup 算子。cogroup合并两个RDD生成一个新的RDD。结果中包含两个Iterable值,第一个表示RDD1中的同一key对应value值,第二个表示RDD2中的同一key对应value值,这个操作中需要通过partitioner进行重新分区,因此需要执行一次shuffle操作(当两个参加 join 的 RDD 的 partitioner 是同一个对象时,会产生窄依赖而不是宽依赖。在这种情况下,两个 RDD 之间只要对应的 partition 互相 join 即可,不会产生宽依赖问题。)。
join是把两个集合根据key进行内容聚合,而cogroup在聚合时会先对RDD中相同的key合并成 Iterable,如:
rdd1元素是:val rdd1= sc.parallelize(List((1,"A"),(2,"B")))
rdd2元素是:val rdd2 = sc.parallelize(List((1,10),(1,20),(2,30)))
rdd1.join(rdd2) 结果为: (2,(B,30)), (1,(A,10)), (1,(A,20))
rdd1.cogroup(rdd2) 结果为:(2,(CompactBuffer(B),CompactBuffer(30))), (1,(CompactBuffer(A),CompactBuffer(10, 20)))
小技巧
- 查看和调试RDD,可以使用:rdd.toDebugString
- 查看RDD中的分区数:
rdd.partitions.size - 查看RDD中每个分区的元素,两种方法:
1)rdd.glom.collect
2)rdd.mapPartitionsWithIndex((index,partition) => {
partition.map(x=>s"分区是$index,元素是$x")
}).collect