- Accumulator
val blankLines = new LongAccumulator
sc.register(blankLines)
put accumulate in transformation for debugging purpose because of speculative task. it's not accurate. But in action, the accumulator is accurate.
- Broadcast (read only)
val signPrefixes = sc.broadcast(loadCallSignTable())
broad cast value is sent to each working node only once.
try to make broadcast variable immutable. for mutable variable, the update is only in local working node whilst other working nodes doesn't get affected.
choose right serializer for broadcast variable
- per-partition basis
mapPartitions() | Iterator of the elements in that partition | Iterator of our return elements | f: (Iterator[T]) → Iterator[U]
mapPartitionsWithIndex() | Integer of partition number, and Iterator of the elements in that partition | Iterator of our return elements | f: (Int, Iterator[T]) → Iterator[U]
foreachPartition() | Iterator of the elements | Nothing | f: (Iterator[T]) → Unit
- pipe to external program (external program can get input from standard input and output to standard output)
- StatCounter on numeric RDD
- partitions after transformation
- general
filter(),map(),flatMap(),distinct() | 和父RDD相同 |
rdd.union(otherRDD) | rdd.partitions.size + otherRDD. partitions.size |
rdd.intersection(otherRDD) | max(rdd.partitions.size, otherRDD. partitions.size) |
rdd.subtract(otherRDD) | rdd.partitions.size |
rdd.cartesian(otherRDD) | rdd.partitions.size * otherRDD. partitions.size |
- pair
reduceByKey(),foldByKey(),combineByKey(), groupByKey() | 和父RDD相同 |
sortByKey() | 同上 |
mapValues(),flatMapValues() | 同上 |
cogroup(), join(), ,leftOuterJoin(), rightOuterJoin() | 所有父RDD按照其partition数降序排列,从partition数最大的RDD开始查找是否存在partitioner,存在则partition数由此partitioner确定,否则,所有RDD不存在partitioner,由spark.default.parallelism确定,若还没设置,最后partition数为所有RDD中partition数的最大值 |