spark - Advanced Spark Programming

最新推荐文章于 2021-05-08 14:15:18 发布

zjfzjf2012

最新推荐文章于 2021-05-08 14:15:18 发布

阅读量215

点赞数

分类专栏： big data

big data 专栏收录该内容

41 篇文章 0 订阅

订阅专栏

- Accumulator

val blankLines = new LongAccumulator
sc.register(blankLines)

put accumulate in transformation for debugging purpose because of speculative task. it's not accurate. But in action, the accumulator is accurate.

- Broadcast (read only)

val signPrefixes = sc.broadcast(loadCallSignTable())

broad cast value is sent to each working node only once.

try to make broadcast variable immutable. for mutable variable, the update is only in local working node whilst other working nodes doesn't get affected.

choose right serializer for broadcast variable

- per-partition basis

foreachPartition() | Iterator of the elements | Nothing | f: (Iterator[T]) → Unit

- pipe to external program (external program can get input from standard input and output to standard output)

- StatCounter on numeric RDD

- partitions after transformation

general

filter(),map(),flatMap(),distinct()	和父RDD相同
rdd.union(otherRDD)	rdd.partitions.size + otherRDD. partitions.size
rdd.intersection(otherRDD)	max(rdd.partitions.size, otherRDD. partitions.size)
rdd.subtract(otherRDD)	rdd.partitions.size
rdd.cartesian(otherRDD)	rdd.partitions.size * otherRDD. partitions.size

pair

reduceByKey(),foldByKey(),combineByKey(), groupByKey()	和父RDD相同
sortByKey()	同上
mapValues(),flatMapValues()	同上
cogroup(), join(), ,leftOuterJoin(), rightOuterJoin()	所有父RDD按照其partition数降序排列，从partition数最大的RDD开始查找是否存在partitioner，存在则partition数由此partitioner确定，否则，所有RDD不存在partitioner，由spark.default.parallelism确定，若还没设置，最后partition数为所有RDD中partition数的最大值

zjfzjf2012

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark - Advanced Spark Programming

- Accumulatorval blankLines = new LongAccumulatorsc.register(blankLines)put accumulate in transformation for debugging purpose because of speculative task. it's not accurate. But in action, the accum...
复制链接

扫一扫

专栏目录