spark - Advanced Spark Programming

- Accumulator

val blankLines = new LongAccumulator
sc.register(blankLines)

put accumulate in transformation for debugging purpose because of speculative task. it's not accurate. But in action, the accumulator is accurate. 

- Broadcast (read only)

val signPrefixes = sc.broadcast(loadCallSignTable())

broad cast value is sent to each working node only once. 

try to make broadcast variable immutable. for mutable variable, the update is only in local working node whilst other working nodes doesn't get affected.

choose right serializer for broadcast variable


- per-partition basis 

mapPartitions() | Iterator of the elements in that partition | Iterator of our return elements | f: (Iterator[T]) → Iterator[U]
mapPartitionsWithIndex() | Integer of partition number, and Iterator of the elements in that partition | Iterator of our return elements | f: (Int, Iterator[T]) → Iterator[U]

foreachPartition() | Iterator of the elements | Nothing | f: (Iterator[T]) → Unit

- pipe to external program (external program can get input from standard input and output to standard output)

- StatCounter on numeric RDD

- partitions after transformation 

  • general

filter(),map(),flatMap(),distinct()和父RDD相同
rdd.union(otherRDD)rdd.partitions.size + otherRDD. partitions.size
rdd.intersection(otherRDD)max(rdd.partitions.size, otherRDD. partitions.size)
rdd.subtract(otherRDD)rdd.partitions.size
rdd.cartesian(otherRDD)rdd.partitions.size * otherRDD. partitions.size
  • pair
reduceByKey(),foldByKey(),combineByKey(), groupByKey()和父RDD相同
sortByKey()同上
mapValues(),flatMapValues()同上
cogroup(), join(), ,leftOuterJoin(), rightOuterJoin()所有父RDD按照其partition数降序排列,从partition数最大的RDD开始查找是否存在partitioner,存在则partition数由此partitioner确定,否则,所有RDD不存在partitioner,由spark.default.parallelism确定,若还没设置,最后partition数为所有RDD中partition数的最大值

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值