Spark修炼之道(进阶篇)——Spark入门到精通:第六节 Spark编程模型(三)

本文是Spark修炼之道的进阶篇,详细介绍了Spark RDD的两种重要操作:repartitionAndSortWithinPartitions与aggregateByKey。repartitionAndSortWithinPartitions在分区和排序方面提供了更高的性能;而aggregateByKey则用于聚合相同Key的值,通过seqOp和combOp函数实现。文章还列举了这两个操作的使用示例,并探讨了aggregateByKey在不同Spark版本中的行为差异。此外,文中还提及了RDD的一些常用action操作,如reduce、count、first、take等。
摘要由CSDN通过智能技术生成

作者:周志湖
网名:摇摆少年梦
微信号:zhouzhihubeyond

本节主要内容

  1. RDD transformation(续)
  2. RDD actions

1. RDD transformation(续)

(1)repartitionAndSortWithinPartitions(partitioner)
repartitionAndSortWithinPartitions函数是repartition函数的变种,与repartition函数不同的是,repartitionAndSortWithinPartitions在给定的partitioner内部进行排序,性能比repartition要高。
函数定义:
/**
* Repartition the RDD according to the given partitioner and, within each resulting partition,
* sort records by their keys.
*
* This is more efficient than calling repartition and then sorting within each partition
* because it can push the sorting down into the shuffle machinery.
*/
def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]

使用示例:

scala> val data = sc.parallelize(List((1,3),(1,2),(5,4),(1, 4),(2,3),(2,4)),3)
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:21

scala> data.repartitionAndSortWithinPartitions(new HashPartitioner(3)).collect
res3: Array[(Int, Int)] = Array((1,4), (1,3), (1,2), (2,3), (2,4), (5,4))

这里写图片描述

(2)aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

aggregateByKey函数对PairRDD中相同Key的值进行聚合操作,在聚合过程中同样使用了一个中立的初始值。其函数定义如下:
/**
* Aggregate the values of each key, using given combine functions and a neutral “zero value”.
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s,
* as in scala.TraversableOnce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return their first argument
* instead of creating a new U.
*/
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)]

示例代码:

import org.apache.spark.SparkContext._
import org.apache.spa
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值