一 reduceByKey
原理
spark的根据key进行分区内数据预聚合,再进行最后进行聚合,好处是可以减少网络带宽的传输,适用于值聚合的情况
import org.apache.spark.rdd.RDD
import org.apache.spark.{
HashPartitioner, SparkConf, SparkContext}
object Test11 {
def main(args: Array[String]): Unit = {
val sparkconf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
val sc =new SparkContext(sparkconf)
val rdd= sc.parallelize(List(1,2,5,7,8,9,3,4,4,5),3)
val rdds:RDD[(Int,Int)] =rdd.map{
(_,1)
}
rdds.reduceByKey(_+_).glom().map(_.toList).collect().foreach(println)
print("*******************")
rdds.reduceByKey(new HashPartitioner(3),(_+_)).glom().map(_.toList).collect().foreach(println)
}
}
/**
* scala聚合算子 reduce foldleft ==>聚合成一个值
* spark的 根据key
*
*/
/**
* 当使用默认分区器是,分区的方法是根据rdd开始时设置的并行度,否则则根据rdd分区的长度进行分区,分区的分区器默认使用 HashPartitioner
*
* Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
*
* If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism
* as the default partitions number, otherwise we'll use the max number of upstream partitions.
*
* When available, we choose the partitioner from rdds with maximum number of partitions. If this
* partitioner is eligible (number of partitions within an order of maximum number of partitions
* in rdds), or has partition number higher than or equal to default partitions number - we use
* this partitioner.
*
* Otherwise, we'll use a new HashPartitioner with the default partitions number.
*
* Unless spark.default.parallelism is set, the number of partitions will be the same as the
* number of partitions in the largest upstream RDD, as this should be least likely to cause
* out-of-memory errors.
*
* We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
*
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val rdds = (Seq(rdd) ++ others)
val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
Some(hasPartitioner.maxBy(_.partitions.length))
} else {
None
}
val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
rdd.context.defaultParallelism
} else {
rdds.map(_.partitions.length).max
}
// If the existing max partitioner is an eligible one, or its partitions number is larger
// than or equal to the default number of partitions, use the existing partitioner.
if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
defaultNumPartitions <= hasMaxPartitioner.get.getNumPartitions)) {
hasMaxPartitioner.get.partitioner.get
} else {
new HashPartitioner(defaultNumPartitions)
}
}
*/
二groupBykey
~~
原理
与reduceByKey相同,都是根据相同key进行分组,但与groupbykey不同的是groupBykey只将value合并成为CompactBuffer,但需要更多的带宽
import org.apache.spark.rdd.RDD
import org.apache.spark.{
SparkConf, SparkContext}
object Test12 {
def main(args: Array[String]): Unit = {
val sparkconf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
val sc =new SparkContext(sparkconf)
val rdd= sc.parallelize(List(1,2,5,7,8,9,3,4,4,5),3)
val rdds:RDD[(Int,Int)] =rdd.map