Spark的Key算子

一 reduceByKey

原理
spark的根据key进行分区内数据预聚合,再进行最后进行聚合,好处是可以减少网络带宽的传输,适用于值聚合的情况

import org.apache.spark.rdd.RDD
import org.apache.spark.{
   HashPartitioner, SparkConf, SparkContext}


object  Test11 {
   
  def main(args: Array[String]): Unit = {
   
    val sparkconf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
    val sc =new SparkContext(sparkconf)
    val rdd= sc.parallelize(List(1,2,5,7,8,9,3,4,4,5),3)
    val rdds:RDD[(Int,Int)] =rdd.map{
   
      (_,1)
    }
    rdds.reduceByKey(_+_).glom().map(_.toList).collect().foreach(println)
    print("*******************")
    rdds.reduceByKey(new HashPartitioner(3),(_+_)).glom().map(_.toList).collect().foreach(println)
  }

}

/**
 * scala聚合算子 reduce foldleft ==>聚合成一个值
 * spark的 根据key
 *
 */


/**
 * 当使用默认分区器是,分区的方法是根据rdd开始时设置的并行度,否则则根据rdd分区的长度进行分区,分区的分区器默认使用 HashPartitioner
 *
 * Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
 *
 * If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism
 * as the default partitions number, otherwise we'll use the max number of upstream partitions.
 *
 * When available, we choose the partitioner from rdds with maximum number of partitions. If this
 * partitioner is eligible (number of partitions within an order of maximum number of partitions
 * in rdds), or has partition number higher than or equal to default partitions number - we use
 * this partitioner.
 *
 * Otherwise, we'll use a new HashPartitioner with the default partitions number.
 *
 * Unless spark.default.parallelism is set, the number of partitions will be the same as the
 * number of partitions in the largest upstream RDD, as this should be least likely to cause
 * out-of-memory errors.
 *
 * We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
 *

  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))

    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
      Some(hasPartitioner.maxBy(_.partitions.length))
    } else {
      None
    }

    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
      rdd.context.defaultParallelism
    } else {
      rdds.map(_.partitions.length).max
    }

    // If the existing max partitioner is an eligible one, or its partitions number is larger
    // than or equal to the default number of partitions, use the existing partitioner.
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions <= hasMaxPartitioner.get.getNumPartitions)) {
      hasMaxPartitioner.get.partitioner.get
    } else {
      new HashPartitioner(defaultNumPartitions)
    }
  }
*/

在这里插入图片描述

二groupBykey

~~
原理
与reduceByKey相同,都是根据相同key进行分组,但与groupbykey不同的是groupBykey只将value合并成为CompactBuffer,但需要更多的带宽

import org.apache.spark.rdd.RDD
import org.apache.spark.{
   SparkConf, SparkContext}

object Test12 {
   
  def main(args: Array[String]): Unit = {
   
    val sparkconf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
    val sc =new SparkContext(sparkconf)
    val rdd= sc.parallelize(List(1,2,5,7,8,9,3,4,4,5),3)
    val rdds:RDD[(Int,Int)] =rdd.map
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值