Spark的Key算子

最新推荐文章于 2022-04-01 20:12:16 发布

Drgom

最新推荐文章于 2022-04-01 20:12:16 发布

阅读量334

点赞数

分类专栏： Spark 文章标签： spark scala big data

本文链接：https://blog.csdn.net/qq_43662627/article/details/117673173

版权

一 reduceByKey

原理
spark的根据key进行分区内数据预聚合，再进行最后进行聚合，好处是可以减少网络带宽的传输，适用于值聚合的情况

import org.apache.spark.rdd.RDD
import org.apache.spark.{
   HashPartitioner, SparkConf, SparkContext}


object  Test11 {
   
  def main(args: Array[String]): Unit = {
   
    val sparkconf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
    val sc =new SparkContext(sparkconf)
    val rdd= sc.parallelize(List(1,2,5,7,8,9,3,4,4,5),3)
    val rdds:RDD[(Int,Int)] =rdd.map{
   
      (_,1)
    }
    rdds.reduceByKey(_+_).glom().map(_.toList).collect().foreach(println)
    print("*******************")
    rdds.reduceByKey(new HashPartitioner(3),(_+_)).glom().map(_.toList).collect().foreach(println)
  }

}

/**
 * scala聚合算子 reduce foldleft ==>聚合成一个值
 * spark的 根据key
 *
 */


/**
 * 当使用默认分区器是,分区的方法是根据rdd开始时设置的并行度，否则则根据rdd分区的长度进行分区,分区的分区器默认使用 HashPartitioner
 *
 * Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
 *
 * If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism
 * as the default partitions number, otherwise we'll use the max number of upstream partitions.
 *
 * When available, we choose the partitioner from rdds with maximum number of partitions. If this
 * partitioner is eligible (number of partitions within an order of maximum number of partitions
 * in rdds), or has partition number higher than or equal to default partitions number - we use
 * this partitioner.
 *
 * Otherwise, we'll use a new HashPartitioner with the default partitions number.
 *
 * Unless spark.default.parallelism is set, the number of partitions will be the same as the
 * number of partitions in the largest upstream RDD, as this should be least likely to cause
 * out-of-memory errors.
 *
 * We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
 *

  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))

    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
      Some(hasPartitioner.maxBy(_.partitions.length))
    } else {
      None
    }

    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
      rdd.context.defaultParallelism
    } else {
      rdds.map(_.partitions.length).max
    }

    // If the existing max partitioner is an eligible one, or its partitions number is larger
    // than or equal to the default number of partitions, use the existing partitioner.
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions <= hasMaxPartitioner.get.getNumPartitions)) {
      hasMaxPartitioner.get.partitioner.get
    } else {
      new HashPartitioner(defaultNumPartitions)
    }
  }
*/

在这里插入图片描述

二groupBykey

~~
原理
与reduceByKey相同，都是根据相同key进行分组，但与groupbykey不同的是groupBykey只将value合并成为CompactBuffer，但需要更多的带宽

import org.apache.spark.rdd.RDD
import org.apache.spark.{
   SparkConf, SparkContext}

object Test12 {
   
  def main(args: Array[String]): Unit = {
   
    val sparkconf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
    val sc =new SparkContext(sparkconf)
    val rdd= sc.parallelize(List(1,2,5,7,8,9,3,4,4,5),3)
    val rdds:RDD[(Int,Int)] =rdd.map

最低0.47元/天解锁文章

Drgom

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
3
评论
Spark的Key算子

一 reduceByKey原理spark的根据key进行分区内数据预聚合，再进行最后进行聚合，好处是可以减少网络带宽的传输，适用于值聚合的情况import org.apache.spark.rdd.RDDimport org.apache.spark.{HashPartitioner, SparkConf, SparkContext}object Test11 { def main(args: Array[String]): Unit = { val sparkconf = ne
复制链接

扫一扫