Spark-算子之shuffle类

在spark中存在隐式转换 将RDD转换成PairFunctionRDD

groupByKey

这个方法生成新的stage,并且源代码是写在PairRDDFunctions中,不是调用新建一个MapPartitionsRDD了,这个方法是map端和reduce端得交互,map端处理完数据会先将数据ShuffleWrite溢写到map端得磁盘,然后reduce端通过网络进行拉取ShuffleRead过来

之于为什么OutPut会变少,是因为输出的文件  key都合并了  没有之前那么多key了  但是value还是之前那么多

还有一个问题值得讨论,就是怎么定的分区规则,其实如果不传一个分区方法,是有默认的分区方法的,就是HashPartitioner,和mapreduce用的是一样的方法.这个方法会跟去你传入分区数的最大值决定分几个区和决定他决定向谁取模

这个方法就是传进去一个分区器

package com.doit.spark.restart

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object GroupByKey {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val words = sc.parallelize(List(
      "spark", "flink", "hadoop", "kafka",
      "spark", "hive", "hadoop", "kafka",
      "spark", "flink", "hive", "kafka",
      "spark", "flink", "hadoop", "kafka"
    ), 4)
    val wordAndOne: RDD[(String, Int)] = words.map((_, 1))
    val value: RDD[(String, Iterable[Int])] = wordAndOne.groupByKey(6)
    value.saveAsTextFile("zhao")
    sc.stop()
  }
}


p0  (kafka,CompactBuffer(1, 1, 1, 1))
p1  (hadoop,CompactBuffer(1, 1, 1))
p2  (hive,CompactBuffer(1, 1))
p3  啥也没有
p4  (flink,CompactBuffer(1, 1, 1))
p5  (spark,CompactBuffer(1, 1, 1, 1))
/**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with into `numPartitions` partitions. The ordering of elements within
   * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))


/**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
   * The ordering of elements within each group is not guaranteed, and may even differ
   * each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }


/**
 * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using
 * Java's `Object.hashCode`.
 *
 * Java arrays have hashCodes that are based on the arrays' identities rather than their contents,
 * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will
 * produce an unexpected or incorrect result.
 */
class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {
    case null => 0
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}

 val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2

这个方法的实现主要是new了个    ShuffleRDD,第一步先在map端创建了个CompactBuffer(v),然后局部聚合向CompactBuffer(v)加入同一分区内相同key的所有value,再到reduce端全局聚合加入所有value

groupBy

这个方法需要传两个参数,一个是分组方法,一个是分区器,最终得到的是(调用函数之后的结果,compactBuffer(未经处理的样子))再进行groupByKey

def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }
package com.doit.spark.restart

import org.apache.spark.{SparkConf, SparkContext}

object GroupBy {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val words = sc.parallelize(List(
      "spark", "flink", "hadoop", "kafka",
      "spark", "hive", "hadoop", "kafka",
      "spark", "flink", "hive", "kafka",
      "spark", "flink", "hadoop", "kafka"
    ), 4)
    val wordAndOne = words.map((_, 1))
    val grouped = wordAndOne.groupBy(_._1)
    //用GroupByKey实现GroupBy
    wordAndOne.map(x =>(x._1,x)).groupByKey()
  }
}

 

Distinct

这个方法是用来去重的,   List(1, 1, 2, 2, 2, 3, 4, 5, 4, 3, 2, 4, 2, 5)=>(1,2,3,4,5)

      case _ => map(x => (x, null)).reduceByKey((x, _) => x, numPartitions).map(_._1)

这个是实现方法,先把数据转成二维元组,然后通过调用reduceByKey传进去的函数只进行拿出每个key的第一个元组 不进行处理,原本的reduceByKey传入和函数常为_+_  是reduce这个方法底层在map端和reduce分别进行了两次合并的事先,就是数值相加,这次只调出一个value也不进行相加,所以就只有null,再取出每个不同key元组的key  完成去重
 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值