Spark：distinct算子会把所有数据拉到Driver吗？

最新推荐文章于 2024-07-31 15:54:40 发布

没有文化，啥也不会

最新推荐文章于 2024-07-31 15:54:40 发布

阅读量742

点赞数 2

分类专栏： spark

本文链接：https://blog.csdn.net/x950913/article/details/114264435

版权

spark 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

前言

在大数据中写SQL时，通常不使用distinct关键字，因为这样效率较低。但是看到spark core中有一个distinct算子，官网上介绍用于返回一个不含重复元素的dataset。目前在打算用spark core解析json，并动态生成hive表，就需要对所有json数据的key做去重，reduce、reduceByKey、groupByKey都可以实现，但是实现起来都得转换成pairs rdd。打算用distinct算子，又怕数据倾斜，所以看看distinct的源码。

distinct([numPartitions]))

Return a new dataset that contains the distinct elements of the source dataset.

源码


  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    //删除分区内重复数据：将数据作为key写进map，实现去重。
    def removeDuplicatesInPartition(partition: Iterator[T]): Iterator[T] = {
      // Create an instance of external append only map which ignores values.
      val map = new ExternalAppendOnlyMap[T, Null, Null](
        createCombiner = _ => null,
        mergeValue = (a, b) => a,
        mergeCombiners = (a, b) => a)
      map.insertAll(partition.map(_ -> null))
      map.iterator.map(_._1)
    }
    partitioner match {
      //如果指定分区数等于当前分区数，则调用removeDuplicatesInPartition对分区内数据去重
      case Some(_) if numPartitions == partitions.length =>
        mapPartitions(removeDuplicatesInPartition, preservesPartitioning = true)
      //如果指定分区数不等于当前分区数，则转换成pair rdd，并用reduceByKey去重。
      case _ => map(x => (x, null)).reduceByKey((x, _) => x, numPartitions).map(_._1)
    }
  }

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  //如果未指定分区数，默认传入当前分区数。
  def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }

从源码上看，distinct根据是否指定分区数参数做了如下事情：

判断指定的分区数，是否等于当前rdd的分区数，如果相等（或没有指定分区数），则调用removeDuplicatesInPartition方法，删除分区内的重复数据（因为相同key都在相同分区，所以只要分区内数据不重复，则整体数据不重复）；

如果指定的分区数不等于当前rdd的分区数，那么将当前rdd转换成pairs rdd，key为当前rdd的值，value为null，然后调用reduceByKey，并按指定的分区数重新分区，对数据进行去重。

删除分区内源码解析：

创建了一个只有key的ExternalAppendOnlyMap类对象，该类为spark的一个工具类，是一个只能append的map，向该map插入数据后，会先写入内存中，当内存中数据到达指定阈值后，溢写到磁盘，溢写时会进行排序。最后Combiners会从磁盘中读取数据进行合并。

将数据作为key写进这个map后，即完成去重操作。