Spark-Operator-groupBy

note 1
这里只是将我学习初期笔记拿来分享,没有做太多精细的推理验证,如有错误,希望指正。
note 2
整个算子系列应用的测试数据是相同的,在本系列第一篇Spark-Operator-Map中有完整的测试数据
note 3
因为工作环境如此,我个人使用Java+Scala混合开发,请知悉
note 4
代码版本
    -Spark2.2 
    -Scala2.11
  • 更常用的是groupByKey
  • 这里可以看到底层还是用的groupByKey
  • 原理相同,学会用这个方法可以让代码更短?(注释更长)

  /**
   * Return an RDD of grouped items. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy[K](f, defaultPartitioner(this))
  }

  /**
   * Return an RDD of grouped elements. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupBy[K](
      f: T => K,
      numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy(f, new HashPartitioner(numPartitions))
  }

  /**
   * Return an RDD of grouped items. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }
  • 最简单的形式是 接受一个 :返回-分组因子-的函数
  • 返回一个二元组,第一个元素为提取出来的因子,第二个元素为这个因子对应的内容的集合(Iterator)
object TestGroupBy {
  val ss = SparkSession.builder().master("local").appName("basic").getOrCreate()
  val sc = ss.sparkContext
  sc.setLogLevel("error")

  def main(args: Array[String]): Unit = {
    val rdd = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic.txt")
    val rdd2 = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic3.txt")
    val rdd3 = (rdd++rdd2).groupBy(r=>{
      r.split(",",-1)(0)
    }).foreach(r=>{
      println("-------------------")
      println(r._1)
      for(str <- r._2){
        println(str)
      }
    })
  }
}

测试结果

-------------------
4
4,6,s,b
4,6,s,b
-------------------
8
8,h,m,o
-------------------
6
6,q,w,e
6,q,w,e
-------------------
2
2,w,gd,h
-------------------
7
7,j,s,b
-------------------
5
5,h,d,o
5,h,d,o
-------------------
9
9,q,w,c
-------------------
3
3,h,r,x
3,h,r,x
-------------------
1
1,a,c,b
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值