note 1
这里只是将我学习初期笔记拿来分享,没有做太多精细的推理验证,如有错误,希望指正。
note 2
整个算子系列应用的测试数据是相同的,在本系列第一篇Spark-Operator-Map中有完整的测试数据
note 3
因为工作环境如此,我个人使用Java+Scala混合开发,请知悉
note 4
代码版本
-Spark2.2
-Scala2.11
- 更常用的是groupByKey
- 这里可以看到底层还是用的groupByKey
- 原理相同,学会用这个方法可以让代码更短?(注释更长)
/**
* Return an RDD of grouped items. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy[K](f, defaultPartitioner(this))
}
/**
* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupBy[K](
f: T => K,
numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy(f, new HashPartitioner(numPartitions))
}
/**
* Return an RDD of grouped items. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
: RDD[(K, Iterable[T])] = withScope {
val cleanF = sc.clean(f)
this.map(t => (cleanF(t), t)).groupByKey(p)
}
- 最简单的形式是 接受一个 :返回-分组因子-的函数
- 返回一个二元组,第一个元素为提取出来的因子,第二个元素为这个因子对应的内容的集合(Iterator)
object TestGroupBy {
val ss = SparkSession.builder().master("local").appName("basic").getOrCreate()
val sc = ss.sparkContext
sc.setLogLevel("error")
def main(args: Array[String]): Unit = {
val rdd = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic.txt")
val rdd2 = sc.textFile("/home/missingli/IdeaProjects/SparkLearn/src/main/resources/sparkbasic3.txt")
val rdd3 = (rdd++rdd2).groupBy(r=>{
r.split(",",-1)(0)
}).foreach(r=>{
println("-------------------")
println(r._1)
for(str <- r._2){
println(str)
}
})
}
}
测试结果
-------------------
4
4,6,s,b
4,6,s,b
-------------------
8
8,h,m,o
-------------------
6
6,q,w,e
6,q,w,e
-------------------
2
2,w,gd,h
-------------------
7
7,j,s,b
-------------------
5
5,h,d,o
5,h,d,o
-------------------
9
9,q,w,c
-------------------
3
3,h,r,x
3,h,r,x
-------------------
1
1,a,c,b