该算子先按照key分组再做count操作
注意事项:
1.数据类型是K,V的RDD才能调用该算子
2.该算子只会在Driver端启动Executor进程来执行计算
3.计算的结果以Map[K, Long]的形式保存在Driver端内存中
所以结果数据集灰常大的时候,建议使用
rdd.mapValues(_ => 1L).reduceByKey(_ + _)来代替,
返回的结果是一个[T, Long]类型的RDD
源码片段:
/**
* Count the number of elements for each key, collecting the results to a local Map.
*
* @note This method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver's memory.
* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByKey(): Map[K, Long] = self.withScope {
self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}
代码实战
val rdd2: RDD[(String, Int)] = sc.parallelize(List(
("zhangsan", 18),
("zhangsan", 19),
("lisi", 20),
("lisi", 20),
("wangwu", 18)
), 3)
val countbykey: collection.Map[String, Long] = rdd2.countByKey()
println("countbykey = " + countbykey)
运行结果:
countbykey = Map(zhangsan -> 2, wangwu -> 1, lisi -> 2)