所有 ByKey 的底层:
可以查看conbineByKey的官方示例代码:
def combineByKey(self, createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=portable_hash):""" Generic function to combine the elements for each key using a custom set of aggregation functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C. Users provide three functions: - `createCombiner`, which turns a V into a C (e.g., creates a one-element list) - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list) - `mergeCombiners`, to combine two C's into a single one (e.g., merges the lists) To avoid memory allocation, both mergeValue and mergeCombiners are allowed to modify and return their first argument instead of creating a new C. In addition, users can control the partitioning of the output RDD. Notes ----- V and C can be different -- for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]). Examples -------- >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)]) >>> def to_list(a): ... return [a] ... >>> def append(a, b): ... a.append(b) ... return a ... >>> def extend(a, b): ... a.extend(b) ... return a ... >>> sorted(x.combineByKey(to_list, append, extend).collect()) [('a', [1, 2]), ('b', [1])] """
从官方给出的源码可以看出,combineByKey中总共三个函数:
第一个函数是将所有的value转换成[value,1]的形式
第二个函数是将同一个分区内所有相同value的值添加到同一个列表中并计数
第三个函数是将不同分区内的所有相同的value的值添加到一块并计数
官方给的示例中如果费解的话可以看我的示例:
rdd2 = rdd1.combineByKey(createCombiner,mergeValue,mergeCombiners)
# todo turns a V into a C (e.g., creates a one-element list)
# todo 取出所有的(key,value)中的 value ,只取 value
def createCombiner(value):
return [value,1]
# todo merge a V into a C (e.g., adds it to the end of a list)
# todo 分区内操作 # todo x相当于[value,1]==>('a',12)==>[12,1]; y代表相同 key 的 value ==>('a',3)==> y=3
# todo x[0]+y相当于把所有相同 key 的 value 都放在了一个列表中,x[1]+1代表计数
def mergeValue(x,y):
return [x[0]+y,x[1]+1]
# todo combine two C's into a single one (e.g., merges the lists)
# todo 分区间操作 # todo 计算分区间的值:将分区间的相同 key 的 value 都添加到一块
def mergeCombiners(x,y):
return [x[0]+y[0],x[1]+y[1]]