@TOC
rdd.aggregateByKey()笔记搬运
rdd1 = rdd1.aggregateByKey(aTuple, lambda a,b: (a[0] + b, a[1] + 1),
lambda a,b: (a[0] + b[0], a[1] + b[1]))
关于上面每个a和b对的含义,以下内容是正确的(以便您可以直观了解发生的情况):
First lambda expression for Within-Partition Reduction Step::
a: is a TUPLE that holds: (runningSum, runningCount).
//
b: is a SCALAR that holds the next Value
Second lambda expression for Cross-Partition Reduction Step::
a: is a TUPLE that holds: (runningSum, runningCount).
b: is a TUPLE that holds: (nextPartitionsSum, nextPartitionsCount).
sumcntRDD_combined = combined_rdd.aggregateByKey((0,0),lambda acc,rating: (acc[0]+rating, acc[1]+1), lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1]))
//output: [('Children', (31426.5, 9208)), ('Fantasy', (41312.5, 11834)), ('Romance', (63552.0, 18124)),
genre_result =sumcntRDD_combined.mapValues(lambda value: value[0]/value[1])
print("The average rating for each genre is: ")
print(genre_result.take(5))
//output:The average rating for each genre is:
[('Children', 3.412956125108601), ('Fantasy', 3.4910005070136894), ('Romance', 3.5065107040388437), ('Action', 3.447984331646809), ('Thriller', 3.4937055799183425)]
第二种
sumcntRDD_combined = combined_rdd.aggregateByKey((0,0),lambda acc,rating: (acc[0]+rating, acc[1]+1), lambda acc1, acc2: (acc1[0]+acc2[0])/(acc1[1]+acc2[1]))