rdd.aggregateByKey()笔记

最新推荐文章于 2023-01-31 14:13:45 发布

R_Gattuso

最新推荐文章于 2023-01-31 14:13:45 发布

阅读量191

点赞数

文章标签：数据挖掘 spark

原文链接：https://www.codenong.com/29930110/

版权

@TOC

rdd.aggregateByKey()笔记搬运

rdd1 = rdd1.aggregateByKey(aTuple, lambda a,b: (a[0] + b,    a[1] + 1),
                                   lambda a,b: (a[0] + b[0], a[1] + b[1]))

关于上面每个a和b对的含义，以下内容是正确的(以便您可以直观了解发生的情况)：

First lambda expression for Within-Partition Reduction Step::
   a: is a TUPLE that holds: (runningSum, runningCount).
   //
   b: is a SCALAR that holds the next Value

   Second lambda expression for Cross-Partition Reduction Step::
   a: is a TUPLE that holds: (runningSum, runningCount).
   b: is a TUPLE that holds: (nextPartitionsSum, nextPartitionsCount).

sumcntRDD_combined = combined_rdd.aggregateByKey((0,0),lambda acc,rating: (acc[0]+rating, acc[1]+1), lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1]))

//output: [('Children', (31426.5, 9208)), ('Fantasy', (41312.5, 11834)), ('Romance', (63552.0, 18124)),

genre_result =sumcntRDD_combined.mapValues(lambda value: value[0]/value[1])
print("The average rating for each genre is: ")
print(genre_result.take(5))
//output:The average rating for each genre is: 
[('Children', 3.412956125108601), ('Fantasy', 3.4910005070136894), ('Romance', 3.5065107040388437), ('Action', 3.447984331646809), ('Thriller', 3.4937055799183425)]

第二种

sumcntRDD_combined = combined_rdd.aggregateByKey((0,0),lambda acc,rating: (acc[0]+rating, acc[1]+1), lambda acc1, acc2: (acc1[0]+acc2[0]）/（acc1[1]+acc2[1]))