rdd 内生分组,如何使用RDD分组和汇总多个字段?

I am new to Apache Spark as well as Scala, currently learning this framework and programming language for big data. I have a sample file I am trying to find out for a given field total number of another field and its count and list of values from another field. I tried on my own and seems that i am not writing in better approach in spark rdd (as starting).

Please find the below sample data (Customerid: Int, Orderid: Int, Amount: Float):

44,8602,37.19

35,5368,65.89

2,3391,40.64

47,6694,14.98

29,680,13.08

91,8900,24.59

70,3959,68.68

85,1733,28.53

53,9900,83.55

14,1505,4.32

51,3378,19.80

42,6926,57.77

2,4424,55.77

79,9291,33.17

50,3901,23.57

20,6633,6.49

15,6148,65.53

44,8331,99.19

5,3505,64.18

48,5539,32.42

My current code:

((sc.textFile("file://../customer-orders.csv").map(x => x.split(",")).map(x => (x(0).toInt,x(1).toInt)).map{case(x,y) => (x, List(y))}.reduceByKey(_ ++ _).sortBy(_._1,true)).

fullOuterJoin(sc.textFile("file://../customer-orders.csv").map(x =>x.split(",")).map(x => (x(0).toInt,x(2).toFloat)).reduceByKey((x,y) => (x + y)).sortBy(_._1,true))).

fullOuterJoin(sc.textFile("file://../customer-orders.csv").map(x =>x.split(",")).map(x => (x(0).toInt)).map(x => (x,1)).reduceByKey((x,y) => (x + y)).sortBy(_._1,true)).sortBy(_._1,true).take(50).foreach(println)

Got a result like this:

(49,(Some((Some(List(8558, 6986, 686....)),Some(4394.5996))),Some(96)))

Expecting result like:

customerid, (orderids,..,..,....), totalamount, number of orderids

Is there any better approach? I just tried combineByKey with the below code but the println inside are not printing.

scala> val reduced = inputrdd.combineByKey(

| (mark) => {

| println(s"Create combiner -> ${mark}")

| (mark, 1)

| },

| (acc: (Int, Int), v) => {

| println(s"""Merge value : (${acc._1} + ${v}, ${acc._2} + 1)""")

| (acc._1 + v, acc._2 + 1)

| },

| (acc1: (Int, Int), acc2: (Int, Int)) => {

| println(s"""Merge Combiner : (${acc1._1} + ${acc2._1}, ${acc1._2} + ${acc2._2})""")

| (acc1._1 + acc2._1, acc1._2 + acc2._2)

| }

| )

reduced: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[27] at combineByKey at :29

scala> reduced.collect()

res5: Array[(String, (Int, Int))] = Array((maths,(110,2)), (physics,(214,3)), (english,(65,1)))

I am using Spark version 2.2.0 , Scala 2.11.8 and Java 1.8 build 101

解决方案

This is much easier to solve using the newer DataFrame API. First read the csv file and add the column names:

val df = spark.read.csv("file://../customer-orders.csv").toDF("Customerid", "Orderid", "Amount")

Then use groupBy and agg to make the aggregations (here you want collect_list, sum and count):

val df2 = df.groupBy("Customerid").agg(

collect_list($"Orderid") as "Orderids",

sum($"Amount") as "TotalAmount",

count($"Orderid") as "NumberOfOrderIds"

)

Resulting dataframe using the provided input example:

+----------+------------+-----------+----------------+

|Customerid| Orderids|TotalAmount|NumberOfOrderIds|

+----------+------------+-----------+----------------+

| 51| [3378]| 19.8| 1|

| 15| [6148]| 65.53| 1|

| 29| [680]| 13.08| 1|

| 42| [6926]| 57.77| 1|

| 85| [1733]| 28.53| 1|

| 35| [5368]| 65.89| 1|

| 47| [6694]| 14.98| 1|

| 5| [3505]| 64.18| 1|

| 70| [3959]| 68.68| 1|

| 44|[8602, 8331]| 136.38| 2|

| 53| [9900]| 83.55| 1|

| 48| [5539]| 32.42| 1|

| 79| [9291]| 33.17| 1|

| 20| [6633]| 6.49| 1|

| 14| [1505]| 4.32| 1|

| 91| [8900]| 24.59| 1|

| 2|[3391, 4424]| 96.41| 2|

| 50| [3901]| 23.57| 1|

+----------+------------+-----------+----------------+

If you want to work with the data as a RDD after these transformations, you can convert it afterwards:

val rdd = df2.as[(Int, Seq[Int], Float, Int)].rdd

Of course, it is possible to solve using RDDs directly as well. Use aggregateByKey:

val rdd = spark.sparkContext

.textFile("test.csv")

.map(x => x.split(","))

.map(x => (x(0).toInt, (x(1).toInt, x(2).toFloat)))

val res = rdd.aggregateByKey((Seq[Int](), 0.0, 0))(

(acc, xs) => (acc._1 ++ Seq(xs._1), acc._2 + xs._2, acc._3 + 1),

(acc1, acc2) => (acc1._1 ++ acc2._1, acc1._2 + acc2._2, acc1._3 + acc2._3))

This is harder to read but will give the same result as the dataframe approach above.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值