spark java 计数_spark 累加历史 + 统计全部 + 行转列

spark 累加历史主要用到了窗口函数,而进行全部统计,则需要用到rollup函数

1  应用场景:

1、我们需要统计用户的总使用时长(累加历史)

2、前台展现页面需要对多个维度进行查询,如:产品、地区等等

3、需要展现的表格头如: 产品、2015-04、2015-05、2015-06

2 原始数据:

69c5a8ac3fa60e0848d784a6dd461da6.pngproduct_code |event_date |duration |

-------------|-----------|---------|

1438 |2016-05-13 |165 |

1438 |2016-05-14 |595 |

1438 |2016-05-15 |105 |

1629 |2016-05-13 |12340 |

1629 |2016-05-14 |13850 |

1629 |2016-05-15 |227 |

69c5a8ac3fa60e0848d784a6dd461da6.png

3 业务场景实现

3.1 业务场景1:累加历史:

如数据源所示:我们已经有当天用户的使用时长,我们期望在进行统计的时候,14号能累加13号的,15号能累加14、13号的,以此类推

3.1.1 spark-sql实现

69c5a8ac3fa60e0848d784a6dd461da6.png//spark sql 使用窗口函数累加历史数据

sqlContext.sql("""select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration

from userlogs_date""").show+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13| 165|

| 1438|2016-05-14| 760|

| 1438|2016-05-15| 865|

| 1629|2016-05-13| 12340|

| 1629|2016-05-14| 26190|

| 1629|2016-05-15| 26417|

+-----+----------+------------+

69c5a8ac3fa60e0848d784a6dd461da6.png

3.1.2 dataframe实现

69c5a8ac3fa60e0848d784a6dd461da6.png//使用Column提供的over 函数,传入窗口操作importorg.apache.spark.sql.expressions._

val first_2_now_window= Window.partitionBy("pcode").orderBy("event_date")

df_userlogs_date.select(

$"pcode",

$"event_date",

sum($"duration").over(first_2_now_window).as("sum_duration")

).show+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13| 165|

| 1438|2016-05-14| 760|

| 1438|2016-05-15| 865|

| 1629|2016-05-13| 12340|

| 1629|2016-05-14| 26190|

| 1629|2016-05-15| 26417|

+-----+----------+------------+

69c5a8ac3fa60e0848d784a6dd461da6.png

3.1.3 扩展 累加一段时间范围内

实际业务中的累加逻辑远比上面复杂,比如,累加之前N天,累加前N天到后N天等等。以下我们来实现:

3.1.3.1 累加历史所有:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date

Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,0)Window.partitionBy("pcode").orderBy("event_date")

上边四种写法完全相等

3.1.3.2 累加N天之前,假设N=3select pcode,event_date,sum(duration) over (partition by pcode order by event_date ascrows between 3 preceding and current row) as sum_durationfrom userlogs_dateWindow.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,0)

3.1.3.3 累加前N天,后M天: 假设N=3 M=5select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and 5 following ) as sum_duration from userlogs_date

Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,5)

3.1.3.4 累加该分区内所有行select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding andunboundedfollowing) as sum_duration from userlogs_dateWindow.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,Long.MaxValue)

总结如下:preceding:用于累加前N行(分区之内)。若是从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量following :与preceding相反,累加后N行(分区之内)。若是累加到该分区结束,则为 unbounded。N为:相对当前行向后的偏移量current row:顾名思义,当前行,偏移量为0说明:上边的前N,后M,以及current row均会累加该偏移量所在行3.1.3.4 实测结果累加历史:分区内当天及之前所有 写法1:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date

69c5a8ac3fa60e0848d784a6dd461da6.png+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13| 165|

| 1438|2016-05-14| 760|

| 1438|2016-05-15| 865|

| 1629|2016-05-13| 12340|

| 1629|2016-05-14| 26190|

| 1629|2016-05-15| 26417|

+-----+----------+------------+

69c5a8ac3fa60e0848d784a6dd461da6.png累加历史:分区内当天及之前所有 写法2:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date

69c5a8ac3fa60e0848d784a6dd461da6.png+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13| 165|

| 1438|2016-05-14| 760|

| 1438|2016-05-15| 865|

| 1629|2016-05-13| 12340|

| 1629|2016-05-14| 26190|

| 1629|2016-05-15| 26417|

+-----+----------+------------+

69c5a8ac3fa60e0848d784a6dd461da6.png累加当日和昨天:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and current row) as sum_duration from userlogs_date

69c5a8ac3fa60e0848d784a6dd461da6.png+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13| 165|

| 1438|2016-05-14| 760|

| 1438|2016-05-15| 700|

| 1629|2016-05-13| 12340|

| 1629|2016-05-14| 26190|

| 1629|2016-05-15| 14077|

+-----+----------+------------+

69c5a8ac3fa60e0848d784a6dd461da6.png累加当日、昨日、明日:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and 1 following ) as sum_duration from userlogs_date

69c5a8ac3fa60e0848d784a6dd461da6.png+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13| 760|

| 1438|2016-05-14| 865|

| 1438|2016-05-15| 700|

| 1629|2016-05-13| 26190|

| 1629|2016-05-14| 26417|

| 1629|2016-05-15| 14077|

+-----+----------+------------+

69c5a8ac3fa60e0848d784a6dd461da6.png累加分区内所有:当天和之前之后所有:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date

69c5a8ac3fa60e0848d784a6dd461da6.png+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13| 865|

| 1438|2016-05-14| 865|

| 1438|2016-05-15| 865|

| 1629|2016-05-13| 26417|

| 1629|2016-05-14| 26417|

| 1629|2016-05-15| 26417|

+-----+----------+------------+

69c5a8ac3fa60e0848d784a6dd461da6.png3.2 业务场景2:统计全部

3.2.1 spark sql实现

69c5a8ac3fa60e0848d784a6dd461da6.png//spark sql 使用rollup添加all统计

sqlContext.sql("""select pcode,event_date,sum(duration) as sum_duration

from userlogs_date_1

group by pcode,event_date with rollup

order by pcode,event_date""").show()+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| null| null| 27282|

| 1438| null| 865|

| 1438|2016-05-13| 165|

| 1438|2016-05-14| 595|

| 1438|2016-05-15| 105|

| 1629| null| 26417|

| 1629|2016-05-13| 12340|

| 1629|2016-05-14| 13850|

| 1629|2016-05-15| 227|

+-----+----------+------------+

69c5a8ac3fa60e0848d784a6dd461da6.png

3.2.2 dataframe函数实现

69c5a8ac3fa60e0848d784a6dd461da6.png//使用dataframe提供的rollup函数,进行多维度all统计

df_userlogs_date.rollup($"pcode", $"event_date").agg(sum($"duration")).orderBy($"pcode", $"event_date")+-----+----------+-------------+

|pcode|event_date|sum(duration)|

+-----+----------+-------------+

| null| null| 27282|

| 1438| null| 865|

| 1438|2016-05-13| 165|

| 1438|2016-05-14| 595|

| 1438|2016-05-15| 105|

| 1629| null| 26417|

| 1629|2016-05-13| 12340|

| 1629|2016-05-14| 13850|

| 1629|2016-05-15| 227|

+-----+----------+-------------+

69c5a8ac3fa60e0848d784a6dd461da6.png

3.3 行转列 ->pivot

pivot目前还没有sql语法,先用df语法吧

69c5a8ac3fa60e0848d784a6dd461da6.pngval userlogs_date_all = sqlContext.sql("select dcode, pcode,event_date,sum(duration) as duration from userlogs group by dognum, pcode,event_date")

userlogs_date_all.registerTempTable("userlogs_date_all")

val dates= userlogs_date_all.select($"event_date").map(row => row.getAs[String]("event_date")).distinct().collect().toList

userlogs_date_all.groupBy($"dcode", $"pcode").pivot("event_date", dates).sum("duration").na.fill(0).show+-----------------+-----+----------+----------+----------+----------+

| dcode|pcode|2016-05-26|2016-05-13|2016-05-14|2016-05-15|

+-----------------+-----+----------+----------+----------+----------+

| F2429186| 1438| 0| 0| 227| 0|

| AI2342441| 1438| 0| 0| 0| 345|

| A320018711| 1438| 0| 939| 0| 0|

| H2635817| 1438| 0| 522| 0| 0|

| D0288196| 1438| 0| 101| 0| 0|

| Y0242218| 1438| 0| 1036| 0| 0|

| H2392574| 1438| 0| 0| 689| 0|

| D2245588| 1438| 0| 0| 1| 0|

| Y2514906| 1438| 0| 0| 118| 4|

| H2540419| 1438| 0| 465| 242| 5|

| R2231926| 1438| 0| 0| 305| 0|

| H2684591| 1438| 0| 136| 0| 0|

| A2548470| 1438| 0| 412| 0| 0|

| GH000309| 1438| 0| 0| 0| 4|

| H2293216| 1438| 0| 0| 0| 534|

| R2170601| 1438| 0| 0| 0| 0|

|B2365238;B2559538| 1438| 0| 0| 0| 0|

| BQ005465| 1438| 0| 0| 642| 78|

| AH2180324| 1438| 0| 608| 146| 36|

| H0279306| 1438| 0| 490| 0| 0|

+-----------------+-----+----------+----------+----------+----------+

69c5a8ac3fa60e0848d784a6dd461da6.png

附录

下面是这两个函数的官方api说明:org.apache.spark.sql.scala

69c5a8ac3fa60e0848d784a6dd461da6.png

Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData forall the available aggregate functions.

Thisisa variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).// Compute the average for all numeric columns rolluped by department andgroup.

df.rollup("department", "group").avg()// Compute the max age and average salary, rolluped by department andgender.

df.rollup($"department", $"gender").agg(Map("salary" -> "avg","age" -> "max"))

69c5a8ac3fa60e0848d784a6dd461da6.png

69c5a8ac3fa60e0848d784a6dd461da6.pngdef rollup(cols: Column*): GroupedData

Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData forall the available aggregate functions.

df.rollup($"department", $"group").avg()// Compute the max age and average salary, rolluped by department andgender.

df.rollup($"department", $"gender").agg(Map("salary" -> "avg","age" -> "max"))

69c5a8ac3fa60e0848d784a6dd461da6.pngorg.apache.spark.sql.Column.scala

69c5a8ac3fa60e0848d784a6dd461da6.pngdefover(window: WindowSpec): Column

Define a windowing column.

val w= Window.partitionBy("name").orderBy("id")

df.select(

sum("price").over(w.rangeBetween(Long.MinValue, 2)),

avg("price").over(w.rowsBetween(0, 4))

)

69c5a8ac3fa60e0848d784a6dd461da6.png

posted on 2017-10-23 22:05 xzc 阅读(616) 评论(0)  编辑  收藏 所属分类: hadoop

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值