spark 累加历史主要用到了窗口函数,而进行全部统计,则需要用到rollup函数
1 应用场景:
1、我们需要统计用户的总使用时长(累加历史)
2、前台展现页面需要对多个维度进行查询,如:产品、地区等等
3、需要展现的表格头如: 产品、2015-04、2015-05、2015-06
2 原始数据:
product_code |event_date |duration |
-------------|-----------|---------|
1438 |2016-05-13 |165 |
1438 |2016-05-14 |595 |
1438 |2016-05-15 |105 |
1629 |2016-05-13 |12340 |
1629 |2016-05-14 |13850 |
1629 |2016-05-15 |227 |
3 业务场景实现
3.1 业务场景1:累加历史:
如数据源所示:我们已经有当天用户的使用时长,我们期望在进行统计的时候,14号能累加13号的,15号能累加14、13号的,以此类推
3.1.1 spark-sql实现
//spark sql 使用窗口函数累加历史数据
sqlContext.sql("""select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration
from userlogs_date""").show+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.1.2 dataframe实现
//使用Column提供的over 函数,传入窗口操作importorg.apache.spark.sql.expressions._
val first_2_now_window= Window.partitionBy("pcode").orderBy("event_date")
df_userlogs_date.select(
$"pcode",
$"event_date",
sum($"duration").over(first_2_now_window).as("sum_duration")
).show+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.1.3 扩展 累加一段时间范围内
实际业务中的累加逻辑远比上面复杂,比如,累加之前N天,累加前N天到后N天等等。以下我们来实现:
3.1.3.1 累加历史所有:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,0)Window.partitionBy("pcode").orderBy("event_date")
上边四种写法完全相等
3.1.3.2 累加N天之前,假设N=3select pcode,event_date,sum(duration) over (partition by pcode order by event_date ascrows between 3 preceding and current row) as sum_durationfrom userlogs_dateWindow.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,0)
3.1.3.3 累加前N天,后M天: 假设N=3 M=5select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and 5 following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,5)
3.1.3.4 累加该分区内所有行select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding andunboundedfollowing) as sum_duration from userlogs_dateWindow.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,Long.MaxValue)
总结如下:preceding:用于累加前N行(分区之内)。若是从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量following :与preceding相反,累加后N行(分区之内)。若是累加到该分区结束,则为 unbounded。N为:相对当前行向后的偏移量current row:顾名思义,当前行,偏移量为0说明:上边的前N,后M,以及current row均会累加该偏移量所在行3.1.3.4 实测结果累加历史:分区内当天及之前所有 写法1:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
累加历史:分区内当天及之前所有 写法2:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
累加当日和昨天:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and current row) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 700|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 14077|
+-----+----------+------------+
累加当日、昨日、明日:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and 1 following ) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 760|
| 1438|2016-05-14| 865|
| 1438|2016-05-15| 700|
| 1629|2016-05-13| 26190|
| 1629|2016-05-14| 26417|
| 1629|2016-05-15| 14077|
+-----+----------+------------+
累加分区内所有:当天和之前之后所有:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 865|
| 1438|2016-05-14| 865|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 26417|
| 1629|2016-05-14| 26417|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.2 业务场景2:统计全部
3.2.1 spark sql实现
//spark sql 使用rollup添加all统计
sqlContext.sql("""select pcode,event_date,sum(duration) as sum_duration
from userlogs_date_1
group by pcode,event_date with rollup
order by pcode,event_date""").show()+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| null| null| 27282|
| 1438| null| 865|
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 595|
| 1438|2016-05-15| 105|
| 1629| null| 26417|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 13850|
| 1629|2016-05-15| 227|
+-----+----------+------------+
3.2.2 dataframe函数实现
//使用dataframe提供的rollup函数,进行多维度all统计
df_userlogs_date.rollup($"pcode", $"event_date").agg(sum($"duration")).orderBy($"pcode", $"event_date")+-----+----------+-------------+
|pcode|event_date|sum(duration)|
+-----+----------+-------------+
| null| null| 27282|
| 1438| null| 865|
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 595|
| 1438|2016-05-15| 105|
| 1629| null| 26417|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 13850|
| 1629|2016-05-15| 227|
+-----+----------+-------------+
3.3 行转列 ->pivot
pivot目前还没有sql语法,先用df语法吧
val userlogs_date_all = sqlContext.sql("select dcode, pcode,event_date,sum(duration) as duration from userlogs group by dognum, pcode,event_date")
userlogs_date_all.registerTempTable("userlogs_date_all")
val dates= userlogs_date_all.select($"event_date").map(row => row.getAs[String]("event_date")).distinct().collect().toList
userlogs_date_all.groupBy($"dcode", $"pcode").pivot("event_date", dates).sum("duration").na.fill(0).show+-----------------+-----+----------+----------+----------+----------+
| dcode|pcode|2016-05-26|2016-05-13|2016-05-14|2016-05-15|
+-----------------+-----+----------+----------+----------+----------+
| F2429186| 1438| 0| 0| 227| 0|
| AI2342441| 1438| 0| 0| 0| 345|
| A320018711| 1438| 0| 939| 0| 0|
| H2635817| 1438| 0| 522| 0| 0|
| D0288196| 1438| 0| 101| 0| 0|
| Y0242218| 1438| 0| 1036| 0| 0|
| H2392574| 1438| 0| 0| 689| 0|
| D2245588| 1438| 0| 0| 1| 0|
| Y2514906| 1438| 0| 0| 118| 4|
| H2540419| 1438| 0| 465| 242| 5|
| R2231926| 1438| 0| 0| 305| 0|
| H2684591| 1438| 0| 136| 0| 0|
| A2548470| 1438| 0| 412| 0| 0|
| GH000309| 1438| 0| 0| 0| 4|
| H2293216| 1438| 0| 0| 0| 534|
| R2170601| 1438| 0| 0| 0| 0|
|B2365238;B2559538| 1438| 0| 0| 0| 0|
| BQ005465| 1438| 0| 0| 642| 78|
| AH2180324| 1438| 0| 608| 146| 36|
| H0279306| 1438| 0| 490| 0| 0|
+-----------------+-----+----------+----------+----------+----------+
附录
下面是这两个函数的官方api说明:org.apache.spark.sql.scala
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData forall the available aggregate functions.
Thisisa variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).// Compute the average for all numeric columns rolluped by department andgroup.
df.rollup("department", "group").avg()// Compute the max age and average salary, rolluped by department andgender.
df.rollup($"department", $"gender").agg(Map("salary" -> "avg","age" -> "max"))
def rollup(cols: Column*): GroupedData
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData forall the available aggregate functions.
df.rollup($"department", $"group").avg()// Compute the max age and average salary, rolluped by department andgender.
df.rollup($"department", $"gender").agg(Map("salary" -> "avg","age" -> "max"))
org.apache.spark.sql.Column.scala
defover(window: WindowSpec): Column
Define a windowing column.
val w= Window.partitionBy("name").orderBy("id")
df.select(
sum("price").over(w.rangeBetween(Long.MinValue, 2)),
avg("price").over(w.rowsBetween(0, 4))
)
posted on 2017-10-23 22:05 xzc 阅读(616) 评论(0) 编辑 收藏 所属分类: hadoop