Spark1.4发布,支持了窗口分析函数(window functions)。
在离线平台中,90%以上的离线分析任务都是使用Hive实现,其中必然会使用很多窗口分析函数,如果SparkSQL支持窗口分析函数,
加载数据:
2、窗口函数测试
| url | rate |
+-------+-------+
| url1 | 12 |
| url2 | 11 |
| url1 | 23 |
| url2 | 25 |
| url1 | 58 |
| url3 | 11 |
| url2 | 25 |
| url3 | 58 |
| url2 | 11 |
+-------+-------+
分组排序:
+-------+-------+----+
| url | rate | r |
+-------+-------+----+
| url1 | 58 | 1 |
| url1 | 23 | 2 |
| url1 | 12 | 3 |
| url2 | 25 | 1 |
| url2 | 25 | 2 |
| url2 | 11 | 3 |
| url2 | 11 | 4 |
| url3 | 58 | 1 |
| url3 | 11 | 2 |
+-------+-------+----+
分组统计sum
+-------+-------+-----+
| url | rate | r |
+-------+-------+-----+
| url1 | 12 | 93 |
| url1 | 23 | 93 |
| url1 | 58 | 93 |
| url2 | 11 | 72 |
| url2 | 25 | 72 |
| url2 | 25 | 72 |
| url2 | 11 | 72 |
| url3 | 11 | 69 |
| url3 | 58 | 69 |
+-------+-------+-----+
分组统计avg
+-------+-------+-------+
| url | rate | r |
+-------+-------+-------+
| url1 | 12 | 31.0 |
| url1 | 23 | 31.0 |
| url1 | 58 | 31.0 |
| url2 | 25 | 18.0 |
| url2 | 11 | 18.0 |
| url2 | 11 | 18.0 |
| url2 | 25 | 18.0 |
| url3 | 11 | 34.5 |
| url3 | 58 | 34.5 |
+-------+-------+-------+
分组统计count
+-------+-------+----+
| url | rate | r |
+-------+-------+----+
| url1 | 12 | 3 |
| url1 | 23 | 3 |
| url1 | 58 | 3 |
| url2 | 11 | 4 |
| url2 | 25 | 4 |
| url2 | 25 | 4 |
| url2 | 11 | 4 |
| url3 | 11 | 2 |
| url3 | 58 | 2 |
+-------+-------+----+
分组lag
+-------+-------+-------+
| url | rate | r |
+-------+-------+-------+
| url1 | 12 | NULL |
| url1 | 23 | 12 |
| url1 | 58 | 23 |
| url2 | 25 | NULL |
| url2 | 11 | 25 |
| url2 | 11 | 11 |
| url2 | 25 | 11 |
| url3 | 11 | NULL |
| url3 | 58 | 11 |
+-------+-------+-------+
在离线平台中,90%以上的离线分析任务都是使用Hive实现,其中必然会使用很多窗口分析函数,如果SparkSQL支持窗口分析函数,
那么对于后面Hive向SparkSQL中的迁移的工作量会大大降低,使用方式如下:
1、初始化数据
创建表
加载数据:
2、窗口函数测试
查询所有数据
+-------+-------+
| url | rate |
+-------+-------+
| url1 | 12 |
| url2 | 11 |
| url1 | 23 |
| url2 | 25 |
| url1 | 58 |
| url3 | 11 |
| url2 | 25 |
| url3 | 58 |
| url2 | 11 |
+-------+-------+
分组排序:
+-------+-------+----+
| url | rate | r |
+-------+-------+----+
| url1 | 58 | 1 |
| url1 | 23 | 2 |
| url1 | 12 | 3 |
| url2 | 25 | 1 |
| url2 | 25 | 2 |
| url2 | 11 | 3 |
| url2 | 11 | 4 |
| url3 | 58 | 1 |
| url3 | 11 | 2 |
+-------+-------+----+
分组统计sum
+-------+-------+-----+
| url | rate | r |
+-------+-------+-----+
| url1 | 12 | 93 |
| url1 | 23 | 93 |
| url1 | 58 | 93 |
| url2 | 11 | 72 |
| url2 | 25 | 72 |
| url2 | 25 | 72 |
| url2 | 11 | 72 |
| url3 | 11 | 69 |
| url3 | 58 | 69 |
+-------+-------+-----+
分组统计avg
+-------+-------+-------+
| url | rate | r |
+-------+-------+-------+
| url1 | 12 | 31.0 |
| url1 | 23 | 31.0 |
| url1 | 58 | 31.0 |
| url2 | 25 | 18.0 |
| url2 | 11 | 18.0 |
| url2 | 11 | 18.0 |
| url2 | 25 | 18.0 |
| url3 | 11 | 34.5 |
| url3 | 58 | 34.5 |
+-------+-------+-------+
分组统计count
+-------+-------+----+
| url | rate | r |
+-------+-------+----+
| url1 | 12 | 3 |
| url1 | 23 | 3 |
| url1 | 58 | 3 |
| url2 | 11 | 4 |
| url2 | 25 | 4 |
| url2 | 25 | 4 |
| url2 | 11 | 4 |
| url3 | 11 | 2 |
| url3 | 58 | 2 |
+-------+-------+----+
分组lag
+-------+-------+-------+
| url | rate | r |
+-------+-------+-------+
| url1 | 12 | NULL |
| url1 | 23 | 12 |
| url1 | 58 | 23 |
| url2 | 25 | NULL |
| url2 | 11 | 25 |
| url2 | 11 | 11 |
| url2 | 25 | 11 |
| url3 | 11 | NULL |
| url3 | 58 | 11 |
+-------+-------+-------+