SparkSQL流量统计案例

数据:

+---+-------------------+-------------------+----+
| id|          startTime|            endTime|flow|
+---+-------------------+-------------------+----+
|  1|2020-02-18 14:20:30|2020-02-18 14:46:30|  20|
|  1|2020-02-18 14:47:20|2020-02-18 15:20:30|  30|
|  1|2020-02-18 15:37:23|2020-02-18 16:05:26|  40|
|  1|2020-02-18 16:06:27|2020-02-18 17:20:49|  50|
|  1|2020-02-18 17:21:50|2020-02-18 18:03:27|  60|
|  2|2020-02-18 14:18:24|2020-02-18 15:01:40|  20|
|  2|2020-02-18 15:20:49|2020-02-18 15:30:24|  30|
|  2|2020-02-18 16:01:23|2020-02-18 16:40:32|  40|
|  2|2020-02-18 16:44:56|2020-02-18 17:40:52|  50|
|  3|2020-02-18 14:39:58|2020-02-18 15:35:53|  20|
|  3|2020-02-18 15:36:39|2020-02-18 15:24:54|  30|
+---+-------------------+-------------------+----+

需求:同一个用户相差10分钟内的流量合并到一块

代码:

1,按照用户分组将endTime下移一行,形成新的一列

SELECT id, startTime, endTime, flow
	 	, LAG(endTime, 1, startTime) OVER(PARTITION BY id ORDER BY startTime) last_endTime
FROM t_flow

 每组第一行的last_endTIme取当前的开始时间

结果:

 

2,计算当前行startTime和last_endTime的相差是否小于10分钟(满足条件),是的话标记0,否1

SELECT id, startTime, endTime, flow
		, IF((to_unix_timestamp(startTime) - to_unix_timestamp(last_endTime))/60 < 10 , 0, 1) flag 
FROM (SELECT id, startTime, endTime, flow
	 	, LAG(endTime, 1, startTime) OVER(PARTITION BY id ORDER BY startTime) last_endTime
	  FROM t_flow) t1

 

3,每组内将flag依次sum,得到相同结果的sum(flag)则是满足条件的

SELECT t2.*,sum(flag) OVER(PARTITION BY t2.id ORDER BY startTime) fid 
FROM(
	SELECT id, startTime, endTime, flow
			, IF((to_unix_timestamp(startTime) - to_unix_timestamp(last_endTime))/60 < 10 , 0, 1) flag 
	FROM (SELECT id, startTime, endTime, flow
		 	, LAG(endTime, 1, startTime) OVER(PARTITION BY id ORDER BY startTime) last_endTime
		  FROM t_flow) t1) t2 

3,按照id和fid分组聚合,取出最终结果

SELECT id,MIN(startTime) startTime, MAX(endTime) endTime ,SUM(flow) totalFlow
FROM(
	SELECT t2.*,SUM(flag) OVER(PARTITION BY t2.id ORDER BY startTime) fid 
	FROM(
		SELECT id, startTime, endTime, flow
				, IF((to_unix_timestamp(startTime) - to_unix_timestamp(last_endTime))/60 < 10 , 0, 1) flag 
		FROM (SELECT id, startTime, endTime, flow
			 	, LAG(endTime, 1, startTime) OVER(PARTITION BY id ORDER BY startTime) last_endTime
			  FROM t_flow) t1) t2) t3 
GROUP BY id, fid

 

DSL风格:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{DataFrame, SparkSession}

object DSLFlowCount {

  def main(args: Array[String]): Unit = {

    val spark: SparkSession = SparkSession.builder()
      .appName(this.getClass.getSimpleName)
      .master("local[*]")
      .getOrCreate()

    //创建DataFrame
    val df: DataFrame = spark
      .read
      .csv("data/data.csv").toDF("uid", "start_time", "end_time", "down_flow")

    import spark.implicits._
    import org.apache.spark.sql.functions._

    df.select(
      'uid,
      'start_time,
      'end_time,
      'down_flow,
      expr("LAG(end_time, 1, start_time)") over(Window.partitionBy("uid").orderBy($"start_time".asc)) as "lag_time"
    ).select(
      'uid,
      'start_time,
      'end_time,
      'down_flow,
      expr("IF((UNIX_TIMESTAMP(start_time) - UNIX_TIMESTAMP(lag_time)) / 60 > 10, 1, 0)") as "flag"
    ).select(
      'uid,
      'start_time,
      'end_time,
      'down_flow,
      sum('flag) over(Window.partitionBy("uid").orderBy("start_time")) as "fid"
    ).groupBy(
      "uid",
      "fid"
    ).agg(
      min("start_time") as "start_time",
      max("end_time").as("end_time"),
      sum("down_flow").as("total_down_flow")
    ).drop("fid")
      .show()
  }
}

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值