spark_sql案例之流量统计

本文介绍了一种使用Spark SQL对用户流量数据进行聚合分析的方法。通过对原始数据应用窗口函数和条件判断,可以有效地识别用户的连续会话,并计算每个会话的最小开始时间、最大结束时间和总流量。此方案适用于需要对大量在线行为数据进行实时或批量处理的应用场景。
摘要由CSDN通过智能技术生成
uid,start_time,end_time,flow
1,2020-02-18 14:20:30,2020-02-18 14:46:30,20
1,2020-02-18 14:47:20,2020-02-18 15:20:30,30
1,2020-02-18 15:37:23,2020-02-18 16:05:26,40
1,2020-02-18 16:06:27,2020-02-18 17:20:49,50
1,2020-02-18 17:21:50,2020-02-18 18:03:27,60
2,2020-02-18 14:18:24,2020-02-18 15:01:40,20
2,2020-02-18 15:20:49,2020-02-18 15:30:24,30
2,2020-02-18 16:01:23,2020-02-18 16:40:32,40
2,2020-02-18 16:44:56,2020-02-18 17:40:52,50
3,2020-02-18 14:39:58,2020-02-18 15:35:53,20
3,2020-02-18 15:36:39,2020-02-18 15:24:54,30

sql实现代码:
import org.apache.spark.sql.{DataFrame, SparkSession}

object GuidDemoSql {
def main(args: Array[String]): Unit = {
val spark: SparkSession =SparkSession.builder()
.master(“local[*]”)
.getOrCreate()
val df: DataFrame =spark
.read
.option(“header”,“true”)
.csv(“data/liuliang.txt”)

df.createTempView("v_flow")
spark.sql(
 s"""
    |select
    |  uid,
    |  min(start_time),
    |  max(end_time ),
    |  sum(flow) sum_flow
    |  from(
    | select
    |  uid,
    |  start_time,
    |  end_time,
    |  flow,
    |  sum(flag) over(partition by uid order by start_time) sum_flag
    |from(
    |select
    |  uid,
    |  start_time,
    |  end_time,
    |  flow,
    |  if(to_unix_timestamp(start_time,"yyyy-MM-dd HH:mm:ss")-
    |  to_unix_timestamp(lag_time,"yyyy-MM-dd HH:mm:ss") > 600,1,0) flag
    |from(
    |  select
    |    uid,
    |    start_time,
    |    end_time,
    |    flow,
    |    lag(end_time,1,start_time) over(partition by uid order by start_time asc) lag_time
    |  from v_flow
    |)
    |)
    |)group by uid,sum_flag
    |""".stripMargin
).show()

}
}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值