我们先来看看这两个函数的语法:
lead(col,n,default) over()
说明: 用于统计窗口内向下第n行的值
参数1: 为要取值的列名
参数2: 为向下第n行,默认值为1,这个值是固定的,不能动态的变化
参数3: 为默认值,当向下第n行的值为NULL时,取默认值,如果不指定,则默认值为NULL
lag(col,n,default) over()
说明: 用于统计窗口内向上第n行的值,与lead()刚好相反
参数1: 为要取值的列名
参数2: 为向上第n行,默认值为1,这个值是固定的,不能动态的变化
参数3: 为默认值,当向上第n行的值为NULL时,取默认值,如果不指定,则默认值为NULL
我们再看看lead()分析函数的应用:
问题:
求每个用户的页面停留时长
分析:
页面停留时长:用户进入A页面,又进入B页面。进入A页面的时间为date1,进入B页面的时间为date2,那么在A页面的停留时间为date2-date1。实际生产场景中,可能会考虑不超过30min等等因素。
我们来看看普通的方法怎么做:
要找到从A页面出来,进入另一个页面的时间。那么首先我们想到的就是排序,拍完序,自关联,关联条件是自己和自己进入下一个页面的时间关联。
案例:
spark-sql> with test1 as
> (select '10001' as uid,'2021-08-10 10:18:31' as dateline,'url3' as page
> union all
> select '10001' as uid,'2021-08-10 10:01:26' as dateline,'url1' as page
> union all
> select '10001' as uid,'2021-08-10 10:09:48' as dateline,'url2' as page
> union all
> select '10002' as uid,'2021-08-10 09:38:31' as dateline,'url3' as page
> union all
> select '10002' as uid,'2021-08-10 09:21:26' as dateline,'url1' as page
> union all
> select '10002' as uid,'2021-08-10 09:19:48' as dateline,'url2' as page)
> select
> t1.uid,
> t1.dateline,
> t2.dateline as dateline_end,
> t1.page
> from
> (select
> uid,
> dateline,
> row_number() over(distribute by uid sort by dateline asc) as rank,
> page
> from test1) t1
>
> left join
> (select
> uid,
> dateline,
> row_number() over(distribute by uid sort by dateline asc) as rank,
> page
> from test1) t2
> on t1.uid=t2.uid and t1.rank+1=t2.rank;
10001 2021-08-10 10:01:26 2021-08-10 10:09:48 url1
10001 2021-08-10 10:09:48 2021-08-10 10:18:31 url2
10001 2021-08-10 10:18:31 NULL url3
10002 2021-08-10 09:19:48 2021-08-10 09:21:26 url2
10002 2021-08-10 09:21:26 2021-08-10 09:38:31 url1
10002 2021-08-10 09:38:31 NULL url3
Time taken: 23.37 seconds, Fetched 6 row(s)
从上面的结果可以看出来,用户进入一个页面和跳到另一个页面的时间,通过这两个时间的差值就是我们要计算的用户页面停留时长
我们来看看lead()分析函数怎么做:
案例:
spark-sql> with test1 as
> (select '10001' as uid,'2021-08-10 10:18:31' as dateline,'url3' as page
> union all
> select '10001' as uid,'2021-08-10 10:01:26' as dateline,'url1' as page
> union all
> select '10001' as uid,'2021-08-10 10:09:48' as dateline,'url2' as page
> union all
> select '10002' as uid,'2021-08-10 09:38:31' as dateline,'url3' as page
> union all
> select '10002' as uid,'2021-08-10 09:21:26' as dateline,'url1' as page
> union all
> select '10002' as uid,'2021-08-10 09:19:48' as dateline,'url2' as page)
> select
> uid,
> dateline,
> lead(dateline,1) over(distribute by uid sort by dateline asc) as dateline_end,
> page
> from test1;
10001 2021-08-10 10:01:26 2021-08-10 10:09:48 url1
10001 2021-08-10 10:09:48 2021-08-10 10:18:31 url2
10001 2021-08-10 10:18:31 NULL url3
10002 2021-08-10 09:19:48 2021-08-10 09:21:26 url2
10002 2021-08-10 09:21:26 2021-08-10 09:38:31 url1
10002 2021-08-10 09:38:31 NULL url3
Time taken: 16.14 seconds, Fetched 6 row(s)
总结:
lead()分析函数和我们普通方法row_number()排序加自关联出来的结果一致。但是明显可以看出row_number()排序加自关联方法非常繁琐,在性能角度不是一个完美的解决方法。
关于row_number()函数的更多应用,可以点击:
hive分析函数row_number()的应用
hive分析函数row_number()、rank()、dense_rank()的不同之处
hive高手的五种去重方式