问题导读 1.LAG功能是什么? 2.LEAD与LAG功能有什么相似的地方那个? 3.FIRST_VALUE与LAST_VALUE分别完成什么功能?
继续学习这四个分析函数(LAG,LEAD,FIRST_VALUE,LAST_VALUE)。注意: 这几个函数不支持WINDOW子句。
Hive版本为 apache-hive-0.13.1
数据准备:
cookie1,2019-04-10 10:00:02,url2 cookie1,2019-04-10 10:00:00,url1 cookie1,2019-04-10 10:03:04,1url3 cookie1,2019-04-10 10:50:05,url6 cookie1,2019-04-10 11:00:00,url7 cookie1,2019-04-10 10:10:00,url4 cookie1,2019-04-10 10:50:01,url5 cookie2,2019-04-10 10:00:02,url22 cookie2,2019-04-10 10:00:00,url11 cookie2,2019-04-10 10:03:04,1url33 cookie2,2019-04-10 10:50:05,url66 cookie2,2019-04-10 11:00:00,url77 cookie2,2019-04-10 10:10:00,url44 cookie2,2019-04-10 10:50:01,url55
CREATE EXTERNAL TABLE zcg1234 ( cookieid string, createtime string, --页面访问时间 url STRING --被访问页面 ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/tmp/zcg1234/';
hive> select * from zcg1234; OK cookie1 2019-04-10 10:00:02 url2 cookie1 2019-04-10 10:00:00 url1 cookie1 2019-04-10 10:03:04 1url3 cookie1 2019-04-10 10:50:05 url6 cookie1 2019-04-10 11:00:00 url7 cookie1 2019-04-10 10:10:00 url4 cookie1 2019-04-10 10:50:01 url5 cookie2 2019-04-10 10:00:02 url22 cookie2 2019-04-10 10:00:00 url11 cookie2 2019-04-10 10:03:04 1url33 cookie2 2019-04-10 10:50:05 url66 cookie2 2019-04-10 11:00:00 url77 cookie2 2019-04-10 10:10:00 url44 cookie2 2019-04-10 10:50:01 url55
LAG LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值 第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time, LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time FROM zcg1234;
cookieid createtime url rn last_1_time last_2_time