第一题:
a_test、b_test、c_test三个hive表 每个表中都只有一列int类型且列名相同,求三个表中互不重复的数 a.txt b.txt c.txt 1 2 1 2 3 2 3 11 3 4 12 11 5 14 5 6 15 6 7 16 7 8 18 8 9 35 20 6 30 7 40 8 create table a_test( id int ); load data local inpath './data/a.txt' overwrite into table a_test; create table b_test( id int ); load data local inpath './data/b.txt' overwrite into table b_test; create table c_test( id int ); load data local inpath './data/c.txt' overwrite into table c_test;
答案粘贴处:
分析:求三个表中互不重复的数 使用的是union all 进行连接变成总再进行查询 三张表联合成一张总表,在外层再套一张表进行查询,用总表的id分组,再过滤having count(A.id)=1 select A.id from (select id from a_test union all select id from b_test union all select id from c_test) A group by A.id having count(A.id)=1; result: OK 4 9 12 14 15 16 18 20 30 35 40
第二题(*):
需求:求出连续七天登陆的总人数 数据: login_test表 uid dt login_status(1登录成功,0异常) 1 2019-07-11 1 1 2019-07-12 1 1 2019-07-13 1 1 2019-07-14 1 1 2019-07-15 1 1 2019-07-16 1 1 2019-07-17 1 1 2019-07-18 1 2 2019-07-11 1 2 2019-07-12 1 2 2019-07-13 0 2 2019-07-14 1 2 2019-07-15 1 2 2019-07-16 0 2 2019-07-17 1 2 2019-07-18 0 3 2019-07-11 1 3 2019-07-12 1 3 2019-07-13 1 3 2019-07-14 1 3 2019-07-15 1 3 2019-07-16 1 3 2019-07-17 1 3 2019-07-18 1 create table login_test( uid int, dt string, login_status int ) row format delimited fields terminated by '\t'; load data local inpath './data/login_test' into table login_test;
答案粘贴处:
第一种写法:Me select count(*) from ( select distinct uid from( select uid,dt,lag(dt,6) over(partition by uid order by dt) pre_dt, sum(login_status) over(partition by uid order by dt rows between 6 preceding and current row) total from login_test ) t where date_sub(dt,6)=pre_dt and t.total=7) t1; 第二种写法:answer select count(distinct a.uid) from (select uid,login_status, datediff(dt,lag(dt,6) over(distribute by uid sort by dt asc)) b from login_test where login_status=1) a where a.b=6 ;
第三题:
2014010114 2014010216 2014010317 2014010410 2014010506 2012010609 2012010732 2012010812 2012010919 2012011023 2001010116 2001010212 2001010310 2001010411 2001010529 2013010619 2013010722 2013010812 2013010929 2013011023 2008010105 2008010216 2008010337 2008010414 2008010516 2007010619 2007010712 2007010812 2007010999 2007011023 2010010114 2010010216 2010010317 2010010410 2010010506 2015010649 2015010722 2015010812 2015010999 2015011023 create table if not exists temperature( data string ) row format delimited fields terminated by '\t' lines terminated by '\n'; load data local inpath './data/temperature' into table temperature; 数据解析:比如:2010012325表示在2010年01月23日的气温为25度。 需求:现在要求使用hive,计算每一年的最大气温的日期+温度。效果如下: 20010105 29 20070109 99 20080103 37 20100103 17 20120107 32 20130109 29 20140103 17 20150109 99
答案粘贴处:
分析:计算每一年的最大气温的日期+温度 查询前面八位是年月日,后面两位是气温 按年进行分组通过气温进行排序 只选总表的气温=1的 就是排在第一名的 第一种写法: select t.dt,t.temp from (select substr(data,0,8) dt, substr(data,9,10) temp, rank() over(partition by substr(data,0,4) order by substr(data,9,10) desc) rn from temperature) t where t.rn=1; 第二种写法: select B.dt,B.temp from ( select A.year,A.temp,A.dt,dense_rank() over(partition by A.year order by A.temp desc) rk from (select substr(data,0,4) year,substr(data,9,2) temp,substr(data,0,8) dt from temperature ) A) B where B.rk=1; 分析:总体思路是一样的 利用substr前四位是年,后两位是气温进行切割
第四题: