答:
(1)order by:
order by会对所给的全部数据进行全局排序,并且,不管有多少数据,都只启动一个reducer来处理。
注意:如果指定了hive.mapred.mode=strict(默认值是nonstrict),这时就必须指定limit来限制输出条数,原因是:所有的数据都会在同一个reducer端进行,数据量大的情况下可能不能出结果,那么在这样的严格模式下,必须指定输出的条数。
(2)sort by:
sort by是局部排序。根据数据量的大小启动一到多个reducer来干活,并且,它会在进入reduce之前为每个reducer都产生一个排序文件(此时,并不能保证所有的数据是有序的 )。这样的好处是提高了全局排序的效率。
(3)distribute by:
distribute by的功能是:distribute by 控制map结果的分发,它会将具有相同字段的map输出分发到一个reduce节点上做处理。即:控制某个特定行到某个reducer中,为后续可能发生的聚集操作做准备。
注意:distribute by必须要写在sort by之前。
(4)cluster by:
当distribute by和sort by字段相同时,可以使用cluster by方式,即cluster by的功能就是distribute by和sort by相结合 。
注意:被cluster by指定的列只能是降序,不能指定asc和desc。
2.聚合函数是否可以写在order by后面,为什么?
答:HIve只能用聚合函数的别名排序,不可以用聚合函数的表达式排序 ,order by的执行顺序在select之后,所以需使用重新定义的列名进行排序。;
Mysql可以用聚合函数的别名排序,也可以用聚合函数的表达式排序。
主要原因,执行顺序的问题:
(1)from (2)join (3)on (4)where (5)select (6)group by (7)having (8)order by (9)limit
3.有以下数据
10001 100 2019-03-01 10002 200 2019-03-02 10003 300 2019-03-03 10004 400 2019-04-01 10005 500 2019-04-02 10006 600 2019-04-03 10007 700 2019-05-01 10008 800 2019-05-02 10009 900 2019-05-03 10010 910 2019-06-01 10011 920 2019-06-02 10012 930 2019-06-03
3.1 把以上数据利用动态分区导入到分区表中(按照年、月进行多级分区)
答:
(1)创建普通表与分区表:
create table t_price( id int, price int, date_time string )row format delimited fields terminated by '\t'; create table t_price_partition( id int, price int, )partitioned BY(year string, month string) row format delimited fields terminated by '\t';
(2)导入数据到普通表
load data local inpath '/opt/bigdata/data/t_price.txt' overwrite into table t_price;
(3)动态加载数据到分区表中
hive> set hive.exec.dynamic.partition=true; //使用动态分区 hive> set hive.exec.dynamic.partition.mode=nonstrict; //非严格模式 insert into table t_price_partition partition(year, month) select order_number,order_price,split(order_time,'-')[0],split(order_time,'-')[1] from t_price;
4.根据user_trade(用户购买明细)数据创建出合理的表结构,导入数据
数据(其中一个分区数据dt='2017-01-01'):
Allison 4 688.8 2755.2 shoes 1483822729 Francis 83 1.1 91.3 food 1483788170
创建user_trade表
create table if not exists user_trade ( user_name string, piece int, price double, pay_amount double, goods_category string, pay_time bigint) partitioned by (dt string) row format delimited fields terminated by '\t';
先设置动态分区
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions=10000; set hive.exec.max.dynamic.partitions.pernode=10000;
上传数据到hdfs
hdfs dfs -put /home/hadoop/user_trade/* /user/hive_remote/warehouse/db_hive.db/user_trade
进行表修复
msck repair table + 表名
查看分区
show partitions + 表名;
4.1 查出2018年一月到四月,每个品类有多少人购买,累计金额是多少?
select goods_category, count(distinct user_name) as user_num, sum(pay_amount) as total_amount from user_trade where dt between '2018-01-01' and '2018-04-30' group by goods_category;
4.2 查出2018年4月,支付金额超过五万元的用户
select user_name, sum(pay_amount) as total_amount from user_trade where dt between '2018-04-01' and '2018-04-30' group by user_name having sum(pay_amount)>50000;
//方案2 select user_name,total_amount from (select user_name, sum(pay_amount) as total_amount from user_trade where dt between '2018-04-01' and '2018-04-30' group by user_name) t where t.total_amount>50000;
4.3 查出2018年4月,支付金额最多的top5用户
select user_name, sum(pay_amount) as total_amount from user_trade where dt between '2018-04-01' and '2018-04-30' group by user_name order by total_amount desc limit 5;
5.根据user_info(用户信息)数据创建出合理的表结构,导入数据
数据:
10001 Abby female 38 hangzhou 2018-04-13 01:06:07 2 {"systemtype": "android", "education": "doctor", "marriage_status": "1", "phonebrand": "VIVO"} systemtype:android,education:doctor,marriage_status:1,phonebrand:VIVO 10002 Ailsa female 42 shenzhen 2018-06-03 20:30:03 2 {"systemtype": "android", "education": "bachelor", "marriage_status": "0", "phonebrand": "YIJIA"} systemtype:android,education:bachelor,marriage_status:0,phonebrand:YIJIA 10003 Alice female 16 changchun 2018-12-04 07:34:27 3 {"systemtype": "ios", "education": "bachelor", "marriage_status": "1", "phonebrand": "iphone7"} systemtype:ios,education:bachelor,marriage_status:1,phonebrand:iphone7
创建user_info表:
create table if not exists user_info ( user_id string, user_name string, sex string, age int, city string, firstactivetime string, level int, extra1 string, extra2 map<string,string>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':' lines terminated by '\n' stored as textfile;
加载数据:
load data local inpath '/home/hadoop/user_info/user_info.txt' into table user_info;
5.1 用户的首次激活时间,与2019年5月1日的日期间隔
select user_name, datediff('2019-05-01', to_date(firstactivetime)) from user_info limit 10;
注释:datediff(string enddate,string startdate):结束日期减去开始日期的天数
5.2 统计一下四个年龄段20岁以下,20-30岁,30-40岁,40岁以上的用户数
select case when age<20 then '20岁以下' when age>=20 and age <30 then '20-30岁' when age>=30 and age <40 then '30-40岁' else '40岁以上' end, count(distinct user_id) as user_num from user_info group by case when age<20 then '20岁以下' when age>=20 and age <30 then '20-30岁' when age>=30 and age <40 then '30-40岁' else '40岁以上' end;
5.3 统计每个性别用户等级高低的分布情况(level大于5为高级)
select sex, if(level>5,'高','低'), count(distinct user_id) as user_num from user_info group by sex, if(level>5,'高','低');
5.4 统计每个月新激活的用户数
select substr(firstactivetime,1,7) as month, count(distinct user_id) as user_num from user_info group by substr(firstactivetime,1,7);
5.5 统计不同手机品牌的用户数
select extra2['phonebrand'] as phone_brand, count(distinct user_id) as user_num from user_info group by extra2['phonebrand']; #使用函数get_json_object将string转化为json select get_json_object(extra1,'$.phonebrand') as phone_brand, count(distinct user_id) as user_num from user_info group by get_json_object(extra1,'$.phonebrand');
6.现在我们已经有了两张表,我们再根据user_refund(退款信息)创建出合理的表结构,导入数据
数据(其中一个分区数据dt='2017-01-13'):
Carroll 43 8600.0 2017-01-13 02:27:59 DEMI 26 57772.0 2017-01-13 11:35:04
创建user_refund表:
create table if not exists user_refund( user_name string, refund_piece int, refund_amount double, refund_time string) partitioned by (dt string) row format delimited fields terminated by '\t';
先设置动态分区
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions=10000; set hive.exec.max.dynamic.partitions.pernode=10000;
上传数据到hdfs
load data local inpath '/home/hadoop/user_info/user_refund.txt' into table user_refund;
进行表修复
msck repair table + 表名
查看分区
show partitions + 表名;
6.1 在2018年购买后又退款的用户
select a.user_name from (select distinct user_name from user_trade where year(dt)=2018 ) a join (select distinct user_name from user_refund where year(dt)=2018) b on a.user_name = b.user_name;
6.2 在2018年购买,但是没有退款的用户
select a.user_name from ( select distinct user_name from user_trade where year(dt) = 2018 ) a left join (select distinct user_name from user_refund where year(dt)=2018)b on a.user_name = b.user_name where b.user_name is null;
6.3 在2017年和2018年都购买的用户
select a.user_name from (select distinct user_name from user_trade where year(dt)=2017) a join (select distinct user_name from user_trade where year(dt)=2018) b on a.user_name=b.user_name;
6.4 在2018年购买用户的学历分布在
select b.education, count(a.user_name) from (select distinct user_name from user_trade where year(dt)=2018) a join (select user_name, get_json_object(extra1, '$.education') as education from user_info) b on a.user_name = b.user_name group by education;
6.5 2018年每个用户的支付和退款金额汇总
select a.user_name, sum(a.pay_amount) as pay_amount, sum(a.refund_amount) as refund_amount from (select user_name, sum(pay_amount) as pay_amount, 0 as refund_amount from user_trade where year(dt)=2018 group by user_name union all select user_name, 0 as pay_amount, sum(refund_amount) as refund_amount from user_refund where year(dt)=2018 group by user_name) a group by a.user_name; ####方案二 select coalesce(a.user_name,b.user_name), a.pay_amount, b.refund_amount from ( select user_name, sum(pay_amount) as pay_amount, 0 as refund_amount from user_trade where year(dt)=2018 group by user_name )a full join ( select user_name, sum(refund_amount) as refund_amount from user_refund where year(dt)=2018 group by user_name )b on a.user_name=b.user_name;
6.6 首次激活时间在2017年,但是一直没有支付的用户年龄段分布
select a.age_level, count(a.user_name) from (select user_name, case when age<20 then '20岁以下' when age>=20 and age<30 then '20-30岁' when age>=30 and age<40 then '30-40岁' else '40岁以上' end as age_level from user_info where year(firstactivetime)=2017) a left join (select distinct user_name from user_trade where dt >= 2017) b on a.user_name = b.user_name where b.user_name is null group by age_level;
7.1 统计2018年每月的支付总额和当年累计支付总额
select a.month, a.pay_amount, sum(a.pay_amount) over(order by a.month) //逐月累加 from (select month(dt) month, sum(pay_amount) pay_amount from user_trade where year(dt)=2018 group by month(dt)) a;
7.2 统计2017-2018每月的支付总额和当年累积支付总额
select a.year, a.month, a.pay_amount, sum(a.pay_amount) over(partition by a.year order by a.month) from (select year(dt) year, month(dt) month, sum(pay_amount) pay_amount from user_trade where year(dt) in (2017,2018) group by year(dt), month(dt))a;
7.3 统计2018年1月,用户购买商品品类数量的排名
select user_name, count(distinct goods_category), row_number() over(order by count(distinct goods_category)), rank() over(order by count(distinct goods_category)), dense_rank() over(order by count(distinct goods_category)) from user_trade where substr(dt,1,7)='2018-01' group by user_name;
7.4 选出2019年支付金额排名在第10,20,30的用户
select a.user_name, a.pay_amount, a.rank from (select user_name, sum(pay_amount) pay_amount, rank() over(order by sum(pay_amount) desc) rank from user_trade where year(dt)=2019 group by user_name)a where a.rank in (10,20,30);
7.5 选出2018年退款金额排名前10%的用户
select a.user_name, a.refund_amount, a.level from (select user_name, sum(refund_amount) refund_amount, ntile(10) over(order by sum(refund_amount) desc) level from user_refund where year(dt)=2018 group by user_name)a where a.level=1;
7.6 支付时间间隔超过100天的用户数
select count(distinct user_name) from (select user_name, dt, lead(dt) over(partition by user_name order by dt) lead_dt from user_trade where dt>'0')a where a.lead_dt is not null and datediff(a.lead_dt,a.dt)>100;
7.7 每个城市,不同性别,2018年支付金额最高的TOP3用户
select c.user_name, c.city, c.sex, c.pay_amount, c.rank from (select a.user_name, b.city, b.sex, a.pay_amount, row_number() over(partition by b.city,b.sex order by a.pay_amount desc) rank from (select user_name, sum(pay_amount) pay_amount from user_trade where year(dt)=2018 group by user_name)a left join user_info b on a.user_name=b.user_name)c where c.rank<=3;
7.8 每个手机品牌退款金额前25%的用户
select * from (select a.user_name, extra2['phonebrand'] as phonebrand, a.refund_amount, ntile(4) over(partition by extra2['phonebrand'] order by a.refund_amount desc) level from (select user_name, sum(refund_amount) refund_amount from user_refund where dt>'0' group by user_name)a left join user_info b on a.user_name=b.user_name)c where c.level=1;
7.9 统计每月的支付金额和每年的支付金额汇总
select a.dt, sum(a.year_amount), sum(a.month_amount) from (select substr(dt,1,4) as dt, sum(pay_amount) year_amount, 0 as month_amount from user_trade where dt>'0' group by substr(dt,1,4) union all select substr(dt,1,7) as dt, 0 as year_amount, sum(pay_amount) as month_amount from user_trade where dt>'0' group by substr(dt,1,7) )a group by a.dt;
7.10 统计每个品类的购买用户数(这里要用到一张新表user_goods_category,第一列是用户名,第二列是用户购买的物品种类)
先建个user_goods_category表
数据:
Abby clothes,food,electronics Ailsa book,clothes,food Albert clothes,electronics,computer
见表,查询:
create table user_goods_category( user_name string, category_detail string) row format delimited fields terminated by '\t'; select b.category, count(distinct a.user_name) from user_goods_category a lateral view explode(split(category_detail,',')) b as category group by b.category;
7.11 计算出每12个月的用户累计支付金额
SELECT a.month, a.pay_amount, sum(a.pay_amount) over(order by a.month rows between 11 preceding and current row) FROM (SELECT substr(dt,1,7) as month, sum(pay_amount) as pay_amount FROM user_trade WHERE dt>'0' GROUP BY substr(dt,1,7))a;
7.12 计算出每4个月的最大退款金额
SELECT a.month, max(a.refund_amount) over(order by a.month rows between 3 preceding and current row) FROM (SELECT substr(dt,1,7) as month, sum(refund_amount) as refund_amount FROM user_refund WHERE dt>'0' GROUP BY substr(dt,1,7))a;
7.13 退款时间间隔最长的用户
SELECT b.user_name, b.refund_days FROM (SELECT a.user_name, datediff(a.dt,a.lag_dt) refund_days, rank() over(order by datediff(a.dt,a.lag_dt) desc) as rank FROM (SELECT user_name, dt, lag(dt) over(partition by user_name order by dt) lag_dt FROM user_refund WHERE dt>'0' )a WHERE a.lag_dt is not null)b WHERE b.rank=1;
7.14 每个性别,不同性别和手机品牌的退款金额分布
SELECT b.sex, b.phonebrand, sum(a.refund_amount) FROM (SELECT user_name, sum(refund_amount) as refund_amount FROM user_refund WHERE dt>'0' GROUP BY user_name)a JOIN (SELECT user_name, sex, extra2['phonebrand'] phonebrand FROM user_info) b on a.user_name=b.user_name GROUP BY b.sex, b.phonebrand GROUPING SETS (b.sex,(b.sex,b.phonebrand));
7.15 把每个用户购买的品类变成一行,品类间用逗号分隔
SELECT a.user_name, concat_ws(',',collect_set(a.goods_category)) FROM (SELECT user_name, goods_category FROM user_trade WHERE dt>'0' GROUP BY user_name, goods_category)a GROUP BY a.user_name;
7.16 2017,2018年按月累计去重的退款用户数
SELECT b.year, b.month, sum(b.user_num) over(partition by b.year order by b.month) FROM (SELECT a.year, a.month, count(distinct a.user_name) user_num FROM (SELECT year(dt) as year, user_name, min(month(dt)) as month FROM user_refund WHERE year(dt) in (2017,2018) GROUP BY year(dt), user_name)a GROUP BY a.year, a.month)b ORDER BY b.year, b.month limit 24;