hive学习-问题解答三

1.order by,sort by,distribute by,cluster by的区别?

答:

(1)order by:

order by会对所给的全部数据进行全局排序,并且,不管有多少数据,都只启动一个reducer来处理。

注意:如果指定了hive.mapred.mode=strict(默认值是nonstrict),这时就必须指定limit来限制输出条数,原因是:所有的数据都会在同一个reducer端进行,数据量大的情况下可能不能出结果,那么在这样的严格模式下,必须指定输出的条数。

(2)sort by:

sort by是局部排序。根据数据量的大小启动一到多个reducer来干活,并且,它会在进入reduce之前为每个reducer都产生一个排序文件(此时,并不能保证所有的数据是有序的 )。这样的好处是提高了全局排序的效率。

(3)distribute by:

distribute by的功能是:distribute  by 控制map结果的分发,它会将具有相同字段的map输出分发到一个reduce节点上做处理。即:控制某个特定行到某个reducer中,为后续可能发生的聚集操作做准备。

注意:distribute by必须要写在sort by之前。

(4)cluster by:

当distribute by和sort by字段相同时,可以使用cluster by方式,即cluster by的功能就是distribute by和sort by相结合 。

注意:被cluster by指定的列只能是降序,不能指定asc和desc。

2.聚合函数是否可以写在order by后面,为什么?

答:HIve只能用聚合函数的别名排序,不可以用聚合函数的表达式排序 ,order by的执行顺序在select之后,所以需使用重新定义的列名进行排序。;

Mysql可以用聚合函数的别名排序,也可以用聚合函数的表达式排序

主要原因,执行顺序的问题:

(1)from
(2)join
(3)on
(4)where
(5)select
(6)group by
(7)having
(8)order by
(9)limit

3.有以下数据

10001    100    2019-03-01
10002    200    2019-03-02
10003    300    2019-03-03
10004    400    2019-04-01
10005    500    2019-04-02
10006    600    2019-04-03
10007    700    2019-05-01
10008    800    2019-05-02
10009    900    2019-05-03
10010    910    2019-06-01
10011    920    2019-06-02
10012    930    2019-06-03

3.1 把以上数据利用动态分区导入到分区表中(按照年、月进行多级分区)

答:

(1)创建普通表与分区表:

create table t_price(
    id int,
    price  int, 
    date_time   string
)row format delimited fields terminated by '\t';
​
create table t_price_partition(
    id int,
    price int,    
)partitioned BY(year string, month string)
row format delimited fields terminated by '\t';

(2)导入数据到普通表

load data local inpath '/opt/bigdata/data/t_price.txt' overwrite into table t_price;

(3)动态加载数据到分区表中

hive> set hive.exec.dynamic.partition=true; //使用动态分区
hive> set hive.exec.dynamic.partition.mode=nonstrict; //非严格模式
   
insert into table t_price_partition partition(year, month) select order_number,order_price,split(order_time,'-')[0],split(order_time,'-')[1] from t_price;

4.根据user_trade(用户购买明细)数据创建出合理的表结构,导入数据

数据(其中一个分区数据dt='2017-01-01'):

Allison    4    688.8    2755.2    shoes    1483822729
Francis    83    1.1    91.3    food    1483788170

创建user_trade表

create table if not exists user_trade (
user_name  string,
piece  int,
price  double,
pay_amount  double, 
goods_category  string, 
pay_time  bigint) 
partitioned by (dt string)
row format delimited fields terminated by '\t';

先设置动态分区

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;    
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=10000;

上传数据到hdfs

hdfs dfs -put /home/hadoop/user_trade/* /user/hive_remote/warehouse/db_hive.db/user_trade

进行表修复

msck repair table + 表名

查看分区

show partitions + 表名;

4.1 查出2018年一月到四月,每个品类有多少人购买,累计金额是多少?

select goods_category,
count(distinct user_name) as user_num,
sum(pay_amount) as total_amount
from user_trade 
where dt between '2018-01-01' and '2018-04-30'
group by goods_category;

4.2 查出2018年4月,支付金额超过五万元的用户

select user_name,
sum(pay_amount) as total_amount
from user_trade
where dt between '2018-04-01' and '2018-04-30'
group by user_name having sum(pay_amount)>50000;
​
//方案2 select user_name,total_amount from (select user_name, sum(pay_amount) as total_amount from user_trade where dt between '2018-04-01' and '2018-04-30' group by user_name) t where t.total_amount>50000;

4.3 查出2018年4月,支付金额最多的top5用户

select user_name,
sum(pay_amount) as total_amount
from user_trade
where dt between '2018-04-01' and '2018-04-30'
group by user_name 
order by total_amount desc limit 5;

5.根据user_info(用户信息)数据创建出合理的表结构,导入数据

数据:

10001    Abby    female    38    hangzhou    2018-04-13 01:06:07    2    {"systemtype": "android", "education": "doctor", "marriage_status": "1", "phonebrand": "VIVO"}    systemtype:android,education:doctor,marriage_status:1,phonebrand:VIVO
10002    Ailsa    female    42    shenzhen    2018-06-03 20:30:03    2    {"systemtype": "android", "education": "bachelor", "marriage_status": "0", "phonebrand": "YIJIA"}    systemtype:android,education:bachelor,marriage_status:0,phonebrand:YIJIA
10003    Alice    female    16    changchun    2018-12-04 07:34:27    3    {"systemtype": "ios", "education": "bachelor", "marriage_status": "1", "phonebrand": "iphone7"}    systemtype:ios,education:bachelor,marriage_status:1,phonebrand:iphone7

创建user_info表:

create table if not exists user_info (
user_id  string,
user_name  string,
sex  string,
age  int,
city  string,
firstactivetime  string,
level  int, 
extra1  string, 
extra2  map<string,string>) 
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;

加载数据:

load data local inpath '/home/hadoop/user_info/user_info.txt' into table user_info;

5.1 用户的首次激活时间,与2019年5月1日的日期间隔

select user_name,
datediff('2019-05-01', to_date(firstactivetime))
from user_info
limit 10;

注释:datediff(string enddate,string startdate):结束日期减去开始日期的天数

5.2 统计一下四个年龄段20岁以下,20-30岁,30-40岁,40岁以上的用户数

select 
case when age<20 then '20岁以下'
when age>=20 and age <30 then '20-30岁'
when age>=30 and age <40 then '30-40岁'
else '40岁以上' end,
count(distinct user_id) as user_num
from user_info
group by 
case when age<20 then '20岁以下'
when age>=20 and age <30 then '20-30岁'
when age>=30 and age <40 then '30-40岁'
else '40岁以上' end;

5.3 统计每个性别用户等级高低的分布情况(level大于5为高级)

select sex,
if(level>5,'高','低'),
count(distinct user_id) as user_num
from user_info
group by sex,
if(level>5,'高','低');

5.4 统计每个月新激活的用户数

select substr(firstactivetime,1,7) as month,
count(distinct user_id) as user_num
from user_info
group by substr(firstactivetime,1,7);

5.5 统计不同手机品牌的用户数

select extra2['phonebrand'] as phone_brand,
count(distinct user_id) as user_num
from user_info
group by extra2['phonebrand'];
​
#使用函数get_json_object将string转化为json
select get_json_object(extra1,'$.phonebrand') as phone_brand,
count(distinct user_id) as user_num
from user_info
group by get_json_object(extra1,'$.phonebrand');

6.现在我们已经有了两张表,我们再根据user_refund(退款信息)创建出合理的表结构,导入数据

数据(其中一个分区数据dt='2017-01-13'):

Carroll    43    8600.0    2017-01-13 02:27:59
DEMI    26    57772.0    2017-01-13 11:35:04

创建user_refund表:

create table if not exists user_refund(
user_name string,
refund_piece int,
refund_amount double, 
refund_time string) 
partitioned by (dt string) 
row format delimited fields terminated by '\t';

先设置动态分区

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;    
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=10000;

上传数据到hdfs

load data local inpath '/home/hadoop/user_info/user_refund.txt' into table user_refund;

进行表修复

msck repair table + 表名

查看分区

show partitions + 表名;

6.1 在2018年购买后又退款的用户

select a.user_name
from
(select distinct user_name
from user_trade
where year(dt)=2018
) a
join
(select distinct user_name
from user_refund
where year(dt)=2018) b
on a.user_name = b.user_name;

6.2 在2018年购买,但是没有退款的用户

select a.user_name
from (
select distinct user_name
from user_trade
where year(dt) = 2018
) a
left join
(select distinct user_name
from user_refund
where year(dt)=2018)b
on a.user_name = b.user_name
where b.user_name is null;

6.3 在2017年和2018年都购买的用户

select a.user_name
from (select distinct user_name
from user_trade
where year(dt)=2017) a
join
(select distinct user_name
from user_trade
where year(dt)=2018) b
on a.user_name=b.user_name;

6.4 在2018年购买用户的学历分布在

select b.education,
count(a.user_name)
from (select distinct user_name
from user_trade
where year(dt)=2018) a
join
(select user_name,
  get_json_object(extra1, '$.education') as education
 from user_info) b
 on a.user_name = b.user_name
 group by education;

6.5 2018年每个用户的支付和退款金额汇总

select a.user_name,
sum(a.pay_amount) as pay_amount,
sum(a.refund_amount) as refund_amount
from
(select user_name,
sum(pay_amount) as pay_amount,
0 as refund_amount
from user_trade
where year(dt)=2018
group by user_name
union all
select user_name,
0 as pay_amount,
sum(refund_amount) as refund_amount
from user_refund
where year(dt)=2018
group by user_name) a
group by a.user_name;
​
####方案二
select coalesce(a.user_name,b.user_name),
       a.pay_amount,
       b.refund_amount
from
     (
     select user_name,
            sum(pay_amount) as pay_amount,
            0 as refund_amount
     from user_trade
     where year(dt)=2018
     group by user_name
     )a
   full join
     (
     select user_name,
            sum(refund_amount) as refund_amount
     from user_refund
     where year(dt)=2018
     group by user_name
     )b on a.user_name=b.user_name;

6.6 首次激活时间在2017年,但是一直没有支付的用户年龄段分布

select a.age_level,
count(a.user_name)
from
(select user_name,
                case when age<20 then '20岁以下'
                  when age>=20 and age<30 then '20-30岁'
                  when age>=30 and age<40 then '30-40岁'
                  else '40岁以上' end as age_level
from user_info
where year(firstactivetime)=2017) a
left join 
(select distinct user_name
from user_trade
where dt >= 2017) b
on a.user_name = b.user_name
where b.user_name is null
group by age_level;

7.sql test:已经创建了三张表(用户信息表,交易明细表,退款表)

7.1 统计2018年每月的支付总额和当年累计支付总额

select a.month,
       a.pay_amount,
       sum(a.pay_amount) over(order by a.month) //逐月累加
from
     (select month(dt) month,
             sum(pay_amount) pay_amount
             from user_trade
             where year(dt)=2018
             group by month(dt)) a;

7.2 统计2017-2018每月的支付总额和当年累积支付总额

select a.year,
       a.month,
       a.pay_amount,
       sum(a.pay_amount) over(partition by a.year order by a.month)
from
     (select year(dt) year,
             month(dt) month,
             sum(pay_amount) pay_amount
      from user_trade
      where year(dt) in (2017,2018)
      group by year(dt),
               month(dt))a;

7.3 统计2018年1月,用户购买商品品类数量的排名

select user_name,
       count(distinct goods_category),
       row_number() over(order by count(distinct goods_category)),
       rank() over(order by count(distinct goods_category)),
       dense_rank() over(order by count(distinct goods_category))
from user_trade
where substr(dt,1,7)='2018-01'
group by user_name;

7.4 选出2019年支付金额排名在第10,20,30的用户

select a.user_name,
       a.pay_amount,
       a.rank
from
    (select user_name,
            sum(pay_amount) pay_amount,
            rank() over(order by sum(pay_amount) desc) rank
    from user_trade
    where year(dt)=2019
    group by user_name)a
where a.rank in (10,20,30);

7.5 选出2018年退款金额排名前10%的用户

select a.user_name,
       a.refund_amount,
       a.level
from
    (select user_name,
            sum(refund_amount) refund_amount,
            ntile(10) over(order by sum(refund_amount) desc) level
     from user_refund
     where year(dt)=2018
     group by user_name)a
where a.level=1;

7.6 支付时间间隔超过100天的用户数

select count(distinct user_name)
from
    (select user_name,
            dt,
            lead(dt) over(partition by user_name order by dt) lead_dt
    from user_trade
    where dt>'0')a
where a.lead_dt is not null
      and datediff(a.lead_dt,a.dt)>100;

7.7 每个城市,不同性别,2018年支付金额最高的TOP3用户

select c.user_name,
       c.city,
       c.sex,
       c.pay_amount,
       c.rank
from
    (select a.user_name,
            b.city,
            b.sex,
            a.pay_amount,
            row_number() over(partition by b.city,b.sex order by a.pay_amount desc) rank
     from
         (select user_name,
                 sum(pay_amount) pay_amount
          from user_trade
          where year(dt)=2018
          group by user_name)a
          left join user_info b on a.user_name=b.user_name)c
where c.rank<=3;

7.8 每个手机品牌退款金额前25%的用户

select *
from
    (select a.user_name,
    extra2['phonebrand'] as phonebrand,
    a.refund_amount,
    ntile(4) over(partition by extra2['phonebrand'] order by a.refund_amount desc) level
    from
        (select user_name,
                sum(refund_amount) refund_amount
         from user_refund
         where dt>'0'
         group by user_name)a
         left join user_info b on a.user_name=b.user_name)c
where c.level=1;

7.9 统计每月的支付金额和每年的支付金额汇总

select a.dt,
       sum(a.year_amount),
       sum(a.month_amount)
from
    (select substr(dt,1,4) as dt,
            sum(pay_amount) year_amount,
            0 as month_amount
    from user_trade
    where dt>'0'
    group by substr(dt,1,4)
    union all
    select substr(dt,1,7) as dt,
           0 as year_amount,
           sum(pay_amount) as month_amount
    from user_trade
    where dt>'0'
    group by substr(dt,1,7)
    )a
group by a.dt;

 

7.10 统计每个品类的购买用户数(这里要用到一张新表user_goods_category,第一列是用户名,第二列是用户购买的物品种类)

先建个user_goods_category表

数据:

Abby    clothes,food,electronics
Ailsa    book,clothes,food
Albert    clothes,electronics,computer

见表,查询:

create table user_goods_category(
user_name string, 
category_detail string)
row format delimited fields terminated by '\t';

select b.category,
       count(distinct a.user_name)
from user_goods_category a 
lateral view explode(split(category_detail,',')) b as category
group by b.category;

7.11 计算出每12个月的用户累计支付金额

SELECT a.month,
       a.pay_amount,
       sum(a.pay_amount) over(order by a.month rows between 11 preceding and current row)
FROM 
(SELECT substr(dt,1,7) as month,
       sum(pay_amount) as pay_amount
FROM user_trade
WHERE dt>'0'
GROUP BY substr(dt,1,7))a;

7.12 计算出每4个月的最大退款金额

SELECT a.month,
       max(a.refund_amount) over(order by a.month rows between 3 preceding and current row)
FROM 
(SELECT substr(dt,1,7) as month,
       sum(refund_amount) as refund_amount
FROM user_refund
WHERE dt>'0'
GROUP BY substr(dt,1,7))a;

7.13 退款时间间隔最长的用户

SELECT b.user_name,
       b.refund_days
FROM 
    (SELECT a.user_name,
           datediff(a.dt,a.lag_dt) refund_days,
           rank() over(order by datediff(a.dt,a.lag_dt) desc) as rank
    FROM 
          (SELECT user_name,
                 dt,
                 lag(dt) over(partition by user_name order by dt) lag_dt
          FROM user_refund
          WHERE dt>'0' )a
    WHERE a.lag_dt is not null)b 
WHERE b.rank=1;

7.14 每个性别,不同性别和手机品牌的退款金额分布

SELECT b.sex,
       b.phonebrand,
       sum(a.refund_amount)
FROM 
        (SELECT user_name,
               sum(refund_amount) as refund_amount
        FROM user_refund
        WHERE dt>'0'
        GROUP BY user_name)a
    JOIN
        (SELECT user_name,
                sex,
                extra2['phonebrand'] phonebrand
        FROM user_info) b on a.user_name=b.user_name
GROUP BY b.sex,
         b.phonebrand
GROUPING SETS (b.sex,(b.sex,b.phonebrand));

7.15 把每个用户购买的品类变成一行,品类间用逗号分隔

SELECT a.user_name,
       concat_ws(',',collect_set(a.goods_category))
FROM 
    (SELECT user_name,
           goods_category
    FROM user_trade
    WHERE dt>'0'
    GROUP BY user_name,
             goods_category)a
GROUP BY a.user_name;

7.16 2017,2018年按月累计去重的退款用户数

SELECT b.year,
       b.month,
       sum(b.user_num) over(partition by b.year order by b.month)
FROM
      (SELECT a.year,
             a.month,
             count(distinct a.user_name) user_num
      FROM 
          (SELECT year(dt) as year,
                 user_name,
                 min(month(dt)) as month
          FROM user_refund
          WHERE year(dt) in (2017,2018)
          GROUP BY year(dt),
                   user_name)a
      GROUP BY a.year,
               a.month)b
ORDER BY b.year,
         b.month 
limit 24;

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值