hive学习-问题解答三

最新推荐文章于 2022-09-20 10:57:55 发布

Hi_Shook

最新推荐文章于 2022-09-20 10:57:55 发布

阅读量675

点赞数

本文链接：https://blog.csdn.net/weixin_42575593/article/details/107439158

版权

1.order by,sort by,distribute by,cluster by的区别？

答：

（1）order by：

order by会对所给的全部数据进行全局排序，并且，不管有多少数据，都只启动一个reducer来处理。

注意：如果指定了hive.mapred.mode=strict（默认值是nonstrict）,这时就必须指定limit来限制输出条数，原因是：所有的数据都会在同一个reducer端进行，数据量大的情况下可能不能出结果，那么在这样的严格模式下，必须指定输出的条数。

（2）sort by：

sort by是局部排序。根据数据量的大小启动一到多个reducer来干活，并且，它会在进入reduce之前为每个reducer都产生一个排序文件（此时，并不能保证所有的数据是有序的）。这样的好处是提高了全局排序的效率。

（3）distribute by：

distribute by的功能是：distribute by 控制map结果的分发，它会将具有相同字段的map输出分发到一个reduce节点上做处理。即：控制某个特定行到某个reducer中，为后续可能发生的聚集操作做准备。

注意：distribute by必须要写在sort by之前。

（4）cluster by：

当distribute by和sort by字段相同时，可以使用cluster by方式，即cluster by的功能就是distribute by和sort by相结合。

注意：被cluster by指定的列只能是降序，不能指定asc和desc。

2.聚合函数是否可以写在order by后面，为什么？

答：HIve只能用聚合函数的别名排序，不可以用聚合函数的表达式排序 ,order by的执行顺序在select之后，所以需使用重新定义的列名进行排序。；

Mysql可以用聚合函数的别名排序，也可以用聚合函数的表达式排序。

主要原因，执行顺序的问题：

(1)from
(2)join
(3)on
(4)where
(5)select
(6)group by
(7)having
(8)order by
(9)limit

3.有以下数据

10001    100    2019-03-01
10002    200    2019-03-02
10003    300    2019-03-03
10004    400    2019-04-01
10005    500    2019-04-02
10006    600    2019-04-03
10007    700    2019-05-01
10008    800    2019-05-02
10009    900    2019-05-03
10010    910    2019-06-01
10011    920    2019-06-02
10012    930    2019-06-03

3.1 把以上数据利用动态分区导入到分区表中(按照年、月进行多级分区)

答：

（1）创建普通表与分区表：

create table t_price(
    id int,
    price  int, 
    date_time   string
)row format delimited fields terminated by '\t';

create table t_price_partition(
    id int,
    price int,    
)partitioned BY(year string, month string)
row format delimited fields terminated by '\t';

(2)导入数据到普通表

load data local inpath '/opt/bigdata/data/t_price.txt' overwrite into table t_price;

（3）动态加载数据到分区表中

hive> set hive.exec.dynamic.partition=true; //使用动态分区
hive> set hive.exec.dynamic.partition.mode=nonstrict; //非严格模式
   
insert into table t_price_partition partition(year, month) select order_number,order_price,split(order_time,'-')[0],split(order_time,'-')[1] from t_price;

4.根据user_trade（用户购买明细）数据创建出合理的表结构，导入数据

数据(其中一个分区数据dt='2017-01-01')：

Allison    4    688.8    2755.2    shoes    1483822729
Francis    83    1.1    91.3    food    1483788170

创建user_trade表

create table if not exists user_trade (
user_name  string,
piece  int,
price  double,
pay_amount  double, 
goods_category  string, 
pay_time  bigint) 
partitioned by (dt string)
row format delimited fields terminated by '\t';

先设置动态分区

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;    
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=10000;

上传数据到hdfs

hdfs dfs -put /home/hadoop/user_trade/* /user/hive_remote/warehouse/db_hive.db/user_trade

进行表修复

msck repair table + 表名

查看分区

show partitions + 表名；

4.1 查出2018年一月到四月，每个品类有多少人购买，累计金额是多少？

select goods_category,
count(distinct user_name) as user_num,
sum(pay_amount) as total_amount
from user_trade 
where dt between '2018-01-01' and '2018-04-30'
group by goods_category;

4.2 查出2018年4月，支付金额超过五万元的用户

select user_name,
sum(pay_amount) as total_amount
from user_trade
where dt between '2018-04-01' and '2018-04-30'
group by user_name having sum(pay_amount)>50000;

//方案2
select user_name,total_amount
from (select user_name,
sum(pay_amount) as total_amount
from user_trade
where dt between '2018-04-01' and '2018-04-30'
group by user_name) t
where t.total_amount>50000;

4.3 查出2018年4月，支付金额最多的top5用户

select user_name,
sum(pay_amount) as total_amount
from user_trade
where dt between '2018-04-01' and '2018-04-30'
group by user_name 
order by total_amount desc limit 5;

5.根据user_info（用户信息）数据创建出合理的表结构，导入数据

数据：

10001    Abby    female    38    hangzhou    2018-04-13 01:06:07    2    {"systemtype": "android", "education": "doctor", "marriage_status": "1", "phonebrand": "VIVO"}    systemtype:android,education:doctor,marriage_status:1,phonebrand:VIVO
10002    Ailsa    female    42    shenzhen    2018-06-03 20:30:03    2    {"systemtype": "android", "education": "bachelor", "marriage_status": "0", "phonebrand": "YIJIA"}    systemtype:android,education:bachelor,marriage_status:0,phonebrand:YIJIA
10003    Alice    female    16    changchun    2018-12-04 07:34:27    3    {"systemtype": "ios", "education": "bachelor", "marriage_status": "1", "phonebrand": "iphone7"}    systemtype:ios,education:bachelor,marriage_status:1,phonebrand:iphone7

创建user_info表：

create table if not exists user_info (
user_id  string,
user_name  string,
sex  string,
age  int,
city  string,
firstactivetime  string,
level  int, 
extra1  string, 
extra2  map<string,string>) 
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;

加载数据：

load data local inpath '/home/hadoop/user_info/user_info.txt' into table user_info;

5.1 用户的首次激活时间，与2019年5月1日的日期间隔

select user_name,
datediff('2019-05-01', to_date(firstactivetime))
from user_info
limit 10;

注释：datediff(string enddate,string startdate):结束日期减去开始日期的天数

5.2 统计一下四个年龄段20岁以下，20-30岁，30-40岁，40岁以上的用户数

select 
case when age<20 then '20岁以下'
when age>=20 and age <30 then '20-30岁'
when age>=30 and age <40 then '30-40岁'
else '40岁以上' end,
count(distinct user_id) as user_num
from user_info
group by 
case when age<20 then '20岁以下'
when age>=20 and age <30 then '20-30岁'
when age>=30 and age <40 then '30-40岁'
else '40岁以上' end;

5.3 统计每个性别用户等级高低的分布情况（level大于5为高级）

select sex,
if(level>5,'高','低'),
count(distinct user_id) as user_num
from user_info
group by sex,
if(level>5,'高','低');

5.4 统计每个月新激活的用户数

select substr(firstactivetime,1,7) as month,
count(distinct user_id) as user_num
from user_info
group by substr(firstactivetime,1,7);

5.5 统计不同手机品牌的用户数

select extra2['phonebrand'] as phone_brand,
count(distinct user_id) as user_num
from user_info
group by extra2['phonebrand'];

#使用函数get_json_object将string转化为json
select get_json_object(extra1,'$.phonebrand') as phone_brand,
count(distinct user_id) as user_num
from user_info
group by get_json_object(extra1,'$.phonebrand');

6.现在我们已经有了两张表，我们再根据user_refund(退款信息)创建出合理的表结构，导入数据

数据(其中一个分区数据dt='2017-01-13')：

Carroll    43    8600.0    2017-01-13 02:27:59
DEMI    26    57772.0    2017-01-13 11:35:04

创建user_refund表：

create table if not exists user_refund(
user_name string,
refund_piece int,
refund_amount double, 
refund_time string) 
partitioned by (dt string) 
row format delimited fields terminated by '\t';

先设置动态分区

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;    
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=10000;

上传数据到hdfs

load data local inpath '/home/hadoop/user_info/user_refund.txt' into table user_refund;

进行表修复

msck repair table + 表名

查看分区

show partitions + 表名；

6.1 在2018年购买后又退款的用户

select a.user_name
from
(select distinct user_name
from user_trade
where year(dt)=2018
) a
join
(select distinct user_name
from user_refund
where year(dt)=2018) b
on a.user_name = b.user_name;

6.2 在2018年购买，但是没有退款的用户

select a.user_name
from (
select distinct user_name
from user_trade
where year(dt) = 2018
) a
left join
(select distinct user_name
from user_refund
where year(dt)=2018)b
on a.user_name = b.user_name
where b.user_name is null;

6.3 在2017年和2018年都购买的用户

select a.user_name
from (select distinct user_name
from user_trade
where year(dt)=2017) a
join
(select distinct user_name
from user_trade
where year(dt)=2018) b
on a.user_name=b.user_name;

6.4 在2018年购买用户的学历分布在

select b.education,
count(a.user_name)
from (select distinct user_name
from user_trade
where year(dt)=2018) a
join
(select user_name,
  get_json_object(extra1, '$.education') as education
 from user_info) b
 on a.user_name = b.user_name
 group by education;

6.5 2018年每个用户的支付和退款金额汇总

select a.user_name,
sum(a.pay_amount) as pay_amount,
sum(a.refund_amount) as refund_amount
from
(select user_name,
sum(pay_amount) as pay_amount,
0 as refund_amount
from user_trade
where year(dt)=2018
group by user_name
union all
select user_name,
0 as pay_amount,
sum(refund_amount) as refund_amount
from user_refund
where year(dt)=2018
group by user_name) a
group by a.user_name;

####方案二
select coalesce(a.user_name,b.user_name),
       a.pay_amount,
       b.refund_amount
from
     (
     select user_name,
            sum(pay_amount) as pay_amount,
            0 as refund_amount
     from user_trade
     where year(dt)=2018
     group by user_name
     )a
   full join
     (
     select user_name,
            sum(refund_amount) as refund_amount
     from user_refund
     where year(dt)=2018
     group by user_name
     )b on a.user_name=b.user_name;

6.6 首次激活时间在2017年，但是一直没有支付的用户年龄段分布

select a.age_level,
count(a.user_name)
from
(select user_name,
                case when age<20 then '20岁以下'
                  when age>=20 and age<30 then '20-30岁'
                  when age>=30 and age<40 then '30-40岁'
                  else '40岁以上' end as age_level
from user_info
where year(firstactivetime)=2017) a
left join 
(select distinct user_name
from user_trade
where dt >= 2017) b
on a.user_name = b.user_name
where b.user_name is null
group by age_level;

7.sql test:已经创建了三张表（用户信息表，交易明细表，退款表）

7.1 统计2018年每月的支付总额和当年累计支付总额

select a.month,
       a.pay_amount,
       sum(a.pay_amount) over(order by a.month) //逐月累加
from
     (select month(dt) month,
             sum(pay_amount) pay_amount
             from user_trade
             where year(dt)=2018
             group by month(dt)) a;

7.2 统计2017-2018每月的支付总额和当年累积支付总额

select a.year,
       a.month,
       a.pay_amount,
       sum(a.pay_amount) over(partition by a.year order by a.month)
from
     (select year(dt) year,
             month(dt) month,
             sum(pay_amount) pay_amount
      from user_trade
      where year(dt) in (2017,2018)
      group by year(dt),
               month(dt))a;

7.3 统计2018年1月，用户购买商品品类数量的排名

select user_name,
       count(distinct goods_category),
       row_number() over(order by count(distinct goods_category)),
       rank() over(order by count(distinct goods_category)),
       dense_rank() over(order by count(distinct goods_category))
from user_trade
where substr(dt,1,7)='2018-01'
group by user_name;

7.4 选出2019年支付金额排名在第10,20,30的用户

select a.user_name,
       a.pay_amount,
       a.rank
from
    (select user_name,
            sum(pay_amount) pay_amount,
            rank() over(order by sum(pay_amount) desc) rank
    from user_trade
    where year(dt)=2019
    group by user_name)a
where a.rank in (10,20,30);

7.5 选出2018年退款金额排名前10%的用户

select a.user_name,
       a.refund_amount,
       a.level
from
    (select user_name,
            sum(refund_amount) refund_amount,
            ntile(10) over(order by sum(refund_amount) desc) level
     from user_refund
     where year(dt)=2018
     group by user_name)a
where a.level=1;

7.6 支付时间间隔超过100天的用户数

select count(distinct user_name)
from
    (select user_name,
            dt,
            lead(dt) over(partition by user_name order by dt) lead_dt
    from user_trade
    where dt>'0')a
where a.lead_dt is not null
      and datediff(a.lead_dt,a.dt)>100;

7.7 每个城市，不同性别，2018年支付金额最高的TOP3用户

select c.user_name,
       c.city,
       c.sex,
       c.pay_amount,
       c.rank
from
    (select a.user_name,
            b.city,
            b.sex,
            a.pay_amount,
            row_number() over(partition by b.city,b.sex order by a.pay_amount desc) rank
     from
         (select user_name,
                 sum(pay_amount) pay_amount
          from user_trade
          where year(dt)=2018
          group by user_name)a
          left join user_info b on a.user_name=b.user_name)c
where c.rank<=3;

7.8 每个手机品牌退款金额前25%的用户

select *
from
    (select a.user_name,
    extra2['phonebrand'] as phonebrand,
    a.refund_amount,
    ntile(4) over(partition by extra2['phonebrand'] order by a.refund_amount desc) level
    from
        (select user_name,
                sum(refund_amount) refund_amount
         from user_refund
         where dt>'0'
         group by user_name)a
         left join user_info b on a.user_name=b.user_name)c
where c.level=1;

7.9 统计每月的支付金额和每年的支付金额汇总

select a.dt,
       sum(a.year_amount),
       sum(a.month_amount)
from
    (select substr(dt,1,4) as dt,
            sum(pay_amount) year_amount,
            0 as month_amount
    from user_trade
    where dt>'0'
    group by substr(dt,1,4)
    union all
    select substr(dt,1,7) as dt,
           0 as year_amount,
           sum(pay_amount) as month_amount
    from user_trade
    where dt>'0'
    group by substr(dt,1,7)
    )a
group by a.dt;

7.10 统计每个品类的购买用户数（这里要用到一张新表user_goods_category，第一列是用户名，第二列是用户购买的物品种类）

先建个user_goods_category表

数据：

Abby    clothes,food,electronics
Ailsa    book,clothes,food
Albert    clothes,electronics,computer

见表，查询：

create table user_goods_category(
user_name string, 
category_detail string)
row format delimited fields terminated by '\t';

select b.category,
       count(distinct a.user_name)
from user_goods_category a 
lateral view explode(split(category_detail,',')) b as category
group by b.category;

7.11 计算出每12个月的用户累计支付金额

SELECT a.month,
       a.pay_amount,
       sum(a.pay_amount) over(order by a.month rows between 11 preceding and current row)
FROM 
(SELECT substr(dt,1,7) as month,
       sum(pay_amount) as pay_amount
FROM user_trade
WHERE dt>'0'
GROUP BY substr(dt,1,7))a;

7.12 计算出每4个月的最大退款金额

SELECT a.month,
       max(a.refund_amount) over(order by a.month rows between 3 preceding and current row)
FROM 
(SELECT substr(dt,1,7) as month,
       sum(refund_amount) as refund_amount
FROM user_refund
WHERE dt>'0'
GROUP BY substr(dt,1,7))a;

7.13 退款时间间隔最长的用户

SELECT b.user_name,
       b.refund_days
FROM 
    (SELECT a.user_name,
           datediff(a.dt,a.lag_dt) refund_days,
           rank() over(order by datediff(a.dt,a.lag_dt) desc) as rank
    FROM 
          (SELECT user_name,
                 dt,
                 lag(dt) over(partition by user_name order by dt) lag_dt
          FROM user_refund
          WHERE dt>'0' )a
    WHERE a.lag_dt is not null)b 
WHERE b.rank=1;

7.14 每个性别，不同性别和手机品牌的退款金额分布

SELECT b.sex,
       b.phonebrand,
       sum(a.refund_amount)
FROM 
        (SELECT user_name,
               sum(refund_amount) as refund_amount
        FROM user_refund
        WHERE dt>'0'
        GROUP BY user_name)a
    JOIN
        (SELECT user_name,
                sex,
                extra2['phonebrand'] phonebrand
        FROM user_info) b on a.user_name=b.user_name
GROUP BY b.sex,
         b.phonebrand
GROUPING SETS (b.sex,(b.sex,b.phonebrand));

7.15 把每个用户购买的品类变成一行，品类间用逗号分隔

SELECT a.user_name,
       concat_ws(',',collect_set(a.goods_category))
FROM 
    (SELECT user_name,
           goods_category
    FROM user_trade
    WHERE dt>'0'
    GROUP BY user_name,
             goods_category)a
GROUP BY a.user_name;

7.16 2017,2018年按月累计去重的退款用户数

SELECT b.year,
       b.month,
       sum(b.user_num) over(partition by b.year order by b.month)
FROM
      (SELECT a.year,
             a.month,
             count(distinct a.user_name) user_num
      FROM 
          (SELECT year(dt) as year,
                 user_name,
                 min(month(dt)) as month
          FROM user_refund
          WHERE year(dt) in (2017,2018)
          GROUP BY year(dt),
                   user_name)a
      GROUP BY a.year,
               a.month)b
ORDER BY b.year,
         b.month 
limit 24;