本博文中HQL将用于shell模板中的sql语句替换:【数据库笔记】hive查询结果导出到本地
同理也有Mysql的shell模板:【DA】shell脚本+SQL应用
1 HiveSQL核心技能1-常用函数
- 掌握HIVE基础语法、常用函数及其组合使用
- 掌握一些基本业务指标的分析思路与实现技巧
1.1 基础语法
① SELECT …A… FROM …B… WHERE …C…
- A:列名
- B:表名
- C:筛选条件
SELECT user_name
FROM user_info
WHERE city='beijing' and sex='female'
limit 10;
注意:如果该表是一个分区表,则WHERE
条件中必须对分区字段进行限制
显示表分区:
hive> show partitions table_name;
根据表分区查询数据:
hive> select * from table_name where partition_date='2018-04-10' ;
② GROUP BY(分类汇总)
select goods_category as category,
count(distinct buyer_id) as user_num,
sum(pay_amount) as total_amount
from user_trade
where sale_date between '2019-01-01' and '2019-03-31'
group by goods_category
GROUP BY …… HAVING
SELECT user_name,
sum(pay_amount) as total_amount
FROM user_trade
WHERE year(dt)=2019 and month(dt)=4
GROUP BY user_name
HAVING sum(pay_amount)>50000;
③ ORDER BY
一开始在终端写:
SELECT user_name,
sum(pay_amount) as total_amount
FROM user_trade
WHERE year(dt)=2019 and month(dt)=4
GROUP BY user_name
HAVING sum(pay_amount)>50000
order by total_amount desc;
报错:
SELECT user_name,
sum(pay_amount) as total_amount
FROM user_trade
WHERE year(dt)=2019 and month(dt)=4
GROUP BY user_name
HAVING sum(pay_amount)>50000
order by total_amount desc
limit 5;
④ HiveSQL执行顺序
FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY
1.2 常用函数
查看Hive中的函数:
show functions;
查看具体函数的用法:
desc function 函数名;
desc function extended 函数名;
① 时间戳转化为日期
SELECT pay_time,
from_unixtime(pay_time,'yyyy-MM-dd hh:mm:ss'),
from_unixtime(pay_time,'yyyy-MM'),
from_unixtime(pay_time,'hh')
FROM user_trade
WHERE dt='2019-04-09';
注意:月份MM是大写的,这里大小写不通用
from_unixtime(bigint unixtime, string format)
:将时间戳转化为指定格式的日期
- yyyy-MM-dd hh:mm:ss
- yyyy-MM-dd hh
- yyyy-MM-dd hh:mm
- yyyyMMdd
拓展:把日期转化为时间戳——unix_timestamp(string date)
② 计算日期间隔
SELECT user_name,
datediff('2019-05-01',to_date(firstactivetime))
FROM user_info
limit 10;
datediff(string enddate, string startdate)
:结束日期减去开始日期的天数
拓展:日期增加函数、减少函数——date_add
、date_sub
(类型要是string类型的)
- date_add(string startdate, int days)
- date_sub (string startdate, int days)
date_add("2019-01-01",10)
只能写10,不能写interval 10 day(s) 或者 10 day(s),这里和mysql不同
③ 条件函数 case & if
SELECT case when age<20 then '20岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age<40 then '30-40岁'
else '40岁以上' end as age_type,
count(distinct user_id) user_num
FROM user_info
GROUP BY case when age<20 then '20岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age<40 then '30-40岁'
else '40岁以上' end;
end as age_type
不能写成‘age_type’
SELECT sex,
if(level>5,'高','低') as level_type,
count(distinct user_id) user_num
FROM user_info
GROUP BY sex,
if(level>5,'高','低');
注意,
select
中的if(level>5,'高','低')
必须和group by
中一致,不能混合用casewhen和if
④ 字符串函数
SELECT substr(firstactivetime,1,7) as month,
count(distinct user_id) user_num
FROM user_info
GROUP BY substr(firstactivetime,1,7);
substr(string A, int start, int len)
:如果不指定截取长度,则从起始位一直截取到最后。
##第一种情况
SELECT get_json_object(extra1, '$.phonebrand') as phone_brand,
count(distinct user_id) user_num
FROM user_info
GROUP BY get_json_object(extra1, '$.phonebrand');
##第二种情况
SELECT extra2['phonebrand'] as phone_brand,
count(distinct user_id) user_num
FROM user_info
GROUP BY extra2['phonebrand'];
get_json_object(string json_string, string path)
- param1:需要解析的json字段
- param2:用$.key取出想要获取的value
⑤ 聚合统计函数
SELECT avg(pay_amount) as avg_amount,
datediff(max(from_unixtime(pay_time,'yyyy-MM-dd')),min(from_unixtime(pay_time,'yyyy-MM-dd')))
FROM user_trade
WHERE year(dt)='2018' and user_name='ELLA';
max(from_unixtime(pay_time,'yyyy-MM-dd'))= from_unixtime(max(pay_time),'yyyy-MM-dd'))
1.3 重点练习
hive需要写成两层
SELECT count(a.user_name)
FROM
(SELECT user_name,
count(distinct goods_category) as category_num
FROM user_trade
WHERE year(dt)='2018'
GROUP BY user_name
HAVING count(distinct goods_category)>2) a;
mysql仅需一层:
SELECT count(user_name),count(distinct goods_category)
FROM user_trade
WHERE year(dt)='2018'
GROUP BY user_name
HAVING count(distinct goods_category)>2;
在hive中如此写会报错,也就是hive要求having
里出现的字段,也需要出现在同层的select
中
select case when age>=20 and age<30 then '20-30'
when age>=30 and age<40 then '30-40'
end as age_bins,
marr,
count(distinct user_id) as num
from
(
select user_id,
age,
get_json_object(extra1,'$.marriage_status') as marr
from user_info
where substring(firstactivetime,1,4)='2018' and age>=20 and age<40
) t
group by case when age>=20 and age<30 then '20-30'
when age>=30 and age<40 then '30-40'
end,marr;
SELECT a.age_type,
if(a.marriage_status=1,'已婚','未婚'),
count(distinct a.user_id)
FROM
(SELECT case when age<20 then '20岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age<40 then '30-40岁'
else '40岁以上' end as age_type,
get_json_object(extra1, '$.marriage_status') as
marriage_status,
user_id
FROM user_info
WHERE to_date(firstactivetime) between '2018-01-01'
and '2018-12-31') a
WHERE a.age_type in ('20-30岁','30-40岁')
GROUP BY a.age_type,
if(a.marriage_status=1,'已婚','未婚');
1.4 常见错误及处理办法
-
可以先在Notepad++中编辑好sql,Ctrl+A,Ctrl+C复制,进入Xshell,Shift+Insert粘贴代码。非常需要注意的是,在Notepad++中只能使用空格键排版,切记不能用TAB,不然粘贴过去的代码不能运行,并且最后hsql以
;
结尾。 -
hsql和mysql的一点不同:在having的使用上,hsql强制在having中出现的计算字段必须出现在同层的select语句中,详见 1.3 重点练习
-
如果查询表是一个分区表,则
WHERE
条件中必须对分区字段进行限制,详见 1.1 ① -
hsql的
order by
一定要和limit
联用,详见 1.1 ③ -
月份是MM,详见1.2 ①
-
date_add(string startdate, int days)
或date_sub
中最后一个参数只接受数字,不用整interval和day,这也是hsql和mysql的不同之处,详见1.2 ② -
hsql中使用
case when
的结尾end as age_type
不能写成‘age_type’
,不同于mysql,详见1.2 ③ -
hsql中
select
中的if(level>5,'高','低')
必须和group by
中一致,不能混合用casewhen和if,详见1.2 ③ -
标点符号错误
-
没有对子查询的表进行重命名
-
使用错误的字段名
-
丢了逗号分隔符
2 HiveSQL核心技能2-表连接
- 掌握HQL中的各种连接及其组合使用
- 掌握数据分析中业务指标思路转换的技巧
2.1 基础语法
① inner join
- 表连接时,必须进行重命名
- on后面使用的连接条件必须起到唯一键值的作用
inner
可省略不写,效果一样
解法1
select distinct t.user_name
from
trade_2019 t
inner join
(
select distinct user_name
from user_refund
where year(dt)=2019
) u
on t.user_name=u.user_name;
解法2
SELECT a.user_name
FROM
(SELECT distinct user_name
FROM user_trade
WHERE year(dt)=2019)a
JOIN
(SELECT distinct user_name
FROM user_refund
WHERE year(dt)=2019)b
on a.user_name=b.user_name;
解法1
select distinct a1.user_name
from trade_2017 a1,trade_2018 a2
where a1.user_name=a2.user_name;
解法2
select a1.user_name
from
(
select distinct user_name
from user_trade
where year(dt)=2017
) a1
inner join
(
select distinct user_name
from user_trade
where year(dt)=2018
) a2
on a1.user_name=a2.user_name;
解法一
select distinct a1.user_name
from trade_2017 a1,trade_2018 a2,trade_2018 a3
where a1.user_name=a2.user_name
and a2.user_name=a3.user_name
解法二,最后一行也可以是 JOIN trade_2019 c on a.user_name=c.user_name;
SELECT distinct a.user_name
FROM trade_2017 a
JOIN trade_2018 b on a.user_name=b.user_name
JOIN trade_2019 c on b.user_name=c.user_name;
解法三,最后一行也可以是 )c on b.user_name=c.user_name;
SELECT a.user_name
FROM
(
SELECT distinct user_name
FROM trade_2017
)a
JOIN
(
SELECT distinct user_name
FROM trade_2018
)b on a.user_name=b.user_name
JOIN
(
SELECT distinct user_name
FROM trade_2019
)c on a.user_name=c.user_name;
在表的数据量级很大时,推荐第三种写法,为什么呢? 先distinct缩小表,减少工作量
② left join & right join
进行左连接后,以左边的表1为全集,返回能够匹配上的右边表2的匹配结果,没有匹配上的则显示NULL,如果左表的键值可以在右表中找到多个对应,那么会多出几行
SELECT a.user_id,
a.user_name
FROM user_list_1 a
LEFT JOIN user_list_2 b
ON a.user_id=b.user_id
WHERE b.user_id is null;
解法一
select t.user_name
from
(
select distinct user_name
from trade_2019
) t
left join
(
select distinct user_name
from user_refund
where year(dt)=2019
) u
on t.user_name=u.user_name
where u.user_name is null;
解法二
SELECT a.user_name
FROM
(SELECT distinct user_name
FROM user_trade
WHERE year(dt)=2019)a
LEFT JOIN
(SELECT distinct user_name
FROM user_refund
WHERE year(dt)=2019)b on a.user_name=b.user_name
WHERE b.user_name is null;
解法一
select u.master,
count(distinct u.user_name) as num
from
(
select distinct user_name
from trade_2019
) t
inner join
(
select user_name,
get_json_object(extra1,'$.education') as master
from user_info
) u
on t.user_name=u.user_name
group by u.master;
解法二
SELECT b.education,
count(distinct a.user_name)
FROM
(SELECT distinct user_name
FROM user_trade
WHERE year(dt)=2019)a
LEFT JOIN
(SELECT user_name,
get_json_object(extra1, '$.education') as education
FROM user_info)b
on a.user_name=b.user_name
GROUP BY b.education;
注意:
get_json_object(extra1, '$.education')
可以换成extra2['education']
解法一
select t1.user_name
from
(
select t1.user_name
from
(select distinct user_name from trade_2017) t1,
(select distinct user_name from trade_2018) t2
where t1.user_name=t2.user_name
) t1
left join
(
select distinct user_name from trade_2019
) t2
on t1.user_name=t2.user_name
where t2.user_name is null;
解法一需要加
distinct
,如果表比较小的时候,加不加distinct
影响不大。但是有分区的大表,不加distinct
写执行速度很慢。
解法二
SELECT a.user_name
FROM
(SELECT distinct user_name
FROM trade_2017)a
JOIN
(SELECT distinct user_name
FROM trade_2018)b on a.user_name=b.user_name
LEFT JOIN
(SELECT distinct user_name
FROM trade_2019)c on b.user_name=c.user_name
WHERE c.user_name is null;
③ full join
SELECT *
FROM user_list_1 a
FULL JOIN user_list_2 b
ON a.user_id=b.user_id;
SELECT coalesce(a.user_name,b.user_name)
FROM user_list_1 a
FULL JOIN user_list_2 b
on a.user_id=b.user_id;
coalesce (expression_1, expression_2, ...,expression_n)
依次参考各参数表达式,遇到非null值即停止并返回该值。如果所有的表达式都是空值,最终将返回一个空值。
④ union all
select user_id,
user_name
from user_list_1
union all
select user_id,
user_name
from user_list_3;
SELECT count(distinct a.user_name),
count(a.user_name)
FROM
(
SELECT user_name
FROM trade_2017
UNION ALL
SELECT user_name
FROM trade_2018
UNION ALL
SELECT user_name
FROM trade_2019)a;
SELECT count(distinct a.user_name),
count(a.user_name)
FROM
(
SELECT user_name
FROM trade_2017
UNION
SELECT user_name
FROM trade_2018
UNION
SELECT user_name
FROM trade_2019)a;
UNION ALL和UNION的区别:
解法一
select user_name,
sum(pay_amount),
sum(refund_amount)
from
(
select user_name,
sum(amount) as pay_amount,
0 as refund_amount
from trade_2019
group by user_name
union all
select user_name,
0 as pay_amount,
sum(refund_amount) as refund_amount
from user_refund
where year(dt)=2019
group by user_name
) u
group by user_name;
解法二:full join
select coalesce(u.user_name,r.user_name) as user_name,
u.pay_amount,
r.refund_amount
from
(
select user_name,
sum(amount) as pay_amount
from trade_2019
group by user_name
) u
full join
(
select user_name,
sum(refund_amount) as refund_amount
from user_refund
where year(dt)=2019
group by user_name
) r
on u.user_name=r.user_name;
如何把NULL都变成0呢?
select coalesce(u.user_name,r.user_name) as user_name,
if(u.pay_amount is null,0,u.pay_amount),
if(r.refund_amount is null,0,r.refund_amount)
from
(
select user_name,
sum(amount) as pay_amount
from trade_2019
group by user_name
) u
full join
(
select user_name,
sum(refund_amount) as refund_amount
from user_refund
where year(dt)=2019
group by user_name
) r
on u.user_name=r.user_name;
select u.user_name,
u.pay_amount,
r.refund_amount
from
(
select user_name,
sum(amount) as pay_amount
from trade_2019
group by user_name
) u
left join
(
select user_name,
sum(refund_amount) as refund_amount
from user_refund
where year(dt)=2019
group by user_name
) r
on u.user_name=r.user_name;
2.2 重点练习
解法一
select case when u.age<20 then '20以下'
when u.age>=20 and u.age<30 then '20-30岁'
when u.age>=30 and u.age<40 then '30-40岁'
else '40以上'
end as age_bins,
count(distinct u.user_name) as num
from
(
select user_name,
avg(age) as age
from user_info
where substring(firstactivetime,1,4)='2017'
group by user_name
) u
left join
(
select user_name
from user_trade
where year(dt)>=2017
group by user_name
) t
on u.user_name=t.user_name
where t.user_name is null
group by case when u.age<20 then '20以下'
when u.age>=20 and u.age<30 then '20-30岁'
when u.age>=30 and u.age<40 then '30-40岁'
else '40以上'
end;
解法二
SELECT a.age_level,
count(a.user_name)
FROM
(SELECT user_name,
case when age<20 then '20岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age<40 then '30-40岁'
else '40岁以上' end as age_level
FROM user_info
WHERE year(firstactivetime)=2017)a
LEFT JOIN
(SELECT distinct user_name
FROM user_trade
WHERE dt>'0')b
on a.user_name=b.user_name
WHERE b.user_name is null
GROUP BY a.age_level;
select year(firstactivetime),
month(firstactivetime),
day(firstactivetime),
hour(firstactivetime),
minute(firstactivetime),
second(firstactivetime)
from user_info
limit 5;
select u1.h,
count(u1.user_name) as num
from
(
select user_name,
hour(firstactivetime) as h
from user_info
) u1
inner join
(
select user_name
from user_trade
where year(dt)=2018 or year(dt)=2019
group by user_name
) u2
on u1.user_name=u2.user_name
group by u1.h;
3 HiveSQL核心技能3-窗口函数
- 掌握sum()、avg()用于累计计算的函数
- 掌握row_number()、rank()用于排序的函数
- 掌握ntile()用于分组查询的函数
- 掌握lag()、lead()偏移分析函数
3.1 累计计算窗口函数
① sum(…) over(……)
大家在做报表的时候,经常会遇到计算截止某月的累计数值,通常在EXCEL里可以通过函数来实现。Excel怎么累计求和
那么在HiveSQL里,该如何实现这种累计数值的计算呢?利用窗口函数!
关于窗口函数的几点说明
- Window Function又称为窗口函数、分析函数。
- 窗口函数与聚合函数类似,但是它是每一行数据都生成一个结果。
- 聚合函数(比如sum、avg、max等)可以将多行数据按照规定聚合为一行,一般来讲聚集后的行数要少于聚集前的行数。但是有时我们想要既显示聚集前的数据,又要显示聚集后的数据,便引入了窗口函数。
- 窗口函数是在
select
时执行的,位于order by
之前。
select mon,
pay_amount,
sum(pay_amount) over(order by mon) as pay_amount
from
(
select month(dt) as mon,
sum(pay_amount) as pay_amount
from user_trade
where year(dt)=2018
group by month(dt)
) a;
select y,m,
pay_amount,
sum(pay_amount) over (partition by y order by m) as total_pay
from
(
select year(dt) as y,
month(dt) as m,
sum(pay_amount) as pay_amount
from user_trade
where year(dt) in (2017,2018)
group by year(dt),month(dt)
) u;
- over中的
partition by
起到了分组的作用 - order by 按照什么顺序进行累加,升序ASC、降序DESC,默认升序
② avg(…) over(……) ☆☆☆☆☆☆
大家看股票的时候,经常会看到这种K线图吧,里面经常用到的就是7日、30日移动平均的趋势图,那如何使用窗口函数来计算移动平均值呢?
select mon,
pay_amount,
avg(pay_amount) over(order by mon rows between 2 preceding and current row) as every3monAVG
from
(
select month(dt) as mon,
sum(pay_amount) as pay_amount
from user_trade
where year(dt)=2018
group by month(dt)
) a;
我们用rows between 2 preceding and current row
来限制计算移动平均的范围,本语句含义是包含本行及前两行,这个就是我们题目中要求的近三月的写法
③ 语法总结
3.2 分区排序窗口函数
row_number() 、rank()、dense_rank()
select user_name,
count(distinct goods_category) as goods_num,
row_number() over (order by count(distinct goods_category)) as row_number,
rank() over (order by count(distinct goods_category)) as rank,
dense_rank() over (order by count(distinct goods_category)) as dense_rank
from user_trade
where year(dt)=2019 and month(dt)=1
group by user_name;
解法一
select user_name
from
(
select user_name,
dense_rank() over (order by pay_amount desc) as ranking
from
(
select user_name,
sum(pay_amount) as pay_amount
from user_trade
where year(dt)=2019
group by user_name
) b
)a
where ranking in (10,20,30);
解法二
select user_name,
pay_amount,
ranking
from
(
select user_name,
sum(pay_amount) as pay_amount,
dense_rank() over(order by sum(pay_amount) desc) as ranking
from user_trade
where year(dt)=2019
group by user_name
) t
where ranking in (10,20,30);
3.3 切片排序窗口函数
ntile(n) over(……)☆☆☆☆☆☆
select user_name,
sum(pay_amount) as pay_amount,
ntile(5) over(order by sum(pay_amount) desc) as level
from user_trade
where year(dt)=2019 and month(dt)=1
group by user_name;
select user_name,
refund_amount,
level
from
(
select user_name,
sum(refund_amount) as refund_amount,
ntile(10) over(order by sum(refund_amount) desc) as level
from user_refund
where year(dt)=2019
group by user_name
) a
where level=1;
3.4 偏移分析窗口函数
lag,lead ☆☆☆☆☆☆
select user_name,
dt,
lag(dt,1,'None') over(partition by user_name order by dt),
lag(dt) over(partition by user_name order by dt),
lag(dt,2,'None') over(partition by user_name order by dt),
lag(dt,2) over(partition by user_name order by dt)
from user_trade
where dt is not null
and user_name in ('Alice','Alexander');
user_name | dt | lag_window_0 | lag_window_1 | lag_window_2 | lag_window_3 |
Alexander | 2017-04-08 | None | NULL | None | NULL |
Alexander | 2017-12-02 | 2017-04-08 | 2017-04-08 | None | NULL |
Alexander | 2017-12-02 | 2017-12-02 | 2017-12-02 | 2017-04-08 | 2017-04-08 |
Alexander | 2018-02-03 | 2017-12-02 | 2017-12-02 | 2017-12-02 | 2017-12-02 |
Alice | 2017-12-11 | None | NULL | None | NULL |
Alice | 2018-07-01 | 2017-12-11 | 2017-12-11 | None | NULL |
Alice | 2018-07-01 | 2018-07-01 | 2018-07-01 | 2017-12-11 | 2017-12-11 |
Alice | 2018-08-15 | 2018-07-01 | 2018-07-01 | 2018-07-01 | 2018-07-01 |
select user_name,
dt,
lead(dt,1,'None') over(partition by user_name order by dt),
lead(dt) over(partition by user_name order by dt),
lead(dt,2,'None') over(partition by user_name order by dt),
lead(dt,2) over(partition by user_name order by dt)
from user_trade
where dt is not null
and user_name in ('Alice','Alexander');
user_name | dt | lead_window_0 | lead_window_1 | lead_window_2 | lead_window_3 |
Alexander | 2017-04-08 | 2017-12-02 | 2017-12-02 | 2017-12-02 | 2017-12-02 |
Alexander | 2017-12-02 | 2017-12-02 | 2017-12-02 | 2018-02-03 | 2018-02-03 |
Alexander | 2017-12-02 | 2018-02-03 | 2018-02-03 | None | NULL |
Alexander | 2018-02-03 | None | NULL | None | NULL |
Alice | 2017-12-11 | 2018-07-01 | 2018-07-01 | 2018-07-01 | 2018-07-01 |
Alice | 2018-07-01 | 2018-07-01 | 2018-07-01 | 2018-08-15 | 2018-08-15 |
Alice | 2018-07-01 | 2018-08-15 | 2018-08-15 | None | NULL |
Alice | 2018-08-15 | None | NULL | None | NULL |
select distinct user_name
from
(
select user_name,
dt as first_pay,
lead(dt,1) over(partition by user_name order by dt) as next_pay
from user_trade
where dt is not null
) a
where datediff(next_pay,first_pay)>100;
3.5 重点练习
select city,
sex,
user_name,
pay_amount,
ranking
from
(
select u2.city,
u2.sex,
u1.user_name,
u1.pay_amount,
dense_rank() over(partition by u2.city,u2.sex order by u1.pay_amount desc) as ranking
from
(
select user_name,
sum(pay_amount) as pay_amount
from user_trade
where year(dt)=2018
group by user_name
) u1
left join
(
select user_name,
sex,
city
from user_info
) u2
on u1.user_name=u2.user_name
) a
where ranking<=3;
select phonebrand,
user_name,
refund_amount,
level
from
(
select r.user_name,
u.phonebrand,
r.refund_amount,
ntile(4) over(partition by u.phonebrand order by r.refund_amount desc) as level
from
(
select user_name,
sum(refund_amount) as refund_amount
from user_refund
where dt is not null
group by user_name
) r
left join
(
select distinct user_name,
get_json_object(extra1,'$.phonebrand') as phonebrand
from user_info
) u
on r.user_name=u.user_name
) t
where level=1;
4 其他函数
4.1 字符串函数
substr, substring
返回字符串A从下标start
位置到结尾的字符串
substr(string A,int start)
等价于
substring(string A,int start)
返回字符串A从下标start
位置开始,长度为len
的字符串
substr(string A,int start,int len)
等价于
substring(string A,int start,int len)
substring_index
按关键字截取字符串
substring_index(被截取字段,关键字,关键字出现的次数)
>>> select substring_index("blog.jb51.net",".",2) as abstract from my_content_t
blog.jb51
注:如果关键字出现的次数是负数 如-2 则是从后倒数,到字符串结束
5 常用优化技巧
- 学会利用各种技巧提升Hive执行效率
- 学会思路变换,多方法解决问题
Hive作为大数据领域常用的数据仓库组件,在平时设计和查询时要特别注意效率。影响Hive效率的几乎从不是数据量过大,而是数据倾斜、数据冗余、job或I/O过多、MapReduce分配不合理等等。
对Hive的调优既包含对HiveSQL语句本身的优化,也包含Hive配置项和MR方面的调整。
5.1 列裁剪和分区裁剪
- 列裁剪:在查询时只读取需要的列。
当列很多或者数据量很大时,如果select
所有的列或者不指定分区,导致的全列扫描和全表扫描效率都很低。Hive中与列裁剪优化相关的配置项是hive.optimize.cp
,默认是true
。 - 分区裁剪:在查询时只读需要的分区。Hive中与分区裁剪优化相关的则是
hive.optimize.pruner
,默认是true
。
5.2 排序技巧:sort by代替order by
HiveSQL中的order by
与其他SQL语言中的功能一样,就是将结果按某个字段**全局排序,**这会导致所有map
端数据都进入一个reduce
中,在数据量大时可能会长时间计算不完。
如果使用sort by
,那么就会视情况启动多个reducer
进行排序,并且保证每个reducer
内局部有序。为了控制map
端数据分配到reduce
的key
,往往还要配合distribute by
一同使用。
如果不加distribute by
的话,map
端数据就会随机分配给reducer
。
--未优化写法
select a,b,c
from table
where xxx
order by a
limit 10;
--优化写法
select a,b,c
from table
where xxx
distribute by user_id
sort by a
limit 10;
5.3 去重技巧:用group by来替换distinct
-- 取出user_trade表中全部支付用户
##原有写法
SELECT distinct user_name
FROM user_trade
WHERE dt>'0';
##优化写法
SELECT user_name
FROM user_trade
WHERE dt>'0'
GROUP BY user_name;
注意:在极大的数据量(且很多重复值)时,可以先group by
去重,再count()
计数,效率高于直接count(distinct **)
5.4 聚合技巧:grouping sets、cube、rollup
grouping sets
通常写法:
--性别分布--
SELECT sex,
count(distinct user_id)
FROM user_info
GROUP BY sex;
--城市分布--
SELECT city,
count(distinct user_id)
FROM user_info
GROUP BY city;
--等级分布--
SELECT level,
count(distinct user_id)
FROM user_info
GROUP BY level;
缺点:要分别写三次SQL,需要执行三次,重复工作,且费时
优化写法:
-- 性别、城市、等级用户分布
SELECT sex,
city,
level,
count(distinct user_id)
FROM user_info
GROUP BY sex,city,level
GROUPING SETS (sex,city,level);
--性别分布--
SELECT sex,
count(distinct user_id)
FROM user_info
GROUP BY sex;
--每个性别的城市分布--
SELECT sex,
city,
count(distinct user_id)
FROM user_info
GROUP BY sex,
city;
优化写法:
SELECT sex,
city,
count(distinct user_id)
FROM user_info
GROUP BY sex,city
GROUPING SETS (sex,(sex,city));
注意:第二列为NULL的,即是性别的用户分布,其余有城市的均为每个性别的城市分布
cube
cube:根据group by
维度的所有组合进行聚合
-- 性别、城市、等级的各种组合的用户分布
SELECT sex,
city,
level,
count(distinct user_id)
FROM user_info
GROUP BY sex,city,level
GROUPING SETS (sex,city,level,(sex,city),(sex,level),
(city,level),(sex,city,level));
-- 优化写法
--性别、城市、等级的各种组合的用户分布--
SELECT sex,
city,
level,
count(distinct user_id)
FROM user_info
GROUP BY sex,city,level
with cube;
注意:跑完数据后,整理很关键!!
rollup
rollup:以最左侧的维度为主,进行层级聚合,是cube
的子集。
SELECT a.dt,
sum(a.year_amount),
sum(a.month_amount)
FROM
(SELECT substr(dt,1,4) as dt,
sum(pay_amount) year_amount,
0 as month_amount
FROM user_trade
WHERE dt>'0'
GROUP BY substr(dt,1,4)
UNION ALL
SELECT substr(dt,1,7) as dt,
0 as year_amount,
sum(pay_amount) as month_amount
FROM user_trade
WHERE dt>'0'
GROUP BY substr(dt,1,7)
)a
GROUP BY a.dt;
优化写法:
-- 优化写法
SELECT year(dt) as year,
month(dt) as month,
sum(pay_amount)
FROM user_trade
WHERE dt>'0'
GROUP BY year(dt),
month(dt)
with rollup;
5.5 换个思路解题
条条大路通罗马,写SQL亦是如此,能达到同样效果的SQL有很多种,要学会思路转换,灵活应用。
解法一
--在2017年和2018年都购买的用户--
SELECT a.user_name
FROM
(SELECT distinct user_name
FROM user_trade
WHERE year(dt)=2017)a
JOIN
(SELECT distinct user_name
FROM user_trade
WHERE year(dt)=2018)b on a.user_name=b.user_name;
解法二
SELECT a.user_name
FROM
(SELECT user_name,
count(distinct year(dt)) as year_num
FROM user_trade
WHERE year(dt) in (2017,2018)
GROUP BY user_name)a
WHERE a.year_num=2;
SELECT user_name,
count(distinct year(dt)) as year_num
FROM user_trade
WHERE year(dt) in (2017,2018)
GROUP BY user_name
having count(distinct year(dt))=2;
5.6 union all时可以开启并发执行
Hive中互相没有依赖关系的job间是可以并行执行的,最典型的就是多个子查询union all
。在集群资源相对充足的情况下,可以开启并发执行。参数设置:set hive.exec.parallel=true;
--每个用户的支付和退款金额汇总--
SELECT a.user_name,
sum(a.pay_amount),
sum(a.refund_amount)
FROM
(
SELECT user_name,
sum(pay_amount) as pay_amount,
0 as refund_amount
FROM user_trade
WHERE dt>'0'
GROUP BY user_name
UNION ALL
SELECT user_name,
0 as pay_amount,
sum(refund_amount) as refund_amount
FROM user_refund
WHERE dt>'0'
GROUP BY user_name
)a
GROUP BY a.user_name;
5.7 表连接优化
-
小表在前,大表在后
Hive假定查询中最后的一个表是大表,它会将其它表缓存起来,然后扫描最后那个表 -
使用相同的连接键
当对3个或者更多个表进行join
连接时,如果每个on
子句都使用相同的连接键的话,那么只会产生一个MapReduce job。 -
尽早的过滤数据
减少每个阶段的数据量,对于分区表要加分区,同时只选择需要使用到的字段。
5.8 遵循严格模式
严格模式:强制不允许用户执行3种有风险的HiveSQL语句,一旦执行会直接报错。
- 查询分区表时不限定分区列的语句
- 两表
join
产生了笛卡尔积的语句 - 要
order by
来排序但没有指定limit
的语句。
要开启严格模式,需要设置参数hive.mapred.mode=strict;
。开启严格模式,避免造成不必要的集群资源的浪费。
5.9 阿里手册 の SQL语句