hive sql调优

最新推荐文章于 2022-12-18 11:50:55 发布

长尾效应

最新推荐文章于 2022-12-18 11:50:55 发布

阅读量203

点赞数

本文链接：https://blog.csdn.net/weixin_40668508/article/details/118965716

版权

1.使用multi-table-insert写法对union all进行优化

（1）原SQL（目标是往一个新表user_new里，插入最大值和最小值的数据）

insert into table user_new
select sex, max(age) as stat, 'max' as class
from user
group by sex
union all
select sex, min(age) as stat, 'min' as class
from user
group by sex;

（2）优化SQL（完成相同的目标，使用multi-table-insert写法）

from user
insert into table user_new
select sex, max(age) as stat, 'max' as class
group by sex
insert into table user_new
select sex, min(age) as stat, 'min' as class
group by sex;

2.使用grouping sets替代union all进行优化

(1)原SQL（目标是进行2个group by，但后一个的分组字段是前者的子集）

select * from 
(
  select sex, age, count(1) as user_num
  from user
  group by sex, age
  union all
  select sex, null as age, count(1) as user_num
  from user
  group by sex
) a;

select sex, age, count(1)
from user
group by sex, age
grouping sets((sex),(sex,age));

3.使用子查询对count(distinct)可能造成的数据倾斜进行优化

(1)原SQL（统计每种性别有多少种不同年龄）

select sex, count(distinct age) as age_num
from sex
group by sex;

(2)优化SQL（完成相同的目标，使用子查询避免数据出现倾斜而导致性能问题）

select sex, count(1) as age_num
from (
  select sex, age, count(1) as num
  group by sex, age
) a;

4.使用left join替代not in完成取A表中没有但B表中有的数据

（1）原SQL（查找订单日期不在2020-10-23日的用户的user_id）

select user_id
from user
where order_id not in(
  select order_id from order
  where order_date != '2020-10-23'
);

(2)优化SQL（完成相同的目标，用left join去实现，避免not in不走索引带来的问题）

select a.user_id
from user a
left join
order b
on a.order_id = b.order_id and b.order_date != '2020-10-23'
where b.user_id is null;

总结

在这里插入图片描述

在有数据倾斜的情况下，第一种写法的SQL方式更优。
当数据大到一定的量级时，第一种写法的SQL有两个作业，可以把处理逻辑分散到两个阶段中，即第一个阶段先处理一部分数据，缩小数据量，第二个阶段在已经缩小的数据集上继续处理。
而第二种写法的SQL，经过Map阶段处理的数据还非常多时，所有的数据却都需要交给一个Reduce节点去处理，就好比千军万马过独木桥一样，不仅无法利用到分布式集群的优势，还要浪费大量时间在等待，而这个等待的时间远比第一种写法的SQL多个MapReduce所延长的流程导致额外花费的时间还多。
在Hive3.0中即使遇到数据倾斜，第二种写法的SQL将hive.optimize.countdistinct设置为true，则整个写法也能达到第一种写法的SQL的效果。

开窗函数

select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按name分组，组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组，组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )  as sample4 ,--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   AND 1 FOLLOWING  ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from t_window;

长尾效应

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive sql调优

1.使用multi-table-insert写法对union all进行优化（1）原SQL（目标是往一个新表user_new里，插入最大值和最小值的数据）insert into table user_newselect sex, max(age) as stat, 'max' as classfrom usergroup by sexunion allselect sex, min(age) as stat, 'min' as classfrom usergroup by sex;（2
复制链接

扫一扫