Hive中SQL的优化技巧

Hive中SQL的优化技巧,核心思想是避免数据倾斜。


1、避免在同一个查询中同时出现count, distinct,group by

2、left join 时把小数据量的表放在前面

3、尽量使用子查询


参数配置

SET mapred.reduce.tasks=50;
SET mapreduce.reduce.memory.mb=6000;
SET mapreduce.reduce.shuffle.memory.limit.percent=0.06;

涉及数据倾斜的话,主要是reduce中数据倾斜的问题,可能通过设置hive中reduce的并行数,reduce的内存大小单位为m,reduce中 shuffle的刷磁盘的比例,来解决。


实例一

--分月
select substr(a.day,1,6)month,count(distinct a.userid)  
from dms.tracklog_5min a  
join default.site_activeuser_tmp c
on a.userid=c.id
where a.day>='201505' and a.day<'201506'
group by substr(a.day,1,6) ;


--优化后
select '201505',count(*) from 
(
select distinct c.userid
from 
(select userid   from default.site_activeuser_tmp where month='201505') c
left join
(
select userid  from
dms.tracklog_5min 
where day>='201505' and day<'201506'  
) tmp 
on tmp.userid=c.userid
) t;


实例二

--分事业部
select substr(a.day,1,6)month,count(distinct a.userid) ,b.dept_name 
from dms.tracklog_5min a join   default.d_channel b
on a.host=b.host  
join default.site_activeuser_tmp c
on a.userid=c.id
where a.day>='201505' and a.day<'201506'
group by substr(a.day,1,6),b.dept_name;

--优化后
SET mapred.reduce.tasks=50;
SET mapreduce.reduce.memory.mb=6000;
SET mapreduce.reduce.shuffle.memory.limit.percent=0.06;

select "201505" month,count(t.userid),t.dept_name 
from 
(select userid from default.site_activeuser_tmp where month='201505') c
left join
(
select distinct a.userid userid,b.dept_name dept_name from default.d_channel b
left join 
(select host,userid from dms.tracklog_5min where day>='201505' and day<'201506' ) a
on a.host=b.host  
)t
on t.userid=c.userid
group by t.dept_name ;


实例三

--分产品
select substr(a.day,1,6)month,count(distinct a.userid) ,b.dept_name,b.prod_name 
from dms.tracklog_5min a join   default.d_channel b
on a.host=b.host  
join default.site_activeuser_tmp c
on a.userid=c.id
where a.day>='201505' and a.day<'201506'
group by substr(a.day,1,6),b.dept_name,b.prod_name;


--优化后
select "201505" month,count(t.userid) cnt,t.dept_name dept_name,t.prod_name prod_name
from 
(select userid from default.site_activeuser_tmp where month='201505') c
left join
(
select distinct a.userid userid,b.dept_name dept_name,b.prod_name prod_name from default.d_channel b
left join 
(select host,userid from dms.tracklog_5min where day>='201505' and day<'201506' ) a
on a.host=b.host  
)t
on t.userid=c.userid
group by t.prod_name,t.dept_name ;


  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值