hive中sql优化解决策略

最新推荐文章于 2024-07-01 13:18:56 发布

大数据爱好者社区

最新推荐文章于 2024-07-01 13:18:56 发布

阅读量8.4k

点赞数

分类专栏： hive 文章标签： hive优化

本文链接：https://blog.csdn.net/qq_35036995/article/details/80298449

版权

hive 专栏收录该内容

11 篇文章 2 订阅

订阅专栏

友情提示：更多有关大数据、人工智能方面技术文章请关注博主个人微信公众号：高级大数据架构师！

Hive中的sql优化

如果是按时间分区的表，查询时一定要使用分区限制，如果没有分区限制，会从该表的所有数据里面遍历。

注意sql中or的使用，or 这个逻辑必须单独括起来，否则可能引起无分区限制，下面举个例子，想查询到的是gd或gx的某天的数据。
Select x from t where ds=d1 and province=’gd’ or province=’gx’
该语句会从所有的分区里面查询！也没有得到自己想要的数据！这里的意思是某天gd的或者所有日期里面gx的数据，正确的写法是：
Select x from t where ds=d1 and (province=’gd’ or province=’gx’)

很容易出错的地方
1、  只支持等值连接，不支持非等值连接
2、  内连接时小表放前面，大表放后面 a Join b on  a.x=b.x  a为小表
3、  连接小表时使用map join 条件
SELECT /*+ MAPJOIN(b) */ a.key, a.value  FROM a join b on a.key = b.key
l   小表可以放进内存（维度表、行数<2w 为宜）
l   小表不能为驱动表 (eg:left outer join 时的左表)

Full outer join 在on中过滤分区有问题，解决方法是将分区过滤条件放到左表和右表子查询里面
比如：
Select a.x,b.x from a full|left|right outer join b on (a.key=b.key and a.ds=d1 and b.ds=d1)
应该写成
Select t1.x ,t2.x from

(select * from a where ds=d1) t1
full outer join
(select * from b where ds=d1) t2

on t1.key=t2.key

内连接注意事项
要使用 a join b on a.key=b.key，
下面写法在mysql/oracle等价于内连接，但是在Hive中会导致笛卡尔积，会跑不出来的
From a ,b where a.key=b.key

join的两个表中有一个表应该是distinct的，但如果实际的数据并非distinct的，这个时候应该做一次select distinct放到子查询，再join

慎用Order By

尽量不用order by，因为order by是全局的，会只有一个reduce

一些用法
1、结果压缩：（默认不压缩）
set hive.exec.compress.output=true;
2、压缩算法：（这个一般不用设，默认是gzip）
set mapred.output.compression.codec=org.apache.hadoop.io
.compress.GzipCodec;

3、特殊情况下设置reduce的个数
set mapred.reduce.tasks=n

4、数据严重倾斜时在sql前面加上
Set hive.groupby.skewindata = true
数据严重倾斜通常发生在 group by，其中维度值少，维度对应的记录数特别大的时候。如计算男女比例。

优化思路：
减少生成的mapreduce步骤。
1.      使用CASE WHEN代替子查询。
2.      由于日志数据量不大，直接用distinct处理。
优化前SQL：
Total MapReduce jobs = 20
insert overwrite table t_md_soft_wp7_dload partition(ds=20120820)
select g_f,dload_count,dload_user,tensoft_dload_count,tensoft_dload_user,outsoft_dload_count,outsoft_dload_user
from
(select
temp1.g_f,
temp1.dload_count,
temp1.dload_user,
temp2.tensoft_dload_count,
temp2.tensoft_dload_user,
temp3.outsoft_dload_count,
temp3.outsoft_dload_user
from
(select
g_f,
count(1) as dload_user,
sum(t1.pv) as dload_count
from
(select
g_f,
cookie_id,
count(1) as pv
from
t_od_soft_wp7_dload
where
ds=20120820
group by g_f,cookie_id) t1
group by g_f) temp1 left outer join
(select
g_f, count(1) as tensoft_dload_user,
sum(tt3.login_pv) as tensoft_dload_count
from
(select
g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt1 join t_rd_soft_wp7_app tt2 on tt1.ds=tt2.ds and tt1.ios_soft_id = tt2.appid
where
tt1.ds=20120820 and tt2.is_self_rd = 1
group by g_f,cookie_id) tt3
group by g_f) temp2 on temp1.g_f = temp2.g_f
left outer join
(select
g_f,
count(1) as outsoft_dload_user,
sum(tt6.login_pv) as outsoft_dload_count
from
(select
g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt4 join t_rd_soft_wp7_app tt5 on tt4.ds=tt5.ds and tt4.ios_soft_id = tt5.appid
where
tt4.ds=20120820 and tt5.is_self_rd = 0
group by g_f,cookie_id) tt6
group by g_f) temp3 on temp1.g_f = temp3.g_f
union all
select
temp4.g_f,
temp4.dload_count,
temp4.dload_user,
temp5.tensoft_dload_count,
temp5.tensoft_dload_user,
temp6.outsoft_dload_count,
temp6.outsoft_dload_user
from
(select
cast('-1' as bigint) as g_f,
count(1) as dload_user,
sum(tt7.pv) as dload_count
from
(select
cast('-1' as bigint) as g_f,
cookie_id,
count(1) as pv
from
t_od_soft_wp7_dload
where
ds=20120820
group by g_f,cookie_id) tt7
group by g_f) temp4 left outer join
(select
cast('-1' as bigint) as g_f,
count(1) as tensoft_dload_user,
sum(tt10.login_pv) as tensoft_dload_count
from
(select
cast('-1' as bigint) as g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt8 join t_rd_soft_wp7_app tt9 on tt8.ds=tt9.ds and tt8.ios_soft_id = tt9.appid
where
tt8.ds=20120820 and tt9.is_self_rd = 1
group by g_f,cookie_id) tt10
group by g_f) temp5 on temp4.g_f = temp5.g_f
left outer join
(select
cast('-1' as bigint) as g_f,
count(1) as outsoft_dload_user,
sum(tt13.login_pv) as outsoft_dload_count
from
(select
cast('-1' as bigint) as g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt11 join t_rd_soft_wp7_app tt12 on tt11.ds=tt12.ds and tt11.ios_soft_id = tt12.appid
where
tt11.ds=20120820 and tt12.is_self_rd = 0
group by g_f,cookie_id) tt13
group by g_f) temp6 on temp4.g_f = temp6.g_f ) t;

优化后SQL:
Total MapReduce jobs = 3
insert overwrite table t_md_soft_wp7_dload partition(ds=20120820)
select
g_f,
count(cookie_id) dload_count,
count(distinct cookie_id) dload_user,
count(case when is_self_rd=1 then cookie_id end ) tensoft_dload_count,
count(distinct case when is_self_rd=1 then cookie_id end ) tensoft_dload_user,
count( case when is_self_rd=0 then cookie_id end ) outsoft_dload_count,
count(distinct case when is_self_rd=0 then cookie_id end ) outsoft_dload_user
from
(select g_f,cookie_id,is_self_rd from
t_od_soft_wp7_dload t1 left outer join (select appid,is_self_rd from t_rd_soft_wp7_app where ds=20120820)  t2 on  t1.ios_soft_id = t2.appid where t1.ds=20120820
union all
select cast('-1' as bigint) as g_f,cookie_id,is_self_rd
from
t_od_soft_wp7_dload t1 left outer join (select appid,is_self_rd from t_rd_soft_wp7_app where ds=20120820)  t2 on  t1.ios_soft_id = t2.appid where t1.ds=20120820
) t
group by g_f;
优化效果：同样方式优化了2个作业，从 2-3小时降到 2分钟。

大数据爱好者社区

关注

0
点赞
踩
30

收藏

觉得还不错? 一键收藏
打赏
0
评论
hive中sql优化解决策略

友情提示：更多有关大数据、人工智能方面技术文章请关注博主个人微信公众号：高级大数据架构师！Hive中的sql优化如果是按时间分区的表，查询时一定要使用分区限制，如果没有分区限制，会从该表的所有数据里面遍历。注意sql中or的使用，or 这个逻辑必须单独括起来，否则可能引起无分区限制，下面举个例子，想查询到的是gd或gx的某天的数据。Select x from t where ds=...
复制链接

扫一扫