转自: https://blog.csdn.net/qq_26803795/article/details/79466592
本篇博客继续HIVE,将所有HIVE优化相关的内容深入清楚:
hive 优化:
1)Map的优化
• 增加map的个数:
set mapred.map.tasks=10;
• 减少map的个数(合并小文件):
set mapred.max.split.size=100000000;
set mapred.min.split.size.per.node=100000000;
set mapred.min.split.size.per.rack=100000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
•
Map端聚合(combiner):
hive.map.aggr=true;
2)Reduce的优化
• 设置reduce的个数:
set mapred.reduce.tasks=10;
• reduce任务处理的数据量
set hive.exec.reducers.bytes.per.reducer=100000;
• 避免使用可能启动mapreduce的查询语句
1)group by
2)order by(改用distribute by和sort by)
3)Join的优化
• Join on的条件:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (a.key = c.key1)
• Join的顺序:
- <code class=“language-html”> /+ STREAMTABLE(a) / :a被视为大表
- /+ MAPJOIN(b) /:b被视为小表
- SELECT /+ STREAMTABLE(a) / a.val, b.val, c.val
- FROM a
- JOIN b ON (a.key = b.key1)
- JOIN c ON (c.key = b.key1);</code>
/ + STREAMTABLE(a) / :a被视为大表 / + MAPJOIN(b) /:b被视为小表 SELECT / + STREAMTABLE(a) / a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1);
4)数据倾斜的优化
• 万能方法:
hive.groupby.skewindata=true
• 大小表关联:
Small_table join big_table• 数据中有大量0或NULL:
on case when (x.uid = '-' or x.uid = '0‘ or x.uid is null) then concat('dp_hive_search',rand()) else x.uid end = f.user_id;• 大大表关联:
Select /+MAPJOIN(t12)/ from dw_log t11 join ( select /+MAPJOIN(t)/ t1. from ( select user_id from dw_log group by user_id ) t join dw_user t1 on t.user_id=t1.user_id ) t12 on t11.user_id=t12.user_id• count distinct时存在大量特殊值:
select cast( count( distinct user_id)+ 1 as bigint) as user_cnt from tab_a where user_id is not null and user_id <> ''• 空间换时间:
select day, count( case when type= 'session' then 1 else null end) as session_cnt, count( case when type= 'user' then 1 else null end) as user_cnt from ( select day,session_id, type from ( select day,session_id, 'session' as type from log union all select day user_id, 'user' as type from log ) group by day,session_id, type ) t1 group by day
5)其他的优化
• 分区裁剪(partition):
Where中的分区条件,会提前生效,不必特意做子查询,直接Join和GroupBy
• 笛卡尔积:
Join的时候不加on条件或者无效的on条件,Hive只能使用1个reducer来完成笛卡尔积
• Union all:
先做union all再做join或group by等操作可以有效减少MR过程,多个Select,也只需一个MR
• Multi-insert & multi-group by:
从一份基础表中按照不同的维度,一次组合出不同的数据• Automatic merge:
FROM from_statement INSERT OVERWRITE TABLE table1 [ PARTITION (partcol1=val1)] select_statement1 group by key1 INSERT OVERWRITE TABLE table2 [ PARTITION(partcol2=val2 )] select_statement2 group by key2
当文件大小比阈值小时,hive会启动一个mr进行合并
hive.merge.mapfiles = true 是否和并 Map 输出文件,默认为 True
hive.merge.mapredfiles = false 是否合并 Reduce 输出文件,默认为 False
hive.merge.size.per.task = 256*1000*1000 合并文件的大小
• Multi-Count Distinct:
一份表中count多个参数(必须设置参数:set hive.groupby.skewindata=true;)
• 并行实行:
select dt, count( distinct uniq_id), count( distinct ip) from ods_log where dt= 20170301 group by dt
hive执行开启:set hive.exec.parallel=true