Hive_调优_set hive.exec.mode.local.auto.input.files.max=8;-CSDN博客

本文链接：https://blog.csdn.net/weixin_43003792/article/details/114498773

set hive.exec.mode.local.auto=true; //开启本地mr
set hive.exec.mode.local.auto.inputbytes.max=50000000;
//设置local mr的最大输入文件个数，当输入文件个数小于这个值时采用local mr的方式，默认为4
set hive.exec.mode.local.auto.input.files.max=10;
1．空KEY过滤
测试不过滤空id
hive (default)> insert overwrite table jointable
select n.* from nullidtable n left join ori o on n.id = o.id;
测试过滤空id
hive (default)> insert overwrite table jointable
select n.* from (select * from nullidtable where id is not null ) n left join ori o on n.id = o.id;
2.空key转换必须解决数据倾斜问题
insert overwrite table jointable
select n.* from nullidtable n left join ori b on n.id = b.id;
insert overwrite table jointable
注意用leftjoin空key过多数据倾斜该用key随机值用on
select n.* from nullidtable n full join ori o on
case when n.id is null then concat(‘hive’, rand()) else n.id end = o.id;
Group By
是否在Map端进行聚合，默认为True
hive.map.aggr = true combioner spark中的reduceBykey 默认有此功能
在Map端进行聚合操作的条目数目
hive.groupby.mapaggr.checkinterval = 100000
（3）有数据倾斜的时候进行负载均衡（默认是false）
hive.groupby.skewindata = true 类似与对key加随机数估计多跑一次mr
当选项设定为 true，生成的查询计划会有两个MR Job。第一个MR Job中，Map的输出结果会随机分布到Reduce中，每个Reduce做部分聚合操作，并输出结果，这样处理的结果是相同的Group By Key有可能被分发到不同的Reduce中，从而达到负载均衡的目的；第二个MR Job再根据预处理的数据结果按照Group By Key分布到Reduce中（这个过程可以保证相同的Group By Key被分布到同一个Reduce中），最后完成最终的聚合操作。
Count(Distinct) 去重统计
一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换
select count(distinct id) from bigtable;
select count(id) from (select id from bigtable group by id) a;//用group by 先分组再去重
行列过滤列处理：在SELECT中，只拿需要的列，如果有，尽量使用分区过滤，少用SELECT *。
hive (default)> select o.id from bigtable b
join ori o on o.id = b.id
where o.id <= 10;

	select b.id from bigtable b

join (select id from ori where id <= 10 ) o on b.id = o.id;
动态分区调整
开启动态分区参数设置
（1）开启动态分区功能（默认true，开启）
（2）设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）
1．开启动态分区参数设置
（1）开启动态分区功能（默认true，开启）
hive.exec.dynamic.partition=true
（2）设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）
hive.exec.dynamic.partition.mode=nonstrict
（3）在所有执行MR的节点上，最大一共可以创建多少个动态分区。
hive.exec.max.dynamic.partitions=1000

deptno int
dname string
loc
create table dept_par(dname string,loc int)
partitioned by(deptno int)
row format delimited fields terminated by ‘\t’
//默认按照位置根据查询表的最后一个字段跟字段名无关。分区字段必须放在最后一个位置
insert into table dept_par partition(deptno)
select dname,loc,deptno from dept;
alter table dept_par drop partition

增加map的方法为：根据computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M公式，调整maxSize最大值。让maxSize最大值低于blocksize就可以增加map的个数。
hive (default)> set mapreduce.input.fileinputformat.split.maxsize=100;
set hive.exec.parallel=true; //打开任务并行执行
set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度，默认为8。

explain select * from emp; 可以分析具体的MR执行过程