hive的基本优化笔记

最新推荐文章于 2023-11-15 12:08:18 发布

lijie_cq

最新推荐文章于 2023-11-15 12:08:18 发布

阅读量672

点赞数

分类专栏： hive 文章标签：优化 hive

本文链接：https://blog.csdn.net/qq_20641565/article/details/52833569

版权

hive 专栏收录该内容

16 篇文章 2 订阅

订阅专栏

hive 优化

一. 查看执行计划：

explain select * from lijie.test where id = '1';
explain extended select * from lijie.test where id = '1';

二. 本地化

hive.exec.mode.local.auto=false; default

三. 设置队列（选取资源丰富的队列）

mapred.queue.name=hadoop
mapred.job.queue.name=hadoop

四. 设置优先级别

mapred.job.priority=HIGH

五. 设置hive的并行执行
1.hive会将一个任务转换成一个或者多个stage（默认hive只会执行一个stage）
2.如果一个任务有多个stage，并且每个stage是依赖的，那么这个任务就不能并行执行
3.例如在union all 操作中几个查询语句都没有任何依赖，这样并行执行会大大提高效率
hive.exec.parallel默认为false

set hive.exec.parallel = true            //开启并行执行
set hive.exec.parallel.thread.number=8   //最大并行线程数

    <property>
        <name>
            hive.exec.parallel
        </name>
        <value>
            true
        </value>
    </property>

六. 设置mapper和reducer的个数
mapper的个数由splits确定(splits默认和block块大小一致，在InputFormat中可以设置)
reduce的个数默认为1

mapred.reduce.tasks=3

七. jvm的重用(对于大量小文件Job，开启jvm重用大概减少45%的时间)

mapred.job.reuse.jvm.num.tasks

配置文件在mapred-default.xml
比如说有很多小文件这个设置默认为1，那处理每个小文件都会启动一个虚拟机，处理完成后关闭，如果一个文件只有1m，
总共处理128m的文件这样的话，会启停128次，这样性能很低。一般可以设置为15到20

<property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>1</value>(默认为1，当为-1表示无限制)
</priority>

八. 创建索引
九. 创建分区
hive默认是静态分区
开启动态分区

    hive.exec.dynamic.partition=true
    hive.exec.dynamic.partition.mode=nonstrict;

静态分区一定会创建分区，不管select语句是否有没数据；
动态分区select必须需要有数据才会创建分区；
动态分区会为每个分区分配reduce数。（静态是默认1）
动态：

    insert overwrite table test partition(dt)
    select f1,f2,f3,...dt
    from test1
    where .......;

静态：

    insert overwrite table test partition(dt=20161016)
    select f1,f2,f3...
    from test1
    where .......;

十. 推测执行(如果有很多map或者reduce 其中其他的任务都执行完成，而只剩一个任务还在执行，hadoop会不管这个任务，
然后再新建一个相同的任务比较执行，一般不需要，可以关闭它)
mapreduce的配置（mapred-site.xml中设置）

mapred.map.tasks.speculative.execution=false;
mapred.reduce.tasks.speculative.execution=false;

hive配置

hive.mapred.reduce.tasks.speculative.execution

十一. 去重不要用distinct，使用group by
十二. 几张表join操。
作时应该join一张表就on a.字段=b.字段，很多人写oracle的时候先join完之后再指定条件，因为oracle有他的优化机制，而hive中却没有，如果按照oracle的写法，比如三个表join操作，就会形成笛卡儿积：
表1数据 X 表二数据 X 表三数据然后再进行筛选，这样的性能很低。
eg 十一以及十二的例子如下：

--oracle写法
select 
    distinct
    a.report_no,
    a.xxx,
    b.xxx 
from 
    a,b,c
where 
    a.report_no = b.rep_num and
    a.price > 1000 and 
    a.report_no = c.num and 
    c.flag = '1'

#hive中的写法
select 
    a.report_no,
    a.xxx,
    b.xxx 
from 
    a inner join b on a.report_no = b.rep_num
    inner join c on a.report_no = c.num
where 
    a.price > 1000 and
    c.flag = '1'
group by 
    a.report_no,
    a.xxx,
    b.xxx

十三. 如果a表经常丢失数据，id会有空值，join操作会造成数据倾斜，按照如下处理可以解决：

select 
    *
from 
     a left outer join b 
     on 
     case when a.user_id is null then concat('test',rand() ) else a.user_id end = b.user_id;