HIVE企业级优化

最新推荐文章于 2022-07-26 07:00:00 发布

小白请教IT

最新推荐文章于 2022-07-26 07:00:00 发布

阅读量259

点赞数

分类专栏： hive 大数据文章标签：大数据

本文链接：https://blog.csdn.net/zsl_java/article/details/101061171

版权

大数据同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

hive

2 篇文章 0 订阅

订阅专栏

基本出发点

1、考虑环境 (硬件服务器、配置)

2、业务 (统计指标的实现思路)

3、代码或者配置属性（hive-default.xml中的属性）

具体优化方案

1.查看执行计划 explain 和 explain extended
2.join
3.limit的优化
4.本地模式
5.并行执行
6.严格模式
7.mapper和reducer的个数
8.配置jvm重用
9.数据倾斜
10.索引也是hive的优化（索引并不好，不建议做）
11.分区本身也是一种优化
12.job的数量

执行计划

explain ：只有对hql语句的解释。

explain extended：对hql语句的解释，以及抽象表达式树的生成。

explain：

explain
select
class,
count(*)
from win
group by class;

stage代表阶段，执行某个阶段
执行结果：
STAGE DEPENDENCIES:  ---stage依赖
  Stage-1 is a root stage  --先从stage1执行
  Stage-0 depends on stages: Stage-1  ---stage0依赖于stage1

STAGE PLANS:  ---stage计划
  Stage: Stage-1 
    Map Reduce   ---  stage1是一个mapreduce阶段
      Map Operator Tree: --map的操作
          TableScan  ---开始扫描表
            alias: win  --表名是win
            Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE --表的信息
            Select Operator    --select操作
              expressions: class (type: string)    --选取的是class 
              outputColumnNames: class   ---输出的也是class
              Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
              Group By Operator ---分组操作
                aggregations: count()  ---聚合函数count
                keys: class (type: string) ---对class字段
                mode: hash  ---类型是hash
                outputColumnNames: _col0, _col1  ---此时输出两列虚拟列，临时列
                Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator  --reduce操作
                  key expressions: _col0 (type: string)  ---输出的就是key 即class
                  sort order: + 
                  Map-reduce partition columns: _col0 (type: string) ---用col0做分组即拿class分组
                  Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: bigint)
      Reduce Operator Tree:  ----真正的reduce操作 最终的输出
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: true
            Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
            table:  --输出输入格式
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0  ---stage0操作
    Fetch Operator  ---抓取操作 最终显示出来
      limit: -1
      Processor Tree:
        ListSink

explain extended

explain extended
select
class,
count(*)
from win
group by class;

执行结果： 输出信息更加详细，最终输出的目录都有提到
ABSTRACT SYNTAX TREE:  ---解释器需要干的事情，抽象语义树  从上向下执行
  
TOK_QUERY  ---查询 从那哪张表
   TOK_FROM
      TOK_TABREF
         TOK_TABNAME
            win
   TOK_INSERT     ----插入到哪，此时是一个临时文件
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT  ----查询哪个字段 然后进行count操作
         TOK_SELEXPR
            TOK_TABLE_OR_COL
               class
         TOK_SELEXPR  
            TOK_FUNCTIONSTAR
               count
      TOK_GROUPBY   ----用哪个字段分组
         TOK_TABLE_OR_COL
            class


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: win
            Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
            GatherStats: false
            Select Operator
              expressions: class (type: string)
              outputColumnNames: class
              Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count()
                keys: class (type: string)
                mode: hash
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
                  tag: -1
                  value expressions: _col1 (type: bigint)
                  auto parallelism: false
      Path -> Alias:
        hdfs://cmcc/user/hive/warehouse/cmcc.db/win [win]
      Path -> Partition:
        hdfs://cmcc/user/hive/warehouse/cmcc.db/win 
          Partition
            base file name: win
            input format: org.apache.hadoop.mapred.TextInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            properties:
              COLUMN_STATS_ACCURATE true
              bucket_count -1
              columns name,class,score
              columns.comments 
              columns.types string:string:int
              field.delim  
              file.inputformat org.apache.hadoop.mapred.TextInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              location hdfs://cmcc/user/hive/warehouse/cmcc.db/win
              name cmcc.win
              numFiles 1
              serialization.ddl struct win { string name, string class, i32 score}
              serialization.format  
              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              totalSize 145
              transient_lastDdlTime 1568359214
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                COLUMN_STATS_ACCURATE true
                bucket_count -1
                columns name,class,score
                columns.comments 
                columns.types string:string:int
                field.delim  
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://cmcc/user/hive/warehouse/cmcc.db/win
                name cmcc.win
                numFiles 1
                serialization.ddl struct win { string name, string class, i32 score}
                serialization.format  
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                totalSize 145
                transient_lastDdlTime 1568359214
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: cmcc.win
            name: cmcc.win
      Truncated Path -> Alias:
        /cmcc.db/win [win]
      Needs Tagging: false
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: true
            GlobalTableId: 0
            directory: hdfs://cmcc/tmp/hive/hadoop/573e56e0-ee46-47d7-9f61-98649dce77a3/hive_2019-09-20_14-15-00_585_2418580963205350294-1/-mr-10000/.hive-staging_hive_2019-09-20_14-15-00_585_2418580963205350294-1/-ext-10001
            NumFilesPerFileSink: 1
            Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
            Stats Publishing Key Prefix: hdfs://cmcc/tmp/hive/hadoop/573e56e0-ee46-47d7-9f61-98649dce77a3/hive_2019-09-20_14-15-00_585_2418580963205350294-1/-mr-10000/.hive-staging_hive_2019-09-20_14-15-00_585_2418580963205350294-1/-ext-10001/
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                properties:
                  columns _col0,_col1
                  columns.types string:bigint
                  escape.delim \
                  hive.serialization.extend.additional.nesting.levels true
                  serialization.format 1
                  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            TotalFiles: 1
            GatherStats: false
            MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

join

hive的查询永远是小表(结果集)驱动大表(结果集)
hive中的on的条件只能是等值 and连接 
注意hive是否配置普通join转换成map端join、以及mapjoin小表文件大小的阀值
注意hive的倾斜join：
hive.optimize.skewjoin=false
hive.skewjoin.key=100000
hive.skewjoin.mapjoin.map.tasks=10000

limit的优化

hive.limit.row.max.size=100000  ---limit最多1000000行
hive.limit.optimize.limit.file=10  ----limit最多限制文件为10个
hive.limit.optimize.enable=false  (如果limit较多时建议开启)   ---是否开启limit优化 默认不开启
hive.limit.optimize.fetch.max=50000  ----用fetch操作最多出来50000条 etch操作 select * from 表这种的

本地模式

hive.exec.mode.local.auto=false (建议打开)---本地模式
hive.exec.mode.local.auto.inputbytes.max=134217728 ---128M 输入数据的总量
hive.exec.mode.local.auto.input.files.max=4 ---文件数量最大四个
以上代表：开启本地模式，输入的数据两不超过128M，文件数量不超过4个就会进入本地模式

并行执行

hive.exec.parallel=false   (建议开启) 
hive.exec.parallel.thread.number=8  --允许并行数量是8个 默认的  运行8个stage
原则上并行度越高越好，太高了也不行

严格模式

hive.mapred.mode=nonstrict ---一般默认是非严格模式 默认的
hive.mapred.mode=strict  --严格模式

开启后可以阻止一些有风险的查询

mapper和reducer的个数

不是mapper和redcuer个数越多越好，也不是越少越好。适合就好。

将小文件合并处理(将输入类设置为：CombineTextInputFormat)
通过配置将小文件合并：
mapred.max.split.size=256000000 --分片的最大数 250M
mapred.min.split.size.per.node=1  --单个节点上面
mapred.min.split.size.per.rack=1  ---单个机架上面
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat  --小文件的表输入的格式

手动设置：
set mapred.map.tasks=2;  --手动设置map个数  默认2个

reducer的个数(自动决定和手动设置)：
mapred.reduce.tasks=-1  ---设置reduce个数 默认为-1 代表不限制
hive.exec.reducers.max=1009   ---最大的reduce个数

配置jvm重用

mapreduce.job.jvm.numtasks=1   ---默认1个JVM里面只运行一个task
mapred.job.reuse.jvm.num.tasks=1; ---

job的数量

一般是一个查询产生一个job，然后通常情况一个job，可以是一个子查询、一个join、一个group by 、一个limit等一些操作。

1个job:
select
t1.*
from t_user1 t1
left join t_user2 t2
on t1.id = t2.id
where t2.id is null
;

如下3个job:
select
t1.*
from t_user1 t1
where id in (
select
t2.id
from t_user2 t2
limit 1
)
;