基本出发点
1、考虑环境 (硬件服务器、配置)
2、业务 (统计指标的实现思路)
3、代码或者配置属性 (hive-default.xml中的属性)
具体优化方案
1.查看执行计划 explain 和 explain extended
2.join
3.limit的优化
4.本地模式
5.并行执行
6.严格模式
7.mapper和reducer的个数
8.配置jvm重用
9.数据倾斜
10.索引 也是hive的优化 (索引并不好,不建议做)
11.分区 本身也是一种优化
12.job的数量
执行计划
explain : 只有对hql语句的解释。
explain extended:对hql语句的解释,以及抽象表达式树的生成。
explain:
explain
select
class,
count(*)
from win
group by class;
stage代表阶段,执行某个阶段
执行结果:
STAGE DEPENDENCIES: ---stage依赖
Stage-1 is a root stage --先从stage1执行
Stage-0 depends on stages: Stage-1 ---stage0依赖于stage1
STAGE PLANS: ---stage计划
Stage: Stage-1
Map Reduce --- stage1是一个mapreduce阶段
Map Operator Tree: --map的操作
TableScan ---开始扫描表
alias: win --表名是win
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE --表的信息
Select Operator --select操作
expressions: class (type: string) --选取的是class
outputColumnNames: class ---输出的也是class
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
Group By Operator ---分组操作
aggregations: count() ---聚合函数count
keys: class (type: string) ---对class字段
mode: hash ---类型是hash
outputColumnNames: _col0, _col1 ---此时输出两列虚拟列,临时列
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator --reduce操作
key expressions: _col0 (type: string) ---输出的就是key 即class
sort order: +
Map-reduce partition columns: _col0 (type: string) ---用col0做分组即拿class分组
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Operator Tree: ----真正的reduce操作 最终的输出
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
table: --输出输入格式
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0 ---stage0操作
Fetch Operator ---抓取操作 最终显示出来
limit: -1
Processor Tree:
ListSink
explain extended
explain extended
select
class,
count(*)
from win
group by class;
执行结果: 输出信息更加详细,最终输出的目录都有提到
ABSTRACT SYNTAX TREE: ---解释器需要干的事情,抽象语义树 从上向下执行
TOK_QUERY ---查询 从那哪张表
TOK_FROM
TOK_TABREF
TOK_TABNAME
win
TOK_INSERT ----插入到哪,此时是一个临时文件
TOK_DESTINATION
TOK_DIR
TOK_TMP_FILE
TOK_SELECT ----查询哪个字段 然后进行count操作
TOK_SELEXPR
TOK_TABLE_OR_COL
class
TOK_SELEXPR
TOK_FUNCTIONSTAR
count
TOK_GROUPBY ----用哪个字段分组
TOK_TABLE_OR_COL
class
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: win
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Select Operator
expressions: class (type: string)
outputColumnNames: class
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
keys: class (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
tag: -1
value expressions: _col1 (type: bigint)
auto parallelism: false
Path -> Alias:
hdfs://cmcc/user/hive/warehouse/cmcc.db/win [win]
Path -> Partition:
hdfs://cmcc/user/hive/warehouse/cmcc.db/win
Partition
base file name: win
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns name,class,score
columns.comments
columns.types string:string:int
field.delim
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://cmcc/user/hive/warehouse/cmcc.db/win
name cmcc.win
numFiles 1
serialization.ddl struct win { string name, string class, i32 score}
serialization.format
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 145
transient_lastDdlTime 1568359214
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns name,class,score
columns.comments
columns.types string:string:int
field.delim
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://cmcc/user/hive/warehouse/cmcc.db/win
name cmcc.win
numFiles 1
serialization.ddl struct win { string name, string class, i32 score}
serialization.format
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 145
transient_lastDdlTime 1568359214
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: cmcc.win
name: cmcc.win
Truncated Path -> Alias:
/cmcc.db/win [win]
Needs Tagging: false
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
GlobalTableId: 0
directory: hdfs://cmcc/tmp/hive/hadoop/573e56e0-ee46-47d7-9f61-98649dce77a3/hive_2019-09-20_14-15-00_585_2418580963205350294-1/-mr-10000/.hive-staging_hive_2019-09-20_14-15-00_585_2418580963205350294-1/-ext-10001
NumFilesPerFileSink: 1
Statistics: Num rows: 1 Data size: 145 Basic stats: COMPLETE Column stats: NONE
Stats Publishing Key Prefix: hdfs://cmcc/tmp/hive/hadoop/573e56e0-ee46-47d7-9f61-98649dce77a3/hive_2019-09-20_14-15-00_585_2418580963205350294-1/-mr-10000/.hive-staging_hive_2019-09-20_14-15-00_585_2418580963205350294-1/-ext-10001/
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
columns _col0,_col1
columns.types string:bigint
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
join
hive的查询永远是小表(结果集)驱动大表(结果集)
hive中的on的条件只能是等值 and连接
注意hive是否配置普通join转换成map端join、以及mapjoin小表文件大小的阀值
注意hive的倾斜join:
hive.optimize.skewjoin=false
hive.skewjoin.key=100000
hive.skewjoin.mapjoin.map.tasks=10000
limit的优化
hive.limit.row.max.size=100000 ---limit最多1000000行
hive.limit.optimize.limit.file=10 ----limit最多限制文件为10个
hive.limit.optimize.enable=false (如果limit较多时建议开启) ---是否开启limit优化 默认不开启
hive.limit.optimize.fetch.max=50000 ----用fetch操作最多出来50000条 etch操作 select * from 表这种的
本地模式
hive.exec.mode.local.auto=false (建议打开)---本地模式
hive.exec.mode.local.auto.inputbytes.max=134217728 ---128M 输入数据的总量
hive.exec.mode.local.auto.input.files.max=4 ---文件数量最大四个
以上代表:开启本地模式,输入的数据两不超过128M,文件数量不超过4个就会进入本地模式
并行执行
hive.exec.parallel=false (建议开启)
hive.exec.parallel.thread.number=8 --允许并行数量是8个 默认的 运行8个stage
原则上并行度越高越好,太高了也不行
严格模式
hive.mapred.mode=nonstrict ---一般默认是非严格模式 默认的
hive.mapred.mode=strict --严格模式
开启后可以阻止一些有风险的查询
mapper和reducer的个数
不是mapper和redcuer个数越多越好,也不是越少越好。适合就好。
将小文件合并处理(将输入类设置为:CombineTextInputFormat)
通过配置将小文件合并:
mapred.max.split.size=256000000 --分片的最大数 250M
mapred.min.split.size.per.node=1 --单个节点上面
mapred.min.split.size.per.rack=1 ---单个机架上面
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat --小文件的表输入的格式
手动设置:
set mapred.map.tasks=2; --手动设置map个数 默认2个
reducer的个数(自动决定和手动设置):
mapred.reduce.tasks=-1 ---设置reduce个数 默认为-1 代表不限制
hive.exec.reducers.max=1009 ---最大的reduce个数
配置jvm重用
mapreduce.job.jvm.numtasks=1 ---默认1个JVM里面只运行一个task
mapred.job.reuse.jvm.num.tasks=1; ---
job的数量
一般是一个查询产生一个job,然后通常情况一个job,可以是一个子查询、一个join、一个group by 、一个limit等一些操作。
1个job:
select
t1.*
from t_user1 t1
left join t_user2 t2
on t1.id = t2.id
where t2.id is null
;
如下3个job:
select
t1.*
from t_user1 t1
where id in (
select
t2.id
from t_user2 t2
limit 1
)
;