hive优化
Hive 优化核心思想:把Hive SQL 当做Mapreduce程序去优化
以下SQL不会转为Mapreduce来执行:
select仅查询本表字段
where仅对本表字段做条件过滤
Explain 显示执行计划:EXPLAIN [EXTENDED] query
hive> explain extended select * from student;
OK
Explain
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: student
Statistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Select Operator
expressions: id (type: int), name (type: string), likes (type: array<string>), address (type: map<string,string>)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: NONE
ListSink
Time taken: 0.231 seconds, Fetched: 18 row(s)
========================================================
hive> explain extended select count(*) from student;
OK
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: student
Statistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: COMPLETE
GatherStats: false
Select Operator
Statistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
null sort order:
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
tag: -1
value expressions: _col0 (type: bigint)
auto parallelism: false
Path -> Alias:
hdfs://mycluster/user/hive/warehouse/student [student]
Path -> Partition:
hdfs://mycluster/user/hive/warehouse/student
Partition
base file name: student
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
bucket_count -1
colelction.delim -
column.name.delimiter ,
columns id,name,likes,address
columns.comments
columns.types int:string:array<string>:map<string,string>
field.delim ,
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://mycluster/user/hive/warehouse/student
mapkey.delim :
name default.student
numFiles 1
numRows 0
rawDataSize 0
serialization.ddl struct student { i32 id, string name, list<string> likes, map<string,string> address}
serialization.format ,
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 618
transient_lastDdlTime 1624695643
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
bucket_count -1
colelction.delim -
column.name.delimiter ,
columns id,name,likes,address
columns.comments
columns.types int:string:array<string>:map<string,string>
field.delim ,
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://mycluster/user/hive/warehouse/student
mapkey.delim :
name default.student
numFiles 1
numRows 0
rawDataSize 0
serialization.ddl struct student { i32 id, string name, list<string> likes, map<string,string> address}
serialization.format ,
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 618
transient_lastDdlTime 1624695643
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.student
name: default.student
Truncated Path -> Alias:
/student [student]
Needs Tagging: false
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
GlobalTableId: 0
directory: hdfs://mycluster/tmp/hive/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-26-39_045_671258483012844310-1/-mr-10001/.hive-staging_hive_2021-07-03_09-26-39_045_671258483012844
310-1/-ext-10002 NumFilesPerFileSink: 1
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Stats Publishing Key Prefix: hdfs://mycluster/tmp/hive/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-26-39_045_671258483012844310-1/-mr-10001/.hive-staging_hive_2021-07-03_09-26-39_0
45_671258483012844310-1/-ext-10002/ table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
columns _col0
columns.types bigint
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.escape.crlf true
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.183 seconds, Fetched: 121 row(s)
Hive抓取策略
Hive中对某些情况的查询不需要使用MapReduce计算
set hive.fetch.task.conversion=none/more;
1、hive的默认抓取策略是more,如果抓取策略设置为none则每次都需