HIVE调优MapJoin

最新推荐文章于 2024-05-22 16:49:19 发布

难以触及的高度

最新推荐文章于 2024-05-22 16:49:19 发布

阅读量1.6k

点赞数 42

文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/2301_77836489/article/details/138697347

版权

HIVE调优MapJoin

1.mapjoin （1.2以后自动默认启动mapjoin）

1.mapjoin （1.2以后自动默认启动mapjoin）

select /*+mapjoin(b)*/ a.xx,b.xxx from a left outer join b on a.id=b.id

2.创建表格


CREATE EXTERNAL TABLE IF NOT EXISTS learn4.student1(
id STRING COMMENT "学生ID",
name STRING COMMENT "学生姓名",
age int COMMENT "年龄",
gender STRING COMMENT "性别",
clazz STRING COMMENT "班级"
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

load data local inpath "/usr/local/soft/hive-3.1.2/data/students.txt" INTO TABLE learn4.student1;


CREATE EXTERNAL TABLE IF NOT EXISTS learn4.score1(
id STRING COMMENT "学生ID",
subject_id STRING COMMENT "科目ID",
score int COMMENT "成绩"
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
load data local inpath "/usr/local/soft/hive-3.1.2/data/score.txt" INTO TABLE learn4.score1;

3.查询建表


CREATE TABLE mapJonTest AS SELECT max.name,min.subject_id,min.score FROM learn4.student1 max JOIN learn4.score1 min ON max.id = min.id;

建表所需时间：

INFO : Total MapReduce CPU Time Spent: 2 seconds 510 msec
INFO : Completed executing command(queryId=root_20240511090524_3a34bdda-4247-4af4-b686-d681856af110); Time taken: 19.199 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
No rows affected (20.707 seconds)

4.通过 explain 展示执行计划

explain SELECT max.name,min.subject_id,min.score FROM learn4.student1 max JOIN learn4.score1 min ON max.id = min.id;


查看详细信息：

explain extended SELECT max.name,min.subject_id,min.score FROM learn4.student1 max JOIN learn4.score1 min ON max.id = min.id;

| STAGE DEPENDENCIES: | -- 执行stage的依赖
| Stage-4 is a root stage |   Stage-4 表示根流程 --表示最先执行的流程
| Stage-3 depends on stages: Stage-4 |   Stage-3 依赖 Stage-4
| Stage-0 depends on stages: Stage-3 |   Stage-0 依赖 Stage-3 依赖 Stage-4
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| $hdt$_1:min |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| $hdt$_1:min |
| TableScan |   TableScan 扫描的表
| alias: min |
| Statistics: Num rows: 1 Data size: 1385400 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: id is not null (type: boolean) |
| Statistics: Num rows: 1 Data size: 1385400 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: id (type: string), subject_id (type: string), score (type: int) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 1 Data size: 1385400 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 _col0 (type: string) |
| 1 _col0 (type: string) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: max |
| Statistics: Num rows: 1 Data size: 388080000 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: id is not null (type: boolean) |
| Statistics: Num rows: 1 Data size: 388080000 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: id (type: string), name (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 388080000 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator | -- 不需要做任何操作默认开启 Map JOIN 操作
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 _col0 (type: string) |
| 1 _col0 (type: string) |
| outputColumnNames: _col1, _col3, _col4 |
| Statistics: Num rows: 1 Data size: 426888009 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col1 (type: string), _col3 (type: string), _col4 (type: int) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 1 Data size: 426888009 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 426888009 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Execution mode: vectorized |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |

5.Map JOIN 相关设置：

1）设置自动选择Mapjoin

set hive.auto.convert.join = true; 默认为true

set hive.auto.convert.join = false; 默认为true

2）大表小表的阈值设置（默认25M以下认为是小表）：

set hive.mapjoin.smalltable.filesize = 25000000;

set hive.mapjoin.smalltable.filesize = 10000000;

难以触及的高度

关注

42
点赞
踩
44

收藏

觉得还不错? 一键收藏
0
评论
HIVE调优MapJoin

HIVE调优具体实现
复制链接

扫一扫

HIVE调优MapJoin

HIVE调优MapJoin

1.mapjoin （1.2以后自动默认启动mapjoin）

2.创建表格

3.查询建表

4.通过 explain 展示执行计划

5.Map JOIN 相关设置：

“相关推荐”对你有帮助么？