Hive表常用连接
对于直接在mapReduce中用join相比,hive的好处是简化了繁琐的处理工作,hive表的连接操作就是如此,本文主要讲解hive的4中主要连接:内连接、外连接、半连接、map连接。
我们用如下的sales,things表的数据来举例说明各种连接的作用,方便大家理解。
(图1 sales表) (图2 things表)
内连接
内连接是最简单的一种连接,它就是将表匹配的行显示出来。通过join关键字对表连接,然后是通过on关键字进行谓语动词的连接,等值的条件在on语句中进行限定,当然我们可以在条件中用and,or等分割限定的条件。
如:select sales.*,things.* from sales join things on (sales.id=things.id);
(图3 查询结果)
添加and限定:select sales.*,things.* from sales join things on (sales.id=things.id and sales.id>2);
(图4 查询结果)
通常单个的连接是执行一个mapredce,可以通过explain来看执行了多少个mapreduce
如:explain extended select sales.*,things.* from sales join things on (sales.id=things.id);
外连接
外连接可以显示表中不能匹配的行,外连接可以分为left outer join,right outer join,full outer join三种
left outer join
左连接是显示左表的字段,将join表的字段不能匹配的行null来显示
比如:select sales.*,things.* from sales left outer join things on (sales.id=things.id);
joe 2 shuit 2
hank 3 milk 3
wangwu 4 water 4
lisi 0 NULL NULL
daic 2 shuit 2
right outer join
相对于left outer join相比,right outer join是交换两表的连接关系
比如:select sales.*,things.* from sales right outer join things on (sales.id=things.id);
joe 2 shuit 2
daic 2 shuit 2
wangwu 4 water 4
NULL NULL air 1
hank 3 milk 3
full outer join
顾名思义就是将所有表所在的行都有对应的行输出
比如:
select sales.*,things.* from sales full outer join things on (sales.id=things.id);
lisi 0 NULL NULL
wangwu 4 water 4
NULL NULL air 1
joe 2 shuit 2
daic 2 shuit 2
hank 3 milk 3
半连接,半连接类似于左连接,不过并不会输出右表的值:
比如:select * from sales left semi join things on (sales.id=things.id);
joe 2
hank 3
wangwu 4
daic 2
map连接
当一个表足够小,比如sales表,适合放在内存中,就可以将其放在内存中做连接操作。如果需要指定map,就需要通过注释的方式来做。
不如:select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);
joe 2 shuit 2
hank 3 milk 3
wangwu 4 water 4
daic 2 shuit 2
最后查看下执行过程。
比如:explain select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:32
Vertices:
Map 2
Map Operator Tree:
TableScan
alias: things
Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
Spark HashTable Sink Operator
keys:
0 id (type: string)
1 id (type: string)
Local Work:
Map Reduce Local Work
Stage: Stage-1
Spark
DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:31
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: sales
Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 id (type: string)
1 id (type: string)
outputColumnNames: _col0, _col1, _col5, _col6
input vertices:
1 Map 2
Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col5 (type: string), _col6 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink