08-Hive高级查询join

声明:我的朋友,这一篇不要转载,因为你可以直接在这里看。

大家好,我们今天来学习Hive高级查询join语法。

你有没有期待把Hive学完整?我打算写完整,只要我知道。我写的都是比较接地气的,因为高大上的我知道的少。砸门一起加油进步吧!

1 我们先回顾一下上一节课说的。上一节我们知道了:order by是一个全局的操作,groupby是一个聚合的操作。避免数据倾斜的方法之一是设置参数:hive.groupby.skewindata=true;
当有数据倾斜的时候进行负载均衡,当选项设定为 true,生成的查询计划会有两个 MR Job。如果你想学多一点关于数据倾斜的知识,分享一个链接:数据倾斜的原因

2 今天学习【join】
表连接
2.1两个表m,n之间按照on条件连接,m中的一条记录和n中的一条记录组成一条新纪录 。
2.2 join等值连接,只有某个值在m和n中同时存在时
2.3 left outer join左外连接,左边表中的值无论是否是在b中存在时,都输出,右边表中的值只有在左边表中存在时才输出。
2.4 right outer join 和left outer join相反。
2.5left semi join 类似exits。
2.6 mapjoin 在map端完成join操作,不需要用ruduce,甚于内存做join,属于优化操作。

3 分别创建表m和表n,具有的字段如下:

clo	col2	m
A	1
C	5
B	2
C	3

col	col3	n
C	4
D	5
A	6

在hive控制台上执行以下语句:

create table m(
col string,
col2 string
)
row format delimited fields terminated by '\t' 
lines terminated by '\n'
stored as textfile;

create table n(
col string,
col3 string
)
row format delimited fields terminated by '\t' 
lines terminated by '\n'
stored as textfile;

load data local inpath '/usr/host/m' into table m;
load data local inpath '/usr/host/n' into table n;

以上的语句都看得懂吧,其实就是创建表和加载数据。

hive> select * from n;
OK
C	4
D	5
A	6
Time taken: 0.415 seconds
hive> select * from n;
OK
C	4
D	5
A	6
Time taken: 0.288 seconds
hive> select * from m;
OK
A	1
C	5
B	2
C	3
Time taken: 0.317 seconds
hive> 

接下来我们开始join操作,语句如下:

hive> set hive.auto.convert.join=true;
hive> select s.col,s.col2,t.col3
    > from
    > (select col,col2 from m)s
    > join
    > (select col,col3 from n)t
    > on s.col=t.col;
java.lang.InstantiationException: org.antlr.runtime.CommonToken
Continuing ...
java.lang.RuntimeException: failed to evaluate: <unbound>=Class.new();
Continuing ...
java.lang.InstantiationException: org.antlr.runtime.CommonToken
Continuing ...
java.lang.RuntimeException: failed to evaluate: <unbound>=Class.new();
Continuing ...
Total MapReduce jobs = 3
Ended Job = -1393545778, job is filtered out (removed at runtime).
Ended Job = -1334217954, job is filtered out (removed at runtime).
2016-06-06 05:25:03	Starting to launch local task to process map join;	maximum memory = 518979584
2016-06-06 05:25:05	Processing rows:	3	Hashtable size:	3	Memory usage:	5078208	rate:	0.01
2016-06-06 05:25:05	Dump the hashtable into file: file:/tmp/root/hive_2016-06-06_05-24-57_982_7288788097348023892/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable
2016-06-06 05:25:05	Upload 1 File to: file:/tmp/root/hive_2016-06-06_05-24-57_982_7288788097348023892/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable File size: 432
2016-06-06 05:25:05	End of local task; Time Taken: 1.587 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 2 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-06 05:25:19,055 null map = 0%,  reduce = 0%
2016-06-06 05:25:28,989 null map = 100%,  reduce = 0%, Cumulative CPU 0.97 sec
2016-06-06 05:25:30,087 null map = 100%,  reduce = 0%, Cumulative CPU 0.97 sec
2016-06-06 05:25:31,173 null map = 100%,  reduce = 0%, Cumulative CPU 0.97 sec
MapReduce Total cumulative CPU time: 970 msec
Ended Job = job_1465200327080_0019
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
A	1	6
C	5	4
C	3	4
Time taken: 35.081 seconds
hive> 

join等值连接,只有某个值在m和n中同时存在时才输出。所以输出的结果就是如上所示了。

2 左外连接

hive> set hive.optimize.skewjoin=true;
hive> set hive.auto.convert.join=true;
hive> select s.col,s.col2,t.col3
    > from
    > (select col,col2 from m)s
    > left outer join
    > (select col,col3 from n)t
    > on s.col=t.col;
Total MapReduce jobs = 2
Ended Job = 1311401655, job is filtered out (removed at runtime).
16/06/06 05:57:17 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16/06/06 05:57:17 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
2016-06-06 05:57:18	Starting to launch local task to process map join;	maximum memory = 518979584
2016-06-06 05:57:20	Processing rows:	3	Hashtable size:	3	Memory usage:	5081104	rate:	0.01
2016-06-06 05:57:20	Dump the hashtable into file: file:/tmp/root/hive_2016-06-06_05-57-12_989_8198446239599600254/-local-10002/HashTable-Stage-3/MapJoin-mapfile111--.hashtable
2016-06-06 05:57:20	Upload 1 File to: file:/tmp/root/hive_2016-06-06_05-57-12_989_8198446239599600254/-local-10002/HashTable-Stage-3/MapJoin-mapfile111--.hashtable File size: 435
2016-06-06 05:57:20	End of local task; Time Taken: 1.401 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
16/06/06 05:57:21 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16/06/06 05:57:21 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-06 05:57:32,269 null map = 0%,  reduce = 0%
2016-06-06 05:57:41,293 null map = 100%,  reduce = 0%, Cumulative CPU 1.04 sec
2016-06-06 05:57:42,388 null map = 100%,  reduce = 0%, Cumulative CPU 1.04 sec
MapReduce Total cumulative CPU time: 1 seconds 40 msec
Ended Job = job_1465200327080_0027
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
A	1	6
C	5	4
B	2	NULL
C	3	4
Time taken: 30.428 seconds
hive> 

3 右外连接

hive> set hive.auto.convert.join=true;
hive> select s.col,s.col2,t.col3
    > from
    > (select col,col2 from m)s
    > right outer join
    > (select col,col3 from n)t
    > on s.col=t.col;
java.lang.InstantiationException: org.antlr.runtime.CommonToken
Continuing ...
java.lang.RuntimeException: failed to evaluate: <unbound>=Class.new();
Continuing ...
java.lang.InstantiationException: org.antlr.runtime.CommonToken
Continuing ...
java.lang.RuntimeException: failed to evaluate: <unbound>=Class.new();
Continuing ...
Total MapReduce jobs = 2
Ended Job = 84151671, job is filtered out (removed at runtime).
16/06/06 06:01:40 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16/06/06 06:01:40 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
16/06/06 06:01:40 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
2016-06-06 06:01:41	Starting to launch local task to process map join;	maximum memory = 518979584
2016-06-06 06:01:43	Processing rows:	3	Hashtable size:	3	Memory usage:	5105544	rate:	0.01
2016-06-06 06:01:43	Dump the hashtable into file: file:/tmp/root/hive_2016-06-06_06-01-36_820_8095318663586481472/-local-10002/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2016-06-06 06:01:43	Upload 1 File to: file:/tmp/root/hive_2016-06-06_06-01-36_820_8095318663586481472/-local-10002/HashTable-Stage-3/MapJoin-mapfile00--.hashtable File size: 451
2016-06-06 06:01:43	End of local task; Time Taken: 1.559 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
16/06/06 06:01:44 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16/06/06 06:01:44 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
16/06/06 06:01:44 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-06 06:01:56,531 null map = 0%,  reduce = 0%
2016-06-06 06:02:05,414 null map = 100%,  reduce = 0%, Cumulative CPU 0.71 sec
2016-06-06 06:02:06,504 null map = 100%,  reduce = 0%, Cumulative CPU 0.71 sec
MapReduce Total cumulative CPU time: 710 msec
Ended Job = job_1465200327080_0029
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
C	5	4
C	3	4
NULL	NULL	5
A	1	6
Time taken: 31.606 seconds
hive> 

为什么输出的结果会是这样,下面这幅图好好琢磨一下吧。
4 数据输出对比
这里写图片描述

5 优化参数设置
如果你发现有问题,执行不了,可以在执行语句前设置以下参数尝试一下:

set hive.optimize.skewjoin=true;

这一个参数设置的意思是:类似于之前的groupby操作的时候设置优化参数避免数据倾斜问题,这一个也是具有类似意义。

6 mapjoin
mapjoin其实就是join的优化操作。
mapjoin(map side join)
在map端把小表加载到内存中,然后读取大表,和内存中的小表完成连接操作
其中使用了分布式缓存技术
mapjoin的原理:
这里写图片描述
优点:
不消耗集群的reduce资源(reduce相对紧缺)
减少了reduce操作,加快程序执行
降低网络
缺点:
占用部分内存,所以加载到内存中的表不能过大,因为每个计算节点都会加载一次
生成较多的小文件

设置成mapjoin有两种方式:
配置以下参数,是hive自动根据sql,选择使用common join或者map join
第一种方式:
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize,默认值是25mb
第二种方式,手动指定,句式如下:

select /*+mapjoin(n)*/ m.col,m.col2,n.col3 from m
join n
on m.col=n.col;
hive> select /*+mapjoin(n)*/ m.col,m.col2,n.col3 from m
    > join n
    > on m.col=n.col;
Total MapReduce jobs = 1
16/06/06 06:04:14 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16/06/06 06:04:14 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
16/06/06 06:04:14 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
2016-06-06 06:04:16	Starting to launch local task to process map join;	maximum memory = 518979584
2016-06-06 06:04:17	Processing rows:	3	Hashtable size:	3	Memory usage:	5062584	rate:	0.01
2016-06-06 06:04:17	Dump the hashtable into file: file:/tmp/root/hive_2016-06-06_06-04-09_627_2210755106302628931/-local-10002/HashTable-Stage-1/MapJoin-n-11--.hashtable
2016-06-06 06:04:17	Upload 1 File to: file:/tmp/root/hive_2016-06-06_06-04-09_627_2210755106302628931/-local-10002/HashTable-Stage-1/MapJoin-n-11--.hashtable File size: 432
2016-06-06 06:04:17	End of local task; Time Taken: 1.451 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
16/06/06 06:04:18 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16/06/06 06:04:18 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
16/06/06 06:04:18 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-06 06:04:29,775 null map = 0%,  reduce = 0%
2016-06-06 06:04:38,832 null map = 100%,  reduce = 0%, Cumulative CPU 1.01 sec
2016-06-06 06:04:39,896 null map = 100%,  reduce = 0%, Cumulative CPU 1.01 sec
2016-06-06 06:04:40,995 null map = 100%,  reduce = 0%, Cumulative CPU 1.01 sec
MapReduce Total cumulative CPU time: 1 seconds 10 msec
Ended Job = job_1465200327080_0030
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
A	1	6
C	5	4
C	3	4
Time taken: 32.387 seconds
hive> 

简单总结以下,mapjoin的使用场景:
1 关联操作中有一张表非常小
2 不等值的链接操作

好了,有点累了,今天就先玩到这里吧。如果你看到此文,想进一步学习或者和我沟通,加我微信公众号:名字:谢华东

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

当法律与事业相遇

你的鼓励是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值