hive-mapJoin和skewJoin

最新推荐文章于 2023-10-08 10:09:19 发布

cclovezbf

最新推荐文章于 2023-10-08 10:09:19 发布

阅读量664

点赞数

分类专栏： hive 文章标签： hive hadoop spark

本文链接：https://blog.csdn.net/cclovezbf/article/details/125100797

版权

hive 专栏收录该内容

49 篇文章 12 订阅

订阅专栏

datax之hdfsReader提速_cclovezbf的博客-CSDN博客_datax hdfsreader好久没写datax的东西了。。紧接着之前的说到hdfsReader他的切片数是根据他的文件数来的。比如我一个table下有 10个文件，就是分成10个tasks所以有时候读hdfs hive的时候就会发现导数速度怎么也上不去。。。那么我们就要考虑怎么增加文件数量了？这时候又有小伙伴要说了增加hive的reduce个数，减少每个reduce的数量，distribute by这些都没错，但是都错了。你hive 的引擎是啥 mr spark， tez？你是否有合并https://blog.csdn.net/cclovezbf/article/details/124492076之前一篇文章提到

set hive.merge.size.per.task=67108864;
set hive.merge.sparkfiles=true;

sql distribute by rand()*10

就可以将原来以的一个450M的文件分为多个小文件

但是这里有个问题。上篇文章提到了一个275个task 然后只有10个task在跑，最后输出10个文件然后merge成最后的5个文件，这里关键点在于275个task和merge。

——————————————————————————————————————————

下面来研究另外一个。

set spark.driver.memory=1G
set spark.executor.cores=2
set spark.executor.instances=2
set spark.executor.memory=4G;

测试一

insert overwrite table test.`result`
select s1.id ,s2.score,s2.name
from test.student s1
join test.student s2
on s1 .score =s2.score
distribute by s1.name;

这个insert 请问会生成几个文件和哪些参数有关？

最终结果

[root@worker01 /data/cc_test/hive]# hdfs dfs -du -h /user/hive/warehouse/test.db/result
264.9 M 529.8 M /user/hive/warehouse/test.db/result/000000_0

这个时候就有问题了？

join的时候按道理不应该又shuffle的吗？为什么这里shuffle write=0

为什么只有两个map 怎么没有reduce

测试2

insert overwrite table test.`result`
select s1.id ,s2.score,s2.name
from test.student s1
join test.student s2
on s1 .score =s2.score
distribute by s2.name; --这个name 只有10种

这里惊讶的发现多了一个reduce，但是输出文件还是只有1个？

[root@worker01 /data/cc_test/hive]# hdfs dfs -ls /user/hive/warehouse/test.db/result
Found 1 items
-rwxrwx--x+ 2 hive hive 277779784 2022-06-02 15:08 /user/hive/warehouse/test.db/result/000000_0

测试3

mapred.reduce.tasks

Default Value: -1
Added In: Hive 0.1.0

The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

hive.exec.reducers.bytes.per.reducer

Default Value: 1,000,000,000 prior to Hive 0.14.0; 256 MB (256,000,000) in Hive 0.14.0 and later
Added In: Hive 0.2.0; default changed in 0.14.0 with HIVE-7158 (and HIVE-7917)

Size per reducer. The default in Hive 0.14.0 and earlier is 1 GB, that is, if the input size is 10 GB then 10 reducers will be used. In Hive 0.14.0 and later the default is 256 MB, that is, if the input size is 1 GB then 4 reducers will be used.

set mapred.reduce.tasks=3

insert overwrite table test.`result`
select s1.id ,s2.score,s2.name
from test.student s1
join test.student s2
on s1 .score =s2.score

结果依旧只有map

最后测试

set mapred.reduce.tasks=3

insert overwrite table test.`result`
select s1.id ,s2.score,s2.name
from test.student s1
join test.student s2
on s1 .score =s2.score
distribute by s2.name;

终于成功了。

reduce个数=3个并且文件也生成了3个。那就说明参数起作用了。

[root@worker01 /data/cc_test/hive]# hdfs dfs -ls /user/hive/warehouse/test.db/result
Found 3 items
-rwxrwx--x+ 2 hive hive 275495946 2022-06-02 15:19 /user/hive/warehouse/test.db/result/000000_0
-rwxrwx--x+ 2 hive hive 1319496 2022-06-02 15:18 /user/hive/warehouse/test.db/result/000001_0
-rwxrwx--x+ 2 hive hive 964342 2022-06-02 15:18 /user/hive/warehouse/test.db/result/000002_0

同时附图，

上述有这些问题。

1.为什么node27 花费了这么长时间GC

2.3个task怎么就有一个特别长能不避免

根据上面的4个测试，通过测试来反推原理。总结下

1.普通的insert overwrite table c as select from a join b 根本不会产生reduce

2.distribute by 可以产生reduce阶段，但是reduce的个数和文件大小或者你设置的reduce个数有关

针对第1点又有疑问join没有shuffle，不会把，不会吧，但是我测试就是这样，我突然又想到了一个参数mapjoin。。。本来应该是shuffle join的可是因为是map join 就没有了

最后测试。。

set mapred.reduce.tasks=-1
set hive.auto.convert.join=false

insert overwrite table test.`result`
select s1.id ,s2.score,s2.name
from test.student s1
join test.student s2
on s1 .score =s2.score

呵呵，一切都逃不过我的火眼金睛，可以看到右侧有了明显的shuffle writer 和read。

那么我们现在有多了个学习点。

mapjoin

hive.auto.convert.join --这个为true就是开启

Default Value: false in 0.7.0 to 0.10.0; true in 0.11.0 and later (HIVE-3297)
Added In: 0.7.0 with HIVE-1642

Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. (Note that hive-default.xml.template incorrectly gives the default as false in Hive 0.11.0 through 0.13.1.)

hive.auto.convert.join.noconditionaltask

Default Value: true
Added In: 0.11.0 with HIVE-3784 (default changed to true with HIVE-4146)

Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. If this parameter is on, and the sum of size for n-1 of the tables/partitions for an n-way join is smaller than the size specified by hive.auto.convert.join.noconditionaltask.size, the join is directly converted to a mapjoin (there is no conditional task).

hive.auto.convert.join.noconditionaltask.size

Default Value: 10000000 --默认10M 可以改为20 25 但是别改太多了
Added In: 0.11.0 with HIVE-3784

If hive.auto.convert.join.noconditionaltask is off, this parameter does not take effect. However, if it is on, and the sum of size for n-1 of the tables/partitions for an n-way join is smaller than this size, the join is directly converted to a mapjoin (there is no conditional task). The default is 10MB.

hive.auto.convert.join.use.nonstaged

Default Value: false
Added In: 0.13.0 with HIVE-6144 (default originally true, but changed to false with HIVE-6749 also in 0.13.0)

For conditional joins, if input stream from a small alias can be directly applied to the join operator without filtering or projection, the alias need not be pre-staged in the distributed cache via a mapred local task. Currently, this is not working with vectorization or Tez execution engine. -- 这个先不说注意这个矢量化vector

在回顾下。我们遇到了哪些问题？

本身student表 23.0 K 自我关联最后得到了 264.9 M 的结果。

关联过程中因为我key的故意设计会导致数据倾斜。此处测试发现 shuffle join 速度比mapjoin快。

如何解决这个数据倾斜？如何使得文件均匀分布？reduce个数和sparkcore 和distributeby 有什么关系

reduce个数和sparkcore 和distributeby 这些是什么关系

reduce和distrubute by ，reduce的个数为n ，distrubute by =m

那么会拿m%n去看数据分配到哪个reduce上。比如说distrubute by name

name=[cc0 cc1 cc2 .....cc10 ] =11

set mapred.reduce.tasks=4

那么会hash(cc0)%4=1 到1号reduce hash(cc1)%3=2 到2号reduce，

最后reduce1处理[cc0 cc2 cc5] reduce2处理[cc1 cc7 cc8]

reduce3处理[cc4 cc6 cc9] reduce2处理[cc3 cc10 ] 只是举例。

实测。

set spark.driver.memory=1G;
set spark.executor.cores=2;
set spark.executor.instances=2;
set spark.executor.memory=4G;
set mapred.reduce.tasks=3; --设置reduce的个数

set hive.auto.convert.join=false; --关闭mapjoin
set spark.dynamicAllocation.enabled=false; --关闭动态调节

insert overwrite local directory '/data/cc_test/hive/'
select s1.id ,s2.score,s2.name
from test.student s1
join test.student s2
on s1 .score =s2.score
distribute by s2.name;

hdfs dfs -get /user/hive/warehouse/test.db/result/* ./ --把数据拉取到服务器看

可以看到00000_0 类似reduce0 只有 cc0 cc9 cc6 cc3 和我上述猜测一样。

那么spark的cores扮演的什么角色呢？

举个例子。 reduce就好像机器加工，sparkcore就像人，distribute 就像材料。

我们现在有4台机器 2个人，10种材料。

那么4个人能负责3台机器么？ 4个人同时可以负责3台机器。但是有个人在偷懒，下图有个core没干活

set mapred.reduce.tasks=6;

set spark.executor.cores=1;
set spark.executor.instances=2;

那么2个人能负责6台机器吗？可以的，举例机器总有休息或者加原料的空闲时间，此时切换到另外一台机器。只是人比较忙，符合资本家的剥削。

上面两种情况哪种好呢，钱多选第一种，钱少选第二种。

解决数据倾斜

这种大key和大key关联其实网上有很多解决方案了，

1.加随机数，我们知道就是score=0 的数据太多了那么打散下就好了。

大表关联小表，大表的key+rand*10 小表key 变为 key1 key2 key3 ....key10同时数据扩大10倍

这样大表数据保持不变，小表扩大十倍对数据也没多大影响。

大表关联大表呢？

其实这里就没必要研究了我这里5000*5000=250w 如果是1w*1w那么就是1亿。

这个数据也不多，想象下100w*10w=1000亿你数据确定有这么多吗？确定需要这么多的数据吗？

其实这个不是我想研究的重点，我想研究的是skewjoin。

set hive.skewjoin.mapjoin.min.split=3355443 --缩小了十倍
set hive.skewjoin.key=100; --设置的小好看到结果
set hive.optimize.skewjoin=true --开启

hive-skewJoin_cclovezbf的博客-CSDN博客skewJoin 是什么就是数据倾斜的时候，hive会帮你做的事。那么具体怎么帮你解决数据倾斜呢？Configuration Properties - Apache Hive - Apache Software Foundationhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.skewjoin.compiletime先总结下https://blog.csdn.net/cclovezbf/article/details/125049631

可以看到这里的job多了一个

普通的reduce

skewjoin的reduce

注意这里生成的6个文件都是6但是实际不一样。shuffle join生成的文件是结果文件

skew join生成的文件是中间临时文件

22/06/02 17:09:46 INFO persistence.RowContainer: RowContainer created temp file /data/yarn/nm/usercache/hive/appcache/application_1654001063366_0144/container_1654001063366_0144_01_000002/tmp/hive-rowcontainer7641468650612939567/RowContainer1236677213473064017.8d5162ca104fa7e79fe80fd92bb657fb.tmp
22/06/02 17:09:46 INFO persistence.RowContainer: RowContainer copied temp file /data/yarn/nm/usercache/hive/appcache/application_1654001063366_0144/container_1654001063366_0144_01_000002/tmp/hive-rowcontainer7641468650612939567/RowContainer1236677213473064017.8d5162ca104fa7e79fe80fd92bb657fb.tmp to dfs directory hdfs://s2cluster/tmp/hive/hive/abb80ffb-b171-4536-b316-f39da9368c47/hive_2022-06-02_17-09-45_178_6674305773529237288-19/-mr-10003/_tmp.hive_skew_join_bigkeys_0/_tmp.000000_0
22/06/02 17:09:46 INFO persistence.RowContainer: Using md5Str: 8d5162ca104fa7e79fe80fd92bb657fb for keyObject: [0]
22/06/02 17:09:46 INFO common.FileUtils: Local directories not specified; created a tmp file: /data/yarn/nm/usercache/hive/appcache/application_1654001063366_0144/container_1654001063366_0144_01_000002/tmp/hive-rowcontainer1150619221020179884
22/06/02 17:09:46 INFO persistence.RowContainer: RowContainer created temp file /data/yarn/nm/usercache/hive/appcache/application_1654001063366_0144/container_1654001063366_0144_01_000002/tmp/hive-rowcontainer1150619221020179884/RowContainer211622451391464989.8d5162ca104fa7e79fe80fd92bb657fb.tmp
22/06/02 17:09:46 INFO persistence.RowContainer: RowContainer copied temp file /data/yarn/nm/usercache/hive/appcache/application_1654001063366_0144/container_1654001063366_0144_01_000002/tmp/hive-rowcontainer1150619221020179884/RowContainer211622451391464989.8d5162ca104fa7e79fe80fd92bb657fb.tmp to dfs directory hdfs://s2cluster/tmp/hive/hive/abb80ffb-b171-4536-b316-f39da9368c47/hive_2022-06-02_17-09-45_178_6674305773529237288-19/-mr-10003/_tmp.hive_skew_join_smallkeys_0_1/_tmp.000000_0

22/06/02 17:09:47 INFO rdd.HadoopRDD: Input split: Paths:/tmp/hive/hive/abb80ffb-b171-4536-b316-f39da9368c47/hive_2022-06-02_17-09-45_178_6674305773529237288-19/-mr-10003/hive_skew_join_smallkeys_0_1/RowContainer211622451391464989.8d5162ca104fa7e79fe80fd92bb657fb.tmp:0+99967InputFormatClass: org.apache.hadoop.mapred.SequenceFileInputFormat

看到没这里他是把key超过 100条的key写到一个文件里，再把另外一个表的这个key又写到一个文件里。然后两次map查到这两个文件你看这里的4994是不是分外熟悉。

然后再算出最后的这个skewkey的值这里也可以看出来时间和另外几个不一致，是后面才完成的

文件均匀分布

其实这个就比较简单了。

根据我们学的。第一差不多知道输出文件250M 我们要10个文件每个差不多25M

那么

set mapred.reduce.tasks=10;

为了更好的剥削机器的性能，我们设置10个core

set spark.executor.cores=2;
set spark.executor.instances=2;

为了更好的分布，这里选id 因为id是单调递增的1到1w，

insert overwrite table test.`result`
select s1.id ,s2.score,s2.name
from test.student s1
join test.student s2
on s1 .score =s2.score
distribute by s1.id;

关闭之前的设置

set hive.optimize.skewjoin=false;

set hive.auto.convert.join=false;
set spark.dynamicAllocation.enabled=false;

也可以distribute by round(rand()*20); 但是这里遇到一个问题最后只生成了5个文件，还有5个空文件？这里测试了20 50 100 1000 都是5个文件。

以后谁说distribute by round(rand()*20) 上去就给他两个大耳瓜子，一个都不能少！！round(rand()*20)这个真的可以吗？谁告诉他的，在哪里看的？？他试过吗？你看的文章的作者试过吗？？灵魂三问。

经过多方查阅。正确答案

insert overwrite table test.`result`
select s1.id ,s2.score,s2.name
from test.student s1
join test.student s2
on s1 .score =s2.score
distribute by pmod(hash(1000*rand(1)), 80); 就是这个！！！后面有时间再研究下这个。

distirbute by rand() - 知乎rand()和rand(int seed)两种随机数生成函数，返回值: double 说明: 返回一个0到1范围内的随机数。如果指定种子seed，则会得到一个稳定的随机数序列。 distribute byhive官网解释：Hive uses the columns in Distri…https://zhuanlan.zhihu.com/p/252776975