黑猴子的家：Hive 表的优化之大表 Join 大表

最新推荐文章于 2024-05-11 08:46:25 发布

黑猴子的家

最新推荐文章于 2024-05-11 08:46:25 发布

阅读量1.3k

点赞数 1

分类专栏： Hive

本文链接：https://blog.csdn.net/qq_28652401/article/details/83509424

版权

Hive 专栏收录该内容

91 篇文章 1 订阅

订阅专栏

1、空KEY过滤

有时join超时是因为某些key对应的数据太多，而相同key对应的数据都会发送到相同的reducer上，从而导致内存不够。此时我们应该仔细分析这些异常的key，很多情况下，这些key对应的数据是异常数据，我们需要在SQL语句中进行过滤。例如key对应的字段为空，操作如下

案例实操

（1）配置历史服务器

配置mapred-site.xml

<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop102:10020</value>
</property>

<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop102:19888</value>
</property>

启动历史服务器

[victor@hadoop102 hadoop] sbin/mr-jobhistory-daemon.sh start historyserver

查看jobhistory
http://192.168.1.102:19888/jobhistory

（2）创建原始数据表、空id表、合并后数据表

create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, 
click_num int, click_url string) 
row format delimited fields terminated by '\t';

create table nullidtable(id bigint, time bigint, uid string, keyword string,
 url_rank int, click_num int, click_url string) 
row format delimited fields terminated by '\t';

create table jointable(id bigint, time bigint, uid string, keyword string,
 url_rank int, click_num int, click_url string) 
row format delimited fields terminated by '\t';

（3）分别加载原始数据和空id数据到对应表中

hive (default)> load data local inpath '/opt/module/datas/ori' 
into table ori;

hive (default)> load data local inpath '/opt/module/datas/nullid' 
into table nullidtable;

（4）测试不过滤空id

hive (default)> insert overwrite table jointable
                select n.* from nullidtable n left join ori o on n.id = o.id;

Time taken: 42.038 seconds

（5）测试过滤空id

hive (default)> insert overwrite table jointable
                select n.* from (select * from nullidtable where id is not null ) n 
                left join ori o on n.id = o.id;

Time taken: 31.725 seconds

2、空key转换

有时虽然某个key为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在join的结果中，此时我们可以表a中key为空的字段赋一个随机的值，使得数据随机均匀地分不到不同的reducer上。例如：
案例实操

不随机分布空null值

（1）设置5个reduce个数

hive (default)> set mapreduce.job.reduces = 5;

（2）JOIN 两张表

hive (default)> insert overwrite table jointable
                select n.* from nullidtable n 
                left join ori b on n.id = b.id;

结果：可以看出来，出现了数据倾斜，某些reducer的资源消耗远大于其他reducer。

随机分布空null值

（1）设置5个reduce个数

hive (default)> set mapreduce.job.reduces = 5;

（2）JOIN两张表

hive (default)> insert overwrite table jointable
                select n.* from nullidtable n full join ori o on 
                case when n.id is null then concat('hive', rand()) 
                else n.id end = o.id;

结果：可以看出来，消除了数据倾斜，负载均衡reducer的资源消耗