hive 一些优化

最新推荐文章于 2023-04-24 15:51:36 发布

12345677654321000000

最新推荐文章于 2023-04-24 15:51:36 发布

阅读量2.4k

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/zhoudetiankong/article/details/48712509

版权

hive 专栏收录该内容

36 篇文章 0 订阅

订阅专栏

环境 hive1.2.1 + hadoop2.6.0

一.mapjoin优化

原理：对于join操作，内连接中有一个表是小表，或者左连接时左表为小表时，自动将MR作业转化为Map，即在map端进行数据join操作，而不是reduce端。在执行任务的本地，将小表转换为hashtable，然后上传到集群中，之后的每个map中都有全量的小表来直接进行join操作，从而跳过了shuffle阶段，这种情况能够适用部分数据倾斜的任务，以及提高了整体的效率。

经过测试，小表序列化为java hashtable需要的内存大约是小表数据量的10倍左右的内存（单列测试，每行为int类型）

mapjoin日志

Execution log at: /tmp/test/test_20150924105256_51743552-6005-4630-b054-96bb1d004b02.log

2015-09-24 10:53:00     Starting to launch local task to process map join;     maximum memory = 932184064
2015-09-24 10:53:03     Processing rows:     200000     Hashtable size:     199999     Memory usage:     156235088     percentage:     0.168
2015-09-24 10:53:03     Dump the side-table for tag: 1 with group count: 244027 into file: file:/tmp/test/369e0363-c6a9-4860-84a9-ea65525a1981/hive_2015-09-24_10-52-56_295_3688765404209462896-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile01--.hashtable
2015-09-24 10:53:06     Uploaded 1 File to: file:/tmp/test/369e0363-c6a9-4860-84a9-ea65525a1981/hive_2015-09-24_10-52-56_295_3688765404209462896-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (27460551 bytes)
2015-09-24 10:53:06     End of local task; Time Taken: 5.932 sec.

Execution completed successfully

1.适用于内连接中有一张表是小表或者左连接时

set hive.auto.convert.join=true

set hive.mapjoin.smalltable.filesize=100000000 ( hive1.2.1默认为25MB，修改为100MB)

适用示例语句：

select a.dvc_id

from tds_did_user_targ_mon a left outer join maptable b

on a.dvc_id=b.dvc_id;

2.只适用于内连接中，除了第一个表之外的其他表是小表的情况，自动连接操作

set hive.auto.convert.join=true

set hive.auto.convert.join.noconditionaltask.size=60000000; (hive1.2.1默认为10MB,修改为60MB)

set hive.auto.convert.join.noconditionaltask=true;