Optimizing Joins running on HDInsight Hive on Azure at GFS

Introduction

To analyze hardware utilization within their data centers, Microsoft’s Online Services Division – Global Foundation Services (GFS) is working with Hadoop / Hive via HDInsight on Azure.  A common scenarios is to perform joins between the various tables of data.  This quick blog post provides a little context on how we managed take a query from >2h to <10min and the thinking behind it.

Background

The join is a three-column join between a large fact table (~1.2B rows/day) and a smaller dimension table (~300K rows).  The size of a single day of compressed source files is ~4.2GB; decompressed is ~120GB.  When performing a regular join (in Hive parlance “common join”), the join managed to create ~230GB of intermediary files.  On a 4-node HDInsight on Azure cluster, taking a 1/6th sample of the large table for a single day of data, the query took 2h 24min.

SELECT
colA, colB, … , colN
FROM FactTable f
LEFT OUTER JOIN DimensionTable d
ON d.colC = f.colC
AND d.colD = f.colD
AND d.colE = f.colE

Join Categories

Our options to improve join performance were noted in Join Strategies in Hive:

Category Description Query Notes
Common Join Standard Hive Join 2h 24min on 1/6 of the full dataset
Map Join Designed for joins when joining between large table and one small table.  The small table can be propped into memory. Map joins should work perfectly for this scenario
Bucket Map Join Great for joining large tables together where you create buckets for the tables so the joins occur between buckets Not optimal for this situation since we had created external hive tables against the data (we had wanted to avoid the additional step / processing time needed to create bucketed tables)
Skewed Joins Hint to tell Hive that the data is skewed and optimize the query accordingly Reviewing the join columns groupings (e.g. colC, colD, colE in the above query), the data was evenly distributed across 38 buckets – so not skewed at all.
Query Path

Below was the thought process performed to get the max query performance.

Test Run Duration Mappers Reducers
Base Query* 2:23:59 23 1
Compression* 1:24:38 23 1
Configure Reducer Task Size* 0:21:39 23 30
Full Dataset 2:01:56 134 182
Increase Nodes (4 to 10) 1:10:57 134 182
Map Joins 0:09:58 132 0

* sample data size (1/6 of the full daily dataset)

  FILES BYTES READ FILES BYTES WRITTEN
Test Run map reduce map reduce
Base Query* 43,370,646,355 78,930,287,557 67,577,746,322 59,748,935,558
Compression* 1,727,983,197 39,441,385,351 2,695,972,976 20,259,915,184
Configure Reducer Task Size* 3,285,339,403 38,775,855,507 2,677,260,304 19,595,626,728
Full Dataset 106,420,783,433 255,327,019,090 17,460,501,681 128,929,981,208
Increase Nodes (4 to 10) 106,420,795,137 255,327,093,479 17,460,513,463 128,930,072,938
Map Joins 540,664 0 7,212,269 0

Base Query

As noted above, on just 1/6 of the data, the regular join above took 2h 24min.

Compressing the Intermediate Files and Output

As noted earlier, upon analysis it was determined that there were 230GB of intermediary files generated.  By compressing the intermediate files (using the set commands below), it improved the query performance (down to 1:24:38) and reduced the size of the files bytes read and files bytes written.

set mapred.compress.map.output=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.intermediate=true;

Note, currently HDInsight supports Gzip and BZ2 codecs – we chose the Gzip codec to match the gzip compressed source.

Configure Reducer Task Size

In the previous two queries, it apparent that there was only one reducer in operation and increasing the number of reducers (up to a point) should improve query performance as well.  To improve the query to 0:21:39, the configuration of the number of reducers was added.

set hive.exec.reducers.bytes.per.reducer=25000000;

Full Dataset

While this improved performance, once we switched back to the full dataset, using the above configuration, it took 134 mappers and 182 reducers to complete the job in 2:01:56.   By increasing the number of nodes from four to ten, the query duration dropped down to 1:10:57.

Map Joins

The great thing about map joins is that it was designed for this type of situation – large tables joined to a small table.  The small table can be placed into memory / distributed cache.  By using the configuration below, we managed to take a query that took 1:10:57 down to 00:09:58.  Note that with map joins, there are no reducers because the join can be completed during the map phase with a lot less data movement.

set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=50000000;

An important note is to not forget the hive.mapjoin.smalltable.filesize setting.  By default it is 25MB and in this case, the smaller table was 43MB.  Because I had forgotten to set it to 50MB, all of my original map join tests had reverted back to common joins.

Verifying Map Joins are Happening

The ways to verify that the map joins are happening (vs. common joins):

1. With a map join, there are no reducers because the join does at the map level
2. From the command line, it’ll report that a map join is being done because it is pushing a smaller table up to memory (as noted in the dump the hash table)
3. And right at the end, there is a call out that it’s converting the join into MapJoin

Below is the command line output of a map join:

2013-04-26 10:52:41 Starting to launch local task to process map join;
maximum memory = 932118528
2013-04-26 10:52:45 Processing rows: 200000 Hashtable size: 199999
Memory usage: 145227488 rate: 0.156
2013-04-26 10:52:47 Processing rows: 300000 Hashtable size: 299999
Memory usage: 183032536 rate: 0.196
2013-04-26 10:52:49 Processing rows: 330936 Hashtable size: 330936
Memory usage: 149795152 rate: 0.161
2013-04-26 10:52:49 Dump the hashtable into file: file:/tmp/msgbigdata/hive_
2013-04-26_22-52-34_959_3143934780177488621/-local-10002/HashTable-Stage-4/MapJo
in-mapfile01–.hashtable
2013-04-26 10:52:56 Upload 1 File to: file:/tmp/msgbigdata/hive_2013-04-26_2
2-52-34_959_3143934780177488621/-local-10002/HashTable-Stage-4/MapJoin-mapfile01
–.hashtable File size: 39687547
2013-04-26 10:52:56 End of local task; Time Taken: 14.203 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 2 out of 2

Discussion

By compressing intermediary / map output files and configuring the map join correctly (and adding some extra cores), we were able to take a join query that originally >2h to complete and get it under 10min.  For this particular situation, map joins were perfect but it will be important for you to analyze your data first to see if you have any skews, can fit the smaller table in memory, etc.

set mapred.compress.map.output=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.intermediate=true;
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=50000000;

References

Other great references on Hive Map Joins include:

- Join Strategies in Hive:https://cwiki.apache.org/Hive/presentations.data/Hive%20Summit%202011-join.pdf

- Join Optimization in Hive: http://www.slideshare.net/aiolos127/join-optimization-in-hive

- Hadoop’s Map Side Join Implements Hash Join:http://stackoverflow.com/questions/2823303/hadoops-map-side-join-implements-hash-join

- Apache Hive Language Manual > Joins: https://cwiki.apache.org/Hive/languagemanual-joins.html

Ref:  http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
提供的源码资源涵盖了小程序应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 适合毕业设计、课程设计作业。这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。 所有源码均经过严格测试,可以直接运行,可以放心下载使用。有任何使用问题欢迎随时与博主沟通,第一时间进行解答!

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值