Optimizing Joins running on HDInsight Hive on Azure at GFS

最新推荐文章于 2021-05-08 22:19:44 发布

macyang

最新推荐文章于 2021-05-08 22:19:44 发布

阅读量1k

点赞数

分类专栏： hive

hive 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

Introduction

To analyze hardware utilization within their data centers, Microsoft’s Online Services Division – Global Foundation Services (GFS) is working with Hadoop / Hive via HDInsight on Azure. A common scenarios is to perform joins between the various tables of data. This quick blog post provides a little context on how we managed take a query from >2h to <10min and the thinking behind it.

Background

The join is a three-column join between a large fact table (~1.2B rows/day) and a smaller dimension table (~300K rows). The size of a single day of compressed source files is ~4.2GB; decompressed is ~120GB. When performing a regular join (in Hive parlance “common join”), the join managed to create ~230GB of intermediary files. On a 4-node HDInsight on Azure cluster, taking a 1/6th sample of the large table for a single day of data, the query took 2h 24min.

SELECT
colA, colB, … , colN
FROM FactTable f
LEFT OUTER JOIN DimensionTable d
ON d.colC = f.colC
AND d.colD = f.colD
AND d.colE = f.colE

Join Categories

Our options to improve join performance were noted in Join Strategies in Hive:

Category	Description	Query Notes
Common Join	Standard Hive Join	2h 24min on 1/6 of the full dataset
Map Join	Designed for joins when joining between large table and one small table. The small table can be propped into memory.	Map joins should work perfectly for this scenario
Bucket Map Join	Great for joining large tables together where you create buckets for the tables so the joins occur between buckets	Not optimal for this situation since we had created external hive tables against the data (we had wanted to avoid the additional step / processing time needed to create bucketed tables)
Skewed Joins	Hint to tell Hive that the data is skewed and optimize the query accordingly	Reviewing the join columns groupings (e.g. colC, colD, colE in the above query), the data was evenly distributed across 38 buckets – so not skewed at all.

Query Path

Below was the thought process performed to get the max query performance.

Test Run	Duration	Mappers	Reducers
Base Query*	2:23:59	23	1
Compression*	1:24:38	23	1
Configure Reducer Task Size*	0:21:39	23	30
Full Dataset	2:01:56	134	182
Increase Nodes (4 to 10)	1:10:57	134	182
Map Joins	0:09:58	132	0

* sample data size (1/6 of the full daily dataset)

	FILES BYTES READ		FILES BYTES WRITTEN
Test Run	map	reduce	map	reduce
Base Query*	43,370,646,355	78,930,287,557	67,577,746,322	59,748,935,558
Compression*	1,727,983,197	39,441,385,351	2,695,972,976	20,259,915,184
Configure Reducer Task Size*	3,285,339,403	38,775,855,507	2,677,260,304	19,595,626,728
Full Dataset	106,420,783,433	255,327,019,090	17,460,501,681	128,929,981,208
Increase Nodes (4 to 10)	106,420,795,137	255,327,093,479	17,460,513,463	128,930,072,938
Map Joins	540,664	0	7,212,269	0

Base Query

As noted above, on just 1/6 of the data, the regular join above took 2h 24min.

Compressing the Intermediate Files and Output

As noted earlier, upon analysis it was determined that there were 230GB of intermediary files generated. By compressing the intermediate files (using the set commands below), it improved the query performance (down to 1:24:38) and reduced the size of the files bytes read and files bytes written.

set mapred.compress.map.output=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.intermediate=true;

Note, currently HDInsight supports Gzip and BZ2 codecs – we chose the Gzip codec to match the gzip compressed source.

Configure Reducer Task Size

In the previous two queries, it apparent that there was only one reducer in operation and increasing the number of reducers (up to a point) should improve query performance as well. To improve the query to 0:21:39, the configuration of the number of reducers was added.

set hive.exec.reducers.bytes.per.reducer=25000000;

Full Dataset

While this improved performance, once we switched back to the full dataset, using the above configuration, it took 134 mappers and 182 reducers to complete the job in 2:01:56. By increasing the number of nodes from four to ten, the query duration dropped down to 1:10:57.

Map Joins

The great thing about map joins is that it was designed for this type of situation – large tables joined to a small table. The small table can be placed into memory / distributed cache. By using the configuration below, we managed to take a query that took 1:10:57 down to 00:09:58. Note that with map joins, there are no reducers because the join can be completed during the map phase with a lot less data movement.

set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=50000000;

An important note is to not forget the hive.mapjoin.smalltable.filesize setting. By default it is 25MB and in this case, the smaller table was 43MB. Because I had forgotten to set it to 50MB, all of my original map join tests had reverted back to common joins.

Verifying Map Joins are Happening

The ways to verify that the map joins are happening (vs. common joins):

1. With a map join, there are no reducers because the join does at the map level
2. From the command line, it’ll report that a map join is being done because it is pushing a smaller table up to memory (as noted in the dump the hash table)
3. And right at the end, there is a call out that it’s converting the join into MapJoin

Below is the command line output of a map join:

2013-04-26 10:52:41 Starting to launch local task to process map join;
maximum memory = 932118528
2013-04-26 10:52:45 Processing rows: 200000 Hashtable size: 199999
Memory usage: 145227488 rate: 0.156
2013-04-26 10:52:47 Processing rows: 300000 Hashtable size: 299999
Memory usage: 183032536 rate: 0.196
2013-04-26 10:52:49 Processing rows: 330936 Hashtable size: 330936
Memory usage: 149795152 rate: 0.161
2013-04-26 10:52:49 Dump the hashtable into file: file:/tmp/msgbigdata/hive_
2013-04-26_22-52-34_959_3143934780177488621/-local-10002/HashTable-Stage-4/MapJo
in-mapfile01–.hashtable
2013-04-26 10:52:56 Upload 1 File to: file:/tmp/msgbigdata/hive_2013-04-26_2
2-52-34_959_3143934780177488621/-local-10002/HashTable-Stage-4/MapJoin-mapfile01
–.hashtable File size: 39687547
2013-04-26 10:52:56 End of local task; Time Taken: 14.203 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 2 out of 2

Discussion

By compressing intermediary / map output files and configuring the map join correctly (and adding some extra cores), we were able to take a join query that originally >2h to complete and get it under 10min. For this particular situation, map joins were perfect but it will be important for you to analyze your data first to see if you have any skews, can fit the smaller table in memory, etc.

set mapred.compress.map.output=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.intermediate=true;
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=50000000;

References

Other great references on Hive Map Joins include:

- Join Strategies in Hive:https://cwiki.apache.org/Hive/presentations.data/Hive%20Summit%202011-join.pdf

- Join Optimization in Hive: http://www.slideshare.net/aiolos127/join-optimization-in-hive

- Hadoop’s Map Side Join Implements Hash Join:http://stackoverflow.com/questions/2823303/hadoops-map-side-join-implements-hash-join

- Apache Hive Language Manual > Joins: https://cwiki.apache.org/Hive/languagemanual-joins.html

Ref: http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/

macyang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Optimizing Joins running on HDInsight Hive on Azure at GFS

IntroductionTo analyze hardware utilization within their data centers, Microsoft’s Online Services Division – Global Foundation Services (GFS) is working with Hadoop / Hive via HDInsight on Azure.
复制链接

扫一扫