hive sort by,order by ,distribute by,cluster by

最新推荐文章于 2023-03-31 22:12:40 发布

xiewenbo

最新推荐文章于 2023-03-31 22:12:40 发布

阅读量875

点赞数

分类专栏： hive 优化

hive 优化专栏收录该内容

12 篇文章 1 订阅

订阅专栏

ORDER BY and SORT BY

The ORDER BY clause is familiar from other SQLdialects. It performs a total ordering of

thequery result set. This means that all the data is passed through a single reducer,

whichmay take an unacceptably long time to execute for larger data sets.

Hiveadds an alternative, SORT BY, that orders the data only within each reducer, thereby

performinga localordering, where eachreducer’s output will be sorted. Better performance

istraded for total ordering.

Inboth cases, the syntax differs only by the use of the ORDER or SORT keyword. You can

specifyany columns you wish and specify whether or not the columns are ascending

usingthe ASC keyword (thedefault) or descending using the DESC keyword.

Hereis an example using ORDER BY:

SELECT s.ymd, s.symbol, s.price_close

FROM stocks s

ORDER BY s.ymd ASC, s.symbol DESC;

Hereis the same example using SORT BY instead:

SELECT s.ymd, s.symbol, s.price_close

FROM stocks s

SORT BY s.ymd ASC, s.symbol DESC;

Thetwo queries look almost identical, but if more than one reducer is invoked, the

outputwill be sorted differently. While each reducer’s output files will be sorted,the

datawill probably overlap with the output of other reducers.

BecauseORDER BY can result in excessively long runtimes, Hive will require a LIMIT

clausewith ORDER BY if the propertyhive.mapred.mode is set to strict. By default, it is

set tononstrict.

DISTRIBUTE BY with SORT BY

DISTRIBUTE BY controls how map output is divided among reducers. Alldata that flows

througha MapReduce job is organized into key-value pairs. Hive must use this feature

internallywhen it converts your queries to MapReduce jobs.

Usually,you won’t need to worry about this feature. The exceptions are queries that

usethe Streamingfeature (see Chapter 14) and some stateful UDAFs (User-Defined

Aggregate Functions; see “Aggregate Functions” on page 164). There is one other scenario

wherethese clauses are useful.

Bydefault, MapReduce computes a hash on the keys output by mappers and tries to

evenlydistribute the key-value pairs among the available reducers using the hashvalues.

Unfortunately,this means that when we use SORT BY, the contents of one reducer’s

outputwill overlap significantly with the output of the other reducers, as far assorted

orderis concerned, even though the data is sorted within each reducer’s output.

DISTRIBUTE BY with SORT BY | 107

Say wewant the data for each stock symbol to be captured together. We can use

DISTRIBUTE BY to ensure that the records for each stock symbol go tothe same reducer,

thenuse SORT BY to order thedata the way we want. The following query demonstrates

thistechnique:

hive> SELECT s.ymd, s.symbol, s.price_close

> FROM stocks s

> DISTRIBUTE BY s.symbol

> SORT BY s.symbol ASC, s.ymd ASC;

1984-09-07 AAPL 26.5

1984-09-10 AAPL 26.37

1984-09-11 AAPL 26.87

1984-09-12 AAPL 26.12

1984-09-13 AAPL 27.5

1984-09-14 AAPL 27.87

1984-09-17 AAPL 28.62

1984-09-18 AAPL 27.62

1984-09-19 AAPL 27.0

1984-09-20 AAPL 27.12

...

Ofcourse, the ASC keywords couldhave been omitted as they are the defaults. The

ASC keyword is placed here for reasons that will be described shortly.

DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers

receiverows for processing, while SORT BY controls the sorting of data inside thereducer.

Notethat Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause.

CLUSTER BY

In theprevious example, the s.symbol column was used in the DISTRIBUTE BY clause,

andthe s.symbol and the s.ymd columns in the SORT BY clause. Suppose that the same

columnsare used in both clauses and all columns are sorted by ascending order (the

default).In this case, the CLUSTER BY clause is a shor-hand way of expressing the same

query.

Forexample, let’s modify the previous query to drop sorting by s.ymd and use CLUSTER

BY on s.symbol:

hive> SELECT s.ymd, s.symbol, s.price_close

> FROM stocks s

> CLUSTER BY s.symbol;

2010-02-08 AAPL 194.12

2010-02-05 AAPL 195.46

2010-02-04 AAPL 192.05

2010-02-03 AAPL 199.23

2010-02-02 AAPL 195.86

2010-02-01 AAPL 194.73

2010-01-29 AAPL 192.06

2010-01-28 AAPL 199.29

2010-01-27 AAPL 207.88

Because the sort requirements are removed for the s.ymd, the output reflects the originalorder of the stock data, which is sorted descending.Using DISTRIBUTE BY ... SORT BY or the shorthand CLUSTER BY clauses is a way to exploit the parallelism of SORT BY, yet achieve a total ordering across the output files

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive sort by,order by ,distribute by,cluster by

ORDER BY and SORT BYThe ORDER BY clause is familiar from other SQLdialects. It performs atotal ordering ofthequery result set. This means that all the data is passed through a single reducer,w
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。