hive sort by,order by ,distribute by,cluster by

ORDER BY and SORT BY

The ORDER BY clause is familiar from other SQLdialects. It performs a total ordering of

thequery result set. This means that all the data is passed through a single reducer,

whichmay take an unacceptably long time to execute for larger data sets.

Hiveadds an alternative, SORT BY, that orders the data only within each reducer, thereby

performinga localordering, where eachreducer’s output will be sorted. Better performance

istraded for total ordering.

Inboth cases, the syntax differs only by the use of the ORDER or SORT keyword. You can

specifyany columns you wish and specify whether or not the columns are ascending

usingthe ASC keyword (thedefault) or descending using the DESC keyword.

Hereis an example using ORDER BY:

SELECT s.ymd, s.symbol, s.price_close

FROM stocks s

ORDER BY s.ymd ASC, s.symbol DESC;

Hereis the same example using SORT BY instead:

SELECT s.ymd, s.symbol, s.price_close

FROM stocks s

SORT BY s.ymd ASC, s.symbol DESC;

Thetwo queries look almost identical, but if more than one reducer is invoked, the

outputwill be sorted differently. While each reducer’s output files will be sorted,the

datawill probably overlap with the output of other reducers.

BecauseORDER BY can result in excessively long runtimes, Hive will require a LIMIT

clausewith ORDER BY if the propertyhive.mapred.mode is set to strict. By default, it is

set tononstrict.

DISTRIBUTE BY with SORT BY

DISTRIBUTE BY controls how map output is divided among reducers. Alldata that flows

througha MapReduce job is organized into key-value pairs. Hive must use this feature

internallywhen it converts your queries to MapReduce jobs.

Usually,you won’t need to worry about this feature. The exceptions are queries that

usethe Streamingfeature (see Chapter 14) and some stateful UDAFs (User-Defined

Aggregate Functions; see “Aggregate Functions” on page 164). There is one other scenario

wherethese clauses are useful.

Bydefault, MapReduce computes a hash on the keys output by mappers and tries to

evenlydistribute the key-value pairs among the available reducers using the hashvalues.

Unfortunately,this means that when we use SORT BY, the contents of one reducer’s

outputwill overlap significantly with the output of the other reducers, as far assorted

orderis concerned, even though the data is sorted within each reducer’s output.

DISTRIBUTE BY with SORT BY | 107

Say wewant the data for each stock symbol to be captured together. We can use

DISTRIBUTE BY to ensure that the records for each stock symbol go tothe same reducer,

thenuse SORT BY to order thedata the way we want. The following query demonstrates

thistechnique:

hive> SELECT s.ymd, s.symbol, s.price_close

> FROM stocks s

> DISTRIBUTE BY s.symbol

> SORT BY s.symbol ASC, s.ymd ASC;

1984-09-07 AAPL 26.5

1984-09-10 AAPL 26.37

1984-09-11 AAPL 26.87

1984-09-12 AAPL 26.12

1984-09-13 AAPL 27.5

1984-09-14 AAPL 27.87

1984-09-17 AAPL 28.62

1984-09-18 AAPL 27.62

1984-09-19 AAPL 27.0

1984-09-20 AAPL 27.12

...

Ofcourse, the ASC keywords couldhave been omitted as they are the defaults. The

ASC keyword is placed here for reasons that will be described shortly.

DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers

receiverows for processing, while SORT BY controls the sorting of data inside thereducer.

Notethat Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause.

CLUSTER BY

In theprevious example, the s.symbol column was used in the DISTRIBUTE BY clause,

andthe s.symbol and the s.ymd columns in the SORT BY clause. Suppose that the same

columnsare used in both clauses and all columns are sorted by ascending order (the

default).In this case, the CLUSTER BY clause is a shor-hand way of expressing the same

query.

Forexample, let’s modify the previous query to drop sorting by s.ymd and use CLUSTER

BY on s.symbol:

hive> SELECT s.ymd, s.symbol, s.price_close

> FROM stocks s

> CLUSTER BY s.symbol;

2010-02-08 AAPL 194.12

2010-02-05 AAPL 195.46

2010-02-04 AAPL 192.05

2010-02-03 AAPL 199.23

2010-02-02 AAPL 195.86

2010-02-01 AAPL 194.73

2010-01-29 AAPL 192.06

2010-01-28 AAPL 199.29

2010-01-27 AAPL 207.88

Because the sort requirements are removed for the s.ymd, the output reflects the originalorder of the stock data, which is sorted descending.Using DISTRIBUTE BY ... SORT BY or the shorthand CLUSTER BY clauses is a way to exploit the parallelism of SORT BY, yet achieve a total ordering across the output files
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值