Pig common command

最新推荐文章于 2022-04-08 16:13:35 发布

xiewenbo

最新推荐文章于 2022-04-08 16:13:35 发布

阅读量481

点赞数

分类专栏： pig

pig 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

STORE

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

STORE A INTO 'myoutput' USING PigStorage ('*');

CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3

SPLIT

Partitions a relation into two or more relations.

A = LOAD 'data' AS (f1:int,f2:int,f3:int);

DUMP A;                
(1,2,3)
(4,5,6)
(7,8,9)        

SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);

DUMP X;
(1,2,3)
(4,5,6)

DUMP Y;
(4,5,6)

DUMP Z;
(1,2,3)
(7,8,9)

Example

In this example, the SPLIT and FILTER statements are essentially equivalent. However, because SPLIT is implemented as "split the data stream and then apply filters" the SPLIT statement is more expensive than the FILTER statement because Pig needs to filter and store two data streams.

SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is null);  
-- where ignored_var is not used elsewhere
   
output_var = FILTER input_var BY (field1 is not null);

ORDER BY

Sorts a relation based on one or more fields.

A = LOAD 'mydata' AS (x: int, y: map[]);     
B = ORDER A BY x; -- this is allowed because x is a simple type
B = ORDER A BY y; -- this is not allowed because y is a complex type
B = ORDER A BY y#'id'; -- this is not allowed because y#'id' is an expression

Examples

Suppose we have relation A.

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

In this example relation A is sorted by the third field, f3 in descending order. Note that the order of the three tuples ending in 3 can vary.

X = ORDER A BY a3 DESC;

DUMP X;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
(8,4,3)
(4,2,1)

LIMIT

Limits the number of output tuples.

Examples

In this example the limit is expressed as a scalar.

a = load 'a.txt';
b = group a all;
c = foreach b generate COUNT(a) as sum;
d = order a by $0;
e = limit d c.sum/100;

Suppose we have relation A.

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

In this example output is limited to 3 tuples. Note that there is no guarantee which three tuples will be output.

X = LIMIT A 3;

DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)

DISTINCT

Removes duplicate tuples in a relation.

Syntax

alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];

Terms

alias

The name of the relation.

PARTITION BY partitioner

Use this feature to specify the Hadoop Partitioner. The partitioner controls the partitioning of the keys of the intermediate map-outputs.

For more details, see http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html
For usage, see Example: PARTITION BY.

PARALLEL n

Increase the parallelism of a job by specifying the number of reduce tasks, n.

For more information, see Use the Parallel Features.

Usage

Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT does not preserve the original order of the contents (to eliminate duplicates, Pig must first sort the data). You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT (see Example: Nested Block).

Example

Suppose we have relation A.

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(8,3,4)
(1,2,3)        
(4,3,3)        
(4,3,3)        
(1,2,3)

In this example all duplicate tuples are removed.

X = DISTINCT A;

DUMP X;
(1,2,3)
(4,3,3)
(8,3,4)

FILTER

Selects tuples from a relation based on some condition.

Syntax

alias = FILTER alias BY expression;

Terms

alias	The name of the relation.
BY	Required keyword.
expression	A boolean expression.

Usage

Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH...GENERATE operation).

FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.

Examples

Suppose we have relation A.

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

In this example the condition states that if the third field equals 3, then include the tuple with relation X.

X = FILTER A BY f3 == 3;

DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)

refer to : http://pig.apache.org/docs/r0.12.1/basic.html

xiewenbo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录