STORE
A = LOAD 'data' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) STORE A INTO 'myoutput' USING PigStorage ('*'); CAT myoutput; 1*2*3 4*2*1 8*3*4 4*3*3 7*2*5 8*4*3
SPLIT
Partitions a relation into two or more relations.
A = LOAD 'data' AS (f1:int,f2:int,f3:int); DUMP A; (1,2,3) (4,5,6) (7,8,9) SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6); DUMP X; (1,2,3) (4,5,6) DUMP Y; (4,5,6) DUMP Z; (1,2,3) (7,8,9)
Example
In this example, the SPLIT and FILTER statements are essentially equivalent. However, because SPLIT is implemented as "split the data stream and then apply filters" the SPLIT statement is more expensive than the FILTER statement because Pig needs to filter and store two data streams.
SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is null); -- where ignored_var is not used elsewhere output_var = FILTER input_var BY (field1 is not null);
ORDER BY
Sorts a relation based on one or more fields.
A = LOAD 'mydata' AS (x: int, y: map[]); B = ORDER A BY x; -- this is allowed because x is a simple type B = ORDER A BY y; -- this is not allowed because y is a complex type B = ORDER A BY y#'id'; -- this is not allowed because y#'id' is an expression
Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
In this example relation A is sorted by the third field, f3 in descending order. Note that the order of the three tuples ending in 3 can vary.
X = ORDER A BY a3 DESC; DUMP X; (7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
LIMIT
Limits the number of output tuples.
Examples
In this example the limit is expressed as a scalar.
a = load 'a.txt'; b = group a all; c = foreach b generate COUNT(a) as sum; d = order a by $0; e = limit d c.sum/100;
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
In this example output is limited to 3 tuples. Note that there is no guarantee which three tuples will be output.
X = LIMIT A 3; DUMP X; (1,2,3) (4,3,3) (7,2,5)
DISTINCT
Removes duplicate tuples in a relation.
Syntax
alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n]; |
Terms
alias | The name of the relation. |
PARTITION BY partitioner | Use this feature to specify the Hadoop Partitioner. The partitioner controls the partitioning of the keys of the intermediate map-outputs.
|
PARALLEL n | Increase the parallelism of a job by specifying the number of reduce tasks, n. For more information, see Use the Parallel Features. |
Usage
Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT does not preserve the original order of the contents (to eliminate duplicates, Pig must first sort the data). You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT (see Example: Nested Block).
Example
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int); DUMP A; (8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
In this example all duplicate tuples are removed.
X = DISTINCT A; DUMP X; (1,2,3) (4,3,3) (8,3,4)
FILTER
Selects tuples from a relation based on some condition.
Syntax
alias = FILTER alias BY expression; |
Terms
alias | The name of the relation. |
BY | Required keyword. |
expression | A boolean expression. |
Usage
Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH...GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.
Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
In this example the condition states that if the third field equals 3, then include the tuple with relation X.
X = FILTER A BY f3 == 3; DUMP X; (1,2,3) (4,3,3) (8,4,3)refer to : http://pig.apache.org/docs/r0.12.1/basic.html