使用pig的注意事项

最新推荐文章于 2021-02-08 10:46:36 发布

jack-zhu

最新推荐文章于 2021-02-08 10:46:36 发布

阅读量3k

点赞数

分类专栏： pig 文章标签：使用pig的注意事项

本文链接：https://blog.csdn.net/zhu_xun/article/details/17679101

版权

pig 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

使用pig的注意事项

1.pig中dump和store的区别：

dump和store的作用都是将pig处理的结果进行输出，前者是打印结果，后者是将结果输出至hdfs文件系统中。

(1).Dump表示将数据打印出来，输出所有的元祖，会带上”( )”。

Eg:(zhangsan,22,男)

(2).Store表示将数据输出至hdfs，输出时不会带上”( )”。

Eg:zhangsan22,男

2.pig中组函数的使用：

在关系型数据库的标准sql中，可以不分组而直接使用组函数(比如SUM)，这是因为sql引擎会把该表的所有记录当作一个大组对待。但pig中组函数在使用前必须先显示地进行分组才行。

eg：有一个test表，里面有字段name,count，现在要统计test表中的count字段的和，那么可以通过下面简单的sql来做：

select SUM(count) as sum；

如果在pig中那就得先手动的分组了，如何手动分组？只要给每个元祖加上一条内容相同的字段，然后按照该字段进行分组，然后统计分组内的count之和。过程如下：

如果要实现输出如下形式的结果：name,count,sum，那么如何做？用sql比较简单：select name,count,sum(count)。

如果用pig，如何实现？思路是在上例的基础上在result上面再加一个字段id,内容为"pigIsVeryGood",过程如下：

data = load 'test.csv' as(name:chararray,count:int);
tmp01 = foreach data generate 'pigIsVeryGood' as id,name,count;
tmp02 = group tmp01 by id;
result = foreach tmp02 generate SUM($1.count) as sum;
dump result;
tmp03 = foreach resultgenerate 'pigIsVeryGood' as id,sum;
tmp04 = join tmp01 by id,tmp03 by id;
finalResult= foreach tmp04 generate name,count,sum;
dump finalResult;

3.Filter使用

是把符合条件的保留下，非符合条件的过滤掉

eg:info = Filter studentsInfo BY (age > 18)

把大于18岁的保留下，小于或等于18岁的过滤掉

4.引用运算符(Dereference Operators.)与解歧义运算符（Disambiguate Operator个人感觉还是叫“域运算符”吧：：）的区别及使用说明：

(1).引用运算符(Dereference Operators.)

Dereference Operators

Description

Operator	Symbol	Notes
tuple dereference	tuple.id or tuple.(id,…)	Tuple dereferencing can be done by name (tuple.field_name) or position (mytuple.$0). If a set of fields are dereferenced (tuple.(name1, name2) or tuple.($0, $1)), the expression represents a tuple composed of the specified fields. Note that if the dot operator is applied to a bytearray, the bytearray will be assumed to be a tuple.
bag dereference	bag.id or bag.(id,…)	Bag dereferencing can be done by name (bag.field_name) or position (bag.$0). If a set of fields are dereferenced (bag.(name1, name2) or bag.($0, $1)), the expression represents a bag composed of the specified fields.
map dereference	map#'key'	Map dereferencing must be done by key (field_name#key or $0#key). If the pound operator is applied to a bytearray, the bytearray is assumed to be a map. If the key does not exist, the empty string is returned.

Examples

Tuple Example

Suppose we have relation A.

LOAD 'data' as (f1:int, f2:tuple(t1:int,t2:int,t3:int));

DUMP A;
(1,(1,2,3))
(2,(4,5,6))
(3,(7,8,9))
(4,(1,4,7))
(5,(2,5,8))

In this example dereferencing is used to retrieve two fields from tuple f2.

X = FOREACH A GENERATE f2.t1,f2.t3;

DUMP X;
(1,3)
(4,6)
(7,9)
(1,7)
(2,8)

Bag Example

Suppose we have relation B, formed by grouping relation A (see the GROUP operator for information about the field names in relation B).

A = LOAD 'data' AS (f1:int, f2:int,f3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

B = GROUP A BY f1;

DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})

ILLUSTRATE B;
etc …
----------------------------------------------------------
| b   | group: int | a: bag({f1: int,f2: int,f3: int}) |
----------------------------------------------------------

In this example dereferencing is used with relation X to project the first field (f1) of each tuple in the bag (a).

X = FOREACH B GENERATE a.f1;

DUMP X;
({(1)})
({(4),(4)})
({(7)})
({(8),(8)})

Tuple/Bag Example

Suppose we have relation B, formed by grouping relation A (see the GROUP operator for information about the field names in relation B).

A = LOAD 'data' AS (f1:int, f2:int, f3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

B = GROUP A BY (f1,f2);

DUMP B;
((1,2),{(1,2,3)})
((4,2),{(4,2,1)})
((4,3),{(4,3,3)})
((7,2),{(7,2,5)})
((8,3),{(8,3,4)})
((8,4),{(8,4,3)})

ILLUSTRATE B;
etc …
-------------------------------------------------------------------------------
| b     | group: tuple({f1: int,f2: int}) | a: bag({f1: int,f2: int,f3: int}) |
-------------------------------------------------------------------------------
|       | (8, 3)                                | {(8, 3, 4), (8, 3, 4)} |
-------------------------------------------------------------------------------

In this example dereferencing is used to project a field (f1) from a tuple (group) and a field (f1) from a bag (a).

X = FOREACH B GENERATE group.f1, a.f1;

DUMP X;
(1,{(1)})
(4,{(4)})
(4,{(4)})
(7,{(7)})
(8,{(8)})
(8,{(8)})

Map Example

Suppose we have relation A.

A = LOAD 'data' AS (f1:int, f2:map[]);

DUMP A;
(1,[open#apache])
(2,[apache#hadoop])
(3,[hadoop#pig])
(4,[pig#grunt])

In this example dereferencing is used to look up the value of key 'open'.

X = FOREACH A GENERATE f2#'open';

DUMP X;
(apache)
()
()
()

(2).解歧义运算符（Disambiguate Operator个人感觉还是叫“域运算符”吧：：

Disambiguate Operator

Use the disambiguate operator ( :: ) to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operators.

In this example, to disambiguate y, use A::y or B::y. In cases where there is no ambiguity, such as z, the :: is not necessary but is still supported.

A = load 'data1' as (x, y);
B = load 'data2' as (x, y, z);
C = join A by x, B by x;
D = foreach C generate A::x,A::y,B::x,B::y,B::z;

未完待续.......

jack-zhu

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录