PIG自带的distinct只支持整条记录相同的过滤,并不支持对某些字段的distinct
PIG的说明如下
You cannot use DISTINCT on a subset of fields. To do this, use FOREACH…GENERATE to select the fields, and then use DISTINCT (seeExample: Nested Block).
后面例子中distinct也是先做了FILTER,然后最整个relation进行distinct
但实际应用场景由于不合理的设计和数据冗余等问题,常常需要用到对某些字段单独做distinct,其他字段中的数据部分有用
其实这可以结合group,foreach,和limit来实现
如数据 foo(id,field1,field2,field3)
id=1的时候field1的值有意义且一定相等
id=2的时候field1和field2的值有意义且相等
id=3的时候field1,field2,field3的值有意义且相等
(PS:这样的数据表设计是违反数据库设计范式的)
1,value1,other1_1,other1_2
2,value2_1,value2_2,other2_1
3,value3_1,value3_2,value3_3
1,value1,other1_3,other1_4
1,value1,other1_5,other1_6
2,value2_1,value2_2,other2_2
4,value4_1,value4_2,
只对id做distinct的PIG代码:
foo = LOAD 'foo' USING PigStorage(',') AS (id:int, field1:chararray, field2:chararray, field3:chararray);
foo_group = GROUP foo BY id;
result = FOREACH foo_group{
foo_one = LIMIT foo 1;
GENERATE FLATTEN(foo_one);
}
dump result;
结果:
(1,value1,other1_1,other1_2)
(2,value2_1,value2_2,other2_1)
(3,value3_1,value3_2,value3_3)
(4,value4_1,value4_2,)
以前上代码在PIG0.9.2运行通过