1、通过调用udf生成bag,报错:ERROR 1068: Using Bag as key not supported.
import sys
from pig_util import outputSchema
import re
rm = re.compile('\\|Gd ([^|]*)')
@outputSchema("{t:(inner_field_name_1:chararray,inner_field_name_2:chararray)}")
def aa(xx):
cc=rm.findall(xx)
oo=[]
for i in cc:
oo.append((i,xx))
return oo
pig脚本:
REGISTER 't.py' USING streaming_python AS udfs;
a = load 'part-r-00000' USING PigStorage('\n') as (dd:chararray);
--c = FILTER a BY $0 MATCHES '.*Gd.*';
a = FOREACH a GENERATE flatten(udfs.aa(dd)) ; -- 这里要用dd,指明数据格式是chararray,否则传到udf里面有可能是array.array,导致findall方法失败。本地可执行成功,但是hdfs上执行就不行了。
a = limit a 100;
dump a;
因为udfs返回的是bag ,故这里要使用flatten 。
大括号是bag,小括号是tuple。
报错:
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Excep tion while executing [POUserFunc (Name: POUserFunc(org.apache.pig.impl.builtin.StreamingUDF)[bag] - scope-3 Operator Key: scope-3)
hdfs上可能不支持streaming_python,需要将streaming_python改成jython才行
即:REGISTER 't.py' USING streaming_python AS udfs; 改成 REGISTER 't.py' USING jython AS udfs;