person.txt
创建文件 person.txt。注意,中间是用 \t
分割。
neil 411326199402110030
pony 41132519950911004x
jcal 12312423454556561
tony 412345671234908
person.py
创建文件 person.py,此程序把两列变成3列,第3列是性别。
# -*- coding: utf-8 -*-
import sys
for line in sys.stdin:
detail = line.strip().split("\t")
if len(detail) != 2:
continue
else:
name = detail[0]
idcard = detail[1]
if len(idcard) == 15:
if int(idcard[-1]) % 2 == 0:
print("\t".join([name,idcard,"female"]))
else:
print("\t".join([name,idcard,"male"]))
elif len(idcard) == 18:
if int(idcard[-2]) % 2 == 0:
print("\t".join([name,idcard,"female"]))
else:
print("\t".join([name,idcard,"male"]))
else:
print("\t".join([name,idcard,"id card not correct"]))
本地测试
cat person.txt|python person.py
neil 411326199402110030 male
pony 41132519950911004x female
jcal 12312423454556561 id card not correct
tony 412345671234908 female
Hive 测试
创建表 person
create table person(
name string,
idcard string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED as TEXTFILE;
加载数据
load data local inpath 'person.txt' overwrite into table person;
加载程序
add file person.py;
执行
select transform(name,idcard) USING 'python person.py' AS (name,idcard,gender) from person;
OK
neil 411326199402110030 male
pony 41132519950911004x female
jcal 12312423454556561 id card not correct
tony 412345671234908 female
Python udf 和 Java udf 的对比:
- 最重要的一点,python udf 比较方便,不需要 java 项目的编译打包过程,特别是对数据开发同学来说。
- Python udf 不能处理单列,一次处理一行(若干个列),输出0行或者多行(若干个列)。Java udf 可以。
- 由于 Python udf 一次处理一行,不能用向量化执行技术,而 Java udf 不影响。
- Python udf 需要单独启动一个 python 程序,代价比较大,执行效率比 java 低 80% ~ 90% 左右。
多行输出测试
我们把 person.py 改为以下内容,每个 print 都执行两次:
cat person.py
# -*- coding: utf-8 -*-
import sys
for line in sys.stdin:
detail = line.strip().split("\t")
if len(detail) != 2:
continue
else:
name = detail[0]
idcard = detail[1]
if len(idcard) == 15:
if int(idcard[-1]) % 2 == 0:
print("\t".join([name,idcard,"female"]))
print("\t".join([name,idcard,"female"]))
else:
print("\t".join([name,idcard,"male"]))
print("\t".join([name,idcard,"male"]))
elif len(idcard) == 18:
if int(idcard[-2]) % 2 == 0:
print("\t".join([name,idcard,"female"]))
print("\t".join([name,idcard,"female"]))
else:
print("\t".join([name,idcard,"male"]))
print("\t".join([name,idcard,"male"]))
else:
print("\t".join([name,idcard,"id card not correct"]))
print("\t".join([name,idcard,"id card not correct"]))
进入 hive
add person.py;
select transform(name,idcard) USING 'python person.py' AS (name,idcard,gender) from person;
结果输出如下,可以看到每行都输出2次:
OK
neil 411326199402110030 male
neil 411326199402110030 male
pony 41132519950911004x female
pony 41132519950911004x female
jcal 12312423454556561 id card not correct
jcal 12312423454556561 id card not correct
tony 412345671234908 female
tony 412345671234908 female
同样的道理,如果把所有的 print 都注释,则一行结果也没有。
详细处理过程请参考Hive Python transform UDF 过程分析