Hive Python transform UDF 示例

最新推荐文章于 2024-05-05 20:10:38 发布

houzhizhen

最新推荐文章于 2024-05-05 20:10:38 发布

阅读量1.7k

点赞数

分类专栏： hive 文章标签： hive python hadoop

本文链接：https://blog.csdn.net/houzhizhen/article/details/122098970

版权

hive 专栏收录该内容

154 篇文章 15 订阅

订阅专栏

person.txt

创建文件 person.txt。注意，中间是用 \t 分割。

neil	411326199402110030
pony	41132519950911004x
jcal	12312423454556561
tony	412345671234908

person.py

创建文件 person.py，此程序把两列变成3列，第3列是性别。

# -*- coding: utf-8 -*-
import sys

for line in sys.stdin:
    detail = line.strip().split("\t")
    if len(detail) != 2:
        continue
    else:
        name = detail[0]
        idcard = detail[1]
        if len(idcard) == 15:
            if int(idcard[-1]) % 2 == 0:
                print("\t".join([name,idcard,"female"]))
            else:
                print("\t".join([name,idcard,"male"]))
        elif len(idcard) == 18:
            if int(idcard[-2]) % 2 == 0:
                print("\t".join([name,idcard,"female"]))
            else:
                print("\t".join([name,idcard,"male"]))
        else:
            print("\t".join([name,idcard,"id card not correct"]))

本地测试

cat person.txt|python person.py
neil	411326199402110030	male
pony	41132519950911004x	female
jcal	12312423454556561	id card not correct
tony	412345671234908	female

Hive 测试

创建表 person

create table person(
name string,
idcard string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
STORED as TEXTFILE;

加载数据

load data local inpath 'person.txt' overwrite into table person;

加载程序

add file person.py;

执行

select transform(name,idcard) USING 'python person.py'  AS (name,idcard,gender) from person; 

OK
neil	411326199402110030	male
pony	41132519950911004x	female
jcal	12312423454556561	id card not correct
tony	412345671234908	female

Python udf 和 Java udf 的对比：

最重要的一点，python udf 比较方便，不需要 java 项目的编译打包过程，特别是对数据开发同学来说。
Python udf 不能处理单列，一次处理一行（若干个列），输出0行或者多行（若干个列）。Java udf 可以。
由于 Python udf 一次处理一行，不能用向量化执行技术，而 Java udf 不影响。
Python udf 需要单独启动一个 python 程序，代价比较大，执行效率比 java 低 80% ~ 90% 左右。

多行输出测试

我们把 person.py 改为以下内容，每个 print 都执行两次：

cat person.py
# -*- coding: utf-8 -*-
import sys

for line in sys.stdin:
    detail = line.strip().split("\t")
    if len(detail) != 2:
        continue
    else:
        name = detail[0]
        idcard = detail[1]
        if len(idcard) == 15:
            if int(idcard[-1]) % 2 == 0:
                print("\t".join([name,idcard,"female"]))
                print("\t".join([name,idcard,"female"]))
            else:
                print("\t".join([name,idcard,"male"]))
                print("\t".join([name,idcard,"male"]))
        elif len(idcard) == 18:
            if int(idcard[-2]) % 2 == 0:
                print("\t".join([name,idcard,"female"]))
                print("\t".join([name,idcard,"female"]))
            else:
                print("\t".join([name,idcard,"male"]))
                print("\t".join([name,idcard,"male"]))
        else:
            print("\t".join([name,idcard,"id card not correct"]))
            print("\t".join([name,idcard,"id card not correct"]))

进入 hive

add person.py;

select transform(name,idcard) USING 'python person.py'  AS (name,idcard,gender) from person;

结果输出如下，可以看到每行都输出2次：

OK
neil	411326199402110030	male
neil	411326199402110030	male
pony	41132519950911004x	female
pony	41132519950911004x	female
jcal	12312423454556561	id card not correct
jcal	12312423454556561	id card not correct
tony	412345671234908	female
tony	412345671234908	female

同样的道理，如果把所有的 print 都注释，则一行结果也没有。

详细处理过程请参考Hive Python transform UDF 过程分析