1.UDTF
上一篇介绍了基础的UDF——UDF和GenericUDF的实现,这一篇将介绍更复杂的用户自定义表生成函数(UDTF)。用户自定义表生成函数(UDTF)接受零个或多个输入,然后产生多列或多行的输出,如explode()。要实现UDTF,需要继承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF,同时实现三个方法:
// 该方法指定输入输出参数:输入的Object Inspectors和输出的Struct。
abstract StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException;
// 该方法处理输入记录,然后通过forward()方法返回输出结果。
abstract void process(Object[] record) throws HiveException;
// 该方法用于通知UDTF没有行可以处理了。可以在该方法中清理代码或者附加其他处理输出。
abstract void close() throws HiveException;
2.示例
现在我们看一个分割字符串的例子:
@Description(
name = "explode_name",
value = "_FUNC_(col) - The parameter is a column name."
+ " The return value is two strings.",
extended = "Example:\n"
+ " > SELECT _FUNC_(col) FROM src;"
+ " > SELECT _FUNC_(col) AS (name, surname) FROM src;"
+ " > SELECT adTable.name,adTable.surname"
+ " > FROM src LATERAL VIEW _FUNC_(col) adTable AS name, surname;"
)
public class ExplodeNameUDTF extends GenericUDTF{
@Override
public StructObjectInspector initialize(ObjectInspector[] argOIs)
throws UDFArgumentException {
if(argOIs.length != 1){
throw new UDFArgumentException("ExplodeStringUDTF takes exactly one argument.");
}
if(argOIs[0].getCategory() != ObjectInspector.Category.PRIMITIVE
&& ((PrimitiveObjectInspector)argOIs[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING){
throw new UDFArgumentTypeException(0, "ExplodeStringUDTF takes a string as a parameter.");
}
ArrayList<String> fieldNames = new ArrayList<String>();
ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
fieldNames.add("name");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldNames.add("surname");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}
@Override
public void process(Object[] args) throws HiveException {
// TODO Auto-generated method stub
String input = args[0].toString();
String[] name = input.split(" ");
forward(name);
}
@Override
public void close() throws HiveException {
// TODO Auto-generated method stub
}
}
然后我们把代码编译打包后的jar文件添加到CLASSPATH,然后创建函数explode_name(),最后仍然使用上一节的数据表employee:
hive (mydb)> ADD jar /root/experiment/hive/hive-0.0.1-SNAPSHOT.jar;
hive (mydb)> CREATE TEMPORARY FUNCTION explode_name AS "edu.wzm.hive.udtf.ExplodeNameUDTF";
hive (mydb)> SELECT explode_name(name) FROM employee;
Query ID = root_20160118000909_c2052a8b-dc3b-4579-931e-d9059c00c25b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1453096763931_0005, Tracking URL = http://master:8088/proxy/application_1453096763931_0005/
Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453096763931_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-01-18 00:09:08,643 Stage-1 map = 0%, reduce = 0%
2016-01-18 00:09:14,152 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.03 sec
MapReduce Total cumulative CPU time: 1 seconds 30 msec
Ended Job = job_1453096763931_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.03 sec HDFS Read: 1040 HDFS Write: 80 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 30 msec
OK
John Doe
Mary Smith
Todd Jones
Bill King
Boss Man
Fred Finance
Stacy Accountant
Time taken: 13.765 seconds, Fetched: 7 row(s)
3.UDTF的使用
UDTF有两种使用方法:
1. 直接和SELECT一起使用
SELECT explode_name(name) AS (name, surname) FROM employee
SELECT explode_name(name) FROM employee
但是,不能添加其他字段:
SELECT name,explode_name(name) AS (name, surname) FROM employee
不能嵌套其他函数:
SELECT explode_name(explode_name(name)) FROM employee
不能和GROUP BY/SORT BY/DISTRIBUTE BY/CLUSTER BY一起使用:
SELECT explode_name(name) FROM employee GROUP BY name
2. 和LATERAL VIEW一起使用
SELECT adTable.name,adTable.surname
FROM employee
LATERAL VIEW explode_name(name) adTable AS name, surname;
源代码托管在GitHub上:https://github.com/GatsbyNewton/hive_udf