1.Hive UDF简介
在Hive中,用户可以自定义一些函数,用于扩展HiveQL的功能,而这类函数叫做UDF(用户自定义函数)。UDF分为两大类:UDAF(用户自定义聚合函数)和UDTF(用户自定义表生成函数)。在介绍UDAF和UDTF实现之前,我们先在本章介绍简单点的UDF实现——UDF和GenericUDF,然后以此为基础在下一章介绍UDAF和UDTF的实现。
Hive有两个不同的接口编写UDF程序。一个是基础的UDF接口,一个是复杂的GenericUDF接口。
- org.apache.hadoop.hive.ql. exec.UDF 基础UDF的函数读取和返回基本类型,即Hadoop和Hive的基本类型。如,Text、IntWritable、LongWritable、DoubleWritable等。
- org.apache.hadoop.hive.ql.udf.generic.GenericUDF 复杂的GenericUDF可以处理Map、List、Set类型。
- name:用于指定Hive中的函数名。
- value:用于描述函数的参数。
- extended:额外的说明,如,给出示例。当使用DESCRIBE FUNCTION EXTENDED name的时候打印。
- hive> ADD jar /root/experiment/hive/hive-0.0.1-SNAPSHOT.jar;
- hive> CREATE TEMPORARY FUNCTION hello AS "edu.wzm.hive. HelloUDF";
- hive> DROP TEMPORARY FUNCTION IF EXIST hello;
2.UDF
本章采用的数据如下:
hive (mydb)> SELECT * FROM employee;
OK
John Doe 100000.0 ["Mary Smith","Todd Jones"] {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600} US CA
Mary Smith 80000.0 ["Bill King"] {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601} US CA
Todd Jones 70000.0 [] {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1} {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700} US CA
Bill King 60000.0 [] {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1} {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100} US CA
Boss Man 200000.0 ["John Doe","Fred Finance"] {"Federal Taxes":0.3,"State Taxes":0.07,"Insurance":0.05} {"street":"1 Pretentious Drive.","city":"Chicago","state":"IL","zip":60500} US CA
Fred Finance 150000.0 ["Stacy Accountant"] {"Federal Taxes":0.3,"State Taxes":0.07,"Insurance":0.05} {"street":"2 Pretentious Drive.","city":"Chicago","state":"IL","zip":60500} US CA
Stacy Accountant 60000.0 [] {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1} {"street":"300 Main St.","city":"Naperville","state":"IL","zip":60563} US CA
Time taken: 0.093 seconds, Fetched: 7 row(s)
hive (mydb)> DESCRIBE employee;
OK
name string
salary float
subordinates array<string>
deductions map<string,float>
address struct<street:string,city:string,state:string,zip:int>
简单UDF的实现很简单,只需要继承UDF,然后实现evaluate()方法就行了。
@Description(
name = "hello",
value = "_FUNC_(str) - from the input string"
+ "returns the value that is \"Hello $str\" ",
extended = "Example:\n"
+ " > SELECT _FUNC_(str) FROM src;"
)
public class HelloUDF extends UDF{
public String evaluate(String str){
try {
return "Hello " + str;
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
return "ERROR";
}
}
}
把jar文件添加后,创建函数hello,然后执行结果如下:
hive (mydb)> SELECT hello(name) FROM employee;
OK
Hello John Doe
Hello Mary Smith
Hello Todd Jones
Hello Bill King
Hello Boss Man
Hello Fred Finance
Hello Stacy Accountant
Time taken: 0.198 seconds, Fetched: 7 row(s)
3.GenericUDF
GenericUDF实现比较复杂,需要先继承GenericUDF。这个API需要操作Object Inspectors,并且要对接收的参数类型和数量进行检查。GenericUDF需要实现以下三个方法:
//这个方法只调用一次,并且在evaluate()方法之前调用。该方法接受的参数是一个ObjectInspectors数组。该方法检查接受正确的参数类型和参数个数。
abstract ObjectInspector initialize(ObjectInspector[] arguments);
//这个方法类似UDF的evaluate()方法。它处理真实的参数,并返回最终结果。
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);
//这个方法用于当实现的GenericUDF出错的时候,打印出提示信息。而提示信息就是你实现该方法最后返回的字符串。
abstract String getDisplayString(String[] children);
下面是实现GenericUDF,判断一个数组或列表中是否包含某个元素的例子:
class ComplexUDFExample extends GenericUDF {
ListObjectInspector listOI;
StringObjectInspector elementsOI;
StringObjectInspector argOI;
@Override
public String getDisplayString(String[] arg0) {
return "arrayContainsExample()"; // this should probably be better
}
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
if (arguments.length != 2) {
throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T");
}
// 1. Check we received the right object types.
ObjectInspector a = arguments[0];
ObjectInspector b = arguments[1];
if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) {
throw new UDFArgumentException("first argument must be a list / array, second argument must be a string");
}
this.listOI = (ListObjectInspector) a;
this.elementsOI = (StringObjectInspector) this.listOI.getListElementObjectInspector();
this.argOI = (StringObjectInspector) b;
// 2. Check that the list contains strings
if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {
throw new UDFArgumentException("first argument must be a list of strings");
}
// the return type of our function is a boolean, so we provide the correct object inspector
return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
}
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
// get the list and string from the deferred objects using the object inspectors
// List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
int elemNum = this.listOI.getListLength(arguments[0].get());
// LazyListObjectInspector llst = (LazyListObjectInspector) arguments[0].get();
// List<String> lst = llst.
LazyString larg = (LazyString) arguments[1].get();
String arg = argOI.getPrimitiveJavaObject(larg);
// System.out.println("Length: =======================================================>>>" + elemNum);
// System.out.println("arg: =======================================================>>>" + arg);
// see if our list contains the value we need
for(int i = 0; i < elemNum; i++) {
LazyString lelement = (LazyString) this.listOI.getListElement(arguments[0].get(), i);
String element = elementsOI.getPrimitiveJavaObject(lelement);
if(arg.equals(element)){
return new Boolean(true);
}
}
return new Boolean(false);
}
}
注意:在Hive-1.0.1估计之后的版本也是,evaluate()方法中从Object Inspectors取出的值,需要先保存为Lazy包中的数据类型(org.apache.hadoop.hive.serde2.lazy),然后才能转换成Java的数据类型进行处理。否则会报错,解决方案可以参考Hive报错集锦中的第5个。
把jar文件添加后,创建函数contains,然后执行结果如下:hive (mydb)> select contains(subordinates, subordinates[0]), subordinates from employee;
OK
true ["Mary Smith","Todd Jones"]
true ["Bill King"]
false []
false []
true ["John Doe","Fred Finance"]
true ["Stacy Accountant"]
false []
Time taken: 0.169 seconds, Fetched: 7 row(s)
现在我们在回头看看GenericUDF的模型:
- 这个UDF使用默认的构造方法初始化。
- initialize()和一个Object Inspectors数组(ListObjectInspector、StringObjectInspector)参数一起被调用。
- 先检查参数个数(2个),和这些参数的类型;
- 为evaluate()方法保存Object Inspectors(listOI、argOI、elementsOI)
- 返回一个ObjectInspector(BooleanObjectInspector),且是Hive可以读取的方法结果。
- 对于查询的每一行都调用evaluate()(如,contains(subordinates, subordinates[0]))
- 取出存储在Object Inspectors中的值;
- 处理完initialize()方法返回的Object Inspectors之后,返回一个值(如,list.contains(elemement) ? true : false)。
源代码托管在GitHub上:https://github.com/GatsbyNewton/hive_udf