关于编写HIVE的UDF,有两个不同的接口可以实现。一个非常简单,另一个则并不容易。
只要你的函数能够读取原始类型数据并返回原始类型数据就能使用简单的API(org.apache.hadoop.hive.ql.exec.UDF),这个原始类型数据是指Hadoop和Hive的可写类型-Text, IntWritable, LongWritable, DoubleWritable等。
无论如何,如果你打算编写一个UDF能够操作嵌入式的数据结构,比如Map,List,Set,那么你应该使用org.apache.hadoop.hive.ql.udf.generic.GenericUDF,这个更加复杂一些。
- 简单的API - org.apache.hadoop.hive.ql.exec.UDF
- 复杂的API - org.apache.hadoop.hive.ql.udf.generic.GenericUDF
<span style="font-family:Microsoft YaHei;font-size:14px;">class SimpleUDFExample extends UDF {
public Text evaluate(Text input) {
return new Text("Hello " + input.toString());
}
}</span>
完整的代码:https://github.com/rathboma/hive-extension-examples
#测试简单的UDF
public class SimpleUDFExampleTest {
@Test
public void testUDF() {
SimpleUDFExample example = new SimpleUDFExample();
Assert.assertEquals("Hello world", example.evaluate(new Text("world")).toString());
}
}
#也要在hive下进行测试
%> mvn assembly:single
%> hive
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
hive> CREATE TEMPORARY FUNCTION helloworld as 'com.matthewrathbone.example.SimpleUDFExample';
hive> select helloworld(name) from people limit 1000;
事实上,这个UDF有一个BUG,它没有对当参数为空的时候进行处理。空值在大数据集中是相当普遍的,所以,必须要考虑周全。
class SimpleUDFExample extends UDF {
public Text evaluate(Text input) {
if(input == null) return null;
return new Text("Hello " + input.toString());
}
}
并且增加第二个测试去确认
class SimpleUDFExample extends UDF {
public Text evaluate(Text input) {
if(input == null) return null;
return new Text("Hello " + input.toString());
}
}
然后使用MVN TEST去确认是否正确运行。
#复杂的API
org.apache.hadoop.hive.ql.udf.generic.GenericUDF API提供一个应对处理对象而不是处理可编写类型的情况,比如处理struct,map和array.
这个API要求你去手动管理函数的输入参数类型:object inspectors,和确认输入的参数数量和类型。 一个object inspector为底层底层对象类型提供一个一致的接口,以便于在hive中不同的对象能以相同的方式去实现(比如你可以实现一个结构比如map,只要你提供一个正确的object inspector)
这个api要求你实现下述三个方法:
// this is like the evaluate method of the simple API. It takes the actual arguments and returns the result
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);
// Doesn't really matter, we can return anything, but should be a string representation of the function.
abstract String getDisplayString(String[] children);
// called once, before any evaluate() calls. You receive an array of object inspectors that represent the arguments of the function
// this is where you validate that the function is receiving the correct argument types, and the correct number of arguments.
abstract ObjectInspector initialize(ObjectInspector[] arguments);
这样说可能毫无头绪,还是写个例子吧。
#例子
我将编写一个函数containsString,接收两个参数:
- A list of Strings A
- A String B
比如:
containsString(List("a", "b", "c"), "b"); // true
containsString(List("a", "b", "c"), "d"); // false
不同于UDF的API,GenericUDF的API要求更加复杂一些。
class ComplexUDFExample extends GenericUDF {
ListObjectInspector listOI;
StringObjectInspector elementOI;
@Override
public String getDisplayString(String[] arg0) {
return "arrayContainsExample()"; // this should probably be better
}
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
if (arguments.length != 2) {
throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T");
}
// 1. Check we received the right object types.
ObjectInspector a = arguments[0];
ObjectInspector b = arguments[1];
if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) {
throw new UDFArgumentException("first argument must be a list / array, second argument must be a string");
}
this.listOI = (ListObjectInspector) a;
this.elementOI = (StringObjectInspector) b;
// 2. Check that the list contains strings
if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {
throw new UDFArgumentException("first argument must be a list of strings");
}
// the return type of our function is a boolean, so we provide the correct object inspector
return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
}
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
// get the list and string from the deferred objects using the object inspectors
List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());
// check for nulls
if (list == null || arg == null) {
return null;
}
// see if our list contains the value we need
for(String s: list) {
if (arg.equals(s)) return new Boolean(true);
}
return new Boolean(false);
}
}
#编码过程
运行过程如下:
1.UDF通过一个缺省构造器进行初始化。
2.udf.initialize()被调用来处理UDF的参数的object instructors(ListObjectInstructor, StringObjectInstructor)
我们确认正确的收到了两个参数,并且是所指定的数据类型。
我们存储object instructors,这个将在evaluate中被调用(listOI,elementOI)
我们返回一个object inspector来让Hive能够读取函数的运行结果(BooleanObjectInspector)
3.调用evaluate处理输入参数中的每行数据(比如这样:evaluate(List(“a”, “b”, “c”), “c”))
我们使用存储object instructors提出要用到的数据值。
我们执行逻辑代码并与object inspector(从initialize中返回的)的格式进行一致检查,返回结果(list.contains(elemement) ? true : false)
#测试
这个测试中唯一复杂的部分是在初始化,一旦清楚了解请求运行顺序,并且我们知道如何创建一个object instructors,这个将会简单很多。
我的测试函数如下:
public class ComplexUDFExampleTest {
@Test
public void testComplexUDFReturnsCorrectValues() throws HiveException {
// set up the models we need
ComplexUDFExample example = new ComplexUDFExample();
ObjectInspector stringOI = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
ObjectInspector listOI = ObjectInspectorFactory.getStandardListObjectInspector(stringOI);
JavaBooleanObjectInspector resultInspector = (JavaBooleanObjectInspector) example.initialize(new ObjectInspector[]{listOI, stringOI});
// create the actual UDF arguments
List<String> list = new ArrayList<String>();
list.add("a");
list.add("b");
list.add("c");
// test our results
// the value exists
Object result = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("a")});
Assert.assertEquals(true, resultInspector.get(result));
// the value doesn't exist
Object result2 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("d")});
Assert.assertEquals(false, resultInspector.get(result2));
// arguments are null
Object result3 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(null), new DeferredJavaObject(null)});
Assert.assertNull(result3);
}
}
再次提到,源码在这:https://github.com/rathboma/hive-extension-examples
#结束
希望这篇文字能够帮助你了解如何使用自定义功能扩展hive。虽然我没有提及其他的在这篇文章中,也有用户定义的聚合函数(UDAF),这让许多行的处理和汇总在一个单一的功能。如果你有兴趣了解更多,有关于这一主题,可以参考https://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335?tag=matratsblo-20
本文翻译自:http://blog.beekeeperdata.com/2013/08/10/guide-to-writing-hive-udfs.html