Hadoop Hive UDF 教程

最新推荐文章于 2022-07-26 11:05:49 发布

jiajiahebangbang

最新推荐文章于 2022-07-26 11:05:49 发布

阅读量5.1k

点赞数

分类专栏： HIVE udf 文章标签： hive hive-udf hadoop

HIVE 同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

udf

1 篇文章 0 订阅

订阅专栏

本文详细介绍了如何在Hive中编写用户定义函数（UDF）。针对两种不同接口，即简单的UDF（org.apache.hadoop.hive.ql.exec.UDF）和复杂的GenericUDF（org.apache.hadoop.hive.ql.udf.generic.GenericUDF），提供了实现步骤和示例。GenericUDF需要手动管理输入参数类型，包括object inspectors，以处理更复杂的数据结构如struct, map和array。文章还包含了一个测试用例，帮助理解UDF的初始化和执行过程。" 114714230,10546761,C语言与Java开发IDE推荐,"['C语言开发', 'Java开发', 'IDE推荐', 'Windows开发', 'Linux开发']

摘要由CSDN通过智能技术生成

关于编写HIVE的UDF，有两个不同的接口可以实现。一个非常简单，另一个则并不容易。

只要你的函数能够读取原始类型数据并返回原始类型数据就能使用简单的API（org.apache.hadoop.hive.ql.exec.UDF），这个原始类型数据是指Hadoop和Hive的可写类型-Text, IntWritable, LongWritable, DoubleWritable等。

无论如何，如果你打算编写一个UDF能够操作嵌入式的数据结构，比如Map,List,Set，那么你应该使用org.apache.hadoop.hive.ql.udf.generic.GenericUDF，这个更加复杂一些。

简单的API - org.apache.hadoop.hive.ql.exec.UDF
复杂的API - org.apache.hadoop.hive.ql.udf.generic.GenericUDF

接下来我要用上述的两个接口编写一个UDF的例子，我将会提供编写过程中用到的代码和测试数据。

#简单的API

编写一个继承简单的api的UDF只比编写一个只有一个函数的类复杂一点，下面是一个例子：

<span style="font-family:Microsoft YaHei;font-size:14px;">class SimpleUDFExample extends UDF {
  
  public Text evaluate(Text input) {
    return new Text("Hello " + input.toString());
  }
}</span>

完整的代码：https://github.com/rathboma/hive-extension-examples

#测试简单的UDF

由于这个UDF只有简单的一个函数，所以你能够使用测试工具测试它，比如Junit.

public class SimpleUDFExampleTest {  
  @Test
  public void testUDF() {
    SimpleUDFExample example = new SimpleUDFExample();
    Assert.assertEquals("Hello world", example.evaluate(new Text("world")).toString());
  }
}

#也要在hive下进行测试

你应该在hive上直接测试这个UDF，特别是当你不确定这个是否能够处理正确的类型的

%> mvn assembly:single
%> hive
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
hive> CREATE TEMPORARY FUNCTION helloworld as 'com.matthewrathbone.example.SimpleUDFExample';
hive> select helloworld(name) from people limit 1000;

事实上，这个UDF有一个BUG，它没有对当参数为空的时候进行处理。空值在大数据集中是相当普遍的，所以，必须要考虑周全。

因此，我增加了一个简单的确认空值的代码：

class SimpleUDFExample extends UDF {
  
  public Text evaluate(Text input) {
    if(input == null) return null;
    return new Text("Hello " + input.toString());
  }
}

并且增加第二个测试去确认

class SimpleUDFExample extends UDF {
  
  public Text evaluate(Text input) {
    if(input == null) return null;
    return new Text("Hello " + input.toString());
  }
}

然后使用MVN TEST去确认是否正确运行。

#复杂的API

org.apache.hadoop.hive.ql.udf.generic.GenericUDF API提供一个应对处理对象而不是处理可编写类型的情况，比如处理struct,map和array.

这个API要求你去手动管理函数的输入参数类型：object inspectors，和确认输入的参数数量和类型。一个object inspector为底层底层对象类型提供一个一致的接口，以便于在hive中不同的对象能以相同的方式去实现（比如你可以实现一个结构比如map，只要你提供一个正确的object inspector）

这个api要求你实现下述三个方法：

// this is like the evaluate method of the simple API. It takes the actual arguments and returns the result
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);

// Doesn't really matter, we can return anything, but should be a string representation of the function.
abstract String getDisplayString(String[] children);

// called once, before any evaluate() calls. You receive an array of object inspectors that represent the arguments of the function
// this is where you validate that the function is receiving the correct argument types, and the correct number of arguments.
abstract ObjectInspector initialize(ObjectInspector[] arguments);

这样说可能毫无头绪，还是写个例子吧。

#例子

我将编写一个函数containsString，接收两个参数：

A list of Strings A
A String B

返回true/false对应这个list A中是否包含字符串B。

比如：

containsString(List("a", "b", "c"), "b"); // true

containsString(List("a", "b", "c"), "d"); // false

不同于UDF的API，GenericUDF的API要求更加复杂一些。

class ComplexUDFExample extends GenericUDF {

  ListObjectInspector listOI;
  StringObjectInspector elementOI;

  @Override
  public String getDisplayString(String[] arg0) {
    return "arrayContainsExample()"; // this should probably be better
  }

  @Override
  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    if (arguments.length != 2) {
      throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T");
    }
    // 1. Check we received the right object types.
    ObjectInspector a = arguments[0];
    ObjectInspector b = arguments[1];
    if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list / array, second argument must be a string");
    }
    this.listOI = (ListObjectInspector) a;
    this.elementOI = (StringObjectInspector) b;
    
    // 2. Check that the list contains strings
    if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list of strings");
    }
    
    // the return type of our function is a boolean, so we provide the correct object inspector
    return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
  }
  
  @Override
  public Object evaluate(DeferredObject[] arguments) throws HiveException {
    
    // get the list and string from the deferred objects using the object inspectors
    List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
    String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());
    
    // check for nulls
    if (list == null || arg == null) {
      return null;
    }
    
    // see if our list contains the value we need
    for(String s: list) {
      if (arg.equals(s)) return new Boolean(true);
    }
    return new Boolean(false);
  }
  
}

#编码过程
运行过程如下：

1.UDF通过一个缺省构造器进行初始化。

2.udf.initialize()被调用来处理UDF的参数的object instructors（ListObjectInstructor, StringObjectInstructor）

我们确认正确的收到了两个参数，并且是所指定的数据类型。

我们存储object instructors，这个将在evaluate中被调用（listOI,elementOI）

我们返回一个object inspector来让Hive能够读取函数的运行结果（BooleanObjectInspector）

3.调用evaluate处理输入参数中的每行数据（比如这样：evaluate(List(“a”, “b”, “c”), “c”)）

我们使用存储object instructors提出要用到的数据值。

我们执行逻辑代码并与object inspector（从initialize中返回的）的格式进行一致检查，返回结果（list.contains(elemement) ? true : false）

#测试

这个测试中唯一复杂的部分是在初始化，一旦清楚了解请求运行顺序，并且我们知道如何创建一个object instructors，这个将会简单很多。

我的测试函数如下：

public class ComplexUDFExampleTest {
  
  @Test
  public void testComplexUDFReturnsCorrectValues() throws HiveException {
    
    // set up the models we need
    ComplexUDFExample example = new ComplexUDFExample();
    ObjectInspector stringOI = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
    ObjectInspector listOI = ObjectInspectorFactory.getStandardListObjectInspector(stringOI);
    JavaBooleanObjectInspector resultInspector = (JavaBooleanObjectInspector) example.initialize(new ObjectInspector[]{listOI, stringOI});
    
    // create the actual UDF arguments
    List<String> list = new ArrayList<String>();
    list.add("a");
    list.add("b");
    list.add("c");
    
    // test our results
    
    // the value exists
    Object result = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("a")});
    Assert.assertEquals(true, resultInspector.get(result));
    
    // the value doesn't exist
    Object result2 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("d")});
    Assert.assertEquals(false, resultInspector.get(result2));
    
    // arguments are null
    Object result3 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(null), new DeferredJavaObject(null)});
    Assert.assertNull(result3);
  }
}

再次提到，源码在这：https://github.com/rathboma/hive-extension-examples

#结束

希望这篇文字能够帮助你了解如何使用自定义功能扩展hive。虽然我没有提及其他的在这篇文章中，也有用户定义的聚合函数（UDAF），这让许多行的处理和汇总在一个单一的功能。如果你有兴趣了解更多，有关于这一主题，可以参考https://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335?tag=matratsblo-20

本文翻译自：http://blog.beekeeperdata.com/2013/08/10/guide-to-writing-hive-udfs.html