Hive UDF教程（一）

最新推荐文章于 2023-05-07 14:21:04 发布

GatsbyNewton

最新推荐文章于 2023-05-07 14:21:04 发布

阅读量2w

点赞数 4

分类专栏： Hive 文章标签： Hive UDF GenericUDF

本文链接：https://blog.csdn.net/u010376788/article/details/50532166

版权

Hive 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

Hive UDF教程（一）

Hive UDF教程（二）

Hive UDF教程（三）

1.Hive UDF简介

在Hive中，用户可以自定义一些函数，用于扩展HiveQL的功能，而这类函数叫做UDF（用户自定义函数）。UDF分为两大类：UDAF（用户自定义聚合函数）和UDTF（用户自定义表生成函数）。在介绍UDAF和UDTF实现之前，我们先在本章介绍简单点的UDF实现——UDF和GenericUDF，然后以此为基础在下一章介绍UDAF和UDTF的实现。

Hive有两个不同的接口编写UDF程序。一个是基础的UDF接口，一个是复杂的GenericUDF接口。

org.apache.hadoop.hive.ql. exec.UDF 基础UDF的函数读取和返回基本类型，即Hadoop和Hive的基本类型。如，Text、IntWritable、LongWritable、DoubleWritable等。
org.apache.hadoop.hive.ql.udf.generic.GenericUDF 复杂的GenericUDF可以处理Map、List、Set类型。

@Describtion注解是可选的，用于对函数进行说明，其中的_FUNC_字符串表示函数名，当使用DESCRIBE FUNCTION命令时，替换成函数名。@Describtion包含三个属性：

name：用于指定Hive中的函数名。
value：用于描述函数的参数。
extended：额外的说明，如，给出示例。当使用DESCRIBE FUNCTION EXTENDED name的时候打印。

而且，Hive要使用UDF，需要把Java文件编译、打包成jar文件，然后将jar文件加入到CLASSPATH中，最后使用CREATE FUNCTION语句定义这个Java类的函数：

hive> ADD jar /root/experiment/hive/hive-0.0.1-SNAPSHOT.jar;
hive> CREATE TEMPORARY FUNCTION hello AS "edu.wzm.hive. HelloUDF";
hive> DROP TEMPORARY FUNCTION IF EXIST hello;

2.UDF

本章采用的数据如下：

hive (mydb)> SELECT * FROM employee;          
OK
John Doe	100000.0	["Mary Smith","Todd Jones"]	{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}	{"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}	US	CA
Mary Smith	80000.0	["Bill King"]	{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}	{"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}	US	CA
Todd Jones	70000.0	[]	{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}	{"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}	US	CA
Bill King	60000.0	[]	{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}	{"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}	US	CA
Boss Man	200000.0	["John Doe","Fred Finance"]	{"Federal Taxes":0.3,"State Taxes":0.07,"Insurance":0.05}	{"street":"1 Pretentious Drive.","city":"Chicago","state":"IL","zip":60500}	US	CA
Fred Finance	150000.0	["Stacy Accountant"]	{"Federal Taxes":0.3,"State Taxes":0.07,"Insurance":0.05}	{"street":"2 Pretentious Drive.","city":"Chicago","state":"IL","zip":60500}	US	CA
Stacy Accountant	60000.0	[]	{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}	{"street":"300 Main St.","city":"Naperville","state":"IL","zip":60563}	US	CA
Time taken: 0.093 seconds, Fetched: 7 row(s)
hive (mydb)> DESCRIBE employee;          
OK
name                	string              	                    
salary              	float               	                    
subordinates        	array<string>       	                    
deductions          	map<string,float>   	                    
address             	struct<street:string,city:string,state:string,zip:int>

简单UDF的实现很简单，只需要继承UDF，然后实现evaluate()方法就行了。

@Description(
	name = "hello",
	value = "_FUNC_(str) - from the input string"
		+ "returns the value that is \"Hello $str\" ",
	extended = "Example:\n"
		+ " > SELECT _FUNC_(str) FROM src;"
)
public class HelloUDF extends UDF{
	
	public String evaluate(String str){
		try {
			return "Hello " + str;
		} catch (Exception e) {
			// TODO: handle exception
			e.printStackTrace();
			return "ERROR";
		}
	}
}

把jar文件添加后，创建函数hello，然后执行结果如下：

hive (mydb)> SELECT hello(name) FROM employee;
OK
Hello John Doe
Hello Mary Smith
Hello Todd Jones
Hello Bill King
Hello Boss Man
Hello Fred Finance
Hello Stacy Accountant
Time taken: 0.198 seconds, Fetched: 7 row(s)

3.GenericUDF

GenericUDF实现比较复杂，需要先继承GenericUDF。这个API需要操作Object Inspectors，并且要对接收的参数类型和数量进行检查。GenericUDF需要实现以下三个方法：

//这个方法只调用一次，并且在evaluate()方法之前调用。该方法接受的参数是一个ObjectInspectors数组。该方法检查接受正确的参数类型和参数个数。
abstract ObjectInspector initialize(ObjectInspector[] arguments);

//这个方法类似UDF的evaluate()方法。它处理真实的参数，并返回最终结果。
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);

//这个方法用于当实现的GenericUDF出错的时候，打印出提示信息。而提示信息就是你实现该方法最后返回的字符串。
abstract String getDisplayString(String[] children);

下面是实现GenericUDF，判断一个数组或列表中是否包含某个元素的例子：

class ComplexUDFExample extends GenericUDF {

  ListObjectInspector listOI;
  StringObjectInspector elementsOI;
  StringObjectInspector argOI;

  @Override
  public String getDisplayString(String[] arg0) {
    return "arrayContainsExample()"; // this should probably be better
  }

  @Override
  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    if (arguments.length != 2) {
      throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T");
    }
    // 1. Check we received the right object types.
    ObjectInspector a = arguments[0];
    ObjectInspector b = arguments[1];
    if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list / array, second argument must be a string");
    }
    this.listOI = (ListObjectInspector) a;
    this.elementsOI = (StringObjectInspector) this.listOI.getListElementObjectInspector();
    this.argOI = (StringObjectInspector) b;
    
    // 2. Check that the list contains strings
    if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list of strings");
    }
    
    // the return type of our function is a boolean, so we provide the correct object inspector
    return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
  }
  
  @Override
  public Object evaluate(DeferredObject[] arguments) throws HiveException {
    
    // get the list and string from the deferred objects using the object inspectors
//    List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
    int elemNum = this.listOI.getListLength(arguments[0].get());
//    LazyListObjectInspector llst = (LazyListObjectInspector) arguments[0].get();
//    List<String> lst = llst.
    
    LazyString larg = (LazyString) arguments[1].get();
    String arg = argOI.getPrimitiveJavaObject(larg);
    
//    System.out.println("Length: =======================================================>>>" + elemNum);
//    System.out.println("arg: =======================================================>>>" + arg);
    // see if our list contains the value we need
    for(int i = 0; i < elemNum; i++) {
    	LazyString lelement = (LazyString) this.listOI.getListElement(arguments[0].get(), i);
    	String element = elementsOI.getPrimitiveJavaObject(lelement);
    	if(arg.equals(element)){
    		return new Boolean(true);
    	}
    }
    return new Boolean(false);
  }
  
}

注意：在Hive-1.0.1估计之后的版本也是，evaluate()方法中从Object Inspectors取出的值，需要先保存为Lazy包中的数据类型（org.apache.hadoop.hive.serde2.lazy），然后才能转换成Java的数据类型进行处理。否则会报错，解决方案可以参考Hive报错集锦中的第5个。

把jar文件添加后，创建函数contains，然后执行结果如下：

hive (mydb)> select contains(subordinates, subordinates[0]), subordinates from employee;
OK
true	["Mary Smith","Todd Jones"]
true	["Bill King"]
false	[]
false	[]
true	["John Doe","Fred Finance"]
true	["Stacy Accountant"]
false	[]
Time taken: 0.169 seconds, Fetched: 7 row(s)

现在我们在回头看看GenericUDF的模型：

这个UDF使用默认的构造方法初始化。
initialize()和一个Object Inspectors数组（ListObjectInspector、StringObjectInspector）参数一起被调用。

- 先检查参数个数（2个），和这些参数的类型；
- 为evaluate()方法保存Object Inspectors（listOI、argOI、elementsOI）
- 返回一个ObjectInspector（BooleanObjectInspector），且是Hive可以读取的方法结果。

对于查询的每一行都调用evaluate()（如，contains(subordinates, subordinates[0])）

- 取出存储在Object Inspectors中的值；
- 处理完initialize()方法返回的Object Inspectors之后，返回一个值（如，list.contains(elemement) ? true : false）。

源代码托管在GitHub上：https://github.com/GatsbyNewton/hive_udf

GatsbyNewton

关注

4
点赞
踩
30

收藏

觉得还不错? 一键收藏
1
评论
Hive UDF教程（一）

1.Hive UDF简介在Hive中，用户可以自定义一些函数，用于扩展HiveQL的功能，而这类函数叫做UDF（用户自定义函数）。UDF分为两大类：UDAF（用户自定义聚合函数）和UDTF（用户自定义表生成函数）。在介绍UDAF和UDTF实现之前，我们先在本章介绍简单点的UDF实现——UDF和GenericUDF，然后以此为基础在下一章介绍UDAF和UDTF的实现。Hive有两个不同的接口
复制链接

扫一扫