一、概述
1. UDF : 一进一出
2. UDTF: 一进多出
3. UDAF: 多进一出
a、进入函数的是什么参数
b、希望得到什么结果
c、考虑通用性
二、UDTF函数
2.1 UDTF解析
A custom UDTF can be created by extending the GenericUDTF abstract class and then implementing the
initialize, process, and possibly close methods. The initialize method is called by Hive to notify the
UDTF the argument types to expect. The UDTF must then return an object inspector corresponding to the
row objects that the UDTF will generate. Once initialize( ) has been called, Hive will give rows to the
UDTF using the process( ) method. While in process( ) , the UDTF can produce and forward rows to other
operators by calling forward( ) . Lastly, Hive will call the close ( ) method when all the rows have
passed to the UDTF.
1. 自定义UDTF函数,继承于抽象类GenericUDTF;
2. 实现initialize, process, close 的三个方法;
3. 三个方法的作用说明:
1 ) 'initialize()'
a、规定形参的参数类型:Hive调用initialize( ) 方法,来告诉UDTF函数接收参数的类型,然后在这个方法中就可以对参数的
类型进行校验,参数类型包括个数、数据类型等等
b、规定函数返回结果数据类型:返回initialize( ) 方法必须返回一个与UDTF将生成的行对象对应的对象检查器
2 ) 'process()'
a、调用时机:initialize( ) 被调用以后,即对输入的数据进行了检查满足条件以后,执行这个方法
b、作用:一行数据调用一次process方法,process方法的内部,会遍历这一行数据,如果是数组,那就遍历数组,遍历以后的
一个元素,调用一次forward( ) 方法,对数据进行输出。
3 ) 'close()'
a、调用时机:当hive中的所有行都通过了UDTF函数以后,则Hive会调用这个close 方法
b、作用:关闭资源。
2.2 代码实现
package cn. com. hive ;
import org. apache. hadoop. hive. ql. exec. UDFArgumentException ;
import org. apache. hadoop. hive. ql. metadata. HiveException ;
import org. apache. hadoop. hive. ql. udf. generic. GenericUDTF ;
import org. apache. hadoop. hive. serde2. objectinspector. ObjectInspector ;
import org. apache. hadoop. hive. serde2. objectinspector. ObjectInspectorFactory ;
import org. apache. hadoop. hive. serde2. objectinspector. StructField ;
import org. apache. hadoop. hive. serde2. objectinspector. StructObjectInspector ;
import org. apache. hadoop. hive. serde2. objectinspector. primitive. PrimitiveObjectInspectorFactory ;
import org. json. JSONArray ;
import java. util. ArrayList ;
import java. util. List ;
public class ExplodeJSONArray extends GenericUDTF {
@Override
public StructObjectInspector initialize ( StructObjectInspector argOIs) throws UDFArgumentException {
List < ? extends StructField > fieldRefs = argOIs. getAllStructFieldRefs ( ) ;
if ( fieldRefs. size ( ) != 1 ) {
throw new UDFArgumentException ( "传递参数的个数超过1个" ) ;
}
ObjectInspector fieldObjectInspector = fieldRefs. get ( 0 ) . getFieldObjectInspector ( ) ;
if ( ! "string" . equals ( fieldObjectInspector. getTypeName ( ) ) ) {
throw new UDFArgumentException ( "传入的参数类型错误,需要是string类型" ) ;
}
List < String > filenames = new ArrayList < String > ( ) ;
ArrayList < ObjectInspector > fileOIS = new ArrayList < ObjectInspector > ( ) ;
filenames. add ( "actions" ) ;
fileOIS. add ( PrimitiveObjectInspectorFactory . javaStringObjectInspector) ;
return ObjectInspectorFactory . getStandardStructObjectInspector ( filenames, fileOIS) ;
}
@Override
public void process ( Object [ ] args) throws HiveException {
String json = args[ 0 ] . toString ( ) ;
JSONArray jsonArray = new JSONArray ( json) ;
for ( int i = 0 ; i < jsonArray. length ( ) ; i++ ) {
String [ ] result = new String [ 1 ] ;
result[ 0 ] = jsonArray. getString ( i) ;
forward ( result) ;
}
}
@Override
public void close ( ) throws HiveException {
}
}
三、UDF函数
3.1 代码实现
package cn. com. hive. udf ;
import org. apache. hadoop. hive. ql. exec. UDFArgumentException ;
import org. apache. hadoop. hive. ql. metadata. HiveException ;
import org. apache. hadoop. hive. ql. udf. generic. GenericUDF ;
import org. apache. hadoop. hive. serde2. objectinspector. ConstantObjectInspector ;
import org. apache. hadoop. hive. serde2. objectinspector. ObjectInspector ;
import org. apache. hadoop. hive. serde2. objectinspector. ObjectInspectorFactory ;
import org. apache. hadoop. hive. serde2. objectinspector. primitive. PrimitiveObjectInspectorFactory ;
import org. json. JSONArray ;
import org. json. JSONObject ;
import java. util. ArrayList ;
import java. util. List ;
public class JsonArrayToStructArray extends GenericUDF {
@Override
public ObjectInspector initialize ( ObjectInspector [ ] arguments) throws UDFArgumentException {
if ( arguments. length < 3 ) {
throw new UDFArgumentException ( "JsonArrayToStructArray至少需要3个参数" ) ;
}
for ( ObjectInspector argument : arguments) {
if ( ! "string" . equals ( argument. getTypeName ( ) ) ) {
throw new UDFArgumentException ( "JsonArrayToStructArray输入的参数类型不对,只能是string类型" ) ;
}
}
List < String > FieldNames = new ArrayList < > ( ) ;
List < ObjectInspector > fieldOIS = new ArrayList < > ( ) ;
for ( int i = ( arguments. length + 1 ) / 2 ; i < arguments. length; i++ ) {
if ( ! ( arguments[ i] instanceof ConstantObjectInspector ) ) {
throw new UDFArgumentException ( "JsonArrayToStructArray输入的参数类型不对,只能是string常量" ) ;
}
String field = ( ( ConstantObjectInspector ) arguments[ i] ) . getWritableConstantValue ( ) . toString ( ) ;
String [ ] split = field. split ( ":" ) ;
FieldNames . add ( split[ 0 ] ) ;
switch ( split[ 1 ] ) {
case "int" :
fieldOIS. add ( PrimitiveObjectInspectorFactory . javaIntObjectInspector) ;
break ;
case "string" :
fieldOIS. add ( PrimitiveObjectInspectorFactory . javaStringObjectInspector) ;
break ;
case "bigint" :
fieldOIS. add ( PrimitiveObjectInspectorFactory . javaLongObjectInspector) ;
break ;
default :
throw new UDFArgumentException ( "json_array_to_struct_array 不支持" + split[ 1 ] + "类型" ) ;
}
}
return ObjectInspectorFactory
. getStandardListObjectInspector ( ObjectInspectorFactory . getStandardStructObjectInspector ( FieldNames , fieldOIS) ) ;
}
@Override
public Object evaluate ( DeferredObject [ ] arguments) throws HiveException {
if ( arguments[ 0 ] . get ( ) == null ) {
return null ;
} ;
String strArray = arguments[ 0 ] . get ( ) . toString ( ) ;
JSONArray jsonArray = new JSONArray ( strArray) ;
ArrayList < List < Object > > array = new ArrayList < > ( ) ;
for ( int i = 0 ; i < jsonArray. length ( ) ; i ++ ) {
ArrayList < Object > struct = new ArrayList < > ( ) ;
JSONObject jsonObject = jsonArray. getJSONObject ( i) ;
for ( int j = 1 ; j < ( arguments. length + 1 ) / 2 ; j++ ) {
String key = arguments[ j] . get ( ) . toString ( ) ;
if ( jsonObject. has ( key) ) {
Object value = jsonObject. get ( key) ;
struct. add ( value) ;
} else {
struct. add ( null ) ;
}
}
array. add ( struct) ;
}
return array;
}
@Override
public String getDisplayString ( String [ ] children) {
return getStandardDisplayString ( "json_array_struct_array" , children) ;
}
}