一 什么是UDF
UDF是UserDefined Function 用户自定义函数的缩写。Hive中除了原生提供的一些函数之外,如果还不能满足我们当前需求,我们可以自定义函数。
除了UDF 之外,我们还可以定义聚合函数UDAF 和 Table-Generating函数
二 如何创建UDF函数
2.1编写JAVA类,需要继承UDF类或者GenericUDF
一般需要返回简单数据类型的,继承UDF就可以,然后实现evaluate方法;如果类型稍微复杂的可以使用GenericUDF,然后实现initializegetDisplayString evaluate方法。
publicclass UDFStripDoubleQuotes extends UDF{
private staticfinal String DOUBLE_QUOTES ="\"";
private staticfinal String BLANK_SYMBOL ="";
public Text evaluate (Text text) throws UDFArgumentException{
if (null ==text || BLANK_SYMBOL.equals(text)) {
throw new UDFArgumentException("The function STRIP_DOUBLE_QUOTES(s) takes exactly 1arguments.");
}
Stringtemp = text.toString().trim();
if (temp.startsWith(DOUBLE_QUOTES) ||temp.endsWith(DOUBLE_QUOTES)) {
temp = temp.replace(DOUBLE_QUOTES,BLANK_SYMBOL);
}
return new Text(temp);
}
}
2.2编译这个java类并打成jar包
2.3在hive中添加jar包
hive(hadoop)> add jar /opt/data/UDFStripDoubleQuotes.jar;
Added/opt/data/UDFStripDoubleQuotes.jar to class path
Addedresource: /opt/data/UDFStripDoubleQuotes.jar
2.4创建临时函数和永久函数
2.4.1创建临时函数
语法:CREATETEMPORARY FUCNTION strip_double_quotes
AS' com.hive.udf.UDFStripDoubleQuotes';
举例:
create temporary function EncryptByMD5 as 'MD5.EncryptByMD5' using jar 'hdfs:///user/xx/hiveUDF/EncryptByMD5.jar'
2.4.2创建永久函数
CREATEFUNCTION [db_name.]function_name AS class_name
[USINGJAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];
说白了其实就是把jar放到HDFS上,然后指定这个函数是哪一个数据库的,然后跟一个URL,这一个url就是你jar包所放的那个HDFS目录
举个例子:
CREATE FUNCTION 库名.函数名 AS 'com.hive.udf.LowerAndUpperUDF' USING JAR 'hdfs:/var/hive/udf/lowerOrUpper.jar';
举例:
永久(永久的必须要加hive库名)
create function xx.EncryptByMD5 as 'MD5.EncryptByMD5' using jar 'hdfs:///user/xx/hiveUDF/EncryptByMD5.jar'
(说明:如果函数名前不加库名. ,那么所有库都能用,如果加了库名. ,那么只有在该库名下时,可以用该函数,在其他库下使用时,会报invalid function错误)
2.5测试
准备测试数据:/opt/data/quotes.txt
"10" "ACCOUNTING" "NEW YORK"
"20" "RESEARCH" "DALLAS"
"30" "SALES" 'CHICAGO'
"40" "OPERATIONS" 'BOSTON'
Hive这边:
CREATETABLE t_dept LIKE dept;
LOADDATA LOCAL INPATH '/opt/data/quotes.txt' INTO TABLE t_dept;
SELECTstrip_double_quotes(dname) name, strip_double_quotes(loc) loc FROM t_dept;
运行结果:
name loc
ACCOUNTING NEW YORK
RESEARCH DALLAS
SALES 'CHICAGO'
OPERATIONS 'BOSTON'
三 如何创建UDAF函数
继承AbstractGenericUDAFAverageEvaluator,并且继承Generic
UDAFEvaluator。GenericUDAFEvaluator就是根据job不同的阶段执行不同的方法。Hive通过GenericUDAFEvaluator.Model来确定job的执行阶段。
那有哪些阶段呢?
PARTIAL1:从原始数据到部分聚合,会调用iterate,terminatePartial方法 -->map的输入 到 输出
PARTIAL2:从部分数据聚合和部分数据聚合,会调用merge和terminatePartial--> map的输出 到reduce输入
FINAL: 从部分数据聚合到全部数据聚合,调用merge和 terminate方法 -->Reduce输入到输出
COMPLETE: 从原始数据到全部数据聚合,会调用iterate和 terminate方法;没有reduce阶段,只有map阶段
有几个注意点:
如果要聚合的数据量比较大,我们需要注意内存是否够,很容易出现内存溢出的问题;
尽可能重用对象,尽量避免new对象,尽量减轻JVM垃圾回收的过程。
publicclass UDAFAdd extendsAbstractGenericUDAFResolver{
static final LogLOG = LogFactory.getLog(UDAFAdd.class.getName());
@Override
publicGenericUDAFEvaluator getEvaluator(TypeInfo[]arguments) throws SemanticException {
//check arguments length
if (arguments.length != 1) {
throw new UDFArgumentException("Exactly one argument is expected.");
}
//check if arguments data type is primitive
if (arguments[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentException("Argument is not expected.");
}
switch(((PrimitiveTypeInfo)arguments[0]).getPrimitiveCategory()) {
case BYTE:
case SHORT:
case INT:
case LONG:
return new UDAFAddLong();
case FLOAT:
case DOUBLE:
return new UDAFAddDouble();
default:
throw new UDFArgumentException("Only numeric or string type argumentsare accepted but "
+arguments[0].getTypeName() + " is passed.");
}
}
public staticclass UDAFAddDouble extendsGenericUDAFEvaluator{
privatePrimitiveObjectInspectorinputOI;
privateDoubleWritable result;
//invoke the INIT method on the each stage
@Override
publicObjectInspector init(Modemode, ObjectInspector[] arguments) throws HiveException {
super.init(mode,arguments);
//INIT double value
result = new DoubleWritable(0);
inputOI =(DoubleObjectInspector)arguments[0];
returnPrimitiveObjectInspectorFactory.writableDoubleObjectInspector;
}
/**
* it is used for storing aggregation resultduring the process of aggregation
*@authornickyzhang
*
*/
static class AddDoubleAggextendsAbstractAggregationBuffer{
boolean empty;
double sum;
@Override
public int estimate() {
return JavaDataModel.PRIMITIVES1 + JavaDataModel.PRIMITIVES2;
}
}
/**Get a new aggregation object*/
@Override
public AggregationBuffergetNewAggregationBuffer()throws HiveException {
AddDoubleAggaddDoubleAgg = new AddDoubleAgg();
reset(addDoubleAgg);
return addDoubleAgg;
}
/** Reset the aggregation. This is useful if we want toreuse the same aggregation. */
@Override
public void reset(AggregationBufferagg) throws HiveException {
AddDoubleAggaddDoubleAgg = (AddDoubleAgg)agg;
addDoubleAgg.empty = Boolean.TRUE;
addDoubleAgg.sum = 0;
}
/** Iterate through original data.*/
@Override
public void iterate(AggregationBufferagg, Object[] arguments)throwsHiveException {
if (arguments.length != 1) {
throw new UDFArgumentException("Just one argument expected!");
}
this.merge(agg,arguments);
}
/** Get partial aggregation result.*/
@Override
public ObjectterminatePartial(AggregationBufferagg) throws HiveException {
returnterminate(agg);
}
/**Combiner or Reduce merge the mapper*/
@Override
public void merge(AggregationBufferagg, Object partial)throwsHiveException {
if (partial ==null) {
return;
}
AddDoubleAggaddDoubleAgg = (AddDoubleAgg)agg;
addDoubleAgg.empty =false;
addDoubleAgg.sum += PrimitiveObjectInspectorUtils.getDouble(partial,inputOI);
}
/** Get final aggregation result */
@Override
public Objectterminate(AggregationBufferagg) throws HiveException {
AddDoubleAggaddDoubleAgg = (AddDoubleAgg)agg;
if (addDoubleAgg.empty) {
return null;
}
result.set(addDoubleAgg.sum);
return result;
}
}
public staticclass UDAFAddLong extendsGenericUDAFEvaluator{
privatePrimitiveObjectInspectorinputOI;
private LongWritableresult;
//invoke the INIT method on the each stage
@Override
publicObjectInspector init(Modemode, ObjectInspector[] arguments) throws HiveException {
super.init(mode,arguments);
//INIT double value
result = new LongWritable(0);
inputOI =(LongObjectInspector)arguments[0];
returnPrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
/**
* it is used for storing aggregation resultduring the process of aggregation
*@authornickyzhang
*
*/
static class AddLongAggextends AbstractAggregationBuffer{
boolean empty;
long sum;
@Override
public int estimate() {
return JavaDataModel.PRIMITIVES1 + JavaDataModel.PRIMITIVES2;
}
}
/**Get a new aggregation object*/
@Override
public AggregationBuffergetNewAggregationBuffer()throws HiveException {
AddLongAggaddLongAgg = new AddLongAgg();
reset(addLongAgg);
return addLongAgg;
}
/** Reset the aggregation. This is useful if we want toreuse the same aggregation. */
@Override
public void reset(AggregationBufferagg) throws HiveException {
AddLongAggaddLongAgg = (AddLongAgg)agg;
addLongAgg.empty = Boolean.TRUE;
addLongAgg.sum = 0;
}
/** Iterate through original data.*/
@Override
public void iterate(AggregationBufferagg, Object[] arguments)throwsHiveException {
if (arguments.length != 1) {
throw new UDFArgumentException("Just one argument expected!");
}
this.merge(agg,arguments);
}
/** Get partial aggregation result.*/
@Override
public ObjectterminatePartial(AggregationBufferagg) throws HiveException {
returnterminate(agg);
}
/**Combiner or Reduce merge the mapper*/
@Override
public void merge(AggregationBufferagg, Object partial)throws HiveException{
if (partial ==null) {
return;
}
AddLongAggaddLongAgg = (AddLongAgg)agg;
addLongAgg.empty =false;
addLongAgg.sum += PrimitiveObjectInspectorUtils.getDouble(partial,inputOI);
}
/** Get final aggregation result */
@Override
public Objectterminate(AggregationBufferagg) throws HiveException {
AddLongAggaddLongAgg = (AddLongAgg)agg;
if (addLongAgg.empty) {
return null;
}
result.set(addLongAgg.sum);
return result;
}
}
}
四 如何创建UDTF函数
一般用于解析工作,比如说解析url,然后获取url信息,需要继承GenericUDTF.
publicclass UDTFEmail extendsGenericUDTF{
@Override
publicStructObjectInspector initialize(StructObjectInspectorinspector) throws UDFArgumentException {
if (inspector ==null) {
throw new UDFArgumentException("arguments is null");
}
List args = inspector.getAllStructFieldRefs();
if(CollectionUtils.isEmpty(args) ||args.size()!= 1){
throw new UDFArgumentException("UDF tables only one argument");
}
List<String>fields = new ArrayList<String>();
fields.add("name");
fields.add("email");
List<ObjectInspector>fieldIOList = new ArrayList<ObjectInspector>();
fieldIOList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldIOList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
returnObjectInspectorFactory.getStandardStructObjectInspector(fields,fieldIOList);
}
@Override
public void process(Object[]args) throws HiveException {
if(ArrayUtils.isEmpty(args) ||args.length != 1) {
return;
}
Stringname = args[0].toString();
Stringemail = name +"@163.com";
super.forward(new String[] {name,email});
}
@Override
public void close()throws HiveException {
super.forward(new String[] {"complete","finish"});
}
}
五UDF UDAF UDTF 区别
UDF:一进一出
UDAF:多进一出,一般聚合用
UDTF:一进多出