Hive自定义函数(UDF、UDAF)

当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数。

###UDF
用户自定义函数(user defined function)–针对单条记录。
创建函数流程
1、自定义一个Java类
2、继承UDF类
3、重写evaluate方法
4、打成jar包
6、在hive执行add jar方法
7、在hive执行创建模板函数
8、hql中使用

Demo01:
自定义一个Java类

package UDFDemo;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class UDFTest extends UDF{
		
	public boolean evaluate(){
		return true;
	}
	
	public boolean evaluate(int b){
		//int b=Integer.parseInt(a);
		if(b<0){
			return false;
		}
		if(b%2==0){
			return true;
		}else {
			return false;
		}
		
	}
	
	public boolean evaluate(String a){
		int b=Integer.parseInt(a);
		
		if(b<0){
			return false;
		}
		if(b%2==0){
			return true;
		}else {
			return false;
		}
		
	}
	
	public boolean evaluate(Text a){
		int b=Integer.parseInt(a.toString());
		
		if(b<0){
			return false;
		}
		if(b%2==0){
			return true;
		}else {
			return false;
		}
		
	}
	public boolean evaluate(Text t1,Text t2){
	//public boolean evaluate(String t1, String t2){	 
		 if(t1==null || t2 ==null){
			 return false;
		 }
		 
		 double d1 = Double.parseDouble(t1.toString());
		 double d2 = Double.parseDouble(t2.toString());
		/* double d1 = Double.parseDouble(t1);
		 double d2 = Double.parseDouble(t2);*/
		 if(d1>d2){
			 return true;
		 }else{
			 return	false;
		 }	 
	}
	
	

	public boolean evaluate(String t1, String t2){	 
		 if(t1==null || t2 ==null){
			 return false;
		 }

		 double d1 = Double.parseDouble(t1);
		 double d2 = Double.parseDouble(t2);
		 if(d1>d2){
			 return true;
		 }else{
			 return	false;
		 }	 
	}
}

打成jar包UDFTest.jar
在hive执行add jar方法
在hive创建一个bigthan的函数,引入的类是UDF.UDFTest

add jar /liguodong/UDFTest.jar;
create temporary function bigthan as 'UDFDemo.UDFTest';

select no,num,bigthan(no,num) from testudf;

###UDAF
UDAF(user defined aggregation function)用户自定义聚合函数,针对记录集合

开发UDAF通用有两个步骤
第一个是编写resolver类,resolver负责类型检查,操作符重载。
第二个是编写evaluator类,evaluator真正实现UDAF的逻辑
通常来说,顶层UDAF类继承
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator里面编写嵌套类evaluator实现UDAF的逻辑。

一、实现resolver
resolver通常继承
org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2,但是更建议继承AbstractGenericUDAFResolver,隔离将来hive接口的变化。
GenericUDAFResolver和GenericUDAFResolver2接口的区别是后面的允许evaluator实现可以访问更多的信息,例如DISTINCT限定符,通配符FUNCTION(*)

二、实现evaluator
所有eva1uators必须继承抽象类
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator。予类必须实现它的一些抽象方法,实现UDAF的逻辑。

Mode
这个类比较重要,它表示了udaf在mapreduce的各个阶段,理解Mode的含义,就理解了hive的UDAF的运行流程。

public static enum Mode{
	PARTIAL1,
	PARTIAL2,
	FINAL,
	COMPLETE
};

PARTIAL1:这个是mapreduce的map阶段:从原始数据到部分数据聚合,将会调用iterate()terminatePartial()

PARTIAL2:这个是mapreduce的map端的Combiner阶段,负责在map端合并map的数据;从部分数据聚合到部分数据聚合,将会调用merge()terminatePartial()

FINAL:mapreduce的reduce阶段:从部分数据的聚合到完全聚合,将会调用merge()terminate()

COMPLETE:如果出现了这个阶段,表示mapreduce只有map,没有reduce,所以map端就直接出结果了;从原始数据直接到完全聚合,将会调用iterate()terminate()

流程–无Combiner

流程–有Combiner

mapreduce阶段调用函数
MAP
init()
iterate()
terminatePartial()

Combiner
merge()
terminatePartial()

REDUCE
init()
merge()
terminate()

####查看源码路径
apache-hive-1.2.1-src\ql\src\java\org\apache\hadoop\hive\ql\udf\generic

例如:关于count函数的源码

package org.apache.hadoop.hive.ql.udf.generic;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.util.JavaDataModel;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.io.LongWritable;

/**
 * This class implements the COUNT aggregation function as in SQL.
 */
@Description(name = "count",
    value = "_FUNC_(*) - Returns the total number of retrieved rows, including "
          +        "rows containing NULL values.\n"

          + "_FUNC_(expr) - Returns the number of rows for which the supplied "
          +        "expression is non-NULL.\n"

          + "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for "
          +        "which the supplied expression(s) are unique and non-NULL.")
public class GenericUDAFCount implements GenericUDAFResolver2 {

  private static final Log LOG = LogFactory.getLog(GenericUDAFCount.class.getName());

  @Override
  public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
      throws SemanticException {
    // This method implementation is preserved for backward compatibility.
    return new GenericUDAFCountEvaluator();
  }

  @Override
  public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo paramInfo)
  throws SemanticException {

    TypeInfo[] parameters = paramInfo.getParameters();

    if (parameters.length == 0) {
      if (!paramInfo.isAllColumns()) {
        throw new UDFArgumentException("Argument expected");
      }
      assert !paramInfo.isDistinct() : "DISTINCT not supported with *";
    } else {
      if (parameters.length > 1 && !paramInfo.isDistinct()) {
        throw new UDFArgumentException("DISTINCT keyword must be specified");
      }
      assert !paramInfo.isAllColumns() : "* not supported in expression list";
    }

    return new GenericUDAFCountEvaluator().setCountAllColumns(
        paramInfo.isAllColumns());
  }

  /**
   * GenericUDAFCountEvaluator.
   *
   */
  public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator {
    private boolean countAllColumns = false;
    private LongObjectInspector partialCountAggOI;
    private LongWritable result;

    @Override
    public ObjectInspector init(Mode m, ObjectInspector[] parameters)
    throws HiveException {
      super.init(m, parameters);
      if (mode == Mode.PARTIAL2 || mode == Mode.FINAL) {
        partialCountAggOI = (LongObjectInspector)parameters[0];
      }
      result = new LongWritable(0);
      return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
    }

    private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) {
      countAllColumns = countAllCols;
      return this;
    }

    /** class for storing count value. */
    @AggregationType(estimable = true)
    static class CountAgg extends AbstractAggregationBuffer {
      long value;
      @Override
      public int estimate() { return JavaDataModel.PRIMITIVES2; }
    }

    @Override
    public AggregationBuffer getNewAggregationBuffer() throws HiveException {
      CountAgg buffer = new CountAgg();
      reset(buffer);
      return buffer;
    }

    @Override
    public void reset(AggregationBuffer agg) throws HiveException {
      ((CountAgg) agg).value = 0;
    }

    @Override
    public void iterate(AggregationBuffer agg, Object[] parameters)
      throws HiveException {
      // parameters == null means the input table/split is empty
      if (parameters == null) {
        return;
      }
      if (countAllColumns) {
        assert parameters.length == 0;
        ((CountAgg) agg).value++;
      } else {
        boolean countThisRow = true;
        for (Object nextParam : parameters) {
          if (nextParam == null) {
            countThisRow = false;
            break;
          }
        }
        if (countThisRow) {
          ((CountAgg) agg).value++;
        }
      }
    }

    @Override
    public void merge(AggregationBuffer agg, Object partial)
      throws HiveException {
      if (partial != null) {
        long p = partialCountAggOI.get(partial);
        ((CountAgg) agg).value += p;
      }
    }

    @Override
    public Object terminate(AggregationBuffer agg) throws HiveException {
      result.set(((CountAgg) agg).value);
      return result;
    }

    @Override
    public Object terminatePartial(AggregationBuffer agg) throws HiveException {
      return terminate(agg);
    }
  }
}

Demo02:
执行过程与UDF类似,该Java、类的功能是第一列的值大于第二列计数加1。

package UDAFDemo;

import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.io.LongWritable;

public class UDAFTest extends AbstractGenericUDAFResolver{
	//判断
	@Override
	public GenericUDAFEvaluator getEvaluator(TypeInfo[] info)//字段的描述信息参数parameters
			throws SemanticException {
		if(info.length !=2){
			throw new UDFArgumentTypeException(info.length-1,
					"Exactly two argument is expected.");	
		}	
		
		//返回处理逻辑的类
		return new GenericEvaluate();
	}
	
	public static class GenericEvaluate extends GenericUDAFEvaluator{

		private LongWritable result;
		private PrimitiveObjectInspector inputIO1;
		private PrimitiveObjectInspector inputIO2;
		
		//这个方法map与reduce阶段都需要执行
		/**
		 * map阶段:parameters长度与udaf输入的参数个数有关
		 * reduce阶段:parameters长度为1
		 */
		//初始化
		@Override
		public ObjectInspector init(Mode m, ObjectInspector[] parameters)
				throws HiveException {
			super.init(m, parameters);
			
			//返回最终的结果
			result = new LongWritable(0);
			
			inputIO1 = (PrimitiveObjectInspector) parameters[0];
			if (parameters.length>1) {
				inputIO2 = (PrimitiveObjectInspector) parameters[1];
			}
			
			return PrimitiveObjectInspectorFactory.writableBinaryObjectInspector;
		}
		
		//map阶段
		@Override
		public void iterate(AggregationBuffer agg, Object[] parameters)//agg缓存结果值
				throws HiveException {
			
			assert(parameters.length==2);
			
			if(parameters==null || parameters[0]==null ||  parameters[1]==null){
				return;
			}
			
			double base = PrimitiveObjectInspectorUtils.getDouble(parameters[0], inputIO1);
			double tmp = PrimitiveObjectInspectorUtils.getDouble(parameters[1], inputIO2);
			
			if(base > tmp){
				((CountAgg)agg).count++;
			}
			
		}
		
		//获得一个聚合的缓冲对象,每个map执行一次
		@Override
		public AggregationBuffer getNewAggregationBuffer() throws HiveException {
			
			CountAgg agg = new CountAgg();
			
			reset(agg);
			
			return agg;
		}

		//自定义类用于计数
		public static class CountAgg implements AggregationBuffer{
			long count;//计数,保存每次临时的结果
		}
		
		//重置
		@Override
		public void reset(AggregationBuffer countagg) throws HiveException {
			CountAgg agg = (CountAgg)countagg;
			agg.count=0;
		}

		//该方法当做iterate执行后,部分结果返回。
		@Override
		public Object terminatePartial(AggregationBuffer agg)
				throws HiveException {
			
			result.set(((CountAgg)agg).count);
			
			return result;
		}

		
		
		@Override
		public void merge(AggregationBuffer agg, Object partial)
				throws HiveException {
			if(partial != null){
				long p = PrimitiveObjectInspectorUtils.getLong(partial, inputIO1);
				((CountAgg)agg).count += p;
			}
		}

		@Override
		public Object terminate(AggregationBuffer agg) throws HiveException {
			result.set(((CountAgg)agg).count); 
			return result;
		}	
	}
}

###永久函数
方式1、如果希望在hive中自定义一个函数,且能永久使用,
则修改源码添加相应的函数类,然后在修改ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java类,添加相应的注册函数代码registerUDF("parse_url",UDFParseUrl.class,false);

方式2、hive -i ‘file’

方式3、新建hiverc文件
1、jar包放到安装日录下或者指定目录下
2、${HIVE_HOME}/bin目录下有个.hiverc文件,它是隐藏文件。
3、把初始化语句加载到文件中

vi .hiverc
add jar /liguodong/UDFTest.jar;
create temporary function bigthan as 'UDFDemo.UDFTest';

然后打开hive时,它会自动执行.hiverc文件。

  • 4
    点赞
  • 40
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
使用SparkSQL和Hive API,可以通过以下步骤实现用户自定义函数UDF)、聚合函数UDAF)和表生成函数(UDTF): 1. 编写自定义函数的代码,例如: ``` // UDF def myUDF(str: String): Int = { str.length } // UDAF class MyUDAF extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def bufferSchema: StructType = StructType(StructField("count", IntegerType) :: Nil) override def dataType: DataType = IntegerType override def deterministic: Boolean = true override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = 0 } override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = buffer.getInt(0) + input.getString(0).length } override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = buffer1.getInt(0) + buffer2.getInt(0) } override def evaluate(buffer: Row): Any = { buffer.getInt(0) } } // UDTF class MyUDTF extends GenericUDTF { override def initialize(args: Array[ConstantObjectInspector]): StructObjectInspector = { // 初始化代码 } override def process(args: Array[DeferedObject]): Unit = { // 处理代码 } override def close(): Unit = { // 关闭代码 } } ``` 2. 将自定义函数注册到SparkSQL或Hive中,例如: ``` // SparkSQL中注册UDF spark.udf.register("myUDF", myUDF _) // Hive中注册UDF hiveContext.sql("CREATE TEMPORARY FUNCTION myUDF AS 'com.example.MyUDF'") // Hive中注册UDAF hiveContext.sql("CREATE TEMPORARY FUNCTION myUDAF AS 'com.example.MyUDAF'") // Hive中注册UDTF hiveContext.sql("CREATE TEMPORARY FUNCTION myUDTF AS 'com.example.MyUDTF'") ``` 3. 在SQL语句中使用自定义函数,例如: ``` -- 使用SparkSQL中的UDF SELECT myUDF(name) FROM users -- 使用Hive中的UDF SELECT myUDF(name) FROM users -- 使用Hive中的UDAF SELECT myUDAF(name) FROM users GROUP BY age -- 使用Hive中的UDTF SELECT explode(myUDTF(name)) FROM users ``` 以上就是使用SparkSQL和Hive API实现用户自定义函数UDFUDAF、UDTF)的步骤。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

吃果冻不吐果冻皮

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值