Hive自定义函数UDAF开发

本文详细介绍了如何在Hive中开发自定义函数UDAF,特别关注了实现actionCount函数的过程,该函数用于统计特定条件下的行动次数。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Hive自定义函数UDAF开发
Hive支持自定义函数,UDAF是接受多行,输出一行。
通常是group by时用到这种函数。

其实最好的学习资料就是官方自带的examples了。
我这里用的是0.10版本hive,所以对于的examples在


我这里的功能需求是:
actionCount(act_code,act_times,'1')
如果act_code=='1',则将在一个group里面的act_times加起来。

package hive.udaf;


import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

/**
 * 
 * It should be very easy to follow and can be used as an example for writing
 * new UDAFs.
 * 
 * Note that Hive internally uses a different mechanism (called GenericUDAF) to
 * implement built-in aggregation functions, which are harder to program but
 * more efficient.
 * 
 */
public final class ActionCount extends UDAF {

  /**
   * The internal state of an aggregation for average.
   * 
   * Note that this is only needed if the internal state cannot be represented
   * by a primitive.
   * 
   * The internal state can also contains fields with types like
   * ArrayList<String> and HashMap<String,Double> if needed.
   */
  public static class UDAFState {
    private long mCount;
    private long mSum;
  }

  /**
   * The actual class for doing the aggregation. Hive will automatically look
   * for all internal classes of the UDAF that implements UDAFEvaluator.
   */
  public static class UDAFExampleAvgEvaluator implements UDAFEvaluator {

    UDAFState state;

    public UDAFExampleAvgEvaluator() {
      super();
      state = new UDAFState();
      init();
    }

    /**
     * Reset the state of the aggregation.
     */
    public void init() {
      state.mSum = 0;
      state.mCount = 0;
    }

    /**
     * Iterate through one row of original data.
     * 
     * The number and type of arguments need to the same as we call this UDAF
     * from Hive command line.
     * 
     * This function should always return true.
     */
    public boolean iterate(String act_code,long act_times,String act_type) // 来了一行
    { 	
      if (act_code .equals(act_type)) 
      {
        state.mSum += act_times;
        state.mCount++;
      }
      return true;
    }

    /**
     * Terminate a partial aggregation and return the state. If the state is a
     * primitive, just return primitive Java classes like Integer or String.
     */
    public UDAFState terminatePartial() {//状态传递
      // This is SQL standard - average of zero items should be null.
      return state.mCount == 0 ? null : state;
    }

    /**
     * Merge with a partial aggregation.
     * 
     * This function should always have a single argument which has the same
     * type as the return value of terminatePartial().
     */
    public boolean merge(UDAFState o) {//子任务合并
      if (o != null) {
        state.mSum += o.mSum;
        state.mCount += o.mCount;
      }
      return true;
    }

    /**
     * Terminates the aggregation and return the final result.
     */
    public long terminate() {//返回最终结果
      // This is SQL standard - average of zero items should be null.
      return state.mCount == 0 ? 0 : state.mSum;
    }
  }

  private ActionCount() {
    // prevent instantiation
  }

}

关键还是要深刻理解map-reduce工作模型,才能更好驾驭hive。

本文作者:linger




使用SparkSQL和Hive API,可以通过以下步骤实现用户自定义函数(UDF)、聚合函数UDAF)和表生成函数(UDTF): 1. 编写自定义函数的代码,例如: ``` // UDF def myUDF(str: String): Int = { str.length } // UDAF class MyUDAF extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def bufferSchema: StructType = StructType(StructField("count", IntegerType) :: Nil) override def dataType: DataType = IntegerType override def deterministic: Boolean = true override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = 0 } override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = buffer.getInt(0) + input.getString(0).length } override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = buffer1.getInt(0) + buffer2.getInt(0) } override def evaluate(buffer: Row): Any = { buffer.getInt(0) } } // UDTF class MyUDTF extends GenericUDTF { override def initialize(args: Array[ConstantObjectInspector]): StructObjectInspector = { // 初始化代码 } override def process(args: Array[DeferedObject]): Unit = { // 处理代码 } override def close(): Unit = { // 关闭代码 } } ``` 2. 将自定义函数注册到SparkSQL或Hive中,例如: ``` // SparkSQL中注册UDF spark.udf.register("myUDF", myUDF _) // Hive中注册UDF hiveContext.sql("CREATE TEMPORARY FUNCTION myUDF AS 'com.example.MyUDF'") // Hive中注册UDAF hiveContext.sql("CREATE TEMPORARY FUNCTION myUDAF AS 'com.example.MyUDAF'") // Hive中注册UDTF hiveContext.sql("CREATE TEMPORARY FUNCTION myUDTF AS 'com.example.MyUDTF'") ``` 3. 在SQL语句中使用自定义函数,例如: ``` -- 使用SparkSQL中的UDF SELECT myUDF(name) FROM users -- 使用Hive中的UDF SELECT myUDF(name) FROM users -- 使用Hive中的UDAF SELECT myUDAF(name) FROM users GROUP BY age -- 使用Hive中的UDTF SELECT explode(myUDTF(name)) FROM users ``` 以上就是使用SparkSQL和Hive API实现用户自定义函数(UDF、UDAF、UDTF)的步骤。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值