Hive自定义函数(UDF、UDAF)_genericudafevaluator-CSDN博客

当Hive提供的内置函数无法满足你的业务处理需要时，此时就可以考虑使用用户自定义函数。

UDF

用户自定义函数（user defined function)–针对单条记录。
创建函数流程
1、自定义一个Java类
2、继承UDF类
3、重写evaluate方法
4、打成jar包
6、在hive执行add jar方法
7、在hive执行创建模板函数
8、hql中使用

Demo01:
自定义一个Java类

package UDFDemo;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class UDFTest extends UDF{

    public boolean evaluate(){
        return true;
    }

    public boolean evaluate(int b){
        //int b=Integer.parseInt(a);
        if(b<0){
            return false;
        }
        if(b%2==0){
            return true;
        }else {
            return false;
        }

    }

    public boolean evaluate(String a){
        int b=Integer.parseInt(a);

        if(b<0){
            return false;
        }
        if(b%2==0){
            return true;
        }else {
            return false;
        }

    }

    public boolean evaluate(Text a){
        int b=Integer.parseInt(a.toString());

        if(b<0){
            return false;
        }
        if(b%2==0){
            return true;
        }else {
            return false;
        }

    }
    public boolean evaluate(Text t1,Text t2){
    //public boolean evaluate(String t1, String t2){     
         if(t1==null || t2 ==null){
             return false;
         }

         double d1 = Double.parseDouble(t1.toString());
         double d2 = Double.parseDouble(t2.toString());
        /* double d1 = Double.parseDouble(t1);
         double d2 = Double.parseDouble(t2);*/
         if(d1>d2){
             return true;
         }else{
             return false;
         }   
    }



    public boolean evaluate(String t1, String t2){   
         if(t1==null || t2 ==null){
             return false;
         }

         double d1 = Double.parseDouble(t1);
         double d2 = Double.parseDouble(t2);
         if(d1>d2){
             return true;
         }else{
             return false;
         }   
    }
}
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84

打成jar包UDFTest.jar
在hive执行add jar方法
在hive创建一个bigthan的函数，引入的类是UDF.UDFTest

add jar /liguodong/UDFTest.jar;
create temporary function bigthan as 'UDFDemo.UDFTest';
 
 1
2

select no,num,bigthan(no,num) from testudf;
 
 1

UDAF

UDAF(user defined aggregation function)用户自定义聚合函数，针对记录集合

开发UDAF通用有两个步骤
第一个是编写resolver类，resolver负责类型检查，操作符重载。
第二个是编写evaluator类，evaluator真正实现UDAF的逻辑
通常来说，顶层UDAF类继承
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator里面编写嵌套类evaluator实现UDAF的逻辑。

一、实现resolver
resolver通常继承
org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2，但是更建议继承AbstractGenericUDAFResolver，隔离将来hive接口的变化。
GenericUDAFResolver和GenericUDAFResolver2接口的区别是后面的允许evaluator实现可以访问更多的信息，例如DISTINCT限定符，通配符FUNCTION(*)。

二、实现evaluator
所有eva1uators必须继承抽象类
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator。予类必须实现它的一些抽象方法，实现UDAF的逻辑。

Mode
这个类比较重要，它表示了udaf在mapreduce的各个阶段，理解Mode的含义，就理解了hive的UDAF的运行流程。

public static enum Mode{
    PARTIAL1,
    PARTIAL2,
    FINAL,
    COMPLETE
};
 
 1
2
3
4
5
6

PARTIAL1：这个是mapreduce的map阶段：从原始数据到部分数据聚合，将会调用iterate()和terminatePartial()。

PARTIAL2：这个是mapreduce的map端的Combiner阶段，负责在map端合并map的数据；从部分数据聚合到部分数据聚合，将会调用merge()和terminatePartial()。

FINAL：mapreduce的reduce阶段：从部分数据的聚合到完全聚合，将会调用merge()和terminate()。

COMPLETE：如果出现了这个阶段，表示mapreduce只有map，没有reduce，所以map端就直接出结果了；从原始数据直接到完全聚合，将会调用iterate()和terminate()

流程–无Combiner

流程–有Combiner

mapreduce阶段调用函数
MAP
init()
iterate()
terminatePartial()

Combiner
merge()
terminatePartial()

REDUCE
init()
merge()
terminate()

查看源码路径

apache-hive-1.2.1-src\ql\src\java\org\apache\hadoop\hive\ql\udf\generic

例如：关于count函数的源码

package org.apache.hadoop.hive.ql.udf.generic;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.util.JavaDataModel;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.io.LongWritable;

/**
 * This class implements the COUNT aggregation function as in SQL.
 */
@Description(name = "count",
    value = "_FUNC_(*) - Returns the total number of retrieved rows, including "
          +        "rows containing NULL values.\n"

          + "_FUNC_(expr) - Returns the number of rows for which the supplied "
          +        "expression is non-NULL.\n"

          + "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for "
          +        "which the supplied expression(s) are unique and non-NULL.")
public class GenericUDAFCount implements GenericUDAFResolver2 {

  private static final Log LOG = LogFactory.getLog(GenericUDAFCount.class.getName());

  @Override
  public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
      throws SemanticException {
    // This method implementation is preserved for backward compatibility.
    return new GenericUDAFCountEvaluator();
  }

  @Override
  public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo paramInfo)
  throws SemanticException {

    TypeInfo[] parameters = paramInfo.getParameters();

    if (parameters.length == 0) {
      if (!paramInfo.isAllColumns()) {
        throw new UDFArgumentException("Argument expected");
      }
      assert !paramInfo.isDistinct() : "DISTINCT not supported with *";
    } else {
      if (parameters.length > 1 && !paramInfo.isDistinct()) {
        throw new UDFArgumentException("DISTINCT keyword must be specified");
      }
      assert !paramInfo.isAllColumns() : "* not supported in expression list";
    }

    return new GenericUDAFCountEvaluator().setCountAllColumns(
        paramInfo.isAllColumns());
  }

  /**
   * GenericUDAFCountEvaluator.
   *
   */
  public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator {
    private boolean countAllColumns = false;
    private LongObjectInspector partialCountAggOI;
    private LongWritable result;

    @Override
    public ObjectInspector init(Mode m, ObjectInspector[] parameters)
    throws HiveException {
      super.init(m, parameters);
      if (mode == Mode.PARTIAL2 || mode == Mode.FINAL) {
        partialCountAggOI = (LongObjectInspector)parameters[0];
      }
      result = new LongWritable(0);
      return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
    }

    private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) {
      countAllColumns = countAllCols;
      return this;
    }

    /** class for storing count value. */
    @AggregationType(estimable = true)
    static class CountAgg extends AbstractAggregationBuffer {
      long value;
      @Override
      public int estimate() { return JavaDataModel.PRIMITIVES2; }
    }

    @Override
    public AggregationBuffer getNewAggregationBuffer() throws HiveException {
      CountAgg buffer = new CountAgg();
      reset(buffer);
      return buffer;
    }

    @Override
    public void reset(AggregationBuffer agg) throws HiveException {
      ((CountAgg) agg).value = 0;
    }

    @Override
    public void iterate(AggregationBuffer agg, Object[] parameters)
      throws HiveException {
      // parameters == null means the input table/split is empty
      if (parameters == null) {
        return;
      }
      if (countAllColumns) {
        assert parameters.length == 0;
        ((CountAgg) agg).value++;
      } else {
        boolean countThisRow = true;
        for (Object nextParam : parameters) {
          if (nextParam == null) {
            countThisRow = false;
            break;
          }
        }
        if (countThisRow) {
          ((CountAgg) agg).value++;
        }
      }
    }

    @Override
    public void merge(AggregationBuffer agg, Object partial)
      throws HiveException {
      if (partial != null) {
        long p = partialCountAggOI.get(partial);
        ((CountAgg) agg).value += p;
      }
    }

    @Override
    public Object terminate(AggregationBuffer agg) throws HiveException {
      result.set(((CountAgg) agg).value);
      return result;
    }

    @Override
    public Object terminatePartial(AggregationBuffer agg) throws HiveException {
      return terminate(agg);
    }
  }
}
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150

Demo02:
执行过程与UDF类似，该Java、类的功能是第一列的值大于第二列计数加1。

package UDAFDemo;

import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.io.LongWritable;

public class UDAFTest extends AbstractGenericUDAFResolver{
    //判断
    @Override
    public GenericUDAFEvaluator getEvaluator(TypeInfo[] info)//字段的描述信息参数parameters
            throws SemanticException {
        if(info.length !=2){
            throw new UDFArgumentTypeException(info.length-1,
                    "Exactly two argument is expected.");   
        }   

        //返回处理逻辑的类
        return new GenericEvaluate();
    }

    public static class GenericEvaluate extends GenericUDAFEvaluator{

        private LongWritable result;
        private PrimitiveObjectInspector inputIO1;
        private PrimitiveObjectInspector inputIO2;

        //这个方法map与reduce阶段都需要执行
        /**
         * map阶段：parameters长度与udaf输入的参数个数有关
         * reduce阶段：parameters长度为1
         */
        //初始化
        @Override
        public ObjectInspector init(Mode m, ObjectInspector[] parameters)
                throws HiveException {
            super.init(m, parameters);

            //返回最终的结果
            result = new LongWritable(0);

            inputIO1 = (PrimitiveObjectInspector) parameters[0];
            if (parameters.length>1) {
                inputIO2 = (PrimitiveObjectInspector) parameters[1];
            }

            return PrimitiveObjectInspectorFactory.writableBinaryObjectInspector;
        }

        //map阶段
        @Override
        public void iterate(AggregationBuffer agg, Object[] parameters)//agg缓存结果值
                throws HiveException {

            assert(parameters.length==2);

            if(parameters==null || parameters[0]==null ||  parameters[1]==null){
                return;
            }

            double base = PrimitiveObjectInspectorUtils.getDouble(parameters[0], inputIO1);
            double tmp = PrimitiveObjectInspectorUtils.getDouble(parameters[1], inputIO2);

            if(base > tmp){
                ((CountAgg)agg).count++;
            }

        }

        //获得一个聚合的缓冲对象，每个map执行一次
        @Override
        public AggregationBuffer getNewAggregationBuffer() throws HiveException {

            CountAgg agg = new CountAgg();

            reset(agg);

            return agg;
        }

        //自定义类用于计数
        public static class CountAgg implements AggregationBuffer{
            long count;//计数，保存每次临时的结果
        }

        //重置
        @Override
        public void reset(AggregationBuffer countagg) throws HiveException {
            CountAgg agg = (CountAgg)countagg;
            agg.count=0;
        }

        //该方法当做iterate执行后，部分结果返回。
        @Override
        public Object terminatePartial(AggregationBuffer agg)
                throws HiveException {

            result.set(((CountAgg)agg).count);

            return result;
        }



        @Override
        public void merge(AggregationBuffer agg, Object partial)
                throws HiveException {
            if(partial != null){
                long p = PrimitiveObjectInspectorUtils.getLong(partial, inputIO1);
                ((CountAgg)agg).count += p;
            }
        }

        @Override
        public Object terminate(AggregationBuffer agg) throws HiveException {
            result.set(((CountAgg)agg).count); 
            return result;
        }   
    }
}
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127

永久函数

方式1、如果希望在hive中自定义一个函数，且能永久使用，
则修改源码添加相应的函数类，然后在修改ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java类，添加相应的注册函数代码registerUDF("parse_url",UDFParseUrl.class,false);。

方式2、hive -i ‘file’

方式3、新建hiverc文件
1、jar包放到安装日录下或者指定目录下
2、${HIVE_HOME}/bin目录下有个.hiverc文件，它是隐藏文件。
3、把初始化语句加载到文件中

vi .hiverc
add jar /liguodong/UDFTest.jar;
create temporary function bigthan as 'UDFDemo.UDFTest';
 
 1
2
3

然后打开hive时，它会自动执行.hiverc文件。