当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数。
UDF
用户自定义函数(user defined function)–针对单条记录。
创建函数流程
1、自定义一个Java类
2、继承UDF类
3、重写evaluate方法
4、打成jar包
6、在hive执行add jar方法
7、在hive执行创建模板函数
8、hql中使用
Demo01:
自定义一个Java类
package UDFDemo;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class UDFTest extends UDF{
public boolean evaluate(){
return true;
}
public boolean evaluate(int b){
if(b<0){
return false;
}
if(b%2==0){
return true;
}else {
return false;
}
}
public boolean evaluate(String a){
int b=Integer.parseInt(a);
if(b<0){
return false;
}
if(b%2==0){
return true;
}else {
return false;
}
}
public boolean evaluate(Text a){
int b=Integer.parseInt(a.toString());
if(b<0){
return false;
}
if(b%2==0){
return true;
}else {
return false;
}
}
public boolean evaluate(Text t1,Text t2){
if(t1==null || t2 ==null){
return false;
}
double d1 = Double.parseDouble(t1.toString());
double d2 = Double.parseDouble(t2.toString());
if(d1>d2){
return true;
}else{
return false;
}
}
public boolean evaluate(String t1, String t2){
if(t1==null || t2 ==null){
return false;
}
double d1 = Double.parseDouble(t1);
double d2 = Double.parseDouble(t2);
if(d1>d2){
return true;
}else{
return false;
}
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
打成jar包UDFTest.jar
在hive执行add jar方法
在hive创建一个bigthan的函数,引入的类是UDF.UDFTest
add jar /liguodong/UDFTest.jar;
create temporary function bigthan as 'UDFDemo.UDFTest';
select no,num,bigthan(no,num) from testudf;
UDAF
UDAF(user defined aggregation function)用户自定义聚合函数,针对记录集合
开发UDAF通用有两个步骤
第一个是编写resolver类,resolver负责类型检查,操作符重载。
第二个是编写evaluator类,evaluator真正实现UDAF的逻辑
通常来说,顶层UDAF类继承
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator
里面编写嵌套类evaluator实现UDAF的逻辑。
一、实现resolver
resolver通常继承
org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2
,但是更建议继承AbstractGenericUDAFResolver
,隔离将来hive接口的变化。
GenericUDAFResolver和GenericUDAFResolver2接口的区别是后面的允许evaluator实现可以访问更多的信息,例如DISTINCT
限定符,通配符FUNCTION(*)
。
二、实现evaluator
所有eva1uators必须继承抽象类
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator
。予类必须实现它的一些抽象方法,实现UDAF的逻辑。
Mode
这个类比较重要,它表示了udaf在mapreduce的各个阶段,理解Mode的含义,就理解了hive的UDAF的运行流程。
public static enum Mode{
PARTIAL1,
PARTIAL2,
FINAL,
COMPLETE
};
PARTIAL1:这个是mapreduce的map阶段:从原始数据到部分数据聚合,将会调用iterate()
和terminatePartial()
。
PARTIAL2:这个是mapreduce的map端的Combiner阶段,负责在map端合并map的数据;从部分数据聚合到部分数据聚合,将会调用merge()
和terminatePartial()
。
FINAL:mapreduce的reduce阶段:从部分数据的聚合到完全聚合,将会调用merge()
和terminate()
。
COMPLETE:如果出现了这个阶段,表示mapreduce只有map,没有reduce,所以map端就直接出结果了;从原始数据直接到完全聚合,将会调用iterate()
和terminate()
流程–无Combiner
流程–有Combiner
mapreduce阶段调用函数
MAP
init()
iterate()
terminatePartial()
Combiner
merge()
terminatePartial()
REDUCE
init()
merge()
terminate()
查看源码路径
apache-hive-1.2.1-src\ql\src\java\org\apache\hadoop\hive\ql\udf\generic
例如:关于count函数的源码
package org.apache.hadoop.hive.ql.udf.generic;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.util.JavaDataModel;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.io.LongWritable;
/**
* This class implements the COUNT aggregation function as in SQL.
*/
@Description(name = "count",
value = "_FUNC_(*) - Returns the total number of retrieved rows, including "
+ "rows containing NULL values.\n"
+ "_FUNC_(expr) - Returns the number of rows for which the supplied "
+ "expression is non-NULL.\n"
+ "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for "
+ "which the supplied expression(s) are unique and non-NULL.")
public class GenericUDAFCount implements GenericUDAFResolver2 {
private static final Log LOG = LogFactory.getLog(GenericUDAFCount.class.getName());
@Override
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
throws SemanticException {
return new GenericUDAFCountEvaluator();
}
@Override
public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo paramInfo)
throws SemanticException {
TypeInfo[] parameters = paramInfo.getParameters();
if (parameters.length == 0) {
if (!paramInfo.isAllColumns()) {
throw new UDFArgumentException("Argument expected");
}
assert !paramInfo.isDistinct() : "DISTINCT not supported with *";
} else {
if (parameters.length > 1 && !paramInfo.isDistinct()) {
throw new UDFArgumentException("DISTINCT keyword must be specified");
}
assert !paramInfo.isAllColumns() : "* not supported in expression list";
}
return new GenericUDAFCountEvaluator().setCountAllColumns(
paramInfo.isAllColumns());
}
/**
* GenericUDAFCountEvaluator.
*
*/
public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator {
private boolean countAllColumns = false;
private LongObjectInspector partialCountAggOI;
private LongWritable result;
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
if (mode == Mode.PARTIAL2 || mode == Mode.FINAL) {
partialCountAggOI = (LongObjectInspector)parameters[0];
}
result = new LongWritable(0);
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) {
countAllColumns = countAllCols;
return this;
}
/** class for storing count value. */
@AggregationType(estimable = true)
static class CountAgg extends AbstractAggregationBuffer {
long value;
@Override
public int estimate() { return JavaDataModel.PRIMITIVES2; }
}
@Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
CountAgg buffer = new CountAgg();
reset(buffer);
return buffer;
}
@Override
public void reset(AggregationBuffer agg) throws HiveException {
((CountAgg) agg).value = 0;
}
@Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
if (parameters == null) {
return;
}
if (countAllColumns) {
assert parameters.length == 0;
((CountAgg) agg).value++;
} else {
boolean countThisRow = true;
for (Object nextParam : parameters) {
if (nextParam == null) {
countThisRow = false;
break;
}
}
if (countThisRow) {
((CountAgg) agg).value++;
}
}
}
@Override
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
long p = partialCountAggOI.get(partial);
((CountAgg) agg).value += p;
}
}
@Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((CountAgg) agg).value);
return result;
}
@Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
return terminate(agg);
}
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
Demo02:
执行过程与UDF类似,该Java、类的功能是第一列的值大于第二列计数加1。
package UDAFDemo;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.io.LongWritable;
public class UDAFTest extends AbstractGenericUDAFResolver{
@Override
public GenericUDAFEvaluator getEvaluator(TypeInfo[] info)//字段的描述信息参数parameters
throws SemanticException {
if(info.length !=2){
throw new UDFArgumentTypeException(info.length-1,
"Exactly two argument is expected.");
}
return new GenericEvaluate();
}
public static class GenericEvaluate extends GenericUDAFEvaluator{
private LongWritable result;
private PrimitiveObjectInspector inputIO1;
private PrimitiveObjectInspector inputIO2;
/**
* map阶段:parameters长度与udaf输入的参数个数有关
* reduce阶段:parameters长度为1
*/
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
result = new LongWritable(0);
inputIO1 = (PrimitiveObjectInspector) parameters[0];
if (parameters.length>1) {
inputIO2 = (PrimitiveObjectInspector) parameters[1];
}
return PrimitiveObjectInspectorFactory.writableBinaryObjectInspector;
}
@Override
public void iterate(AggregationBuffer agg, Object[] parameters)//agg缓存结果值
throws HiveException {
assert(parameters.length==2);
if(parameters==null || parameters[0]==null || parameters[1]==null){
return;
}
double base = PrimitiveObjectInspectorUtils.getDouble(parameters[0], inputIO1);
double tmp = PrimitiveObjectInspectorUtils.getDouble(parameters[1], inputIO2);
if(base > tmp){
((CountAgg)agg).count++;
}
}
@Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
CountAgg agg = new CountAgg();
reset(agg);
return agg;
}
public static class CountAgg implements AggregationBuffer{
long count;
}
@Override
public void reset(AggregationBuffer countagg) throws HiveException {
CountAgg agg = (CountAgg)countagg;
agg.count=0;
}
@Override
public Object terminatePartial(AggregationBuffer agg)
throws HiveException {
result.set(((CountAgg)agg).count);
return result;
}
@Override
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if(partial != null){
long p = PrimitiveObjectInspectorUtils.getLong(partial, inputIO1);
((CountAgg)agg).count += p;
}
}
@Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((CountAgg)agg).count);
return result;
}
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
永久函数
方式1、如果希望在hive中自定义一个函数,且能永久使用,
则修改源码添加相应的函数类,然后在修改ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
类,添加相应的注册函数代码registerUDF("parse_url",UDFParseUrl.class,false);
。
方式2、hive -i ‘file’
方式3、新建hiverc文件
1、jar包放到安装日录下或者指定目录下
2、${HIVE_HOME}/bin目录下有个.hiverc文件,它是隐藏文件。
3、把初始化语句加载到文件中
vi .hiverc
add jar /liguodong/UDFTest.jar;
create temporary function bigthan as 'UDFDemo.UDFTest';
然后打开hive时,它会自动执行.hiverc文件。