Hive函数及性能优化
Hive函数分类
从输入输出角度分类
- 标准函数:一行数据中的一列或多列为输入,结果为单一值
- 聚合函数:多行的零列到多列为输入,结果为单一值
- 表生成函数:零个或多个输入,结果为多列或多行
从实现方式分类
- 内置函数
- 自定义函数
- UDF:自定义标准函数
- UDAF:自定义聚合函数
- UDTF:自定义表生成函数
内置函数
- 标准函数
- 字符函数
- 类型转换函数
- 数学函数
- 日期函数
- 集合函数
- 条件函数
- 聚合函数
- 表生成函数
字符函数
返回值 | 函数 | 描述 |
---|---|---|
string | concat(string binary A, string binary B…) | 返回将A和B按顺序连接在一起的字符串,如:concat(‘hello’, ‘world’) 返回’helloworld’ |
int | instr(string str, string substr) | 返回substr在str中第一次出现的位置。若任何参数为null返回null,若substr不在str中返回0。Str中第一个字符的位置为1 |
int | length(string A) | 返回字符串的长度 |
int | locate(string substr, string str[, int pos]) | 返回substr在str的位置pos后第一次出现的位置 |
string | lower(string A) lcase(string A) | 返回字符串的小写形式 |
string | egexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT) | 使用REPLACEMENT替换字符串INITIAL_STRING中匹配PATTERN的子串 |
array | split(string str, string pat) | 用pat分割字符串str,pat为正则表达式 |
string | substr(string binary A, int start) substring(string binary A, int start) | 返回A中从位置start开始,长度为len的子串 |
string | trim(string A) | 去掉字符串A两端的空格 |
map | str_to_map(text,delimiter1,delimiter2) | 将字符串str按照指定分隔符转换成map |
binary | encode(string src, string charset) | 使用指定的字符集将第一个参数编码为binary ,如果任一参数为null,返回null |
类型转换和数学函数
日期函数
集合函数
条件函数
聚合函数和表生成函数
Hive UDF开发流程
- 继承UDF类或GenericUDF类
- 重写evaluate()方法并实现函数逻辑
- 打包为jar文件编译
- 复制到正确的HDFS路径
- 使用jar创建临时、永久函数
- 调用函数
UDF示例
TimeDiff.java
package cn.kgc.kb11.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
/**
* @Author ZhangPeng
* @Date 2021/4/6
* @Description 一个datediff适配版本,可以根据传入的参数决定在那个维度进行时间计算
*/
@Description(
name = "timeDiff",
value = "_FUNC_(dateStr,dateStr,diffType) -"+
"diffType:0:date;1:hour;2:minute;3:second;" +
"4:year;5:month." + "result" + "while 0 is date_diff...",
extended = "select timeDiff('2021-04-06 12:00:00','2021-04-05 23:00:00',1);" +
"result is '13 hours'."
)
public class TimeDiff extends UDF {
public Text evaluate(Text dateStr1, Text dateStr2, IntWritable diffType) throws ParseException {
String s1 = dateStr1.toString();
String s2 = dateStr2.toString();
int type = diffType.get();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date d1 = sdf.parse(s1);
Date d2 = sdf.parse(s2);
long diff = Math.abs(d1.getTime() - d2.getTime());
long result = 0;
String dateType = "";
switch (type){
case 0:result = diff/1000/3600/24;
dateType = "day";
break;
case 1:result = diff/1000/3600;
dateType = "hour";
break;
case 2:result = diff/1000/60;
dateType = "minute";
break;
case 3:result =diff/1000;
dateType = "second";
break;
case 4:result = diff/1000/3600/24/365;
dateType = "year";
break;
case 5:result = diff/1000/3600/24/30;
dateType = "month";
break;
}
return new Text(result +" " + dateType);
}
}
TestGenericUDF.java
package cn.kgc.kb11.gudf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
import java.lang.reflect.Array;
import java.util.Arrays;
/**
* @Author ZhangPeng
* @Date 2021/4/7
* @Description
*/
@Description(
name = "arrContains",
value = "look around a value is one element of an array or not.result is true or false",
extended = "select arrContains(array('a','b','c'),'c');" +
"result is true"
)
public class TestGenericUDF extends GenericUDF {
ListObjectInspector loi;
StringObjectInspector valueOi;
ObjectInspector argc;
boolean result = false;
@Override
public ObjectInspector initialize(ObjectInspector[] ois) throws UDFArgumentException {
//init方法主要为了初始化输入类型,进行合法性等业务逻辑判断
loi = (ListObjectInspector) ois[0];
valueOi = (StringObjectInspector)ois[1];
argc = loi.getListElementObjectInspector();
result = false;
return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
}
@Override
public Object evaluate(DeferredObject[] dos) throws HiveException {
Object arr = dos[0].get();
Object value = dos[1].get();
int len = loi.getListLength(arr);
for (int i = 0; i < len; i++) {
Object element = loi.getListElement(arr, i);
if (value != null && value.equals(element)){
result = true;
break;
}
}
return result;
}
@Override
public String getDisplayString(String[] chis) {
return "this is arrContains,Hello there!" + Arrays.toString(chis);
}
}
打jar包
E:\public\apache-maven-3.6.1\bin\mvn clean 先清理target文件
E:\public\apache-maven-3.6.1\bin\mvn install 生成jar包
将jar包上传到hdfs
hdfs dfs -put /opt/software/data/hivedata/20210406-UDF-1.0-SNAPSHOT.jar /hive/jars
在hive创建函数
hive> create function timeDiff as 'cn.kgc.kb11.udf.TimeDiff'
> using jar 'hdfs://hadoop12:9000/hive/jars/20210406-UDF-1.0-SNAPSHOT.jar';
调用函数
hive> select timeDiff('2021-04-06 12:00:00','2021-04-05 23:00:00',1);
OK
13 hour
Time taken: 0.202 seconds, Fetched: 1 row(s)
UDAF开发
介绍
hive的用户自定义聚合函数(UDAF)是一个很好的功能,集成了先进的数据处理。hive有两种UDAF:简单和通用。顾名思义,简单的UDAF,写的相当简单的,但因为使用Java反射导致性能损失,而且有些特性不能使用,如可变长度参数列表。通用UDAF可以使用所有功能,但是UDAF就写的比较复杂,不直观。
UDAF是需要在hive的sql语句和group by联合使用,hive的group by对于每个分组,只能返回一条记录,这点和mysql不一样,切记。
开发步骤
开发通用UDAF有两个步骤,第一个是编写resolver类,第二个是编写evaluator类。resolver负责类型检查,操作符重载。evaluator真正实现UDAF的逻辑。通常来说,顶层UDAF类继承org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2,里面编写嵌套类evaluator 实现UDAF的逻辑。
完整代码
GenericUDAFEvaluator有一个嵌套类Mode,这个类很重要,它表示了udaf在mapreduce的各个阶段,理解Mode的含义,就可以理解了hive的UDAF的运行流程。
public static enum Mode {
/**
* PARTIAL1: 这个是mapreduce的map阶段:从原始数据到部分数据聚合
* 将会调用iterate()和terminatePartial()
*/
PARTIAL1,
/**
* PARTIAL2: 这个是mapreduce的map端的Combiner阶段,负责在map端合并map的数据::从部分数据聚合到部分数据聚合:
* 将会调用merge() 和 terminatePartial()
*/
PARTIAL2,
/**
* FINAL: mapreduce的reduce阶段:从部分数据的聚合到完全聚合
* 将会调用merge()和terminate()
*/
FINAL,
/**
* COMPLETE: 如果出现了这个阶段,表示mapreduce只有map,没有reduce,所以map端就直接出结果了:从原始数据直接到完全聚合
* 将会调用 iterate()和terminate()
*/
COMPLETE
};
TestAvg.java
package cn.kgc.kb11.gudaf;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFParameterInfo;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFResolver2;
import org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryStruct;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.DoubleObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.io.DoubleWritable;
import java.util.ArrayList;
import java.util.List;
/**
* @Author ZhangPeng
* @Date 2021/4/7
* @Description
*/
public class TestAvg implements GenericUDAFResolver2 {
@Override
public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo info) throws SemanticException {
ObjectInspector[] paramOis = info.getParameterObjectInspectors();
return new GenericUDAFAverage.GenericUDAFAverageEvaluatorDouble();
}
@Override
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException {
return new GenericUDAFAverage.GenericUDAFAverageEvaluatorDouble();
}
public static class AvgTest extends GenericUDAFEvaluator{
//创建新的聚合计算的需要的内存,用来存储mapper,combiner,reducer运算过程中的相加总和
@Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
AvgAgg avgAgg = new AvgAgg();
reset(avgAgg);
return avgAgg;
}
//mapreduce支持mapper和reducer的重用,所以为了兼容,也需要做内存的重用。
@Override
public void reset(AggregationBuffer agg) throws HiveException {
AvgAgg a = (AvgAgg)agg;
a.value=0.0;
}
//map阶段调用,只要把保存当前和的对象agg,再加上输入的参数,就可以了。
@Override
public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException {
//每一行进一次这个方法
//只有一列数值列,对其进行累加同时计数
if(null == parameters) return;
AvgAgg a = (AvgAgg)agg;
a.sum += (Double)parameters[0];
a.count++;
}
//mapper结束要返回的结果,还有combiner结束返回的结果
@Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
return terminate(agg);
}
//combiner合并map返回的结果,还有reducer合并mapper或combiner返回的结果。
@Override
public void merge(AggregationBuffer agg, Object partial) throws HiveException {
System.out.println(partial);
if (partial instanceof LazyBinaryStruct){
LazyBinaryStruct lbs = (LazyBinaryStruct)partial;
System.out.println(lbs.getFieldsAsList());
Double sum = (Double)lbs.getField(0);
Integer count = (Integer)lbs.getField(1);
AvgAgg a = (AvgAgg)agg;
a.sum += sum;
a.count += count;
}
}
//reducer返回结果,或者是只有mapper,没有reducer时,在mapper端返回结果。
@Override
public Object terminate(AggregationBuffer agg) throws HiveException {
AvgAgg a = (AvgAgg)agg;
Double sum = a.sum;
Integer count = a.count;
Double avg = sum/count;
return new DoubleWritable(avg);
}
DoubleObjectInspector ooi;
DoubleWritable dw;
@Override
//这个方法返回了UDAF的返回类型
public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {
super.init(m, parameters);
ooi = PrimitiveObjectInspectorFactory.writableDoubleObjectInspector;
dw = new DoubleWritable();
List<String> fieldsName = new ArrayList<>();
fieldsName.add("sum");
fieldsName.add("count");
List<ObjectInspector> structFieldsOis = new ArrayList<>();
structFieldsOis.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
structFieldsOis.add(PrimitiveObjectInspectorFactory.writableIntObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldsName,structFieldsOis);
}
@AggregationType(estimable = true)
static class AvgAgg extends AbstractAggregationBuffer{
double value;
Integer count = 0;
Double sum = 0.0;
public int estimate(){
return 16;
}
}
}
}
UDTF开发
步骤
1.必须继承org.apache.Hadoop.hive.ql.udf.generic.GenericUDTF
2.实现initialize, process, close三个方法
3.UDTF首先会
a.调用initialize方法,此方法返回UDTF的返回行的信息(返回个数,类型)
b.初始化完成后,会调用process方法,对传入的参数进行处理,可以通过forword()方法把结果返回
c.最后close()方法调用,对需要清理的方法进行清理
代码
TestUDTF.java
package cn.kgc.kb11.gudtf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
import java.util.ArrayList;
import java.util.List;
/**
* @Author ZhangPeng
* @Date 2021/4/7
* @Description
*/
@Description(
name = "splitToLines",
value = "call to the split character,to split the str to lines.",
extended = "select splToLines(array('a','b','c'));result is:"+
"'a'\n"+
"'b'\n"+
"'c'\n"
)
public class TestUDTF extends GenericUDTF {
StringObjectInspector strOi;
StringObjectInspector separatorChar;
@Override
public void process(Object[] args) throws HiveException {
String str = strOi.getPrimitiveJavaObject(args[0]);
String sep = separatorChar.getPrimitiveJavaObject(args[1]);
String[] lines = str.split(sep);
for (String line : lines) {
ArrayList<Object> o = new ArrayList<>();
o.add(line);
forward(o);
}
}
@Override
public void close() throws HiveException {
}
@Override
public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
List<String> name = new ArrayList<>();
name.add("values");
strOi = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
separatorChar = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
List<ObjectInspector> listOis = new ArrayList<>();
listOis.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(name,listOis);
}
}