Hive UDF UDAF UDTF

在工作学习中,hive提供的现成的函数往往不是很够用,我们需要手动去编写函数,hive提供了用户扩展的权利 。用户继承相关类或者实现相关接口,编写代码,即可实现自己需要的功能

UDF(User-Defined Function):用户定义普通函数,只对单行的数值产生作用

UDAF(User-Defined Aggregation Function):用户自定义聚合函数,可以对多行数据产生作用,类似于sum,avg等函数

UDTF(User-Defined Table-Generating Function):用户定义表生成函数,用来解决输入一行,输出多行的场景。

下面将一一展示三种自定义函数的编写及应用

准备工作

准备数据user.txt
zs      play,sing
ls      sleep,eat
mwf     study,sleep

hadoop fs -mkdir /external/user/
hadoop fs -put user.txt /external/user/

在hive中创建user表
create external table user(name string,hobby string) row format delimited fields terminated by '\t' location '/external/user';

user表包括姓名和爱好两个字段,通过编写代码实现以下:

(1) 通过UDF实现将名字变为大写

(2) 通过UDTF将爱好拆分为两列

(3) 通过UDAF计算所有人的爱好总数

创建maven项目/pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.mwf</groupId>
    <artifactId>test_udf01</artifactId>
    <version>1.0-SNAPSHOT</version>


    <dependencies>
 <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>1.2.1</version>
        </dependency>
 <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.3</version>
        </dependency>
    </dependencies>
    
</project>

UDF

UDF的写法是继承UDF类,然后编写evaluate()方法实现具体的逻辑

public class SimpleUDF extends UDF {
    public String evaluate(String str) {
        return str.toUpperCase();
    }
}

打包测试

打jar包,上传到服务器,在hive中添加jar包,注册临时函数,然后调用
add jar test_udf.jar
create temporary function udf as 'com.mwf.demo.SimpleUDF'
select udf(name) from user;

UDTF

编写UDTF需要继承GenericUDTF类,然后重写initialize方法和process方法和close方法

initialize方法主要是初始化返回的列和返回的列类型

process方法对输入的每一行进行操作,他通过调用forward()返回一行或者多行数据

close方法在process方法结束后调用,用于进行一些其他的操作,只执行一次

public class SimpleUDTF extends GenericUDTF {

    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        ArrayList<String> colNames = new ArrayList<String>();
        colNames.add("hobby1");
        colNames.add("hobby2");
        ArrayList<ObjectInspector> fieldIOs = new ArrayList<ObjectInspector>();
        fieldIOs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        fieldIOs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        return ObjectInspectorFactory.getStandardStructObjectInspector(colNames,fieldIOs);
    }

    public void process(Object[] objects) throws HiveException {
        forward(objects[0].toString().split(","));
    }

    public void close() throws HiveException {

    }
}

打包测试

add jar test_udtf.jar
create temporary function udf as 'com.mwf.demo.SimpleUDTF'
select udtf(hoby) from user;

UDAF

参考下面这两篇文章,讲的很好

https://blog.csdn.net/zyz_home/article/details/79889519(我就是看这个看明白的,大家可以移步去学)

https://cloud.tencent.com/info/05fa14293c68fe91f7b4670389b8f7e5.html

需要继承以下两个类

AbstractGenericUDAFResolver根据传入的数据指定调用哪一个Evaluator进行处理

GenericUDAFEvaluator实现具体的方法,编写逻辑

编写之前要了解GenericUDAFEvaluator的内部类ModelObjectInspector接口,可以通过看上面两篇文章了解

GenericUDAFEvaluator类需要实现的几个方法
// 确定各个阶段输入输出参数的数据格式ObjectInspectors  
public  ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException;  
  
// 保存数据聚集结果的类  
abstract AggregationBuffer getNewAggregationBuffer() throws HiveException;  
  
// 重置聚集结果  
public void reset(AggregationBuffer agg) throws HiveException;  
  
// map阶段,迭代处理输入sql传过来的列数据  
public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException;  
  
// map与combiner结束返回结果,得到部分数据聚集结果  
public Object terminatePartial(AggregationBuffer agg) throws HiveException;  
  
// combiner合并map返回的结果,还有reducer合并mapper或combiner返回的结果。  
public void merge(AggregationBuffer agg, Object partial) throws HiveException;  
  
// reducer阶段,输出最终结果  
public Object terminate(AggregationBuffer agg) throws HiveException; 
代码实现
public class SimpleUDAF extends AbstractGenericUDAFResolver {

    @Override
    public GenericUDAFEvaluator getEvaluator(TypeInfo[] info) throws SemanticException {
        return new TestEvaluator();
    }

    public static class TestEvaluator extends GenericUDAFEvaluator{

        //数据的类型
        PrimitiveObjectInspector inputOI;
        ObjectInspector ouputOI;
        PrimitiveObjectInspector integerOI;

        //总数
        int total = 0 ;

        //指定输入,过程和输出的类型(有不同的阶段PARTIAL1,PARTIAL2,FINAL,COMPLETE)
        @Override
        public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {
            super.init(m, parameters);
            if(m ==Mode.PARTIAL1 || m == Mode.COMPLETE ){
                inputOI = (PrimitiveObjectInspector) parameters[0];
            }else{
                integerOI = (PrimitiveObjectInspector) parameters[0];
            }
            ouputOI = ObjectInspectorFactory.getReflectionObjectInspector(Integer.class,
                    ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
            return ouputOI;
        }

        //聚集结果存储的中间类
        static class HobbyAggregationBuffer implements AggregationBuffer {
            int tempTotal = 0 ;
            void add(int num){tempTotal += num; }
        }

        //获取聚集结果存储的中间类
        public AggregationBuffer getNewAggregationBuffer() throws HiveException {
            return new HobbyAggregationBuffer();
        }

        //重置聚集结果存储的中间类
        public void reset(AggregationBuffer aggregationBuffer) throws HiveException {
            aggregationBuffer = new HobbyAggregationBuffer();
        }

        //对每一行进行逻辑计算
        public void iterate(AggregationBuffer aggregationBuffer, Object[] objects) throws HiveException {
            if(objects[0] != null ){
                HobbyAggregationBuffer hobbyAgg = (HobbyAggregationBuffer) aggregationBuffer;
                Object p = ((PrimitiveObjectInspector)inputOI).getPrimitiveJavaObject(objects[0]);
                hobbyAgg.add(String.valueOf(p).split(",").length);
            }
        }

        //对中间值(mapper和combiner的结果)进行计算
        public Object terminatePartial(AggregationBuffer aggregationBuffer) throws HiveException {
            HobbyAggregationBuffer hobbyAgg = (HobbyAggregationBuffer) aggregationBuffer;
            total += hobbyAgg.tempTotal;
            return total;
        }
        //合并mapper阶段和combiner阶段的结果
        public void merge(AggregationBuffer aggregationBuffer, Object o) throws HiveException {
            if(o != null){
                HobbyAggregationBuffer hobbyAgg1 = (HobbyAggregationBuffer) aggregationBuffer;
                Integer partialSum = (Integer) integerOI.getPrimitiveJavaObject(o);
                HobbyAggregationBuffer hobbyAgg2 = new HobbyAggregationBuffer();
                hobbyAgg2.add(partialSum);
                hobbyAgg1.add(hobbyAgg2.tempTotal);
            }


        }
        //对结果值(reducer的记过)进行计算
        public Object terminate(AggregationBuffer aggregationBuffer) throws HiveException {
            HobbyAggregationBuffer hobbyAgg = (HobbyAggregationBuffer) aggregationBuffer;
            total = hobbyAgg.tempTotal;
            return total;
        }
    }
}

打包测试

add jar test_udaf.jar
create temporary function udf as 'com.mwf.demo.SimpleUDAF'
select udaf(hoby) from user;

完整的代码,可以移步https://github.com/upupfeng/test_udf获取



知识的搬运工,多搬搬总是好的

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值