hive写udaf的示例

转载地址

http://beekeeperdata.com/posts/hadoop/2015/08/17/hive-udaf-tutorial.html


This is part 3/3 in my tutorial series for extending Apache Hive.

In previous articles I outlined how to write very simple functions for Hive - UDF and GenericUDF, followed by the generic version - GenericUDTF.

In this post we will look at a function type in Hive that allows working with column data - a GenericUDAF represented byorg.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver andorg.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.

Examples of built-in UDAF functions include sum() and count().

Code

All code and data used in this post can be found in my hive examples GitHub repository.

Demonstration Data

The table that will be used for demonstration is called people. It has one column - name, which contains names of individuals and couples.

It is stored in a file called people.txt

~$ cat ./people.txt

John Smith
John and Ann White
Ted Green
Dorothy

We can upload this to Hadoop to a directory called people:

hadoop fs -mkdir people
hadoop fs -put ./people.txt people


Then load up the hive shell, and create the hive table

CREATE EXTERNAL TABLE people (name string)
ROW FORMAT DELIMITED FIELDS 
	TERMINATED BY '\t' 
	ESCAPED BY '' 
	LINES TERMINATED BY '\n'
STORED AS TEXTFILE 
LOCATION '/user/matthew/people';


The Value of UDAF

There are cases when we want to process data inside a column, contrary to a row data. Aggregating or ordering the data in a column for example.

A Practical Example

I will work through an example of aggregating data. Our UDTF post manipulated peoples’ names, so I will do something similar. Lets suppose we want to calculate number of letters in the entirename column of our people table.

To create a GenericUDAF we have to implement org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver andorg.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.

The resolver simply checks the input parameters and specifies which resolver to use, and so is fairly simple.

<pre name="code" class="plain">public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException;

 
 

The main work happens inside the Evaluator, in which we have several methods to implement.

Before proceeding, if you are not familiar with object inspectors, you might want to read myfirst post on Hive UDFs, in which I write a brief summary of their purpose.

// Object inspectors for input and output parameters
public  ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException;

// class to store the result of the data processing
abstract AggregationBuffer getNewAggregationBuffer() throws HiveException;

// reset Aggregation buffer
public void reset(AggregationBuffer agg) throws HiveException;

// process input record
public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException;

// finilize processing of a part of all the input data
public Object terminatePartial(AggregationBuffer agg) throws HiveException;

// add the results of two partial aggregations together
public void merge(AggregationBuffer agg, Object partial) throws HiveException;

// output final result
public Object terminate(AggregationBuffer agg) throws HiveException;

The function below calculates the total number of characters in all the strings in the specified column (including spaces)

public static class TotalNumOfLettersEvaluator extends GenericUDAFEvaluator {

        PrimitiveObjectInspector inputOI;
        ObjectInspector outputOI;
        PrimitiveObjectInspector integerOI;
        
        int total = 0;

        @Override
        public ObjectInspector init(Mode m, ObjectInspector[] parameters)
                throws HiveException {
        	
            assert (parameters.length == 1);
            super.init(m, parameters);
           
            // init input object inspectors
            if (m == Mode.PARTIAL1 || m == Mode.COMPLETE) {
                inputOI = (PrimitiveObjectInspector) parameters[0];
            } else {
                integerOI = (PrimitiveObjectInspector) parameters[0];
            }

            // init output object inspectors
            outputOI = ObjectInspectorFactory.getReflectionObjectInspector(Integer.class,
                    ObjectInspectorOptions.JAVA);
            return outputOI;

        }

        /**
         * class for storing the current sum of letters
         */
        static class LetterSumAgg implements AggregationBuffer {
            int sum = 0;
            void add(int num){
            	sum += num;
            }
        }

        @Override
        public AggregationBuffer getNewAggregationBuffer() throws HiveException {
            LetterSumAgg result = new LetterSumAgg();
            return result;
        }

        @Override
        public void reset(AggregationBuffer agg) throws HiveException {
        	LetterSumAgg myagg = new LetterSumAgg();
        }
        
        private boolean warned = false;

        @Override
        public void iterate(AggregationBuffer agg, Object[] parameters)
                throws HiveException {
            assert (parameters.length == 1);
            if (parameters[0] != null) {
                LetterSumAgg myagg = (LetterSumAgg) agg;
                Object p1 = ((PrimitiveObjectInspector) inputOI).getPrimitiveJavaObject(parameters[0]);
                myagg.add(String.valueOf(p1).length());
            }
        }

        @Override
        public Object terminatePartial(AggregationBuffer agg) throws HiveException {
            LetterSumAgg myagg = (LetterSumAgg) agg;
            total += myagg.sum;
            return total;
        }

        @Override
        public void merge(AggregationBuffer agg, Object partial)
                throws HiveException {
            if (partial != null) {
                
                LetterSumAgg myagg1 = (LetterSumAgg) agg;
                
                Integer partialSum = (Integer) integerOI.getPrimitiveJavaObject(partial);
                
                LetterSumAgg myagg2 = new LetterSumAgg();
                
                myagg2.add(partialSum);
                myagg1.add(myagg2.sum);
            }
        }

        @Override
        public Object terminate(AggregationBuffer agg) throws HiveException {
            LetterSumAgg myagg = (LetterSumAgg) agg;
            total = myagg.sum;
            return myagg.sum;
        }

    }


Code walkthrough

To understand the API of this function better remember that Hive is just a set of MapReduce functions. The MapReduce code itself has been written for us and is hidden from our view for convenience (or inconvenience, perhaps). So let us refresh ourselves on Mappers and Combiners and Reducers while thinking about this function. Remember that with Hadoop we have different machines, and on each machine Mappers and Reducers work independently of all the others.

So broadly, this function reads data (mapper), combines a bunch of mapper output into partial results (combiner), and finally creates a final, combined output (reducer). Because we aggregage across many combiners, we need to accomodate the idea of partial results.

Looking deeper at the structure of the class:

  • init - specifies input and output types of data (we have previously seen the requirement to specify input and output parameters)

  • iterate - reads data from the input table (a typical Mapper)

  • terminate - outputs the final result (the Reducer)

and then there are Partials and an AggregationBuffer:

  • terminatePartial - outputs a partial result
  • merge - merges partial results into a single result (eg the outputs of multiple combiner calls)

There are some good resources on combiners, Philippe Adjiman has a really good walkthrough.

The AggregationBuffer allows us to store intermediate (and final) results. By defining our own buffer, we can process any type of data we like.

In my code example a sum of letters is stored in our (simple) AggregationBuffer.

/**
* class for storing the current sum of letters
*/
static class LetterSumAgg implements AggregationBuffer {
	int sum = 0;
	void add(int num){
		sum += num;
	}
}


One final part of the init method which may still be confusing is the concept ofMode. Mode is uded to define what the function should be doing at different stages of the MapReduce pipeline (mapping, combining or reducing)

Hive Documentation pages give the following explanation for the Mode:

Parameters: In PARTIAL1 and COMPLETE mode, the parameters are original data; In PARTIAL2 and FINAL mode, the parameters are just partial aggregations.
Returns: In PARTIAL1 and PARTIAL2 mode, the ObjectInspector for the return value of terminatePartial() call; In FINAL and COMPLETE mode, the ObjectInspector for the return value of terminate() call.

That means the UDAF receives different input at different MapReduce stages. iterate reads a line from our table (or an input record as per the InputFormat of our table to be more precise), and outputs something for aggregation in some other format.partialAggregation combines a number of these elements into an aggregated form of the same format. And then the final reducer takes this input and outputs a final result a format of which may be different from format in which the data was received.

Our Implementation

In the init() function we specify input as a string, final output as an integer, and partial aggregation output as an integer (stored in an aggregation buffer). That is,iterate() gets a String, merge() an Integer; and both terminate() and terminatePartial() output an Integer.

// init input object inspectors depending on the mode
if (m == Mode.PARTIAL1 || m == Mode.COMPLETE) {
	inputOI = (PrimitiveObjectInspector) parameters[0];
} else {
	integerOI = (PrimitiveObjectInspector) parameters[0];
}

// output
outputOI = ObjectInspectorFactory.getReflectionObjectInspector(Integer.class,
                    ObjectInspectorOptions.JAVA);


The iterate() function gets an input string from the column and calculates and stores the length of the input string.

public void iterate(AggregationBuffer agg, Object[] parameters)
	throws HiveException {
	...
	Object p1 = ((PrimitiveObjectInspector) inputOI).getPrimitiveJavaObject(parameters[0]);
	myagg.add(String.valueOf(p1).length());
	}
}


Merge adds a result of a partial sum to the AggregationBuffer

public void merge(AggregationBuffer agg, Object partial)
      	throws HiveException {
	if (partial != null) {
                
		LetterSumAgg myagg1 = (LetterSumAgg) agg;
                
		Integer partialSum = (Integer) integerOI.getPrimitiveJavaObject(partial);
                
		LetterSumAgg myagg2 = new LetterSumAgg();
                
		myagg2.add(partialSum);
		myagg1.add(myagg2.sum);
	}
}


Terminate returns the contents of an AggregationBuffer. This is where the final result is produced.

public Object terminate(AggregationBuffer agg) throws HiveException {
	LetterSumAgg myagg = (LetterSumAgg) agg;
	total = myagg.sum;
	return myagg.sum;
}


Using the Function in Hive

ADD JAR ./hive-extension-examples-master/target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
CREATE TEMPORARY FUNCTION letters as 'com.matthewrathbone.example.TotalNumOfLettersGenericUDAF';

SELECT letters(name) 
FROM people;
OK
44
Time taken: 20.688 seconds


Testing

It is possible to write an effective Unit Test for part of this process, although effective unit testing is complex due to the complex nature of the API. I would recommend testing the individual aggregation functions if they are particularly complex, but testing the function as a whole is tough. More trivially, the final function can be tested on a test table in Hive.

This is actually the recommended workflow for developers wishing to submit their functions to the Hive project itself. See the “Creating the tests” in the officialGenericUDAF tutorial.

Finishing up

By now you should be a pro at customizing Hive functions.

If you need more resources you can check out my personal blog post for a walkthrough of building regular user defined functions, or take a look at theApache Hive Book.


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Hive 中,您可以通过编自定义聚合函数(UDAF)来扩展其功能。自定义聚合函数允许您根据特定的业务需求定义自己的聚合逻辑,并将其用于 Hive 查询。 下面是一个简单的示例,展示了如何在 Hive 中创建一个自定义聚合函数来计算一组数字的平均值: 1. 创建一个 Java 类,实现 HiveUDAF 接口。例如,创建一个名为 `AverageUDAF` 的类: ```java import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDAF; import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; @Description(name = "average", value = "Calculates the average of a set of numbers") public class AverageUDAF extends UDAF { public static class AverageUDAFEvaluator implements UDAFEvaluator { private int count; private double sum; public void init() { count = 0; sum = 0; } public boolean iterate(double value) { if (value != 0) { count++; sum += value; } return true; } public double terminatePartial() { return sum; } public boolean merge(double otherSum) { if (otherSum != 0) { count++; sum += otherSum; } return true; } public double terminate() { if (count == 0) { return 0; } return sum / count; } } } ``` 2. 使用 Maven 或其他构建工具将该类编译为一个 JAR 文件。 3. 将编译后的 JAR 文件添加到 Hive 的类路径中。 4. 在 Hive 中注册自定义聚合函数。假设您已将 JAR 文件命名为 `my-udafs.jar`,并将其放置在 HDFS 的 `/user/hive/lib` 目录下: ```sql ADD JAR hdfs:///user/hive/lib/my-udafs.jar; CREATE TEMPORARY FUNCTION average AS 'com.example.udaf.AverageUDAF'; ``` 5. 现在,您可以在 Hive 查询中使用自定义聚合函数 `average` 来计算一组数字的平均值: ```sql SELECT average(column_name) AS avg_value FROM your_table; ``` 这将返回一组数字的平均值。 这只是一个简单的示例,您可以根据自己的需求编更复杂的自定义聚合函数。请确保编的自定义函数与 HiveUDAF 接口兼容,并遵循正确的编译和部署步骤。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值