hive中udtf编写及使用

1.udtf介绍及编写

1.1.介绍

HIVE中udtf可以将一行转成一行多列,也可以将一行转成多行多列,使用频率较高。本篇文章通过实际案例剖析udtf的编写及使用方法和原理。阅读本篇文章前请先阅读UDF编写

测试数据

drop table if exists test;
create table test
(
  ind int,
  col string,
  col1 string
) ;
insert into test values (1,'a,b,c','1,2');
insert into test values (2,'j,k',null);
insert into test values (3,null,null) ;

对第一行需要输出如下结果:

IndKeyValue
1a1
1b2
1cNull

其它行都要输出类似数据,如果输入数据为null,则没输出。

1.2udtf编写

编写UDTF(User-Defined Table-Generating Functions),需要继承GenericUDTF类,类中部分代码如下:

/**
   * A Generic User-defined Table Generating Function (UDTF)
   *
   * Generates a variable number of output rows for a single input row. Useful for
   * explode(array)...
   */
  public abstract class GenericUDTF {
  ​
    public StructObjectInspector initialize(StructObjectInspector argOIs)
          throws UDFArgumentException {
        List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();
        ObjectInspector[] udtfInputOIs = new ObjectInspector[inputFields.size()];
        for (int i = 0; i < inputFields.size(); i++) {
          udtfInputOIs[i] = inputFields.get(i).getFieldObjectInspector();
        }
        return initialize(udtfInputOIs);
    }
    
    /**
       * Give a set of arguments for the UDTF to process.
       *
       * @param args
       *          object array of arguments
       */
    public abstract void process(Object[] args) throws HiveException;
  ​
    /**
       * Called to notify the UDTF that there are no more rows to process.
       * Clean up code or additional forward() calls can be made here.
       */
    public abstract void close() throws HiveException;
  }

继承GenericUDTF需要实现以上方法,其中initialize方法和UDF中类似,主要是判断输入类型并确定返回的字段类型。process方法对udft函数输入的每一样进行操作,通过调用forward方法返回一行或多行数据。close方法在process调用结束后调用,用于进行其它一些额外操作,只执行一次。

package com.practice.hive.udtf;
  ​
  import java.util.List;
  ​
  import com.google.common.collect.Lists;
  import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
  import org.apache.hadoop.hive.ql.metadata.HiveException;
  import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
  import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
  import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
  import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
  import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
  ​
  /**
   * @author liufeifei
   * @date 2018/06/20
   */
  public class ArrToMapUDTF extends GenericUDTF {
  ​
      private String[] obj = new String[2];
  ​
      /**
       * 返回类型为 String,string
       *
       * @param argOIs
       * @return
       * @throws UDFArgumentException
       */
      @Override
      public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
          List<String> colName = Lists.newLinkedList();
          colName.add("key");
          colName.add("value");
          List<ObjectInspector> resType = Lists.newLinkedList();
          resType.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
          resType.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
          // 返回分别为列名 和 列类型
          return ObjectInspectorFactory.getStandardStructObjectInspector(colName, resType);
      }
  ​
      @Override
      public void process(Object[] args) throws HiveException {
          if(args[0] == null) {
              return;
          }
          String arg1 = args[0].toString();
  ​
          String[] arr1 = arg1.split(",");
          String[] arr2 = null;
          if(args[1] != null) {
              arr2 = args[1].toString().split(",");
          }
  ​
          for(int i = 0; i < arr1.length ; i++ ) {
              obj[0] = arr1[i];
              if(arr2 != null && arr2.length > i) {
                  obj[1] = arr2[i];
              } else {
                  obj[1] = null;
              }
              forward(obj);
          }
      }
  ​
      @Override
      public void close() throws HiveException {
      }
  }

2.udtf使用

执行效果

2.1方法一

-- 原始数据
  hive> select * from test;
  OK
  1   a,b,c   1,2
  2   j,k NULL
  3   NULL    NULL
  Time taken: 0.051 seconds, Fetched: 3 row(s)
  ​
  -- 执行效果
  hive> add jar /Users/liufeifei/hive/jar/hive.jar;
  Added [/Users/liufeifei/hive/jar/hive.jar] to class path
  Added resources: [/Users/liufeifei/hive/jar/hive.jar]
  hive> create temporary function get_map as 'com.practice.hive.udtf.ArrToMapUDTF';
  OK
  Time taken: 0.005 seconds
  hive> select get_map(col,col1) from test;
  OK
  a   1
  b   2
  c   NULL
  j   NULL
  k   NULL
  Time taken: 1.008 seconds, Fetched: 5 row(s)

2.2方法二

以上为get_map函数的基本使用方法,该方法局限性为使用时无法引用其它列。结合lateral view关键词使用可以达到预期效果。

hive> select t.ind,t.col,t.col1,t1.key,t1.value
      >   from test t
      > lateral view get_map(col,col1) t1 as key,value;
  OK
  1   a,b,c   1,2 a   1
  1   a,b,c   1,2 b   2
  1   a,b,c   1,2 c   NULL
  2   j,k NULL    j   NULL
  2   j,k NULL    k   NULL
  Time taken: 0.045 seconds, Fetched: 5 row(s)

3.执行原理

针对以上lateral view方法:

该使用方法中涉及到t(test)、t1两张表。lateral view相当于将两张表进行join操作,过程如下:  (个人理解)

1. 对t表中数据筛选出 ind,col,col1字段

2.对输入的t表中的col,col1列进行udtf操作,将得到的数据集命名为t1,并对列命令为key,value

3.将t表和t1表进行join操作,得到结果数据集

4.如果右表得到的值为空,但是需要保留左表的值,请使用 outer关键词

引用官网原文如下:

片段一:

Lateral view is used in conjunction with user-defined table generating functions such as explode(). As mentioned in Built-in Table-Generating Functions, a UDTF generates zero or more output rows for each input row. A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

片段二:

The user can specify the optional OUTER keyword to generate rows even when a LATERAL VIEW usually would not generate a row. This happens when the UDTF used does not generate any rows which happens easily with explode when the column to explode is empty. In this case the source row would never appear in the results. OUTER can be used to prevent that and rows will be generated with NULL values in the columns coming from the UDTF.

文档地址

执行计划如下:

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: t
          Statistics: Num rows: 3 Data size: 26 Basic stats: COMPLETE Column stats: NONE
          Lateral View Forward
            Statistics: Num rows: 3 Data size: 26 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: ind (type: int), col (type: string), col1 (type: string)
              outputColumnNames: ind, col, col1
              Statistics: Num rows: 3 Data size: 26 Basic stats: COMPLETE Column stats: NONE
              Lateral View Join Operator
                outputColumnNames: _col0, _col1, _col2, _col6, _col7
                Statistics: Num rows: 6 Data size: 52 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4
                  Statistics: Num rows: 6 Data size: 52 Basic stats: COMPLETE Column stats: NONE
                  ListSink
            Select Operator
              expressions: col (type: string), col1 (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 3 Data size: 26 Basic stats: COMPLETE Column stats: NONE
              UDTF Operator
                Statistics: Num rows: 3 Data size: 26 Basic stats: COMPLETE Column stats: NONE
                function name: com.ffl.study.hive.udtf.ArrToMapUDTF@6b7a0f18
                Lateral View Join Operator
                  outputColumnNames: _col0, _col1, _col2, _col6, _col7
                  Statistics: Num rows: 6 Data size: 52 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string)
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4
                    Statistics: Num rows: 6 Data size: 52 Basic stats: COMPLETE Column stats: NONE
                    ListSink

打开源码 LateralViewJoinOperator和LateralViewForwardOperator ,可以看到如下注释

LateralViewJoinOperator
/**
 * The lateral view join operator is used for FROM src LATERAL VIEW udtf()...
 * This operator was implemented with the following operator DAG in mind.
 *
 * For a query such as
 *
 * SELECT pageid, adid.* FROM example_table LATERAL VIEW explode(adid_list) AS
 * adid
 *
 * The top of the operator DAG will look similar to
 *
 *            [Table Scan]
 *                |
 *       [Lateral View Forward]
 *              /   \
 *   [Select](*)    [Select](adid_list)
 *            |      |
 *            |     [UDTF] (explode)
 *            \     /
 *      [Lateral View Join]
 *               |
 *               |
 *      [Select] (pageid, adid.*)
 *               |
 *              ....
 *
 * Rows from the table scan operator are first to a lateral view forward
 * operator that just forwards the row and marks the start of a LV. The
 * select operator on the left picks all the columns while the select operator
 * on the right picks only the columns needed by the UDTF.
 *
 * The output of select in the left branch and output of the UDTF in the right
 * branch are then sent to the lateral view join (LVJ). In most cases, the UDTF
 * will generate > 1 row for every row received from the TS, while the left
 * select operator will generate only one. For each row output from the TS, the
 * LVJ outputs all possible rows that can be created by joining the row from the
 * left select and one of the rows output from the UDTF.
 *
 * Additional lateral views can be supported by adding a similar DAG after the
 * previous LVJ operator.   // 后面还可以接join操作
 */
LateralViewForwardOperator   // 用于谓词下推判断
/**
 * LateralViewForwardOperator. This operator sits at the head of the operator
 * DAG for a lateral view. This does nothing, but it aids the predicate push
 * down during traversal to identify when a lateral view occurs.
 *
 */

可以看到lateral view 和join操作类似;lateral view outer 和left join 操作类似(当explode中数据为空时保留原数据)。另外lateral view也支持谓词下推 ,具体源码待以后有需要再深入研究 ^~^

说明:笔者在其它平台用大表数据测试,方法一只开启了map任务;方法二同时开启了map和reduce任务;本地电脑由于数据量较小,没有开启mr任务,因此无法看到效果;

代码地址

  • 6
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
UDTF和UDF是Hive的两种不同类型的函数。 UDTF(User-Defined Table-Generating Function)是用户自定义的表生成函数,用于在查询生成表。UDTF函数在SELECT子句使用,可以将一列或多列的输入数据转换为多个输出列,并生成一个新的表。UDTF函数通常用于将一列的值拆分成多个行,以实现行转列的效果。在Hive使用lateral view关键字来调用UDTF函数。\[1\] UDF(User-Defined Function)是用户自定义的函数,用于在查询对数据进行转换和处理。UDF函数可以接受一个或多个输入参数,并返回一个值作为输出。UDF函数通常用于对单个数据项进行计算或转换。编写UDF函数需要继承两个类:org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver和org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator。\[2\] 举个例子,如果需要将一列的值拆分成多个行,可以使用UDTF函数。例如,使用lateral view和split_udtf函数可以实现行转列的效果: ``` select id, names, score from udtf_test lateral view split_udtf(name,',') temp as names; ``` 另外,如果只需要对单个数据项进行计算或转换,可以使用UDF函数。例如,使用add_udf函数可以将score列的值加上10: ``` select id, name, score, add_udf(score) as add_score from udf_test; ``` 总结来说,UDTF函数用于生成表,通常用于行转列的操作;而UDF函数用于对单个数据项进行计算或转换。 #### 引用[.reference_title] - *1* *3* [HiveUDF、UDTF 、UDAF函数详解及示例](https://blog.csdn.net/qq_40579464/article/details/105903405)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down28v1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [hive udf,udaf和udtf](https://blog.csdn.net/whdxkcx/article/details/107171065)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down28v1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值