Hive之ObjectInspector详解

Hive之ObjectInspector详解

对于我而言,我是在写GenericUDF/UDAF/UDTF时候遇到ObjectInspector的。所以这里的内容仅针对函数的时候写的。

我们都知道hql最后会转为MapReduce作业来执行。而我们之前单独写MR的时候,需要写一个Map类和Reduce类,在写这些类的时候我们需要指定输入和输出参数的数据类型(记住不是Java的基本数据类型,还记得吗。是经过Hadoop封装的XxxWritable类型,比如int类型,要写成IntWritable,String类型要写成Text)。因此,ObjectInspector 的作用就是告诉hive输入输出的数据类型(在自定义函数中是在初始化方法中配置的),以便hive将hql转为MR程序。

一、官方解释

Wiki

Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.(hive 使用 ObjectInspector来分析行对象的内部结构以及各个列的结构

ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:

  • Instance of a Java class (Thrift or native Java)
  • A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map)
  • A lazily-initialized object (for example, a Struct of string fields stored in a single Java string object with starting offset for each field)

A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.一个复杂的对象可以由一对ObjectInspector和Java Object表示。 ObjectInspector不仅告诉我们对象的结构,而且还提供了访问对象内部字段的方法。(下面看接口源码的时候也可以看到这种类型和实例分离的结构,ObjectInspector只记录类型并且可以直接返回,另外提供了一个获取实例的方法,该方法的参数是一个Object对象,即本身不存储具体的数据,而是根据传入的对象,利用自己的类型来转换成具有类型的对象)

NOTE: Apache Hive recommends that custom ObjectInspectors created for use with custom SerDes have a no-argument constructor in addition to their normal constructors for serialization purposes. See HIVE-5380 for more details.

JAVA API DOC

ObjectInspector helps us to look into the internal structure of a complex object. A (probably configured) ObjectInspector instance stands for a specific type and a specific way to store the data of that type in the memory. For native java Object, we can directly access the internal structure through member fields and methods. ObjectInspector is a way to delegate that functionality away from the Object, so that we have more control on the behavior of those actions. An efficient implementation of ObjectInspector should rely on factory, so that we can make sure the same ObjectInspector only has one instance. That also makes sure hashCode() and equals() methods of java.lang.Object directly works for ObjectInspector as well.

ObjectInspector帮助我们研究复杂对象的内部结构。一个(可能已配置的)ObjectInspector实例代表一种特定的类型和一种将该类型的数据存储在内存中的特定方式。对于本机Java对象,我们可以通过成员字段和方法直接访问内部结构。 ObjectInspector是一种将功能委托给Object的方法,这样我们就可以更好地控制这些动作的行为。 ObjectInspector的有效实现应依赖工厂,以便我们可以确保同一ObjectInspector仅具有一个实例。这也可以确保java.lang.Object的hashCode()和equals()方法也直接适用于ObjectInspector。

二、关系网

源码中还有好多接口,这里只列出我在写自定义函数的时候见到的。了解这些接口以及对应的实现类,有助于我们理解。

2.1 ObjectInspector 接口

public interface ObjectInspector extends Cloneable {
   
    String getTypeName();

    ObjectInspector.Category getCategory();

    // 其中 PRIMITIVE 又细分 PrimitiveCategory 枚举类型对应的值
    public static enum Category {
   
        PRIMITIVE,	// 原始数据类型
        LIST,
        MAP,
        STRUCT,
        UNION;

        private Category() {
   
        }
    }
}
2.1.1 ListObjectInspector 接口

主要内容:

  • 获取List中元素的对象检查器
  • 获取List指定下标的元素的对象实例
  • 获取List的长度
  • 获取List实例(该方法只应该在,如果List对象是Object数据的一部分,的时候使用)
package org.apache.hadoop.hive.serde2.objectinspector;

import org.apache.hadoop.hive.common.classification.InterfaceAudience;
import org.apache.hadoop.hive.common.classification.InterfaceStability;

import java.util.List;

/**
 * ListObjectInspector.
 *
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public interface ListObjectInspector extends ObjectInspector {
   

  // ** Methods that does not need a data object **
  ObjectInspector getListElementObjectInspector();

  // ** Methods that need a data object **
  /**
   * returns null for null list, out-of-the-range index.
   */
  Object getListElement(Object data, int index);

  /**
   * returns -1 for data = null.
   */
  int getListLength(Object data);

  /**
   * returns null for data = null.
   * 
   * Note: This method should not return a List object that is reused by the
   * same ListObjectInspector, because it's possible that the same
   * ListObjectInspector will be used in multiple places in the code.
   * 
   * However it's OK if the List object is part of the Object data.
   */
  List<?> getList(Object data);

}
StandardListObjectInspector 实现类

重点:使用以下方式来创建List对象检查器:

ObjectInspectorFactory.getStandardListObjectInspector(ObjectInspector listElementObjectInspector))

源码:

package org.apache.hadoop.hive.serde2.objectinspector;

import java.util.ArrayList;
import java.util.List;
import java.util.Set;

/**
 * DefaultListObjectInspector works on list data that is stored as a Java List
 * or Java Array object.
 *
 * 默认的List对象检查器,在存储数据的Java List 或者 Java Array上工作。
 *
 * Always use the ObjectInspectorFactory to create new ObjectInspector objects,
 * instead of directly creating an instance of this class.
 *
 * 总是用 ObjectInspectorFactory 来创建一个新的 ObjectInspector 对象,而不是直接 new 该对象。
 */
public class StandardListObjectInspector implements SettableListObjectInspector {
   

  // 内部元素的对象检查器
  private ObjectInspector listElementObjectInspector;

  protected StandardListObjectInspector() {
   
    super();
  }
  /**
   * Call ObjectInspectorFactory.getStandardListObjectInspector instead.
   *
   * 使用 “ObjectInspectorFactory.getStandardListObjectInspector” 来代替
   */
  protected StandardListObjectInspector(
      ObjectInspector listElementObjectInspector) {
   
    this.listElementObjectInspector = listElementObjectInspector;
  }

  // 返回的是List类别
  public final Category getCategory() {
   
    return Category.LIST;
  }

  // without data 返回对象检查器
  public ObjectInspector getListElementObjectInspector() {
   
    return listElementObjectInspector;
  }

  // with data 返回对象实例
  @SuppressWarnings({
    "rawtypes", "unchecked" })
  public Object getListElement(Object data, int index) {
   
    if (data == null) {
   
      return null;
    }
    // We support List<Object>, Set<Object> and Object[] 我们支持3种数据类型
    // so we have to do differently. 因此,不得不进行不同的判断处理
    // 如果data不能转为list,除了set和array其他的就不满足了
    if (! (data instanceof List)) {
   
      // set的情况
      if (! (data instanceof Set)) {
   
        Object[] list = (Object[]) data;
        if (index < 0 || index >= list.length) {
   
          return null;
        }
        return list[index];
      } else {
   
        // array的情况
        data = new ArrayList((Set<?>) data);
      }
    
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值