ObjectInspector帮助我们研究复杂对象的内部结构,解耦了数据使用和数据格式,从而提高了代码的复用度。
一个ObjectInspector实例代表了一个类型的数据在内存中存储的特定类型和方法。
一个ObjectInspector对象本身并不包含任何数据,它只是提供对数据的存储类型说明和对数据对象操作的统一管理或者是代理
ObjectInspector接口使得Hive不拘于一种特定数据格式,使得数据流:
- 输入端和输出端切换不同的输入输出格式
- 在不同的Operator上使用不同的数据格式
一个枚举类Category,定义了5种类型:
基本类型(Primitive),集合(List),键值对映射(Map),结构体(Struct),联合体(Union)
ObjectInspector接口定义:
public interface ObjectInspector extends Cloneable {
public static enum Category {
PRIMITIVE, LIST, MAP, STRUCT, UNION
};
String getTypeName();
Category getCategory();
}
ObjectInspector对应的子抽象类和接口分别为:
- StructObjectInspector :完成对一行数据的解析,本身有一组ObjectInspector组成
- MapObjectInspector
- ListObjectInspector
- PrimitiveObjectInspector :完成对基本数据类型的解析
- UnionObjectInspector
Hive SerDe测试代码:
//创建schema,保存在Properties中
private Properties createProperties() {
Properties tbl = new Properties();
// Set the configuration parameters
tbl.setProperty(Constants.SERIALIZATION_FORMAT, "9");
tbl.setProperty("columns",
"abyte,ashort,aint,along,adouble,astring,anullint,anullstring");
tbl.setProperty("columns.types",
"tinyint:smallint:int:bigint:double:string:int:string");
tbl.setProperty(Constants.SERIALIZATION_NULL_FORMAT, "NULL");
return tbl;
}
public void testLazySimpleSerDe() throws Throwable {
try {
// Create the SerDe
LazySimpleSerDe serDe = new LazySimpleSerDe();
Configuration conf = new Configuration();
Properties tbl = createProperties();
//用Properties初始化serDe
serDe.initialize(conf, tbl);
// Data
Text t = new Text("123\t456\t789\t1000\t5.3\thive and hadoop\t1.\tNULL");
String s = "123\t456\t789\t1000\t5.3\thive and hadoop\tNULL\tNULL";
Object[] expectedFieldsData = {new ByteWritable((byte) 123),
new ShortWritable((short) 456), new IntWritable(789),
new LongWritable(1000), new DoubleWritable(5.3),
new Text("hive and hadoop"), null, null};
// Test
deserializeAndSerialize(serDe, t, s, expectedFieldsData);
} catch (Throwable e) {
e.printStackTrace();
throw e;
}
}
private void deserializeAndSerialize(LazySimpleSerDe serDe, Text t, String s,
Object[] expectedFieldsData) throws SerDeException {
// Get the row ObjectInspector
StructObjectInspector oi = (StructObjectInspector) serDe
.getObjectInspector();
// 获取列信息
List<? extends StructField> fieldRefs = oi.getAllStructFieldRefs();
assertEquals(8, fieldRefs.size());
// Deserialize
Object row = serDe.deserialize(t);
for (int i = 0; i < fieldRefs.size(); i++) {
Object fieldData = oi.getStructFieldData(row, fieldRefs.get(i));
if (fieldData != null) {
fieldData = ((LazyPrimitive) fieldData).getWritableObject();
}
assertEquals("Field " + i, expectedFieldsData[i], fieldData);
}
// Serialize
assertEquals(Text.class, serDe.getSerializedClass());
Text serializedText = (Text) serDe.serialize(row, oi);
assertEquals("Serialized data", s, serializedText.toString());
}
Hive将对行中列的读取和行的存储方式解耦和
对于数据的使用者来说,只需要行的Object和相应的ObjectInspector,就能读取出每一列的对象
Hive ExprNodeEvaluator 和 UDF,UDAF, UDTF 都需要 (Object, ObjectInspector) pair
xxxObjectInspector poi = xxxObjectInspectorFactory.getxxxInspector(xxx 类型);
xxxObjectInspectorUtils.getxxx(基本对象,poi对象)
参考:
https://blog.csdn.net/qq_27169017/article/details/78458188
https://blog.csdn.net/weixin_39469127/article/details/89739285