一、问题背景
之前在做Datax数据同步时,发现源端binary、decimal等类型的数据无法写入hive字段。看了一下官网文档,DataX HdfsWriter 插件文档,是1-2年前的,当初看过部分源码其实底层hadoop是支持这些类型写入的,后来随着工作变动也忘了记录下来,借着近期datax群里又有人问起,勾起了回忆,索性改一下源码记录一下。
很重要的一点:我们其实要知道,datax只不过是个集成了异构数据源同步的框架,真正的读取和写入都是数据源底层本身支持功能才能用,所以要想知道某个功能支不支持,首先得去看底层的数据源支不支持。
注:binary类型写入之后读取又会有坑,将另外开启一篇单独介绍Hdfs如何实现支持binary类型数据读写,改动部分代码已提交。
欢迎自取:github地址
分支:feature_hdfs_writer_decimal_binary_support
二. 环境准备
Datax版本:3.0
Hadoop版本:2.7.3
Hive版本:2.3.2
三. Datax 源码
首先从hdfswriter的startwrite方法入手,根据配置job文件的filetype类型区分写入hdfs的存储格式:
HdfsWriter:
public void startWrite(RecordReceiver lineReceiver) {
LOG.info("begin do write...");
LOG.info(String.format("write to file : [%s]", this.fileName));
if(fileType.equalsIgnoreCase("TEXT")){
//写TEXT FILE
hdfsHelper.textFileStartWrite(lineReceiver,this.writerSliceConfig, this.fileName,
this.getTaskPluginCollector());
}else if(fileType.equalsIgnoreCase("ORC")){
//写ORC FILE
hdfsHelper.orcFileStartWrite(lineReceiver,this.writerSliceConfig, this.fileName,
this.getTaskPluginCollector());
}
LOG.info("end do write");
}
进入hdfsHelper查看具体的写入逻辑:
HdfsHelper:
// TEXT
public void textFileStartWrite(RecordReceiver lineReceiver, Configuration config, String fileName,
TaskPluginCollector taskPluginCollector){
...
RecordWriter writer = outFormat.getRecordWriter(fileSystem, conf, outputPath.toString(), Reporter.NULL);
Record record = null;
while ((record = lineReceiver.getFromReader()) != null) {
MutablePair<Text, Boolean> transportResult = transportOneRecord(record, fieldDelimiter, columns, taskPluginCollector);
if (!transportResult.getRight()) {
writer.write(NullWritable.get(),transportResult.getLeft());
}
}
writer.close(Reporter.NULL);
...
}
// ORC
public void orcFileStartWrite(RecordReceiver lineReceiver, Configuration config, String fileName,
TaskPluginCollector taskPluginCollector){
...
List<String> columnNames = getColumnNames(columns);
// 获取字段类型序列化器,这个方法很关键,后续对于decimal类型字段的改造需要用到
List<ObjectInspector> columnTypeInspectors = getColumnTypeInspectors(columns);
StructObjectInspector inspector = (StructObjectInspector)ObjectInspectorFactory
.getStandardStructObjectInspector(columnNames, columnTypeInspectors);
...
RecordWriter writer = outFormat.getRecordWriter(fileSystem, conf, fileName, Reporter.NULL);
Record record = null;
while ((record = lineReceiver.getFromReader()) != null) {
MutablePair<List<Object>, Boolean> transportResult = transportOneRecord(record,columns,taskPluginCollector);
if (!transportResult.getRight()) {
// orc 格式的需要对应类型序列化器才能写入到hdfs
writer.write(NullWritable.get(), orcSerde.serialize(transportResult.getLeft(), inspector));
}
}
writer.close(Reporter.NULL);
...
}
// 将从channel中收到的record字符串按照对应的字段类型进行转换
public static MutablePair<List<Object>, Boolean> transportOneRecord(
Record record,List<Configuration> columnsConfiguration,
TaskPluginCollector taskPluginCollector){
...
for (int i = 0; i < recordLength; i++) {
column = record.getColumn(i);
//todo as method
if (null != column.getRawData()) {
String rowData = column.getRawData().toString();
// datax定义的hive支持类型枚举类
SupportHiveDataType columnType = SupportHiveDataType.valueOf(columnsConfiguration.get(i).getString(Key.TYPE).toUpperCase());
//根据writer端类型配置做类型转换
switch (columnType) {
case TINYINT:
recordList.add(Byte.valueOf(rowData));
break;
...
}
从上述代码中可以得知,text类型文件写入,不需要做特殊的序列化处理,因此对于text类型的文本写入,只要在transportOneRecord中添加缺少的类型转换就能实现对应类型字段的写入,而对于ORC类型的文件写入则需要对应的类型序列化器才能做到。至此我们重点应该放在验证hadoop底层是否真的没有binary以及decimal等类型的序列化器。
上述代码中,我也标记出了ORC中获取字段序列化器的入口位置[HdfsHelper.getColumnTypeInspectors]方法内部。
HdfsHelper:
// 根据writer配置的字段类型,构建序列化器
public List<ObjectInspector> getColumnTypeInspectors(List<Configuration> columns){
List<ObjectInspector> columnTypeInspectors = Lists.newArrayList();
for (Configuration eachColumnConf : columns) {
SupportHiveDataType columnType = SupportHiveDataType.valueOf(eachColumnConf.getString(Key.TYPE).toUpperCase());
ObjectInspector objectInspector = null;
switch (columnType) {
case TINYINT:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Byte.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
...
}
看到这里就知道下一步就是需要到ObjectInspectorFactory中去看对应类型的ObjectInspector类是什么,接下来就是到hive的底层源码了。
ObjectInspectorFactory:
public static ObjectInspector getReflectionObjectInspector(Type t, ObjectInspectorFactory.ObjectInspectorOptions options) {
// 优先从缓存中获取
ObjectInspector oi = (ObjectInspector)objectInspectorCache.get(t);
if (oi == null) {
// 缓存中不存在,获取实际类,并添加到缓存中
oi = getReflectionObjectInspectorNoCache(t, options);
objectInspectorCache.put(t, oi);
}
...
return oi;
}
private static ObjectInspector getReflectionObjectInspectorNoCache(Type t, ObjectInspectorFactory.ObjectInspectorOptions options) {
// 开头就验证Map,Array类型的复合字段类型,这就说明了其实hive提供的sdk本身也是支持这些字段类型写入的
if (t instanceof GenericArrayType) {
GenericArrayType at = (GenericArrayType)t;
return getStandardListObjectInspector(getReflectionObjectInspector(at.getGenericComponentType(), options));
} else {
if (t instanceof ParameterizedType) {
ParameterizedType pt = (ParameterizedType)t;
if (List.class.isAssignableFrom((Class)pt.getRawType()) || Set.class.isAssignableFrom((Class)pt.getRawType())) {
return getStandardListObjectInspector(getReflectionObjectInspector(pt.getActualTypeArguments()[0], options));
}
if (Map.class.isAssignableFrom((Class)pt.getRawType())) {
return getStandardMapObjectInspector(getReflectionObjectInspector(pt.getActualTypeArguments()[0], options), getReflectionObjectInspector(pt.getActualTypeArguments()[1], options));
}
t = pt.getRawType();
}
if (!(t instanceof Class)) {
throw new RuntimeException(ObjectInspectorFactory.class.getName() + " internal error:" + t);
} else {
Class<?> c = (Class)t;
// 根据传入的不同类去不同的缓存中获取class对象
if (PrimitiveObjectInspectorUtils.isPrimitiveJavaType(c)) {
return PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveObjectInspectorUtils.getTypeEntryFromPrimitiveJavaType(c).primitiveCategory);
} else if (PrimitiveObjectInspectorUtils.isPrimitiveJavaClass(c)) {
return PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveObjectInspectorUtils.getTypeEntryFromPrimitiveJavaClass(c).primitiveCategory);
} else if (PrimitiveObjectInspectorUtils.isPrimitiveWritableClass(c)) {
return PrimitiveObjectInspectorFactory.getPrimitiveWritableObjectInspector(PrimitiveObjectInspectorUtils.getTypeEntryFromPrimitiveWritableClass(c).primitiveCategory);
}
...
}
}
代码很清晰,直接看对应的缓存class是怎么初始化进去的就可以知道,我们一会需要用什么类型去做代码改造
PrimitiveObjectInspectorUtils:
// 缓存中注册类型
static void registerType(PrimitiveObjectInspectorUtils.PrimitiveTypeEntry t) {
...
if (t.primitiveJavaType != null) {
primitiveJavaTypeToTypeEntry.put(t.primitiveJavaType, t);
}
if (t.primitiveJavaClass != null) {
primitiveJavaClassToTypeEntry.put(t.primitiveJavaClass, t);
}
if (t.primitiveWritableClass != null) {
primitiveWritableClassToTypeEntry.put(t.primitiveWritableClass, t);
}
...
}
// 静态代码块初始化
static {
binaryTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.BINARY, "binary", byte[].class, byte[].class, BytesWritable.class);
stringTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.STRING, "string", (Class)null, String.class, Text.class);
booleanTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.BOOLEAN, "boolean", Boolean.TYPE, Boolean.class, BooleanWritable.class);
intTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.INT, "int", Integer.TYPE, Integer.class, IntWritable.class);
longTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.LONG, "bigint", Long.TYPE, Long.class, LongWritable.class);
floatTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.FLOAT, "float", Float.TYPE, Float.class, FloatWritable.class);
voidTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.VOID, "void", Void.TYPE, Void.class, NullWritable.class);
doubleTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.DOUBLE, "double", Double.TYPE, Double.class, DoubleWritable.class);
byteTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.BYTE, "tinyint", Byte.TYPE, Byte.class, ByteWritable.class);
shortTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.SHORT, "smallint", Short.TYPE, Short.class, ShortWritable.class);
dateTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.DATE, "date", (Class)null, Date.class, DateWritable.class);
timestampTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.TIMESTAMP, "timestamp", (Class)null, Timestamp.class, TimestampWritable.class);
decimalTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.DECIMAL, "decimal", (Class)null, HiveDecimal.class, HiveDecimalWritable.class);
varcharTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.VARCHAR, "varchar", (Class)null, HiveVarchar.class, HiveVarcharWritable.class);
charTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.CHAR, "char", (Class)null, HiveChar.class, HiveCharWritable.class);
unknownTypeEntry = new PrimitiveObjectInspectorUtils.PrimitiveTypeEntry(PrimitiveCategory.UNKNOWN, "unknown", (Class)null, Object.class, (Class)null);
registerType(binaryTypeEntry);
registerType(stringTypeEntry);
registerType(charTypeEntry);
registerType(varcharTypeEntry);
registerType(booleanTypeEntry);
registerType(intTypeEntry);
registerType(longTypeEntry);
registerType(floatTypeEntry);
registerType(voidTypeEntry);
registerType(doubleTypeEntry);
registerType(byteTypeEntry);
registerType(shortTypeEntry);
registerType(dateTypeEntry);
registerType(timestampTypeEntry);
registerType(decimalTypeEntry);
registerType(unknownTypeEntry);
}
看到这里,就很明白了,hive底层是支持binary,decimal这些类型的字段写入的,所以我们只需要拿到入参的class类。这里用decimal拿来举例子,选择有2个,一个是HiveDecimal.class, HiveDecimalWritable.class,因此回到HdfsHelper中,添加decimal类型,并在枚举类中新增DECIMAL即可
case DECIMAL:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(HiveDecimal.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
但是实际还有个坑,没注意,因为我先测试的text类型的文件写入,在transportOneRecord中用java的decimal去做类型转换操作了
transportOneRecord方法:
case DECIMAL:
recordList.add(new BigDecimal(rowData));
break;
结果运行datax的时候会报错
java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal
后看源码得知,写入时是用对应的ObjectInspector去获取record值,底层最终认的对象只有HiveDecimal类型,所以我们需要将原来的bigdecimal类型改为HiveDecimal即可。
case DECIMAL:
recordList.add(HiveDecimal.create(new BigDecimal(rowData)));
break;
ORC写入的源码调用链,有兴趣的可以跟踪看一下,这里不再做描述。
1、HdfsWriter$Task.startWrite->
2、HdfsHelper.orcFileStartWrite->
3、OrcOutputFormat$OrcRecordWriter.write->
4、WriterImpl.addRow->
5、WriterImpl$StructTreeWriter.write->
6、WriterImpl$DecimalTreeWriter.write(支持的类型都有相应的treeWriter)
四、验证结果
将对应的hdfswriter插件进行编译打包,替换到对应的datax/plugins/writer/hdfswriter下
注:打包某个模块指令 mvn -U clean package -pl {模块名} -am -DskipTests
mvn -U clean package -pl hdfswriter -am -DskipTests
mysql中创建对应的表和测试记录:
hive中建orc表:
create table test_decimal(
id int,
money decimal(10,4)
)
row format delimited
fields terminated by ','
STORED AS ORC
;
最终hive写入结果:
五、代码地址
代码放在我fork的datax上了
分支:feature_hdfs_writer_decimal_binary_support