Datax-HdfsWriter如何实现支持Map,List,Set等符合类型数据写入

一、问题背景

      之前分析hdfswriter时,看到过api支持map,list,set类型的数据写入,因此,想将其合入到datax里,使得源数据端有类似json,array等这种数据可以直接导入到hive中。

二. 环境准备

Datax版本:3.0

Hadoop版本:2.7.3

Hive版本:2.3.2

三. 源码

       hdfs写入时,最重要的是获取字段对应类型的ObjectInspector,因此直接从ObjectInspectorFactory类入手看:

ObjectInspectorFactory:

private static ObjectInspector getReflectionObjectInspectorNoCache(Type t, ObjectInspectorFactory.ObjectInspectorOptions options) {
...
if (t instanceof ParameterizedType) {
	ParameterizedType pt = (ParameterizedType)t;
	if (List.class.isAssignableFrom((Class)pt.getRawType()) || Set.class.isAssignableFrom((Class)pt.getRawType())) {
	    return getStandardListObjectInspector(getReflectionObjectInspector(pt.getActualTypeArguments()[0], options));
	}

	if (Map.class.isAssignableFrom((Class)pt.getRawType())) {
	    return getStandardMapObjectInspector(getReflectionObjectInspector(pt.getActualTypeArguments()[0], options), getReflectionObjectInspector(pt.getActualTypeArguments()[1], options));
	}

	t = pt.getRawType();
}
...
}

可以看出,如果类型是Map,List,Set类型的,实际获取的inspector都有对应的方法去获取。以下就拿其中的Map类型做例子来进行讲解:

ObjectInspectorFactory:

public static StandardMapObjectInspector getStandardMapObjectInspector(ObjectInspector mapKeyObjectInspector, ObjectInspector mapValueObjectInspector) {
    ArrayList<ObjectInspector> signature = new ArrayList(2);
    signature.add(mapKeyObjectInspector);
    signature.add(mapValueObjectInspector);
    StandardMapObjectInspector result = (StandardMapObjectInspector)cachedStandardMapObjectInspector.get(signature);
    if (result == null) {
        result = new StandardMapObjectInspector(mapKeyObjectInspector, mapValueObjectInspector);
        StandardMapObjectInspector prev = (StandardMapObjectInspector)cachedStandardMapObjectInspector.putIfAbsent(signature, result);
        if (prev != null) {
            result = prev;
        }
    }

    return result;
}

 从这个方法中,可以得知,要想获取map类型的inspector,需要知道对应的Key和Value的对应类型inspector,而hive中又不存在object这个类型,所以一般我们都是直接用string类型来接所有的map值,而且一般源数据读取的都是json格式的字符串类型,以下就以Key和Value都是String来对hdfswriter进行改造:

1、首先还是在SupportHiveDataType中新增一个MAP的类型

2、在HdfsHelper中修改2处地方,一个是源数据源的数据进行map类型转换,一个是获取Map对应的ObjectInspector

HdfsHelper:

public List<ObjectInspector>  getColumnTypeInspectors(List<Configuration> columns){
...
case MAP:
	ObjectInspector stringInspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
	objectInspector = ObjectInspectorFactory.getStandardMapObjectInspector(stringInspector, stringInspector);
break;
...
}

public static MutablePair<List<Object>, Boolean> transportOneRecord(
            Record record,List<Configuration> columnsConfiguration,
            TaskPluginCollector taskPluginCollector){
...
case MAP:
    // 获取Map
	recordList.add(JSON.parseObject(column.asString(), Map.class));
break;
...
}

四、测试结果

Mysql端数据:

写入ORC后查询结果:

但是对于Text类型的数据,还是需要进行额外的改动,具体改造思路可参考java往hive 的map类型字段写数据

public static MutablePair<List<Object>, Boolean> transportOneRecord(
            Record record,List<Configuration> columnsConfiguration,
            TaskPluginCollector taskPluginCollector,
            // 新增一个参数,区分text还是orc
            String fileType){
...
case MAP:
    Map map = JSON.parseObject(column.asString(), Map.class);
    if(fileType.equalsIgnoreCase("TEXT")){
        // text类型特殊处理
        recordList.add(map.toString().replaceAll("\\{|}|\\s",""));
    }else {
        recordList.add(map);
    }
break;
...
}

建表时,也需要指定一下集合的分隔符:

create table test_map(
    id      int,
    student map<string,string>
)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by '='
;

Datax指定字段分割符:

"writer": {
	"name": "hdfswriter",
	"parameter": {
	"defaultFS": "hdfs://node:9000",
	"fileType": "text",
	"path": "/usr/hive/warehouse/test.db/test_map/",
	"fileName": "test",
	"column": [
	  {
	    "name": "id",
	    "type": "int"
	  },
	  {
	    "name": "student",
	    "type": "map"
	  }
	],
	"writeMode": "append",
	"fieldDelimiter": "|"
}

 

最终执行结果:

 五、代码地址

github地址

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值