一、问题背景
之前分析hdfswriter时,看到过api支持map,list,set类型的数据写入,因此,想将其合入到datax里,使得源数据端有类似json,array等这种数据可以直接导入到hive中。
二. 环境准备
Datax版本:3.0
Hadoop版本:2.7.3
Hive版本:2.3.2
三. 源码
hdfs写入时,最重要的是获取字段对应类型的ObjectInspector,因此直接从ObjectInspectorFactory类入手看:
ObjectInspectorFactory:
private static ObjectInspector getReflectionObjectInspectorNoCache(Type t, ObjectInspectorFactory.ObjectInspectorOptions options) {
...
if (t instanceof ParameterizedType) {
ParameterizedType pt = (ParameterizedType)t;
if (List.class.isAssignableFrom((Class)pt.getRawType()) || Set.class.isAssignableFrom((Class)pt.getRawType())) {
return getStandardListObjectInspector(getReflectionObjectInspector(pt.getActualTypeArguments()[0], options));
}
if (Map.class.isAssignableFrom((Class)pt.getRawType())) {
return getStandardMapObjectInspector(getReflectionObjectInspector(pt.getActualTypeArguments()[0], options), getReflectionObjectInspector(pt.getActualTypeArguments()[1], options));
}
t = pt.getRawType();
}
...
}
可以看出,如果类型是Map,List,Set类型的,实际获取的inspector都有对应的方法去获取。以下就拿其中的Map类型做例子来进行讲解:
ObjectInspectorFactory:
public static StandardMapObjectInspector getStandardMapObjectInspector(ObjectInspector mapKeyObjectInspector, ObjectInspector mapValueObjectInspector) {
ArrayList<ObjectInspector> signature = new ArrayList(2);
signature.add(mapKeyObjectInspector);
signature.add(mapValueObjectInspector);
StandardMapObjectInspector result = (StandardMapObjectInspector)cachedStandardMapObjectInspector.get(signature);
if (result == null) {
result = new StandardMapObjectInspector(mapKeyObjectInspector, mapValueObjectInspector);
StandardMapObjectInspector prev = (StandardMapObjectInspector)cachedStandardMapObjectInspector.putIfAbsent(signature, result);
if (prev != null) {
result = prev;
}
}
return result;
}
从这个方法中,可以得知,要想获取map类型的inspector,需要知道对应的Key和Value的对应类型inspector,而hive中又不存在object这个类型,所以一般我们都是直接用string类型来接所有的map值,而且一般源数据读取的都是json格式的字符串类型,以下就以Key和Value都是String来对hdfswriter进行改造:
1、首先还是在SupportHiveDataType中新增一个MAP的类型
2、在HdfsHelper中修改2处地方,一个是源数据源的数据进行map类型转换,一个是获取Map对应的ObjectInspector
HdfsHelper:
public List<ObjectInspector> getColumnTypeInspectors(List<Configuration> columns){
...
case MAP:
ObjectInspector stringInspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
objectInspector = ObjectInspectorFactory.getStandardMapObjectInspector(stringInspector, stringInspector);
break;
...
}
public static MutablePair<List<Object>, Boolean> transportOneRecord(
Record record,List<Configuration> columnsConfiguration,
TaskPluginCollector taskPluginCollector){
...
case MAP:
// 获取Map
recordList.add(JSON.parseObject(column.asString(), Map.class));
break;
...
}
四、测试结果
Mysql端数据:
写入ORC后查询结果:
但是对于Text类型的数据,还是需要进行额外的改动,具体改造思路可参考java往hive 的map类型字段写数据
public static MutablePair<List<Object>, Boolean> transportOneRecord(
Record record,List<Configuration> columnsConfiguration,
TaskPluginCollector taskPluginCollector,
// 新增一个参数,区分text还是orc
String fileType){
...
case MAP:
Map map = JSON.parseObject(column.asString(), Map.class);
if(fileType.equalsIgnoreCase("TEXT")){
// text类型特殊处理
recordList.add(map.toString().replaceAll("\\{|}|\\s",""));
}else {
recordList.add(map);
}
break;
...
}
建表时,也需要指定一下集合的分隔符:
create table test_map(
id int,
student map<string,string>
)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by '='
;
Datax指定字段分割符:
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://node:9000",
"fileType": "text",
"path": "/usr/hive/warehouse/test.db/test_map/",
"fileName": "test",
"column": [
{
"name": "id",
"type": "int"
},
{
"name": "student",
"type": "map"
}
],
"writeMode": "append",
"fieldDelimiter": "|"
}
最终执行结果: