1.datax简介
datax是阿里开发的大数据同步引擎,现在使用越来越广泛,其基本原理就是各种输入输出统一转换成中间格式,中间层处理各种控制来完成整个过程。具体详细截取段文档如下
DataX本身作为离线数据同步框架,采用Framework + plugin架构构建。将数据源读取和写入抽象成为Reader/Writer插件,纳入到整个同步框架中。
Reader:Reader为数据采集模块,负责采集数据源的数据,将数据发送给Framework。
Writer: Writer为数据写入模块,负责不断向Framework取数据,并将数据写入到目的端。
Framework:Framework用于连接reader和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。
2.现有同步插件问题
hive同步 使用官方插件hdfsreader或开源的hivereader(基于hdfsreader修改) https://github.com/deanxiao/DataX-HiveReader
mongo同步 使用官方插件mongodbwriter
hivereader
hive存储格式如下,\t分割
uid\tlabel\tt
(用户id long) (用户标签,逗号分割) (标签时间)
一开始使用过hivereader同步,配置文件内容如下
{
"core": {
"transport": {
"channel": {
"speed": {
"record": 10000
}
},
"flowControlInterval":10000
}
},
"job": {
"content": [
{
"reader": {
"name": "hivereader",
"parameter": {
"defaultFS": "hdfs://dap",
"hadoopConfig":{
"dfs.nameservices": "dap",
"dfs.ha.namenodes.dap": "namenode1966,namenode1944",
"dfs.namenode.rpc-address.dap.namenode1966": "dx-namenode-01:8020",
"dfs.namenode.rpc-address.dap.namenode1944": "dx-namenode-02:8020",
"dfs.client.failover.proxy.provider.dap": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"hiveSql":[
"select uid,label,t from test "
],
"username":"athena",
"hive_exec_queue":"root.boss.pac.dap.vip"
}
},
"writer": {
"name": "mongodbwriter",
"parameter": {
"address": [
"127.0.0.1:27017"
],
"userName": "xxx",
"userPassword": "xxx",
"dbName": "xxx",
"collectionName": "xxxx",
"column": [
{
"name": "uid",
"type": "long"
},
{
"name": "label",
"type": "Array",
"splitter": ","
},
{
"name": "t",
"type": "date"
}
],
"writeMode": {
"isReplace": "true",
"replaceKey": "uid"
}
}
}
}
],
"setting": {
"speed": {
"record" : 10000,
"channel": 10
}
}
}
}
datax 配置文件.json运行同步,完成后会发现uid和她都被同步成字符串格式,和原始hive类型不一样,为什么会这样呢,去看下mongodbwriter源码如下:
可以看到mongowriter只对int和array类型特殊处理,其他的类型都按照reader类型处理,但是hivereader并不支持读取元数据类型,所以我们只能选用hdfsreader来完成同步。
3.解决方式
hdfsreader
如下配置,按照索引号指定reader字段数据类型。注意writer这里还配置了相同的column,是因为mongowriter强制检查了所有目标类型,尽管没用上。
{
"core": {
"transport": {
"channel": {
"speed": {
"record": 10000
}
},
"flowControlInterval":10000
}
},
"job": {
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/user/xxx",
"defaultFS": "hdfs://dap",
"hadoopConfig":{
"dfs.nameservices": "dap",
"dfs.ha.namenodes.dap": "namenode1966,namenode1944",
"dfs.namenode.rpc-address.dap.namenode1966": "dx-namenode-01:8020",
"dfs.namenode.rpc-address.dap.namenode1944": "dx-namenode-02:8020",
"dfs.client.failover.proxy.provider.dap": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "string"
},
{
"index": 2,
"type": "date"
}
],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": "\t",
"username":"athena",
"hive_exec_queue":"root.boss.pac.dap.vip"
}
},
"writer": {
"name": "mongodbwriter",
"parameter": {
"address": [
"127.0.0.1:27017"
],
"userName": "xxx",
"userPassword": "xxx",
"dbName": "xxx",
"collectionName": "xxxx",
"column": [
{
"name": "uid",
"type": "long"
},
{
"name": "label",
"type": "Array",
"splitter": ","
},
{
"name": "t",
"type": "date"
}
],
"writeMode": {
"isReplace": "true",
"replaceKey": "uid"
}
}
}
}
],
"setting": {
"speed": {
"record" : 10000,
"channel": 10
}
}
}
}