文章目录
1介绍
DataX 是阿里云 DataWorks数据集成 的开源版本,在阿里巴巴集团内被广泛使用的离线数据同步工具/平台。DataX 实现了包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、Hologres、DRDS 等各种异构数据源之间高效的数据同步功能。
2支持数据源
3组成
DataX作为离线数据同步的框架,采用Framework+plugin架构构建,采用的是Reader/Writer插件,可以直接将数据源的读取和写入抽象化。
(1)Reader:数据采集模块,负责数据源的数据采集,将数据发送给Framework;
(2)Writer:数据写入模块,负责目的端的数据写入,从Framework取数据;
(3)Framework:用于连接reader和writer,作为两者的数据传输通道,处理缓冲、流控、并发、转换等问题。
4核心架构
DataX有其自己的作业生命周期:
(1)Job:完成单数据从端到端的同步作业,是datax数据同步的最小业务单元;
(2)Task:将Job拆分成的最小执行单元,每一个task都会负责一部分数据的同步工作;
(3)TaskGroup:job调用Schedule模块按照并发的数据量,将拆分的task重新组合成taskgroup(任务组),每一个任务组负责以一定的并发运行分配好的task,默认数量为5;
(4)JobContainer:job执行器,类似yarn中JobTracker;
(5)TaskGroupContainer:TaskGroup执行器,负责执行一组task,类似yarn中TaskTracker;
5测试(Mysql<==>Mysql)
json文件配置如下:
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123456",
"column": [
"id",
"name",
"age",
"gender",
"clazz"
],
"splitPk": "age",
"connection": [
{
"table": [
"student"
],
"jdbcUrl": [
"jdbc:mysql://master:3306/student"
]
}
]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "insert",
"username": "root",
"password": "123456",
"column": [
"id",
"name",
"age",
"gender",
"clazz"
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://master:3306/student?useUnicode=true&characterEncoding=utf8",
"table": [
"student_copy"
]
}
],
"writeMode":"insert"
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
任务结束输出如下:
6测试(Mysql<==>Hive2)
6.1Mysql->Hive
json文件配置如下:
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123456",
"column": [
"id",
"name",
"age",
"gender",
"clazz"
],
"splitPk": "age",
"connection": [
{
"table": [
"student"
],
"jdbcUrl": [
"jdbc:mysql://master:3306/student"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://master:9000",
"fileType": "text",
"path": "/hivestuinput",
"fileName": "student",
"column": [
{
"name": "id",
"type": "BIGINT"
},
{
"name": "name",
"type": "STRING"
},
{
"name": "age",
"type": "INT"
},
{
"name": "gender",
"type": "STRING"
},
{
"name": "clazz",
"type": "STRING"
}
],
"writeMode": "append",
"fieldDelimiter": ","
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
6.2Hive->Mysql
json文件配置如下:
{
"job": {
"setting": {
"speed": {
"channel": 5
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/hivestuinput/*",
"defaultFS": "hdfs://master:9000",
"column": [
{
"index": 0,
"type": "Long"
},
{
"index": 1,
"type": "STRING"
},
{
"index": 2,
"type": "Long"
},
{
"index": 3,
"type": "STRING"
},
{
"index": 4,
"type": "STRING"
}
],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": ","
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "replace",
"username": "root",
"password": "123456",
"column": [
"id",
"name",
"age",
"gender",
"clazz"
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://master:3306/student?useUnicode=true&characterEncoding=utf8",
"table": [
"student_copy"
]
}
]
}
}
}
]
}
}
6.3Q&A
在进行类型转换的时候,必须指定数据的类型,默认的类型是为String,类型转换的建议表如下:
7测试(Mysql<==>ES5.X)
7.1Mysql->ES5.X
json文件配置如下:
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123456",
"column": [
"id",
"name",
"age",
"gender",
"clazz"
],
"splitPk": "age",
"connection": [
{
"table": [
"student_copy"
],
"jdbcUrl": [
"jdbc:mysql://master:3306/student"
]
}
]
}
},
"writer": {
"name": "elasticsearchwriter",
"parameter": {
"endpoint": "http://master:9200",
"index": "datax_index",
"type": "default",
"cleanup": false,
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0
}
},
"discovery": false,
"batchSize": 1000,
"splitter": ",",
"column": [
{
"name": "id",
"type": "long"
},
{
"name": "name",
"type": "keyword"
},
{
"name": "age",
"type": "integer"
},
{
"name": "gender",
"type": "keyword"
},
{
"name": "clazz",
"type": "keyword"
}
]
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
7.2ES5.X->Mysql
由于官方自带的插件不包含es的读取操作,因此这里采用自定义插件的形式,进行es的数据读取,主要关键的点是es客户端的创建以及读取的相关操作和包的依赖问题,json文件配置如下:
{
"job": {
"content": [
{
"reader": {
"name": "elasticsearchreader",
"parameter": {
"connection": [
"master:9300"
],
"index": "datax_index",
"type": "default",
"pageSize": 100,
"column": [
"gender",
"name",
"id",
"clazz",
"age"
]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "insert",
"username": "root",
"password": "123456",
"column": [
"gender",
"name",
"id",
"clazz",
"age"
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://master:3306/student?useUnicode=true&characterEncoding=utf8",
"table": [
"student_copy"
]
}
]
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
7.3Q&A
1、如果仅仅是按照官网的方法一进行部署,则会导致如下的报错,按照报错说明查验发现datax的plugins内没有elasticsearch写插件的存在,但发现官方文档可以支持,进一步查验得知,如果仅仅下载方法一的gz包内是没有es插件的,只有下载源码才包含,因此需要下载源码进行打包上传linux。
2、当本地编译打包上传到linux解压,写好mysql到es的json文件配置后,运行datax.py的脚本却发现异常报错的信息,多次复查发现,原来是es写插件中的ESClient类中有身份验证的要求,将.setPreemptiveAuth(new HttpHost(endpoint))注释掉,取消身份验证。
8DataX-Web
8.1部署
基于DataX之上开发的分布式数据同步的工具,使用方面较为简便,作为datax的web端虽说支持的数据源种类有限,但基本的关系型数据库和hadoop生态之间还是可以互通的,下载的源码在linux进行编译,或者直接下载gz包一键部署就可,默认的元数据库是mysql5.7版本的,可以自行修改。主页面如图所示:
参考文档:
https://github.com/WeiYe-Jing/datax-web/blob/master/doc/datax-web/datax-web-deploy.md
8.2Q&A
1、可以设置定时任务以及资源分配,包括邮件提醒,具体页面如下:
2、执行器若调度成功,结果却失败,该web页面自带日志,可查看datax报错详情,下图错误由于Datax限速bug,解决方法为:core.transport.channel.speed.byte和job.setting.speed.byte需要同时进行设置或全都不设置,具体问题和修改的方式,如图所示:
{
"core": {
"transport": {
"channel": {
"speed": {
"byte": 大于0的数
}
}
}
},
"job": {
"setting": {
"speed": {
"channel": 3,
"byte": 大于0的数
},
3、可以直观看到cpu的使用率和内存的使用情况,也可将json格式化: