一. DataX3.0概念
dataX是一个异构数据源离线同步工具,实现包括关系型数据库(MySQL,Oracle.....)、hdfs、hive、odps、hbase等各种异构数据源之间稳定高效的数据同步功能。
二、DataX3.0框架设计
DataX作为离线数据同步框架,采用Framework + plugin架构构建。
-
Reader:Reader为数据采集模块,负责采集数据源的数据,将数据发送给Framework。
-
Writer: Writer为数据写入模块,负责不断向Framework取数据,并将数据写入到目的端。
-
Framework:Framework用于连接reader和writer,是两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。
三、DataX 3.0六大核心优势
1.可靠的数据质量监控
-
完美解决数据传输个别类型失真问题
-
提供作业数据量运行时监控
-
提供脏数据探测
2.丰富的数据转换功能
3.精准的速度控制
"speed": {
"channel": 5,
"byte": 1048576,
"record": 10000
}
4.强劲的同步性能
5.健壮的容错机制
6.极简的使用体验
四、DataX与Sqoop区别
| 功能 | DataX | Sqoop |
|---|---|---|
| 运行模式 | 单进程多线程 | MR |
| 分布式 | 不支持,可以通过调度系统规避 | 支持 |
| 流控 | 有流控功能 | 需要定制 |
| 统计信息 | 已有一些统计,上报要定制 | 没有,分布式的数据收集不方便 |
| 数据校验 | 在core部分有校验功能 | 没有,分布式的数据收集不方便 |
| 监控 | 需要定制 | 需要定制 |
五、DataX部署
1.下载
github上下载安装包并上传到master中
地址:https://github.com/alibaba/DataX
2.解压并配置环境变量
tar -zxvf datax.tar.gz
3.检查是否安装成功
[root@master datax]#python ./bin/datax.py ./job/job.json
六、DataX使用
dataX生成模板的命令:datax.py -r mysqlreader -w hdfswriter
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [{
"type":"string",
"value":"dataX"
},
{
"type":"string",
"value":"生成模板的"
},
{
"type":"string",
"value":"命令"
}],
"sliceRecordCount": 10
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 2
}
}
}
}
1.使用说明
1.1 DataX任务提交命令
python ./bin/datax.py ./job/job.json datax.py xxx.json
1.2 DataX配置文件格式
mysql->hdfs [root@master datax]# datax.py -r mysqlreader -w hdfswriter
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": [],
"table": []
}
],
"password": "",
"username": "",
"where": ""
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [],
"compress": "",
"defaultFS": "",
"fieldDelimiter": "",
"fileName": "",
"fileType": "",
"path": "",
"writeMode": ""
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
2.DataX基本使用
2.1 打印输入流在控制台
查看:streamreader --> streamwriter模板 datax.py -r streamreader -w streamwriter
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [{
"type": "string",
"value": "张三"
},
{
"type": "string",
"value": "真帅"
},
{
"type": "string",
"value": "李四"
},
{
"type": "string",
"value": "表示不服"
}
],
"sliceRecordCount": "2"
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": "2"
}
}
}
}
执行同步任务
datax.py stream2stream.json
2.2 mysql2mysql(数据迁移)
模板:datax.py -r mysqlreader -w mysqlwriter
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": ["*"], # 同步的列名 (* 表示所有)
"connection": [
{
"jdbcUrl": ["jdbc:mysql://master:3306/bigdata30?useUnicode=true&characterEncoding=utf-8"],
"table": ["md_goods"]
}
],
"password": "12345678",
"username": "root",
"where": "goods_shop='某东自营官方旗舰店'"
}
},
"writer": {
}
执行同步任务
datax.py mysql2mysql.json
2.3mysql2hdfs
查看模板
python /usr/local/soft/datax/bin/datax.py -r mysqlreader -w hdfswriter
mysql 建表
CREATE TABLE `t_user` ( `id` bigint(10) NOT NULL, `name` varchar(100) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; insert into `t_user`(`id`,`name`) values (1,'flink'),(2,'slave'),(3,'hive'),(4,'flink04'),(5,'hbase');
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"id",
"name"
],
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://1xx.1xx.xx.xxx:3306/student"
],
"table": [
"t_user"
]
}
],
"password": "12345678",
"username": "root"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "int"
},
{
"name": "name",
"type": "string"
}
],
"defaultFS": "hdfs://xxx.1xx.xx.xx0:9000",
"fieldDelimiter": " ",
"fileName": "t_user.txt",
"fileType": "text",
"path": "/",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
执行命令:datax.py mysql2hdfs.json
2.4 mysql2hive
目前HdfsWriter仅支持textfile和orcfile两种格式的文件
查看模板:datax.py -r mysqlreader -w hdfswriter
创建hive数据库:create database dataxku;
创建表
CREATE TABLE IF NOT EXISTS xuqiu1( id STRING, name STRING, sum_score bigint, clazz STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
编写配置文件mysql2hdfs2.json
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": ["*"],
"connection": [
{
"jdbcUrl": ["jdbc:mysql://master:3306/bigdata30?useUnicode=true&characterEncoding=utf-8"],
"table": ["new_stu"]
}
],
"password": "12345678",
"username": "root"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "int"
},
{
"name": "name",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "age",
"type": "int"
}
],
"defaultFS": "hdfs://master:9000",
"fieldDelimiter": ",",
"fileName": "new_stu",
"fileType": "text",
"path": "/user/hive/warehouse/dataxinfo.db/new_stu",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
执行命令:datax.py mysql2hdfs2.json
2.5 hive2mysql
datax.py -r hivereader -w mysqlwriter
cd/usr/local/soft/bigdata30 mkdir datax_jsons cd datax_jsons vim hive2mysql.json
{
"job": {
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "date"
},
{
"index": 2,
"type": "string"
}
],
"defaultFS": "hdfs://master:9000",
"encoding": "UTF-8",
"fieldDelimiter": ",",
"fileType": "text",
"path": "/user/hive/warehouse/bigdata30.db/business/*"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"name",
"orderdate",
"cost"
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://master:3306/sqoo pdb?useUnicode=true&characterEncoding=utf-8&useSSL=false",
"table": ["business_outx"]
}
],
"password": "12345678",
"username": "root",
"writeMode": "insert"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
执行命令:dataX.py hive2mysql.json
3751

被折叠的 条评论
为什么被折叠?



