DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。
目前支持的数据源如下:
与sqoop的对比:
1. 编译
1.1编译datax
idea打开datax项目,执行mvn命令:
mvn -U clean package assembly:assembly -Dmaven.test.skip=true
编译成功,打好的tar包在target目录,上传到linux指定目录。
1.2 编译datax-web
idea打开datax-web项目,执行mvn命令
mvn clean install
编译成功,打好的包在build目录,也上传到linux指定目录。
2.安装
2.1 测试datax
将datax.tar.gz解压,进入datax目录,执行测试程序,查看结果。
python bin/datax.py job/job.json
2023-09-13 15:27:28.851 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.023s | All Task WaitReaderTime 0.030s | Percentage 100.00%
2023-09-13 15:27:28.851 [job-0] INFO JobContainer -
任务启动时刻 : 2023-09-13 15:27:18
任务结束时刻 : 2023-09-13 15:27:28
任务总计耗时 : 10s
任务平均流量 : 253.91KB/s
记录写入速度 : 10000rec/s
读出记录总数 : 100000
读写失败总数 : 0
2.2 安装datax-web
将datax-web-2.1.2.tar.gz解压,进入datax-web/bin目录,执行脚本
./install
2023-09-13 15:31:33.797 [INFO] (101530) Creating directory: [/data/etl/datax-web-2.1.2/bin/../modules].
2023-09-13 15:31:33.806 [INFO] (101530) ####### Start To Uncompress Packages ######
2023-09-13 15:31:33.809 [INFO] (101530) Uncompressing....
Do you want to decompress this package: [datax-admin_2.1.2_1.tar.gz]? (Y/N)y
2023-09-13 15:32:46.982 [INFO] (101530) Uncompress package: [datax-admin_2.1.2_1.tar.gz] to modules directory
Do you want to decompress this package: [datax-executor_2.1.2_1.tar.gz]? (Y/N)y
2023-09-13 15:33:03.246 [INFO] (101530) Uncompress package: [datax-executor_2.1.2_1.tar.gz] to modules directory
2023-09-13 15:33:03.485 [INFO] (101530) ####### Finish To Umcompress Packages ######
Scan modules directory: [/data/etl/datax-web-2.1.2/bin/../modules] to find server under dataxweb
2023-09-13 15:33:03.489 [INFO] (101530) ####### Start To Install Modules ######
2023-09-13 15:33:03.491 [INFO] (101530) Module servers could be installed:
[datax-admin] [datax-executor]
Do you want to confiugre and install [datax-admin]? (Y/N)y
2023-09-13 15:33:09.255 [INFO] (101530) Install module server: [datax-admin]
Start to make directory
2023-09-13 15:33:09.276 [INFO] (105228) Start to build directory
2023-09-13 15:33:09.278 [INFO] (105228) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-admin/bin/../logs].
2023-09-13 15:33:09.336 [INFO] (105228) Directory or file: [/data/etl/datax-web-2.1.2/modules/datax-admin/bin/../conf] has been exist
2023-09-13 15:33:09.338 [INFO] (105228) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-admin/bin/../data].
end to make directory
Start to initalize database
2023-09-13 15:33:09.487 [INFO] (105228) Scan out mysql command, so begin to initalize the database
Do you want to initalize database with sql: [/data/etl/datax-web-2.1.2/bin/db/datax_web.sql]? (Y/N)y
Please input the db host(default: 127.0.0.1):
Please input the db port(default: 3306):
Please input the db username(default: root):
Please input the db password(default: ): XXXXXX
Please input the db name(default: dataxweb)
mysql: [Warning] Using a password on the command line interface can be insecure.
Do you want to confiugre and install [datax-executor]? (Y/N)y
2023-09-13 15:34:10.863 [INFO] (101530) Install module server: [datax-executor]
2023-09-13 15:34:10.884 [INFO] (107668) Start to build directory
2023-09-13 15:34:10.886 [INFO] (107668) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-executor/bin/../logs].
2023-09-13 15:34:10.917 [INFO] (107668) Directory or file: [/data/etl/datax-web-2.1.2/modules/datax-executor/bin/../conf] has been exist
2023-09-13 15:34:10.919 [INFO] (107668) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-executor/bin/../data].
2023-09-13 15:34:10.953 [INFO] (107668) Creating directory: [/data/etl/datax-web-2.1.2/modules/datax-executor/bin/../json].
2023-09-13 15:34:10.984 [INFO] (101530) ####### Finish To Install Modules ######
2.3 配置执行器
编辑datax-web-2.1.2/modules/datax-executor/bin/env.properties
DATAX_ADMIN_PORT=9527
## PYTHON脚本执行位置
#PYTHON_PATH=/home/hadoop/install/datax/bin/datax.py
PYTHON_PATH=/data/etl/datax/bin/datax.py
2.3 启动服务
./start-all.sh
2023-09-13 15:36:26.441 [INFO] (112931) Try To Start Modules In Order
2023-09-13 15:36:26.448 [INFO] (112939) ####### Begin To Start Module: [datax-admin] ######
2023-09-13 15:36:26.454 [INFO] (112947) load environment variables
2023-09-13 15:36:26.755 [INFO] (112947) /data/base/jdk1.8.0_211/bin/java
2023-09-13 15:36:26.757 [INFO] (112947) Waiting DATAX-ADMIN to start complete ...
2023-09-13 15:36:26.964 [INFO] (112947) DATAX-ADMIN start success
2023-09-13 15:36:26.991 [INFO] (113168) ####### Begin To Start Module: [datax-executor] ######
2023-09-13 15:36:26.997 [INFO] (113176) load environment variables
2023-09-13 15:36:27.390 [INFO] (113176) /data/base/jdk1.8.0_211/bin/java
2023-09-13 15:36:27.393 [INFO] (113176) Waiting DATAX-EXEXUTOR to start complete ...
2023-09-13 15:36:27.763 [INFO] (113176) DATAX-EXEXUTOR start success
打开页面,登录密码是admin/123456
http://xx.16.60.31:9527/index.html
打开接口文档页面
http://xx.16.60.31:9527/doc.html
3.同步mysql数据到hive
3.1 建表和准备数据
mysql建表并插入数据
CREATE TABLE `user_behavior_mysql` (
`user_id` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
`item_id` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
`category_id` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`behavior_type` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`time_stamp` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
PRIMARY KEY (`user_id`, `item_id`, `time_stamp`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = DYNAMIC;
INSERT INTO `user_behavior_mysql` VALUES ('1', '1531036', '2920476', 'pv', '1511733732');
INSERT INTO `user_behavior_mysql` VALUES ('1', '2268318', '2520377', 'pv', '1511544070');
INSERT INTO `user_behavior_mysql` VALUES ('1', '230380', '411153', 'pv', '1511644942');
INSERT INTO `user_behavior_mysql` VALUES ('1', '2333346', '2520771', 'pv', '1511561733');
INSERT INTO `user_behavior_mysql` VALUES ('1', '2576651', '149192', 'pv', '1511572885');
INSERT INTO `user_behavior_mysql` VALUES ('1', '3745169', '2891509', 'pv', '1511725471');
INSERT INTO `user_behavior_mysql` VALUES ('1', '3827899', '2920476', 'pv', '1511713473');
INSERT INTO `user_behavior_mysql` VALUES ('1', '3830808', '4181361', 'pv', '1511593493');
INSERT INTO `user_behavior_mysql` VALUES ('1', '4365585', '2520377', 'pv', '1511596146');
INSERT INTO `user_behavior_mysql` VALUES ('1', '4606018', '2735466', 'pv', '1511616481');
hive中建表
CREATE TABLE `user_behavior_mysql` (
`user_id` string,
`item_id` string,
`category_id` string,
`behavior_type` string,
`time_stamp` string
) row format delimited fields terminated by ",";
3.2 新增mysql数据源
3.3 新增hive数据源
3.4 新增任务模板
3.5 任务构建
生成json如下:
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "yRjwDFuoPKlqya9h9H2Amg==",
"password": "XCYVpFosvZBBWobFzmLWvA==",
"splitPk": ",",
"connection": [
{
"querySql": [
"select * from user_behavior_mysql;"
],
"jdbcUrl": [
"jdbc:mysql://10.16.60.31:3306/datasets"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://10.16.60.31:8020",
"fileType": "text",
"path": "/user/hive/warehouse/datasets.db/user_behavior_mysql",
"fileName": "user_behavior_mysql",
"writeMode": "append",
"fieldDelimiter": ",",
"column": [
{
"name": "user_id",
"type": "string"
},
{
"name": "item_id",
"type": "string"
},
{
"name": "category_id",
"type": "string"
},
{
"name": "behavior_type",
"type": "string"
},
{
"name": "time_stamp",
"type": "string"
}
]
}
}
}
]
}
}
3.6 执行结果
hive中数据如下
3.7 亿条数据测试
另外测试从mysql同步约1亿条记录,测试结果如下