一、安装部署
源码地址:
https://github.com/alibaba/DataX
(1)将 datax.tar.gz 上传到服务器并解压
(2)运行自检脚本,框架根目录下执行命令
bin/datax.py job/job.json
二、使用
1、案例一:从stream流读取数据并打印到控制台。
(1)框架根目录下执行命令获取模板,不同的数据源模板不一样,命令也不一样
python bin/datax.py -r streamreader -w streamwriter
(2)根据模板编写配置文件
vim job/stream2stream.json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": "2"
}
}
}
}
注意:在stream里面channel是几表示打印几遍,总数为10,打印两遍就是20。但是对接其他数据源的时候比如mysql等表示并发数。
(3)执行任务
bin/datax.py job/stream2stream.json
2、案例二:读取MySQL中的数据存放到HDFS。
(1)框架根目录下执行命令获取模板,不同的数据源模板不一样,命令也不一样
python bin/datax.py -r mysqlreader -w hdfswriter
(2)准备数据
create database datax;
use datax;
create table test(id int,name varchar(20));
insert into test values(1001,'zhangsan'),(1002,'lisi'),(1003,'wangwu');
(3)编写配置文件
vim job/mysql2hdfs.json
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"id",
"name"
],
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://hadoop114:3306/datax"
],
"table": [
"test"
]
}
],
"username": "root",
"password": "199032"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "int"
},
{
"name": "name",
"type": "string"
}
],
"defaultFS": "hdfs://hadoop114:9000",
"fieldDelimiter": "\t",
"fileName": "test.txt",
"fileType": "text",
"path": "/",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
mysqlreader参数解析:
hdfswriter参数解析:
(4)执行任务
bin/datax.py job/mysql2hdfs.json
注意:HdfsWriter实际执行时会在该文件名后添加随机的后缀作为每个线程写入实际文件名。可以自己修改:
hadoop fs -mv /test.txt* /test.txt
3、案例三:读取HDFS数据写入MySQL。
(1)将上个案例上传的文件改名。
(2)获取官方模板
python bin/datax.py -r hdfsreader -w mysqlwriter
(3)在MySQL的datax数据库中创建 test1 表用来接收数据
use datax;
create table test1(id int,name varchar(20));
(4)创建配置文件
vim job/hdfs2mysql.json
{
"job": {
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": ["*"],
"defaultFS": "hdfs://hadoop114:9000",
"encoding": "UTF-8",
"fieldDelimiter": "\t",
"fileType": "text",
"path": "/test.txt"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"id",
"name"
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://hadoop114:3306/datax",
"table": ["test1"]
}
],
"password": "199032",
"username": "root",
"writeMode": "insert"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
(5)执行任务
bin/datax.py job/hdfs2mysql.json