datax的介绍
1,DataX 是阿里开源的一个异构数据源离线同步工具,为了解决包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
2,1)设计的目的:
2)设计的框架:
reader:数据采集
writer:数据写入
framework:连接
3)原理:
安装
1, 官方地址
下载地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
2,环境准备
- Linux
- JDK(推荐1.8)
- Python(推荐Python2.6.X)
3 安装
1)将下载好的datax.tar.gz上传到hadoop12的/opt/app
[xian@hadoop12 soft]$ ls
datax.tar.gz
2)解压datax.tar.gz到/opt/module
[xian@hadoop12 soft]$ tar -zxvf datax.tar.gz -C /opt/app/
3)运行自检脚本
[xian@hadoop12 bin]$ cd /opt/app/datax/bin/
[xian@hadoop12 bin]$ python datax.py /opt/appdatax/job/job.json
开始使用
1, 从stream流读取数据并打印到控制台
1)查看配置模板
[xian@hadoop12 bin]$ python datax.py -r streamreader -w streamwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright © 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document: https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
Please refer to the streamwriter document: https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
“job”: {
“content”: [
{
“reader”: {
“name”: “streamreader”,
“parameter”: {
“column”: [],
“sliceRecordCount”: “”
}
},
“writer”: {
“name”: “streamwriter”,
“parameter”: {
“encoding”: “”,
“print”: true
}
}
}
],
“setting”: {
“speed”: {
“channel”: “”
}
}
}
}
2)根据模板编写配置文件
[xian@hadoop12 job]$ vim stream2stream.json
填写以下内容:
{
“job”: {
“content”: [
{
“reader”: {
“name”: “streamreader”,
“parameter”: {
“sliceRecordCount”: 10,
“column”: [
{
“type”: “long”,
“value”: “10”
},
{
“type”: “string”,
“value”: “hello,DataX”
}
]
}
},
“writer”: {
“name”: “streamwriter”,
“parameter”: {
“encoding”: “UTF-8”,
“print”: true
}
}
}
],
“setting”: {
“speed”: {
“channel”: 1
}
}
}
}
3)运行
[xian@hadoop12 job]$
/opt/app/datax/bin/datax.py /opt/app/datax/job/stream2stream.json
1.2 读取MySQL中的数据存放到HDFS
[xian@hadoop12 ~]$ python /opt/app/datax/bin/datax.py -r mysqlreader -w hdfswriter
{
“job”: {
“content”: [
{
“reader”: {
“name”: “mysqlreader”,
“parameter”: {
“column”: [],
“connection”: [
{
“jdbcUrl”: [],
“table”: []
}
],
“password”: “”,
“username”: “”,
“where”: “”
}
},
“writer”: {
“name”: “hdfswriter”,
“parameter”: {
“column”: [],
“compress”: “”,
“defaultFS”: “”,
“fieldDelimiter”: “”,
“fileName”: “”,
“fileType”: “”,
“path”: “”,
“writeMode”: “”
}
}
}
],
“setting”: {
“speed”: {
“channel”: “”
}
}
}
}
mysqlreader参数解析:
hdfswriter参数解析:
1.2.1 准备数据
1)创建student表
mysql> create database datax;
mysql> use datax;
mysql> create table emp(id int,name varchar(20));
2)插入数据
mysql> insert into emp values(1001,‘zhangfei’),(1002,‘lisi’),(1003,‘liubei’);
1.2.2 编写配置文件
[xian@hadoop12 datax]$ vim /opt/app/datax/job/mysql2hdfs.json
{
“job”: {
“content”: [
{
“reader”: {
“name”: “mysqlreader”,
“parameter”: {
“column”: [
“id”,
“name”
],
“connection”: [
{
“jdbcUrl”: [
“jdbc:mysql://hadoop12:3306/datax”
],
“table”: [
“emp”
]
}
],
“username”: “root”,
“password”: “123456”
}
},
“writer”: {
“name”: “hdfswriter”,
“parameter”: {
“column”: [
{
“name”: “id”,
“type”: “int”
},
{
“name”: “name”,
“type”: “string”
}
],
“defaultFS”: “hdfs://hadoop12:9000”,
“fieldDelimiter”: “\t”,
“fileName”: “emp.txt”,
“fileType”: “text”,
“path”: “/”,
“writeMode”: “append”
}
}
}
],
“setting”: {
“speed”: {
“channel”: “1”
}
}
}
}
1.2.3 执行任务
[xian@hadoop12 datax]$ bin/datax.py job/mysql2hdfs.json
2021-08-17 16:02:16.581 [job-0] INFO JobContainer -
任务启动时刻 : 2021-08-17 16:02:04
任务结束时刻 : 2021-08-17 16:02:16
任务总计耗时 : 12s
任务平均流量 : 3B/s
记录写入速度 : 0rec/s
读出记录总数 : 3
读写失败总数 : 0
1.2.4 查看hdfs
注意:HdfsWriter实际执行时会在该文件名后添加随机的后缀作为每个线程写入实际文件名。
1.3 读取HDFS数据写入MySQL
1)将上个案例上传的文件改名
[xian@hadoop12 datax]$ hadoop fs -mv /emp.txt* /emp.txt
2)查看官方模板
[xian@hadoop12 datax]$ python bin/datax.py -r hdfsreader -w mysqlwriter
{
“job”: {
“content”: [
{
“reader”: {
“name”: “hdfsreader”,
“parameter”: {
“column”: [],
“defaultFS”: “”,
“encoding”: “UTF-8”,
“fieldDelimiter”: “,”,
“fileType”: “orc”,
“path”: “”
}
},
“writer”: {
“name”: “mysqlwriter”,
“parameter”: {
“column”: [],
“connection”: [
{
“jdbcUrl”: “”,
“table”: []
}
],
“password”: “”,
“preSql”: [],
“session”: [],
“username”: “”,
“writeMode”: “”
}
}
}
],
“setting”: {
“speed”: {
“channel”: “”
}
}
}
}
3)创建配置文件
[xian@hadoop12 datax]$ vim job/hdfs2mysql.json
{
“job”: {
“content”: [
{
“reader”: {
“name”: “hdfsreader”,
“parameter”: {
“column”: ["*"],
“defaultFS”: “hdfs://hadoop12:9000”,
“encoding”: “UTF-8”,
“fieldDelimiter”: “\t”,
“fileType”: “text”,
“path”: “/emp.txt”
}
},
“writer”: {
“name”: “mysqlwriter”,
“parameter”: {
“column”: [
“id”,
“name”
],
“connection”: [
{
“jdbcUrl”: “jdbc:mysql://hadoop12:3306/datax”,
“table”: [“emp2”]
}
],
“password”: “123456”,
“username”: “root”,
“writeMode”: “insert”
}
}
}
],
“setting”: {
“speed”: {
“channel”: “1”
}
}
}
}
4)在MySQL的datax数据库中创建student2
mysql> use datax;
mysql> create table emp2(id int,name varchar(20));
5)执行任务
[xian@hadoop102 datax]$ bin/datax.py job/hdfs2mysql.json
2021-08-17 16:21:53.616 [job-0] INFO JobContainer -
任务启动时刻 : 2021-08-17 16:21:41
任务结束时刻 : 2021-08-17 16:21:53
任务总计耗时 : 11s
任务平均流量 : 3B/s
记录写入速度 : 0rec/s
读出记录总数 : 3
读写失败总数 : 0
6)查看emp2表
mysql> select * from emp2;
±-----±---------+
| id | name |
±-----±---------+
| 1001 | zhangfei |
| 1002 | liubei |
| 1003 | guanyu |
±-----±---------+
3 rows in set (0.00 sec)
2, oracle
前提装好Oracle,不知道联系哈。
2.1 从Oracle中读取数据存到MySQL
1)MySQL中创建表
[oracle@hadoop12 ~]$ mysql -uroot -p123456
mysql> create database oracle;
mysql> use oracle;
mysql> create table student(id int,name varchar(20));
2)编写datax配置文件
[oracle@hadoop12 ~]$ vim /opt/app/datax/job/oracle2mysql.json
{
“job”: {
“content”: [
{
“reader”: {
“name”: “oraclereader”,
“parameter”: {
“column”: [""],
“connection”: [
{
“jdbcUrl”: [“jdbc:oracle:thin:@hadoop12:1521:orcl”],
“table”: [“emp”]
}
],
“password”: “123456”,
“username”: “root”
}
},
“writer”: {
“name”: “mysqlwriter”,
“parameter”: {
“column”: [""],
“connection”: [
{
“jdbcUrl”: “jdbc:mysql://hadoop12:3306/oracle”,
“table”: [“emp”]
}
],
“password”: “123456”,
“username”: “root”,
“writeMode”: “insert”
}
}
}
],
“setting”: {
“speed”: {
“channel”: “1”
}
}
}
}
3)执行命令
[oracle@hadoop12 ~]$
/opt/app/datax/bin/datax.py /opt/module/datax/job/oracle2mysql.json
查看结果:
mysql> select * from student;
±-----±---------+
| id | name |
±-----±---------+
| 1 | zhangfei |
±-----±---------+
2.2 读取Oracle的数据存入HDFS中
1)编写配置文件
[oracle@hadoop12 datax]$ vim job/oracle2hdfs.json
{
"job": {
"content": [
{
"reader": {
"name": "oraclereader",
"parameter": {
"column": ["*"],
"connection": [
{
"jdbcUrl": ["jdbc:oracle:thin:@hadoop12:1521:orcl"],
"table": ["emp"]
}
],
"password": "123456",
"username": "root"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "int"
},
{
"name": "name",
"type": "string"
}
],
"defaultFS": "hdfs://hadoop12:9000",
"fieldDelimiter": "\t",
"fileName": "oracle.txt",
"fileType": "text",
"path": "/",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
2)执行
[oracle@hadoop12 datax]$ bin/datax.py job/oracle2hdfs.json
3)查看HDFS结果
3,mongoDB
前提装好MongoDB
3.1 读取MongoDB的数据导入到HDFS
1)编写配置文件
[xian@hadoop12 datax]$ vim job/mongdb2hdfs.json
{
“job”: {
“content”: [
{
“reader”: {
“name”: “mongodbreader”,
“parameter”: {
“address”: [“127.0.0.1:27017”],
“collectionName”: “atguigu”,
“column”: [
{
“name”:“name”,
“type”:“string”
},
{
“name”:“url”,
“type”:“string”
}
],
“dbName”: “test”,
}
},
“writer”: {
“name”: “hdfswriter”,
“parameter”: {
“column”: [
{
“name”:“name”,
“type”:“string”
},
{
“name”:“url”,
“type”:“string”
}
],
“defaultFS”: “hdfs://hadoop12:9000”,
“fieldDelimiter”: “\t”,
“fileName”: “mongo.txt”,
“fileType”: “text”,
“path”: “/”,
“writeMode”: “append”
}
}
}
],
“setting”: {
“speed”: {
“channel”: “1”
}
}
}
}
2)mongodbreader参数解析
address: MongoDB的数据地址信息,因为MonogDB可能是个集群,则ip端口信息需要以Json数组的形式给出。【必填】
userName:MongoDB的用户名。【选填】
userPassword: MongoDB的密码。【选填】
collectionName: MonogoDB的集合名。【必填】
column:MongoDB的文档列名。【必填】
name:Column的名字。【必填】
type:Column的类型。【选填】
splitter:因为MongoDB支持数组类型,但是Datax框架本身不支持数组类型,所以mongoDB读出来的数组类型要通过这个分隔符合并成字符串。【选填】
3)执行
[xian@hadoop12 datax]$ bin/datax.py job/mongdb2hdfs.json
4)查看结果
5.6.2 读取MongoDB的数据导入MySQL
1)在MySQL中创建表
mysql> create table atguigu(name varchar(20),url varchar(20));
2)编写DataX配置文件
[xian@hadoop12 datax]$ vim job/mongodb2mysql.json
{
“job”: {
“content”: [
{
“reader”: {
“name”: “mongodbreader”,
“parameter”: {
“address”: [“127.0.0.1:27017”],
“collectionName”: “atguigu”,
“column”: [
{
“name”:“name”,
“type”:“string”
},
{
“name”:“url”,
“type”:“string”
}
],
“dbName”: “test”,
}
},
“writer”: {
“name”: “mysqlwriter”,
“parameter”: {
“column”: ["*"],
“connection”: [
{
“jdbcUrl”: “jdbc:mysql://hadoop12:3306/test”,
“table”: [“atguigu”]
}
],
“password”: “000000”,
“username”: “root”,
“writeMode”: “insert”
}
}
}
],
“setting”: {
“speed”: {
“channel”: “1”
}
}
}
}
3)执行
[xian@hadoop12 datax]$ bin/datax.py job/mongodb2mysql.json
4)查看结果
mysql> select * from emp;
±--------±----------------+
| name | url |
±--------±----------------+
| emp | www.emp.com |
±--------±----------------+
鼓励下,兄弟们,一键三连,关注就更好了,祝你工资连升