业务数据采集模块—全量表数据同步实现
前言
在确定了全量表使用DataX进行同步,增量表使用Maxwell进行同步,且学习了DataX相关使用方法之后,开始着手进行全量表数据同步的实现。
一、全量表数据同步
1.数据同步通道
全量表的数据由DataX从MySQL业务数据库直接同步到HDFS,具体数据流向如下图所示:
上传到的HDFS路径是/origin_data/gmall/db/activity_info_full/xxxx-xx-xx,其中在HDFS中与MySQL表对应的表名加上了后缀_full,且HDFS中每个表下以一天作为一个文件夹(方便后面Hive建表以及分区表的规划)
2.DataX配置文件编写
先回顾一下,全量表有下图这些:
这里以其中一张表activity_info为例,编写对应的配置文件:
{
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123456",
"column": [
"id",
"activity_name",
"activity_type",
"activity_desc",
"start_time",
"end_time",
"create_time"
],
"splitPk": "",
"connection": [
{
"table": [
"activity_info"
],
"jdbcUrl": [
"jdbc:mysql://hadoop102:3306/gamll"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://hadoop102:8020",
"fileType": "text",
"path": "${targetDir}",
"fileName": "activity_info",
"column": [
{
"name": "id",
"type": "bigint"
},
{
"name": "activity_name",
"type": "string"
},
{
"name": "activity_type",
"type": "string"
},
{
"name": "activity_desc",
"type": "string"
},
{
"name": "start_time",
"type": "string"
},
{
"name": "end_time",
"type": "string"
},
{
"name": "create_time",
"type": "string"
}
],
"writeMode": "append",
"fieldDelimiter": "