项目背景:
从gbase8a中抽取为业务大表到mysql数据库中,mysql中的业务表多了两个字段,之前的抽取方式是基于rowid,但这张表业务经过了delete方式删除,目前这张表有2亿多的条记录,但rowid达到27亿之多,后面查看该到有时间字段,于是决定通过时间字段按天进行抽取。
系统环境:
gbase8a 集群
mysql5.6.46主从
实现步骤
1、查询需要同步的列字段
select group_concat(concat(’"’,column_name,’"’)) from information_schema.columns where table_name=‘aa’
2、生成同步的json文件
{
“job”: {
“setting”: {
“speed”: {
“byte”: 1048576
},
“errorLimit”: {
“record”: 0,
“percentage”: 0.02
}
},
“content”: [
{
“reader”: {
“name”: “KaTeX parse error: Expected '}', got 'EOF' at end of input: … "username": "source_db_user”,
“password”: “$source_db_pwd”,
“connection”: [
{
“querySql”: [
“select * from $source_table_name where
d
a
t
e
c
o
l
u
m
n
>
=
′
date_column>='
datecolumn>=′start_time’ and
d
a
t
e
c
o
l
u
m
n
<
′
date_column<'
datecolumn<′end_time’”
],
“jdbcUrl”: [
“KaTeX parse error: Expected 'EOF', got '}' at position 79: … }̲ …target_reader”,
“parameter”: {
“username”: “
t
a
r
g
e
t
d
b
u
s
e
r
"
,
"
p
a
s
s
w
o
r
d
"
:
"
target_db_user", "password": "
targetdbuser","password":"target_db_pwd”,
“session”: [],
“preSql”: [],
“column”: [“col1”,“col2”],
“connection”: [
{
“table”: [
“
t
a
r
g
e
t
t
a
b
l
e
n
a
m
e
"
]
,
"
j
d
b
c
U
r
l
"
:
"
target_table_name" ], "jdbcUrl": "
targettablename"],"jdbcUrl":"target_db_conn”
}
]
}
}
}
]
}
}
3、生成表的配置文件,格式如下
qjqx.aa:webdata.aa:2020-05-17:2020-05-18:aa_dd
qjqx.aa:webdata.aa:2020-05-16:2020-05-17:aa_dd:
说明:
qjqx.aa代表源数据库名和表名
webdata.aa代表目标数据库名和表名
2020-05-17:2020-05-18代表按天同步
aa_dd:代表源数据库表名中基本哪个时间字段进行
4、准备同步脚本
#!/bin/bash
#function:
#version:0.3
#author:lineqi
#crt_time:2020-04-26
#大表同步思路
#datax命令调用案例
#eg:/opt/datax/bin/datax.py -p “-Dsource_reader=${v_gbase_reader}” /opt/datax/job/mysqltest.json
#定义文件与命令执行路径
v_table_list=’/opt/datax/job/table_name.txt’
v_exec_command=’/opt/datax/bin/datax.py’
v_path_json=’/opt/datax/job/mysqltomysql57.json’
v_path_log=’/opt/datax/log/’
#定义常用参数
v_source_table_name=’’
v_target_table_name=’’
v_start_time=’’
v_end_time=’’
v_date_column=’’
v_append_time=" 00:00:00"
v_column_sets=’’
v_sync_start_time=date -d "today" +"%Y-%m-%d-%H-%M-%S"
#定义源数据库的连接方式
v_gbase_user=‘test’
v_gbase_pwd=‘test’
v_gbase_conn=‘jdbc:gbase://192.168.1.103:5258/test’
v_gbase_reader=‘rdbmsreader’
#定义目标数据库的连接方式
v_mysql_user=‘lineqi’
v_mysql_pwd=‘Lineqi#123!’
v_mysql_reader=‘mysqlwriter’
v_mysql_conn=‘jdbc:mysql://192.168.1.103:3306/webdata’
#从table_name.txt获取表名、表记录数并计算分片
for table_name in cat $v_table_list
do
#get table_name
v_source_table_name=echo $table_name|awk -F ":" '{print $1}'
v_target_table_name=echo $table_name|awk -F ":" '{print $2}'
v_start_time=echo $table_name|awk -F ":" '{print $3}'
"$v_append_time"
v_end_time=echo $table_name|awk -F ":" '{print $4}'
"$v_append_time"
v_date_column=echo $table_name|awk -F ":" '{print $5}'
KaTeX parse error: Undefined control sequence: \ at position 20: …ec_command -p "\̲ ̲ …{v_gbase_reader}
-Dsource_db_user=KaTeX parse error: Undefined control sequence: \ at position 16: {v_gbase_user} \̲ ̲ …{v_gbase_pwd}
-Dsource_db_conn=KaTeX parse error: Undefined control sequence: \ at position 16: {v_gbase_conn} \̲ ̲ …{v_source_table_name}
-Ddate_column=KaTeX parse error: Undefined control sequence: \ at position 17: …v_date_column} \̲ ̲ …{v_start_time}’
-Dend_time=’KaTeX parse error: Undefined control sequence: \ at position 15: {v_end_time}' \̲ ̲ …{v_mysql_reader}
-Dtarget_db_user=KaTeX parse error: Undefined control sequence: \ at position 16: {v_mysql_user} \̲ ̲ …{v_mysql_pwd}
-Dtarget_db_conn=KaTeX parse error: Undefined control sequence: \ at position 16: {v_mysql_conn} \̲ ̲ …{v_column_sets}
"
v
p
a
t
h
j
s
o
n
>
>
"
v_path_json >> "
vpathjson>>"v_path_log"
v
s
o
u
r
c
e
t
a
b
l
e
n
a
m
e
"
v_source_table_name"
vsourcetablename"v_sync_start_time".log 2>&1
done
5、进行同步操作
总结
1、为什么按天进行迁移,因为该表是宽表,一天的数据有40-60万条,按几天进行迁移容易造成datax 内存溢出
2、需要提前查询目标表同步的字段,手动添加到json配置文件中,后期将要同步的字段当成参数传给json配置文件更方面