之前公司交给我一个需求:需要从postgresql数据库的多张表同步到sqlsersver数据库的表中,这次利用阿里云开发的DataX对异构数据库之间同步数据。
DataX官网介绍:https://github.com/alibaba/DataX/blob/master/introduction.md
DataX安装使用:https://github.com/alibaba/DataX/blob/master/userGuid.md
简单总结下前置工作:
1.下载DataX工具包,这个是官方直接编译好能用的;
2.下载安装Python 2.6.X;
3. 打开命令窗口,切换到 bin 目录下: cd DataX/target/datax/datax/bin;
4. 在 ./datax/job 下创建 test.json,一张表对应一个json文件,这个配置是全量同步的;
5. 启动DataX命令:python datax.py ./job/test.json
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "postgresqlreader",
"parameter": {
"username": "postgres",
"password": "root",
"column": ["id","token","origin_module","is_sso_token","cust_no","sign_time","is_valid"],
"where": "",
"connection": [
{
"table": [
"public.auth_token"],
"jdbcUrl": [
"jdbc:postgresql://localhost:5432/cloud_finance"
]
}
]
}
},
"writer": {
"name": "sqlserverwriter",
"parameter": {
"username": "sa",
"password": "root",
"column": ["id","token","origin_module","is_sso_token","cust_no","sign_time","is_valid"],
"connection": [
{
"table": [
"dbo.auth_token"],
"jdbcUrl": "jdbc:sqlserver://localhost:1433;DatabaseName=cloud_finance"
}
]
}
}
}
]
}
}
增量同步操作步骤:
1.实现增量同步需要在表中增加一个时间戳字段,如created_time,在同步配置文件中,通过where条件,根据时间戳字段筛选当前时间向前一段时间内的增量数据。
json文件中,${start_time} 和 ${end_time}为调用datax.py时传入的参数。这次的json例子是从sqlserver同步数据到pgsql,表的时间字段类型需要转换一下。
{
"job":{
"content":[
{
"reader": {
"name": "sqlserverreader",
"parameter": {
"username": "sa",
"password": "admin",
"where":"DATEDIFF(SS,'1970-1-1 08:00:00',created_time) >= ${start_time} and DATEDIFF(SS,'1970-1-1 08:00:00',created_time) < ${end_time}",
"column": [
"id", "cust_no", "id_no", "mobile", "name", "user_type","created_time"
],
"connection": [
{
"table": [
"test"
],
"jdbcUrl": [
"jdbc:sqlserver://localhost:1433;DatabaseName=cloud_finance"
]
}
]
}
},
"writer": {
"name": "postgresqlwriter",
"parameter": {
"username": "postgres",
"password": "admin",
"column": [
"id", "cust_no", "id_no", "mobile", "name", "user_type", "created_time"
],
"connection": [
{
"jdbcUrl": "jdbc:postgresql://localhost:5432/cloud_finance",
"table": [
"test"
]
}
]
}
}
}
],
"setting":{
"speed":{
"channel":"5"
}
}
}
}
2. 多表增量同步实现,创建一个脚本 task.sh
#!/bin/bash
source /etc/profile
# 截至时间设置为当前时间戳
end_time=$(date +%s)
# 开始时间设置为60s前时间戳
start_time=$(($end_time - 60))
echo "同步开始时间:", $start_time, "<br>\n";
echo "同步结束时间:", $end_time, "<br>\n";
python ./bin/datax.py ./job/test.json -p "-Dstart_time=$start_time -Dend_time=$end_time"
echo "test表同步结束";
python ./bin/datax.py ./job/auth_token.json -p "-Dstart_time=$start_time -Dend_time=$end_time"
echo "auth_token表同步结束";
3.设置定时任务:crontab -e 进入当前用户的工作表编辑定时器,设置每分钟执行一次
*/1 * * * * /data/app/datax/task.sh
参考链接:https://www.lhtry.net/Blog/Detail/29
4.有些表是没有时间字段的,where配置无法满足,可以通过querySql配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
拓展:
Datax-web是DataX的操作界面工具,我们可通过页面选择数据源即可创建数据同步任务。参考链接:https://github.com/WeiYe-Jing/datax-web