Flinkx实时和离线同步Postgresql数据到Kafka
一、环境部署
1.选择一台服务器,安装了Flink的更好,使用git工具把项目clone到本地
git clone https://github.com/DTStack/flinkx.git
cd flinkx
或者直接下载源码
wget https://github.com/DTStack/flinkx/archive/1.8.5.zip
unzip flinkx-1.8.5.zip
cd flink-1.8.5
2.编译插件
mvn clean package -DskipTests
编译需要下载很多依赖,这边比较慢,如果编译过程中报错 找不到DB2、达梦、gbase、ojdbc8等驱动包,这个时候可以使用内部提供的脚本进行安装 ./install_jars.sh ,脚本在bin目录下面。然后再重新编译一下就可以了。
离线同步
首先读取编写一个比较简单的任务,读取pg库数据,直接打印到控制台。
任务json文件:
{
"job": {
"content": [{
"reader": {
"parameter" : {
"column" : [ {
"name" : "id",
"type" : "bigint",
"key" : "id"
} ],
"username" : "账号",
"password" : "密码",
"connection" : [ {
"jdbcUrl" : [ "jdbc连接地址" ],
"table" : [ "yrw_test" ]
} ],
"where": "id > 0",
"splitPk": "id",
"fetchSize": 1000,
"queryTimeOut": 1000,
"customSql": "",
"requestAccumulatorInterval": 2
},
"name" : "postgresqlreader"
},
"writer": {
"name": "streamwriter",
"parameter": {
"print": true
}
}
}],
"setting": {
"speed": {
"channel": 1,
"bytes": 0
},
"errorLimit": {
"record": 100
}
}
}
}
执行命令:
/data/bigdata/module/flinkx/flinkx_1.10/bin/flinkx -mode local \
-job /data/bigdata/module/flinkx/job/pg2print.json \
-pluginRoot /data/bigdata/module/flinkx/flinkx_1.10/plugins \
-confProp "{\"flink.checkpoint.interval\":60000}"
这样就可以运行离线同步pg的任务啦,这是本地模式,当然Flinkx还提供了 standlone和yarn模式,可以参考官网的方式即可。
实时同步
这边演示通过cdc的方式实时同步pg库的数据,然后写入到kafka中。
任务json文件:
{
"job": {
"content": [{
"reader" : {
"parameter" : {
"username" : "账号",
"password" : "密码",
"url" : "pg连接url",
"databaseName" : "数据库名",
"cat" : "update,insert,delete",
"tableList" : [
"表名"
],
"statusInterval" : 10000,
"lsn" : 0,
"slotName" : "aaa",
"allowCreateSlot" : true,
"temporary" : true,
"pavingData" : true
},
"name" : "pgwalreader"
},
"writer" : {
"parameter": {
"timezone": "UTC",
"topic": "yrw_test_flinkx",
"producerSettings": {
"zookeeper.connect" : "zookeeper的地址",
"bootstrap.servers" : "kafka的bootstrap.servers"
}
},
"name": "kafka10writer"
}
} ],
"setting": {
"speed": {
"channel": 1,
"bytes": 0
},
"errorLimit": {
"record": 100
},
"restore": {
"maxRowNumForCheckpoint": 0,
"isRestore": false,
"isStream" : true,
"restoreColumnName": "",
"restoreColumnIndex": 0
},
"log" : {
"isLogger": false,
"level" : "debug",
"path" : "",
"pattern":""
}
}
}
}
执行命令
/data/bigdata/module/flinkx/flinkx_1.10/bin/flinkx -mode local \
-job /data/bigdata/module/flinkx/job/pg_wal_kafka.json \
-pluginRoot /data/bigdata/module/flinkx/flinkx_1.10/plugins \
-confProp "{\"flink.checkpoint.interval\":60000}" \
-jobid yrw_test_1125
执行后会发现任务出错,需要修改源码
com.dtstack.flinkx.pgwal.format.PgWalInputFormat#openInputFormat
@Override
public void openInputFormat() throws IOException{
super.openInputFormat();//增加这一行
executor = Executors.newFixedThreadPool(1);
queue = new SynchronousQueue<>(true);
}
com.dtstack.flinkx.pgwal.reader.PgwalReader#readData
@Override
public DataStream<Row> readData() {
PgWalInputFormatBuilder builder = new PgWalInputFormatBuilder();
builder.setUsername(username);
builder.setPassword(password);
builder.setUrl(url);
builder.setDatabaseName(databaseName);
builder.setCat(cat);
builder.setPavingData(pavingData);
builder.setTableList(tableList);
builder.setRestoreConfig(restoreConfig);
builder.setStatusInterval(statusInterval);
builder.setLsn(lsn);
builder.setSlotName(slotName);
builder.setAllowCreateSlot(allowCreateSlot);
builder.setTemporary(temporary);
builder.setDataTransferConfig(dataTransferConfig);//增加这一行
return createInput(builder.finish(), "pgwalreader");
}
重新打包编译即可。
在使用kafka10writer时,会存在一些问题,比如说我当前同步到了pg的数据到kafka中,这个时候kafka有5个分区,我如何保证同一个id的操作发送到同一个分区呢,因为只有这样才能保证数据的一致性。这个时候可以指定id作为record的key,使得同一条记录的操作发送到同一个分区。
修改com.dtstack.flinkx.kafka10.writer.Kafka10OutputFormat
@Override
protected void emit(Map event) throws IOException {
String tp = Formatter.format(event, topic, timezone);
String key = (String) (event.get("before_id") == null? event.get("after_id"):event.get("before_id"));//将主键设置为record的key
producer.send(new ProducerRecord<>(tp, /*event.toString()*/ key, objectMapper.writeValueAsString(event)), (metadata, exception) -> {
if(Objects.nonNull(exception)){
String errorMessage = String.format("send data failed,data 【%s】 ,error info %s",event,ExceptionUtil.getErrorMessage(exception));
LOG.warn(errorMessage);
throw new RuntimeException(errorMessage);
}
});
}
打包重新编译一下 就ok了。
最后,奉上Flinkx的git链接https://github.com/DTStack/flinkx