Flink CDC PostgresCDC全量数据同步阶段支持字段排序
需求背景
在使用PostgresCDC进行数据同步的时候,全量阶段会导致状态大小疯涨,平时设置的状态ttl是基于处理时间的,但是全量阶段数据在短时间内被Flink处理,ttl的设置会导致过期的数据被丢弃,数据处于无序状态,会导致处理结果不一致。如果源数据有时间字段可以表示处理顺序的话,可以按照此字段进行有序处理,将会解决处理结果不一致的问题。当然如果不设置ttl,则无需考虑乱序问题。本篇文章主要介绍如何在PostgresCDC全量阶段(非增量快照同步)支持字段排序。
技术实现
从连接器参数中读取需要排序的字段列表,可以是多个字段,全量阶段在生成查询SQL的时候将排序字段添加的SQL语句上。
主要涉及两个类的变动:
- RelationalSnapshotChangeEventSource
- PostgresSnapshotChangeEventSource
"debezium.snapshot.scan.order.columns": "time"
修改RelationalSnapshotChangeEventSource
位置:io.debezium.relational.RelationalSnapshotChangeEventSource
目的:从连接器参数中读取排序字段,并拼接SQL
修改内容:
- 添加排序字段属性
public static final Field SNAPSHOT_SCAN_ORDER_COLUMNS = Field.create("snapshot.scan.order.columns")
.withDisplayName("Snapshot scan order columns")
.withType(ConfigDef.Type.LIST)
.withWidth(ConfigDef.Width.LONG)
.withValidation(Field::isListOfRegex);
- 新增抽象方法用来让不同的数据库方言实现对应的排序语句
protected Optional<String> getSnapshotSelectWithOrderBy(String select, List<String> orderColumns) {
return Optional.of(select);
}
- 组装查询语句,修改determineSnapshotSelect方法
/**
* Returns a valid query string for the specified table, either given by the user via snapshot select overrides or
* defaulting to a statement provided by the DB-specific change event source.
*
* @param tableId the table to generate a query for
* @return a valid query string or empty if table will not be snapshotted
*/
private Optional<String> determineSnapshotSelect(RelationalSnapshotContext<P, O> snapshotContext, TableId tableId) {
String overriddenSelect = connectorConfig.getSnapshotSelectOverridesByTable().get(tableId);
// try without catalog id, as this might or might not be populated based on the given connector
if (overriddenSelect == null) {
overriddenSelect = connectorConfig.getSnapshotSelectOverridesByTable().get(new TableId(null, tableId.schema(), tableId.table()));
}
if (overriddenSelect != null) {
return Optional.of(enhanceOverriddenSelect(snapshotContext, overriddenSelect, tableId));
}
List<String> columns = getPreparedColumnNames(snapshotContext.partition, schema.tableFor(tableId));
Optional<String> snapshotSelect = getSnapshotSelect(snapshotContext, tableId, columns);
// 支持按照指定列名进行排序
if (snapshotSelect.isPresent()) {
String orderColumnsConf = this.connectorConfig.getConfig().getString(SNAPSHOT_SCAN_ORDER_COLUMNS);
if (!Strings.isNullOrBlank(orderColumnsConf)) {
List<String> orderColumns = Arrays.stream(orderColumnsConf.split(",")).map(
s -> {
String orderColumn = jdbcConnection.quotedColumnIdString(s.trim());
if (!columns.contains(orderColumn)) {
throw new ConnectException("Invalid order column '" + orderColumn + "' in '" + columns + "'");
}
return orderColumn;
}
).collect(Collectors.toList());
if (!orderColumns.isEmpty()) {
return getSnapshotSelectWithOrderBy(snapshotSelect.get(), orderColumns);
}
}
}
return snapshotSelect;
}
修改PostgresSnapshotChangeEventSource
位置:io.debezium.connector.postgresql.PostgresSnapshotChangeEventSource
目的:生成Postgres对应的排序语句
修改内容:
- 实现抽象方法getSnapshotSelectWithOrderBy
@Override
protected Optional<String> getSnapshotSelectWithOrderBy(String select, List<String> orderColumns) {
return Optional.of(select + " ORDER BY " + String.join(",", orderColumns));
}