Flink CDC系列之：flink-cdc-base模块dialect

原创于 2025-11-28 22:31:37 发布 · 470 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#Flink CDC系列 #flink-cdc-base #dialect

日常分享专栏专栏收录该内容

585 篇文章

订阅专栏

Flink CDC系列之：flink-cdc-base模块dialect

DataSourceDialect
JdbcDataSourceDialect

DataSourceDialect

这是一个数据源方言接口，定义了与不同类型数据库交互的统一抽象层。

接口概述

/**
 * The dialect of data source.
 *
 * @param <C> The source config of data source.
 */
@Experimental
public interface DataSourceDialect<C extends SourceConfig> extends Serializable, Closeable {

方言模式: 为不同数据库提供特定实现

泛型参数: 支持不同类型的源配置

实验性: @Experimental 表示接口可能还在演进中

可序列化: 支持在分布式环境中传输

可关闭: 支持资源清理

核心方法详解
方言标识

/** Get the name of dialect. */
String getName();

作用: 返回方言名称，如 “MySQL”、“PostgreSQL”、“Oracle” 等

数据发现

/** Discovers the list of data collection to capture. */
List<TableId> discoverDataCollections(C sourceConfig);

作用: 发现需要捕获的数据表集合
返回: 表标识符列表

/**
 * Discovers the captured data collections' schema by {@link SourceConfig}.
 */
Map<TableId, TableChanges.TableChange> discoverDataCollectionSchemas(C sourceConfig);

作用: 发现表结构定义
返回: 表ID到表结构变更的映射

偏移量管理

/**
 * Displays current offset from the database e.g. query Mysql binary logs by query <code>
 * SHOW MASTER STATUS</code>.
 */
Offset displayCurrentOffset(C sourceConfig);

作用: 获取数据库当前偏移量（如 MySQL 的 binlog 位置）

/** Displays committed offset from the database e.g. query Postgresql confirmed_lsn */
default Offset displayCommittedOffset(C sourceConfig) {
    throw new UnsupportedOperationException();
}

作用: 获取已提交的偏移量（某些数据库需要）
默认实现: 抛出异常，需要支持的数据库重写

标识符处理

/** Check if the CollectionId is case-sensitive or not. */
boolean isDataCollectionIdCaseSensitive(C sourceConfig);

作用: 检查表名等标识符是否大小写敏感
重要性: 影响 SQL 查询的构建

数据分片

/** Returns the {@link ChunkSplitter} which used to split collection to splits. */
@Deprecated
ChunkSplitter createChunkSplitter(C sourceConfig);

ChunkSplitter createChunkSplitter(C sourceConfig, ChunkSplitterState chunkSplitterState);

作用: 创建数据分片器，用于将大表拆分为并行处理的小块
演进: 新版本支持从状态恢复分片器

数据获取任务

/** The fetch task used to fetch data of a snapshot split or stream split. */
FetchTask<SourceSplitBase> createFetchTask(SourceSplitBase sourceSplitBase);

作用: 创建数据获取任务，用于读取快照分片或流分片

/** The task context used for fetch task to fetch data from external systems. */
FetchTask.Context createFetchTaskContext(C sourceConfig);

作用: 创建获取任务的上下文（如数据库连接等）

检查点通知

/**
 * We may need the offset corresponding to the checkpointId. For example, we should commit LSN
 * of checkpoint to postgres's slot.
 */
default void notifyCheckpointComplete(long checkpointId, Offset offset) throws Exception {}

作用: 检查点完成时通知方言（如提交 PostgreSQL 的 LSN）
应用: 确保不会重复读取已处理的数据

表过滤检查

/** Check if the tableId is included in SourceConfig. */
boolean isIncludeDataCollection(C sourceConfig, TableId tableId);

作用: 检查表是否在配置的捕获范围内

具体实现示例
MySQL 方言实现

public class MySqlDataSourceDialect implements DataSourceDialect<MySqlSourceConfig> {
    
    @Override
    public String getName() {
        return "MySQL";
    }
    
    @Override
    public List<TableId> discoverDataCollections(MySqlSourceConfig sourceConfig) {
        // 使用 JDBC 查询 information_schema 获取表列表
        try (Connection conn = getConnection(sourceConfig)) {
            DatabaseMetaData metaData = conn.getMetaData();
            ResultSet tables = metaData.getTables(
                sourceConfig.getDatabaseList().get(0), null, null, new String[]{"TABLE"});
            List<TableId> tableIds = new ArrayList<>();
            while (tables.next()) {
                tableIds.add(new TableId(tables.getString("TABLE_CAT"), 
                                       tables.getString("TABLE_SCHEM"), 
                                       tables.getString("TABLE_NAME")));
            }
            return tableIds;
        }
    }
    
    @Override
    public Offset displayCurrentOffset(MySqlSourceConfig sourceConfig) {
        // 执行 SHOW MASTER STATUS 获取 binlog 位置
        try (Connection conn = getConnection(sourceConfig);
             Statement stmt = conn.createStatement();
             ResultSet rs = stmt.executeQuery("SHOW MASTER STATUS")) {
            if (rs.next()) {
                String file = rs.getString("File");
                long position = rs.getLong("Position");
                return new BinlogOffset(file, position);
            }
        }
        return null;
    }
    
    @Override
    public ChunkSplitter createChunkSplitter(MySqlSourceConfig sourceConfig, 
                                           ChunkSplitterState chunkSplitterState) {
        return new MySqlChunkSplitter(sourceConfig, chunkSplitterState);
    }
    
    // ... 其他方法实现
}

PostgreSQL 方言实现

public class PostgresDataSourceDialect implements DataSourceDialect<PostgresSourceConfig> {
    
    @Override
    public String getName() {
        return "PostgreSQL";
    }
    
    @Override
    public List<TableId> discoverDataCollections(PostgresSourceConfig sourceConfig) {
        // 查询 pg_catalog.pg_tables 获取表列表
        String sql = "SELECT schemaname, tablename FROM pg_catalog.pg_tables " +
                    "WHERE schemaname NOT IN ('pg_catalog', 'information_schema')";
        // ... 执行查询并返回 TableId 列表
    }
    
    @Override
    public Offset displayCommittedOffset(PostgresSourceConfig sourceConfig) {
        // 查询 pg_replication_slots 获取 confirmed_flush_lsn
        String sql = "SELECT confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name = ?";
        // ... 执行查询并返回 LSN 偏移量
    }
    
    @Override
    public void notifyCheckpointComplete(long checkpointId, Offset offset) throws Exception {
        // 向 PostgreSQL 复制槽确认 LSN
        String sql = "SELECT pg_replication_origin_advance(?, ?)";
        // ... 执行确认
    }
    
    // ... 其他方法实现
}

设计模式分析
策略模式

// 不同数据库使用不同的方言策略
DataSourceDialect<MySqlSourceConfig> mysqlDialect = new MySqlDataSourceDialect();
DataSourceDialect<PostgresSourceConfig> postgresDialect = new PostgresDataSourceDialect();

模板方法模式
接口定义算法骨架，具体实现提供特定数据库的逻辑。

工厂模式
方言通常由工厂创建：

public class DataSourceDialectFactory {
    public static DataSourceDialect createDialect(SourceConfig sourceConfig) {
        if (sourceConfig instanceof MySqlSourceConfig) {
            return new MySqlDataSourceDialect();
        } else if (sourceConfig instanceof PostgresSourceConfig) {
            return new PostgresDataSourceDialect();
        }
        throw new UnsupportedOperationException("Unsupported database");
    }
}

在 Flink CDC 架构中的角色
数据流处理中的方言作用：

Flink CDC Framework
    ↓
DataSourceDialect (统一接口)
    ↓
具体数据库方言 (MySQLDialect、PostgresDialect等)
    ↓  
数据库特定操作 (SHOW MASTER STATUS、pg_replication_slots等)

配置和使用示例
在源创建中使用方言

public class MySqlIncrementalSource<C extends SourceConfig> implements IncrementalSource {
    
    private final DataSourceDialect<C> dialect;
    private final C sourceConfig;
    
    public MySqlIncrementalSource(C sourceConfig) {
        this.sourceConfig = sourceConfig;
        this.dialect = new MySqlDataSourceDialect();
    }
    
    @Override
    public List<TableId> discoverTables() {
        return dialect.discoverDataCollections(sourceConfig);
    }
    
    @Override
    public List<SourceSplitBase> createSplits() {
        ChunkSplitter splitter = dialect.createChunkSplitter(sourceConfig, null);
        return splitter.generateSplits();
    }
    
    // ... 其他方法
}

任务执行流程

// 在数据读取任务中
public class MySqlFetchTask implements FetchTask<SourceSplitBase> {
    private final DataSourceDialect dialect;
    private final SourceSplitBase split;
    
    public void execute(FetchTask.Context context) {
        // 使用方言创建数据库特定的读取逻辑
        if (split.isSnapshotSplit()) {
            // 执行快照数据读取
            readSnapshotData(split, context);
        } else {
            // 执行增量数据读取  
            readStreamData(split, context);
        }
    }
}

扩展性设计
支持新数据库
要支持新的数据库，只需：

实现 DataSourceDialect 接口
提供数据库特定的配置类（继承 JdbcSourceConfig）
实现所有抽象方法

默认实现策略
接口提供了合理的默认实现，降低新方言的实现难度：

default void notifyCheckpointComplete(long checkpointId, Offset offset) throws Exception {}
default void close() throws IOException {}

总结
DataSourceDialect 接口是 Flink CDC 框架的核心抽象，它：

提供统一抽象层: 屏蔽不同数据库的差异
支持多数据库: 通过具体方言实现支持 MySQL、PostgreSQL、Oracle 等
封装数据库特性: 每个方法对应特定的数据库操作
支持状态管理: 分片器支持从状态恢复
保证数据一致性: 通过偏移量管理和检查点通知
易于扩展: 新数据库只需实现接口即可接入

这个接口使得 Flink CDC 能够以统一的方式处理各种关系型数据库，是构建通用 CDC 框架的关键设计。

JdbcDataSourceDialect

这是一个 JDBC 数据源方言接口，专门为基于 JDBC 的关系型数据库提供方言抽象。

接口概述

/** The dialect of JDBC data source. */
@Experimental
public interface JdbcDataSourceDialect extends DataSourceDialect<JdbcSourceConfig> {

继承关系: 继承自 DataSourceDialect

专门化: 针对 JDBC 数据源的特化接口

配置类型: 使用 JdbcSourceConfig 作为配置类型

方法详解
数据发现方法（重写父接口）

/** Discovers the list of table to capture. */
@Override
List<TableId> discoverDataCollections(JdbcSourceConfig sourceConfig);

/** Discovers the captured tables' schema by {@link SourceConfig}. */
@Override
Map<TableId, TableChange> discoverDataCollectionSchemas(JdbcSourceConfig sourceConfig);

特点:

重写父接口方法，将参数类型特化为 JdbcSourceConfig

提供 JDBC 特定的数据发现实现

JDBC 连接管理

/**
 * Creates and opens a new {@link JdbcConnection} backing connection pool.
 *
 * @param sourceConfig a basic source configuration.
 * @return a utility that simplifies using a JDBC connection.
 */
JdbcConnection openJdbcConnection(JdbcSourceConfig sourceConfig);

作用: 创建并打开 JDBC 连接
返回: Debezium 的 JdbcConnection 工具类，简化 JDBC 操作

/** Get a connection pool factory to create connection pool. */
JdbcConnectionPoolFactory getPooledDataSourceFactory();

作用: 获取连接池工厂，用于创建和管理数据库连接池

表结构查询

/** Query and build the schema of table. */
TableChange queryTableSchema(JdbcConnection jdbc, TableId tableId);

作用: 查询并构建单个表的模式信息
参数:

jdbc: JDBC 连接
tableId: 表标识符
返回: 表结构变更信息

任务创建方法（重写父接口）

@Override
FetchTask<SourceSplitBase> createFetchTask(SourceSplitBase sourceSplitBase);

作用: 创建 JDBC 特定的数据获取任务

资源清理方法

default void close() throws IOException {
    JdbcConnectionPools.getInstance(getPooledDataSourceFactory()).clear();
}

作用: 清理连接池资源
实现: 通过 JdbcConnectionPools 单例清理所有连接池

设计特点分析
JDBC 专门化

// 父接口使用泛型 SourceConfig
public interface DataSourceDialect<C extends SourceConfig>

// 子接口特化为 JdbcSourceConfig  
public interface JdbcDataSourceDialect extends DataSourceDialect<JdbcSourceConfig>

优势:

类型安全，避免配置类型错误

可以直接访问 JDBC 特定配置

连接池集成

JdbcConnection openJdbcConnection(JdbcSourceConfig sourceConfig);
JdbcConnectionPoolFactory getPooledDataSourceFactory();

设计意图:

统一连接管理
支持连接复用，提高性能
避免连接泄漏

表结构查询分离

TableChange queryTableSchema(JdbcConnection jdbc, TableId tableId);

作用: 提供细粒度的表结构查询，便于：

增量发现新表
处理表结构变更
按需查询特定表

具体实现示例
MySQL JDBC 方言实现

public class MySqlJdbcDataSourceDialect implements JdbcDataSourceDialect {
    
    private final JdbcConnectionPoolFactory connectionPoolFactory = 
        new MySqlConnectionPoolFactory();
    
    @Override
    public List<TableId> discoverDataCollections(JdbcSourceConfig sourceConfig) {
        try (JdbcConnection jdbc = openJdbcConnection(sourceConfig)) {
            // 使用 JDBC 查询 information_schema 获取表列表
            return jdbc.queryAndMap(
                "SELECT TABLE_SCHEMA, TABLE_NAME FROM information_schema.tables " +
                "WHERE TABLE_TYPE = 'BASE TABLE' AND TABLE_SCHEMA NOT IN ('mysql', 'sys', 'performance_schema')",
                rs -> {
                    List<TableId> tables = new ArrayList<>();
                    while (rs.next()) {
                        String schema = rs.getString(1);
                        String table = rs.getString(2);
                        tables.add(new TableId(sourceConfig.getDatabaseList().get(0), schema, table));
                    }
                    return tables;
                });
        }
    }
    
    @Override
    public JdbcConnection openJdbcConnection(JdbcSourceConfig sourceConfig) {
        // 创建 MySQL JDBC 连接
        Properties props = new Properties();
        props.setProperty("user", sourceConfig.getUsername());
        props.setProperty("password", sourceConfig.getPassword());
        props.setProperty("connectTimeout", String.valueOf(sourceConfig.getConnectTimeout().toMillis()));
        
        String url = String.format("jdbc:mysql://%s:%d/%s", 
            sourceConfig.getHostname(), 
            sourceConfig.getPort(), 
            sourceConfig.getDatabaseList().get(0));
            
        return new JdbcConnection(url, props, getPooledDataSourceFactory());
    }
    
    @Override
    public JdbcConnectionPoolFactory getPooledDataSourceFactory() {
        return connectionPoolFactory;
    }
    
    @Override
    public TableChange queryTableSchema(JdbcConnection jdbc, TableId tableId) {
        // 查询特定表的详细结构
        return jdbc.getTableSchema(tableId);
    }
    
    @Override
    public FetchTask<SourceSplitBase> createFetchTask(SourceSplitBase sourceSplitBase) {
        if (sourceSplitBase.isSnapshotSplit()) {
            return new MySqlSnapshotFetchTask(sourceSplitBase);
        } else {
            return new MySqlStreamFetchTask(sourceSplitBase);
        }
    }
    
    @Override
    public String getName() {
        return "MySQL-JDBC";
    }
    
    // ... 其他方法实现
}

PostgreSQL JDBC 方言实现

public class PostgresJdbcDataSourceDialect implements JdbcDataSourceDialect {
    
    private final JdbcConnectionPoolFactory connectionPoolFactory = 
        new PostgresConnectionPoolFactory();
    
    @Override
    public List<TableId> discoverDataCollections(JdbcSourceConfig sourceConfig) {
        try (JdbcConnection jdbc = openJdbcConnection(sourceConfig)) {
            // 查询 pg_catalog 获取表列表
            return jdbc.queryAndMap(
                "SELECT schemaname, tablename FROM pg_catalog.pg_tables " +
                "WHERE schemaname NOT IN ('pg_catalog', 'information_schema')",
                rs -> {
                    List<TableId> tables = new ArrayList<>();
                    while (rs.next()) {
                        String schema = rs.getString(1);
                        String table = rs.getString(2);
                        tables.add(new TableId(sourceConfig.getDatabaseList().get(0), schema, table));
                    }
                    return tables;
                });
        }
    }
    
    @Override
    public JdbcConnection openJdbcConnection(JdbcSourceConfig sourceConfig) {
        Properties props = new Properties();
        props.setProperty("user", sourceConfig.getUsername());
        props.setProperty("password", sourceConfig.getPassword());
        props.setProperty("connectTimeout", String.valueOf(sourceConfig.getConnectTimeout().toSeconds()));
        
        String url = String.format("jdbc:postgresql://%s:%d/%s", 
            sourceConfig.getHostname(), 
            sourceConfig.getPort(), 
            sourceConfig.getDatabaseList().get(0));
            
        return new JdbcConnection(url, props, getPooledDataSourceFactory());
    }
    
    @Override
    public TableChange queryTableSchema(JdbcConnection jdbc, TableId tableId) {
        // PostgreSQL 特定的表结构查询
        return jdbc.readTableSchema(tableId.catalog(), tableId.schema(), tableId.table());
    }
    
    // ... 其他方法实现
}

连接池管理架构
连接池工厂设计

public interface JdbcConnectionPoolFactory {
    DataSource createDataSource(JdbcSourceConfig sourceConfig);
    void closeDataSource(DataSource dataSource);
}

// MySQL 连接池工厂实现
public class MySqlConnectionPoolFactory implements JdbcConnectionPoolFactory {
    @Override
    public DataSource createDataSource(JdbcSourceConfig sourceConfig) {
        HikariConfig config = new HikariConfig();
        config.setJdbcUrl(String.format("jdbc:mysql://%s:%d/%s", 
            sourceConfig.getHostname(), sourceConfig.getPort(), 
            sourceConfig.getDatabaseList().get(0)));
        config.setUsername(sourceConfig.getUsername());
        config.setPassword(sourceConfig.getPassword());
        config.setMaximumPoolSize(sourceConfig.getConnectionPoolSize());
        config.setConnectionTimeout(sourceConfig.getConnectTimeout().toMillis());
        return new HikariDataSource(config);
    }
}

连接池管理器

public class JdbcConnectionPools {
    private static JdbcConnectionPools instance;
    private final Map<String, DataSource> dataSourceMap = new ConcurrentHashMap<>();
    private final JdbcConnectionPoolFactory factory;
    
    public static JdbcConnectionPools getInstance(JdbcConnectionPoolFactory factory) {
        if (instance == null) {
            synchronized (JdbcConnectionPools.class) {
                if (instance == null) {
                    instance = new JdbcConnectionPools(factory);
                }
            }
        }
        return instance;
    }
    
    public DataSource getDataSource(JdbcSourceConfig config) {
        String key = generateKey(config);
        return dataSourceMap.computeIfAbsent(key, k -> factory.createDataSource(config));
    }
    
    public void clear() {
        dataSourceMap.values().forEach(factory::closeDataSource);
        dataSourceMap.clear();
    }
}

在 Flink CDC 架构中的位置
方言层次结构：

DataSourceDialect<SourceConfig> (通用数据源方言)
    ↑
JdbcDataSourceDialect (JDBC数据源方言)  
    ↑
具体JDBC方言 (MySqlJdbcDataSourceDialect、PostgresJdbcDataSourceDialect等)

数据读取流程中的角色：

IncrementalSource
    → JdbcDataSourceDialect.discoverDataCollections()  // 发现表
    → JdbcDataSourceDialect.openJdbcConnection()      // 打开连接
    → JdbcDataSourceDialect.createFetchTask()         // 创建读取任务
    → FetchTask.execute()                            // 执行数据读取

配置和使用示例
创建和使用方言

public class MySqlIncrementalSource extends IncrementalSource<MySqlSourceConfig> {
    
    private final JdbcDataSourceDialect dialect;
    
    public MySqlIncrementalSource(MySqlSourceConfig sourceConfig) {
        super(sourceConfig);
        this.dialect = new MySqlJdbcDataSourceDialect();
    }
    
    @Override
    protected void discoverTables() {
        // 使用方言发现表
        List<TableId> tables = dialect.discoverDataCollections(sourceConfig);
        // 处理发现的表...
    }
    
    @Override
    protected void createSplits() {
        // 使用方言创建分片器
        ChunkSplitter splitter = dialect.createChunkSplitter(sourceConfig, null);
        List<SourceSplitBase> splits = splitter.generateSplits();
        // 分配分片...
    }
    
    @Override
    public void close() throws Exception {
        // 清理资源
        dialect.close();
        super.close();
    }
}

表结构发现流程

public Map<TableId, TableChange> discoverTableSchemas(JdbcSourceConfig config) {
    JdbcDataSourceDialect dialect = getDialect(config);
    Map<TableId, TableChange> schemas = new HashMap<>();
    
    try (JdbcConnection jdbc = dialect.openJdbcConnection(config)) {
        List<TableId> tables = dialect.discoverDataCollections(config);
        
        for (TableId tableId : tables) {
            // 逐个查询表结构
            TableChange schema = dialect.queryTableSchema(jdbc, tableId);
            schemas.put(tableId, schema);
        }
    }
    
    return schemas;
}