Flink CDC 两种方式实践

本文所实践的是截至 2023.02.03 为止,最新的 Flink CDC 2.3

本文参考

更多最新信息可参考flink-cdc官网,本文只注重flink-yarn编程实践,flink-sql-cli 方式实践请参考官网

环境

Mysql 5.6、5.7、8.0.x

Doris 1.1

Flink 1.14.4

Flink CDC 2.3.0

Flink CDC 概念

CDC 的全称是 Change Data Capture ,在广义的概念上,只要能捕获数据变更的技术,我们都可以称为 CDC 。通常我们说的 CDC 技术主要面向 数据库的变更,是一种用于捕获数据库中数据变更的技术。

应用场景

  1. 数据同步,用于备份,容灾
  2. 数据分发,一个数据源分发给多个下游,可通过如kafka,RocketMQ,等工具分发数据
  3. 数据采集(E),面向数据仓库/数据湖的 ETL 数据集成

CDC 技术

目前业界主流的实现机制的可以分为两种:

  • 一类是基于查询的 CDC 技术 ,比如 DataX。随着当下场景对实时性要求越来越高,此类技术的缺陷也逐渐凸显。离线调度和批处理的模式导致延迟较高;基于离线调度做切片,因而无法保障数据的一致性;另外,也无法保障实时性。
  • 一类是基于日志的 CDC 技术,比如 Debezium、Canal、 Flink CDC。这种 CDC 技术能够实时消费数据库的日志,流式处理的模式可以保障数据的一致性,提供实时的数据,可以满足当下越来越实时的业务需求。

常见的开源 CDC 方案

img

Flink-CDC实践

Flink-CDC 大体来说分两种

  • Flink-SQL 方式
  • Flink-DataStream 方式

Flink-SQL提交方式有两种

  • Flink-Client 直接提交SQL
  • 提交程序

本次编程实践是从 Mysql -> Doris 数据同步实践,flink-sql-cli 方式实践请参考官网

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>FlinkCDC</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <scala.version>2.12</scala.version>
        <java.version>1.8</java.version>
        <flink.version>1.14.4</flink.version>
        <fastjson.version>1.2.62</fastjson.version>
        <hadoop.version>2.8.3</hadoop.version>
        <scope.mode>compile</scope.mode>
        <slf4j.version>1.7.30</slf4j.version>
    </properties>

    <dependencies>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-api-scala-bridge_${scala.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner_${scala.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_${scala.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-jdbc_${scala.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_${scala.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>${fastjson.version}</version>
        </dependency>
        
        <!-- Add log dependencies when debugging locally -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        
        <!-- flink-doris-connector -->
        <dependency>
            <groupId>org.apache.doris</groupId>
            <artifactId>flink-doris-connector-1.14_2.12</artifactId>
            <version>1.1.0</version>
        </dependency>
        
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.27</version>
        </dependency>
        
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-runtime-web_${scala.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>com.ververica</groupId>
            <artifactId>flink-connector-mysql-cdc</artifactId>
            <version>2.3.0</version>
        </dependency>
    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.1</version>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <args>
                        <arg>-feature</arg>
                    </args>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

</project>

Flink-Client 方式直接提交

具体提交方式参阅 https://ververica.github.io/flink-cdc-connectors/master/content/%E5%BF%AB%E9%80%9F%E4%B8%8A%E6%89%8B/mysql-postgres-tutorial-zh.html

Flink-SQL java程序

java 版本

package com.kanaikee.bigdata.flink;

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

import java.util.UUID;
public class FlinkSQLCDC {
    public static void main(String[] args) {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(10000);
        env.setParallelism(1);
        final StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
        // register a table in the catalog
        tEnv.executeSql(
                "CREATE TABLE cdc_test_source (\n" +
                        "  emp_no INT,\n" +
                        "  birth_date DATE,\n" +
                        "  first_name STRING,\n" +
                        "  last_name STRING,\n" +
                        "  gender STRING,\n" +
                        "  hire_date  STRING,\n" +
                        "  database_name STRING METADATA VIRTUAL,\n" +
                        "  table_name STRING METADATA VIRTUAL,\n" +
                        "  PRIMARY KEY (`emp_no`) NOT ENFORCED  \n" +
                        ") WITH (\n" +
                        "  'connector' = 'mysql-cdc',\n" +
                        "  'hostname' = '192.168.22.xxx',\n" +
                        "  'port' = '3306',\n" +
                        "  'username' = 'xxx',\n" +
                        "  'password' = 'xxx',\n" +
                        "  'database-name' = 'emp_[0-9]+',\n" +
                        "  'table-name' = 'employees_[0-9]+'\n" +
                        ")");

        String label = UUID.randomUUID().toString();
        //doris table
        tEnv.executeSql(
                "CREATE TABLE doris_test_sink (" +
                        "  emp_no INT,\n" +
                        "  birth_date STRING,\n" +
                        "  first_name STRING,\n" +
                        "  last_name STRING,\n" +
                        "  gender STRING,\n" +
                        "  hire_date  STRING\n" +
                        ") " +
                        "WITH (\n" +
                        "  'connector' = 'doris',\n" +
                        "  'fenodes' = '172.8.10.xxx:8030',\n" +
                        "  'table.identifier' = 'test_db.all_employees_info',\n" +
                        "  'username' = 'xxx',\n" +
                        "  'password' = 'xxx',\n" +
                /* doris stream load label, In the exactly-once scenario,
                   the label is globally unique and must be restarted from the latest checkpoint when restarting.
                   Exactly-once semantics can be turned off via sink.enable-2pc. */
                        "  'sink.label-prefix' ='" + label + "',\n" +
                        "  'sink.properties.format' = 'json',\n" +       //json data format
                        "  'sink.properties.read_json_by_line' = 'true'\n" +
                        ")");

        //insert into mysql table to doris table
        tEnv.executeSql("INSERT INTO doris_test_sink select emp_no,cast(birth_date as string) as birth_date ,first_name,last_name,gender,cast(hire_date as string) as hire_date   from cdc_test_source ");
    }
}

scala 版本

package com.kanaikee.bigdata.flink.scala

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment

import java.util.UUID

object ScalaFlinkSQLCDC {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment;
    env.enableCheckpointing(10000);
    env.setParallelism(1);
    val tEnv = StreamTableEnvironment.create(env);
    // register a table in the catalog
    tEnv.executeSql(
      "CREATE TABLE cdc_test_source (\n" +
        "  emp_no INT,\n" +
        "  birth_date DATE,\n" +
        "  first_name STRING,\n" +
        "  last_name STRING,\n" +
        "  gender STRING,\n" +
        "  hire_date  STRING,\n" +
        "  database_name STRING METADATA VIRTUAL,\n" +
        "  table_name STRING METADATA VIRTUAL,\n" +
        "  PRIMARY KEY (`emp_no`) NOT ENFORCED  \n" +
        ") WITH (\n" +
        "  'connector' = 'mysql-cdc',\n" +
        "  'hostname' = '192.168.22.xxx',\n" +
        "  'port' = '3306',\n" +
        "  'username' = 'xxx',\n" +
        "  'password' = 'xxx',\n" +
        "  'database-name' = 'emp_[0-9]+',\n" +
        "  'table-name' = 'employees_[0-9]+'\n" +
        ")");

    val label = UUID.randomUUID().toString
    //doris table
    tEnv.executeSql(
      "CREATE TABLE doris_test_sink (" +
        "  emp_no INT,\n" +
        "  birth_date STRING,\n" +
        "  first_name STRING,\n" +
        "  last_name STRING,\n" +
        "  gender STRING,\n" +
        "  hire_date  STRING\n" +
        ") " +
        "WITH (\n" +
        "  'connector' = 'doris',\n" +
        "  'fenodes' = '172.8.10.xxx:8030',\n" +
        "  'table.identifier' = 'test_db.all_employees_info',\n" +
        "  'username' = 'xxx',\n" +
        "  'password' = 'xxx',\n" +
        /* doris stream load label, In the exactly-once scenario,
           the label is globally unique and must be restarted from the latest checkpoint when restarting.
           Exactly-once semantics can be turned off via sink.enable-2pc. */
        "  'sink.label-prefix' ='" + label + "',\n" +
        "  'sink.properties.format' = 'json',\n" + //json data format
        "  'sink.properties.read_json_by_line' = 'true'\n" +
        ")");

    //insert into mysql table to doris table
    tEnv.executeSql("INSERT INTO doris_test_sink select emp_no,cast(birth_date as string) as birth_date ,first_name,last_name,gender,cast(hire_date as string) as hire_date   from cdc_test_source ");
  }
}


Flink-DataStream 方式

java版本

FlinkCDCTest.java

package com.kanaikee.bigdata.flink;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.kanaikee.bigdata.config.DataBaseConfig;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import org.apache.doris.flink.cfg.DorisExecutionOptions;
import org.apache.doris.flink.cfg.DorisOptions;
import org.apache.doris.flink.cfg.DorisReadOptions;
import org.apache.doris.flink.sink.DorisSink;
import org.apache.doris.flink.sink.writer.SimpleStringSerializer;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Properties;
import java.util.UUID;

import static com.kanaikee.bigdata.config.DataBaseConfig.*;

public class FlinkCDCTest {

    private static final Logger log = LoggerFactory.getLogger(FlinkCDCTest.class);

    public static void main(String[] args){
        MySqlSource<String> mySqlSource = buildMysqlSource();

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        // enable checkpoint
        env.enableCheckpointing(10000);

        DataStreamSource<String> cdcSource = env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL CDC Source");


        cdcSource.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String value, Collector<String> out) throws Exception {
                JSONObject rowJson = JSON.parseObject(value);
                String op = rowJson.getString("op");
                JSONObject source = rowJson.getJSONObject("source");
                String table = source.getString("table");

                // sync insert and update
                if (
//                        ("c".equals(op) || "u".equals(op)) &&
                                ("employees_1".equals(table) || "employees_2".equals(table))
                ) {
                    out.collect(rowJson.getJSONObject("after").toJSONString());
                }
            }
        })
                .sinkTo(buildDorisSink("all_employees_info"));

        try {
            env.execute("Full Database Sync ");
        } catch (Exception e) {
            log.error("Error: " + e.getMessage());
        }
    }


    /**
     * build Doris-Sink
     *
     * @param table tableName
     * @return DorisSink
     */
    public static DorisSink<String> buildDorisSink(String table) {
        DorisSink.Builder<String> builder = DorisSink.builder();

        DorisOptions.Builder dorisBuilder = DorisOptions.builder();
        dorisBuilder.setFenodes(DORIS_IP + ":" + DORIS_PORT)
                .setTableIdentifier("test_db" + "." + table)
                .setUsername(DORIS_USER)
                .setPassword(DORIS_PASS);

        String uuid = UUID.randomUUID().toString();

        Properties pro = new Properties();
        // json data format
        pro.setProperty("format", "json");
        pro.setProperty("read_json_by_line", "true");

        DorisExecutionOptions executionOptions = DorisExecutionOptions.builder()
                // 过滤删除操作
                .setDeletable(false)
                // stream-load label prefix,
                .setLabelPrefix("label-" + uuid + table)
                .setStreamLoadProp(pro)
                .build();

        builder.setDorisReadOptions(DorisReadOptions.builder().build())
                .setDorisExecutionOptions(executionOptions)
                // serialize according to string
                .setSerializer(new SimpleStringSerializer())
                .setDorisOptions(dorisBuilder.build());

        return builder.build();
    }


    /**
     * build MySql-Source
     *
     * @return MySqlSource
     */
    public static MySqlSource<String> buildMysqlSource() {
        return MySqlSource.<String>builder()
                .hostname("192.168.10.10")
                .port(3306)
                .databaseList("emp_1", "emp_2")
                // 完全限定 <databaseName>.<tableName> 格式
                .tableList("emp_1" + "." + "employees_1",
                        "emp_1" + "." + "employees_2",
                        "emp_2" + "." + "employees_1",
                        "emp_2" + "." + "employees_2")
                .username("******")
                .password("******")
                .startupOptions(StartupOptions.earliest())
                // json 序列化
                .deserializer(new JsonDebeziumDeserializationSchema())
                .build();
    }
}

scala 版本

ScalaFlinkCDC.scala

package com.kanaikee.bigdata.flink.scala;

import com.alibaba.fastjson.{JSON, JSONObject}
import com.kanaikee.bigdata.config.DataBaseConfig
import com.ververica.cdc.connectors.mysql.source.MySqlSource
import com.ververica.cdc.connectors.mysql.table.StartupOptions
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema
import org.apache.doris.flink.cfg.{DorisExecutionOptions, DorisOptions, DorisReadOptions}
import org.apache.doris.flink.sink.DorisSink
import org.apache.doris.flink.sink.writer.SimpleStringSerializer
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.api.common.functions.FlatMapFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector

import java.util.{Properties, UUID}

object ScalaFlinkCDC {

  def main(args: Array[String]): Unit = {
    val mySqlSource: MySqlSource[String] = buildMysqlSource
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    // enable checkpoint
    env.enableCheckpointing(10000)

    import org.apache.flink.api.scala._

    val cdcSource = env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks[String](), "MySQL CDC Source")

    cdcSource.flatMap(new FlatMapFunction[String, String]() {
      override def flatMap(value: String, out: Collector[String]): Unit = {
        val rowJson: JSONObject = JSON.parseObject(value)
        val op: String = rowJson.getString("op")

        val source: JSONObject = rowJson.getJSONObject("source")
        val table: String = source.getString("table")

        // sync insert and update
        if ( //                        ("c".equals(op) || "u".equals(op)) &&
          ("employees_1" == (table) || "employees_2" == (table))) {
          out.collect(rowJson.getJSONObject("after").toJSONString)
        }

      }
    }).sinkTo(buildDorisSink("all_employees_info"))

    env.execute("Full Database Sync ")

  }


  /**
   * build Doris-Sink
   *
   * @param table tableName
   * @return DorisSink
   */
  def buildDorisSink(table: String): DorisSink[String] = {
    val builder: DorisSink.Builder[String] = DorisSink.builder[String]
    val dorisBuilder: DorisOptions.Builder = DorisOptions.builder

    dorisBuilder.setFenodes(DataBaseConfig.DORIS_IP + ":" + DataBaseConfig.DORIS_PORT)
      .setTableIdentifier("test_db" + "." + table)
      .setUsername(DataBaseConfig.DORIS_USER)
      .setPassword(DataBaseConfig.DORIS_PASS)

    val uuid: String = UUID.randomUUID.toString
    val pro: Properties = new Properties
    // json data format
    pro.setProperty("format", "json")
    pro.setProperty("read_json_by_line", "true")

    val executionOptions: DorisExecutionOptions = DorisExecutionOptions.builder
      .setDeletable(false)
      .setLabelPrefix("label-" + uuid + table)
      .setStreamLoadProp(pro)
      .build

    builder.setDorisReadOptions(DorisReadOptions.builder.build)
      .setDorisExecutionOptions(executionOptions)
      .setSerializer(new SimpleStringSerializer)
      .setDorisOptions(dorisBuilder.build)

    builder.build
  }


  /**
   * build MySql-Source
   *
   * @return MySqlSource
   */
  def buildMysqlSource: MySqlSource[String] = {
    MySqlSource.builder[String]
      .hostname("192.168.10.10")
      .port(3306)
      .databaseList("emp_1", "emp_2")
      .tableList(
        "emp_1" + "." + "employees_1",
        "emp_1" + "." + "employees_2",
        "emp_2" + "." + "employees_1",
        "emp_2" + "." + "employees_2"
      )
      .username("******")
      .password("******")
      .startupOptions(StartupOptions.earliest)
      .deserializer(new JsonDebeziumDeserializationSchema)
      .build
  }
}


总结

从以上三种方式可以看出,如果相对简单的job,对数据不做任何处理的时候,或涉及表较少时,选择Flink-SQL/CLI方式更为便捷,如果要对数据进行一定处理,那么Flink-DataStream方式更加简便,多表情况也可用Flink侧输出流搞定

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值