Flink CDC 两种方式实践
本文所实践的是截至 2023.02.03 为止,最新的 Flink CDC 2.3
本文参考
- Flink CDC官网
- 王知无-import_bigdata : 文章Flink CDC 2.0原理详解和生产实践
- Z-hhhhh: 文章flink cdc MySQL2Doris 案例分享 解决分库多表同步
更多最新信息可参考flink-cdc官网,本文只注重flink-yarn编程实践,flink-sql-cli 方式实践请参考官网
环境
Mysql 5.6、5.7、8.0.x
Doris 1.1
Flink 1.14.4
Flink CDC 2.3.0
Flink CDC 概念
CDC 的全称是 Change Data Capture ,在广义的概念上,只要能捕获数据变更的技术,我们都可以称为 CDC 。通常我们说的 CDC 技术主要面向 数据库的变更,是一种用于捕获数据库中数据变更的技术。
应用场景
CDC 技术
目前业界主流的实现机制的可以分为两种:
- 一类是基于查询的 CDC 技术 ,比如 DataX。随着当下场景对实时性要求越来越高,此类技术的缺陷也逐渐凸显。离线调度和批处理的模式导致延迟较高;基于离线调度做切片,因而无法保障数据的一致性;另外,也无法保障实时性。
- 一类是基于日志的 CDC 技术,比如 Debezium、Canal、 Flink CDC。这种 CDC 技术能够实时消费数据库的日志,流式处理的模式可以保障数据的一致性,提供实时的数据,可以满足当下越来越实时的业务需求。
常见的开源 CDC 方案
Flink-CDC实践
Flink-CDC 大体来说分两种
- Flink-SQL 方式
- Flink-DataStream 方式
Flink-SQL提交方式有两种
- Flink-Client 直接提交SQL
- 提交程序
本次编程实践是从 Mysql -> Doris 数据同步实践,flink-sql-cli 方式实践请参考官网
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>FlinkCDC</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.12</scala.version>
<java.version>1.8</java.version>
<flink.version>1.14.4</flink.version>
<fastjson.version>1.2.62</fastjson.version>
<hadoop.version>2.8.3</hadoop.version>
<scope.mode>compile</scope.mode>
<slf4j.version>1.7.30</slf4j.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-scala-bridge_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>${fastjson.version}</version>
</dependency>
<!-- Add log dependencies when debugging locally -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<!-- flink-doris-connector -->
<dependency>
<groupId>org.apache.doris</groupId>
<artifactId>flink-doris-connector-1.14_2.12</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.27</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>com.ververica</groupId>
<artifactId>flink-connector-mysql-cdc</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.1</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<args>
<arg>-feature</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
Flink-Client 方式直接提交
具体提交方式参阅 https://ververica.github.io/flink-cdc-connectors/master/content/%E5%BF%AB%E9%80%9F%E4%B8%8A%E6%89%8B/mysql-postgres-tutorial-zh.html
Flink-SQL java程序
java 版本
package com.kanaikee.bigdata.flink;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import java.util.UUID;
public class FlinkSQLCDC {
public static void main(String[] args) {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(10000);
env.setParallelism(1);
final StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
// register a table in the catalog
tEnv.executeSql(
"CREATE TABLE cdc_test_source (\n" +
" emp_no INT,\n" +
" birth_date DATE,\n" +
" first_name STRING,\n" +
" last_name STRING,\n" +
" gender STRING,\n" +
" hire_date STRING,\n" +
" database_name STRING METADATA VIRTUAL,\n" +
" table_name STRING METADATA VIRTUAL,\n" +
" PRIMARY KEY (`emp_no`) NOT ENFORCED \n" +
") WITH (\n" +
" 'connector' = 'mysql-cdc',\n" +
" 'hostname' = '192.168.22.xxx',\n" +
" 'port' = '3306',\n" +
" 'username' = 'xxx',\n" +
" 'password' = 'xxx',\n" +
" 'database-name' = 'emp_[0-9]+',\n" +
" 'table-name' = 'employees_[0-9]+'\n" +
")");
String label = UUID.randomUUID().toString();
//doris table
tEnv.executeSql(
"CREATE TABLE doris_test_sink (" +
" emp_no INT,\n" +
" birth_date STRING,\n" +
" first_name STRING,\n" +
" last_name STRING,\n" +
" gender STRING,\n" +
" hire_date STRING\n" +
") " +
"WITH (\n" +
" 'connector' = 'doris',\n" +
" 'fenodes' = '172.8.10.xxx:8030',\n" +
" 'table.identifier' = 'test_db.all_employees_info',\n" +
" 'username' = 'xxx',\n" +
" 'password' = 'xxx',\n" +
/* doris stream load label, In the exactly-once scenario,
the label is globally unique and must be restarted from the latest checkpoint when restarting.
Exactly-once semantics can be turned off via sink.enable-2pc. */
" 'sink.label-prefix' ='" + label + "',\n" +
" 'sink.properties.format' = 'json',\n" + //json data format
" 'sink.properties.read_json_by_line' = 'true'\n" +
")");
//insert into mysql table to doris table
tEnv.executeSql("INSERT INTO doris_test_sink select emp_no,cast(birth_date as string) as birth_date ,first_name,last_name,gender,cast(hire_date as string) as hire_date from cdc_test_source ");
}
}
scala 版本
package com.kanaikee.bigdata.flink.scala
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
import java.util.UUID
object ScalaFlinkSQLCDC {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment;
env.enableCheckpointing(10000);
env.setParallelism(1);
val tEnv = StreamTableEnvironment.create(env);
// register a table in the catalog
tEnv.executeSql(
"CREATE TABLE cdc_test_source (\n" +
" emp_no INT,\n" +
" birth_date DATE,\n" +
" first_name STRING,\n" +
" last_name STRING,\n" +
" gender STRING,\n" +
" hire_date STRING,\n" +
" database_name STRING METADATA VIRTUAL,\n" +
" table_name STRING METADATA VIRTUAL,\n" +
" PRIMARY KEY (`emp_no`) NOT ENFORCED \n" +
") WITH (\n" +
" 'connector' = 'mysql-cdc',\n" +
" 'hostname' = '192.168.22.xxx',\n" +
" 'port' = '3306',\n" +
" 'username' = 'xxx',\n" +
" 'password' = 'xxx',\n" +
" 'database-name' = 'emp_[0-9]+',\n" +
" 'table-name' = 'employees_[0-9]+'\n" +
")");
val label = UUID.randomUUID().toString
//doris table
tEnv.executeSql(
"CREATE TABLE doris_test_sink (" +
" emp_no INT,\n" +
" birth_date STRING,\n" +
" first_name STRING,\n" +
" last_name STRING,\n" +
" gender STRING,\n" +
" hire_date STRING\n" +
") " +
"WITH (\n" +
" 'connector' = 'doris',\n" +
" 'fenodes' = '172.8.10.xxx:8030',\n" +
" 'table.identifier' = 'test_db.all_employees_info',\n" +
" 'username' = 'xxx',\n" +
" 'password' = 'xxx',\n" +
/* doris stream load label, In the exactly-once scenario,
the label is globally unique and must be restarted from the latest checkpoint when restarting.
Exactly-once semantics can be turned off via sink.enable-2pc. */
" 'sink.label-prefix' ='" + label + "',\n" +
" 'sink.properties.format' = 'json',\n" + //json data format
" 'sink.properties.read_json_by_line' = 'true'\n" +
")");
//insert into mysql table to doris table
tEnv.executeSql("INSERT INTO doris_test_sink select emp_no,cast(birth_date as string) as birth_date ,first_name,last_name,gender,cast(hire_date as string) as hire_date from cdc_test_source ");
}
}
Flink-DataStream 方式
java版本
FlinkCDCTest.java
package com.kanaikee.bigdata.flink;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.kanaikee.bigdata.config.DataBaseConfig;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import org.apache.doris.flink.cfg.DorisExecutionOptions;
import org.apache.doris.flink.cfg.DorisOptions;
import org.apache.doris.flink.cfg.DorisReadOptions;
import org.apache.doris.flink.sink.DorisSink;
import org.apache.doris.flink.sink.writer.SimpleStringSerializer;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Properties;
import java.util.UUID;
import static com.kanaikee.bigdata.config.DataBaseConfig.*;
public class FlinkCDCTest {
private static final Logger log = LoggerFactory.getLogger(FlinkCDCTest.class);
public static void main(String[] args){
MySqlSource<String> mySqlSource = buildMysqlSource();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// enable checkpoint
env.enableCheckpointing(10000);
DataStreamSource<String> cdcSource = env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL CDC Source");
cdcSource.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
JSONObject rowJson = JSON.parseObject(value);
String op = rowJson.getString("op");
JSONObject source = rowJson.getJSONObject("source");
String table = source.getString("table");
// sync insert and update
if (
// ("c".equals(op) || "u".equals(op)) &&
("employees_1".equals(table) || "employees_2".equals(table))
) {
out.collect(rowJson.getJSONObject("after").toJSONString());
}
}
})
.sinkTo(buildDorisSink("all_employees_info"));
try {
env.execute("Full Database Sync ");
} catch (Exception e) {
log.error("Error: " + e.getMessage());
}
}
/**
* build Doris-Sink
*
* @param table tableName
* @return DorisSink
*/
public static DorisSink<String> buildDorisSink(String table) {
DorisSink.Builder<String> builder = DorisSink.builder();
DorisOptions.Builder dorisBuilder = DorisOptions.builder();
dorisBuilder.setFenodes(DORIS_IP + ":" + DORIS_PORT)
.setTableIdentifier("test_db" + "." + table)
.setUsername(DORIS_USER)
.setPassword(DORIS_PASS);
String uuid = UUID.randomUUID().toString();
Properties pro = new Properties();
// json data format
pro.setProperty("format", "json");
pro.setProperty("read_json_by_line", "true");
DorisExecutionOptions executionOptions = DorisExecutionOptions.builder()
// 过滤删除操作
.setDeletable(false)
// stream-load label prefix,
.setLabelPrefix("label-" + uuid + table)
.setStreamLoadProp(pro)
.build();
builder.setDorisReadOptions(DorisReadOptions.builder().build())
.setDorisExecutionOptions(executionOptions)
// serialize according to string
.setSerializer(new SimpleStringSerializer())
.setDorisOptions(dorisBuilder.build());
return builder.build();
}
/**
* build MySql-Source
*
* @return MySqlSource
*/
public static MySqlSource<String> buildMysqlSource() {
return MySqlSource.<String>builder()
.hostname("192.168.10.10")
.port(3306)
.databaseList("emp_1", "emp_2")
// 完全限定 <databaseName>.<tableName> 格式
.tableList("emp_1" + "." + "employees_1",
"emp_1" + "." + "employees_2",
"emp_2" + "." + "employees_1",
"emp_2" + "." + "employees_2")
.username("******")
.password("******")
.startupOptions(StartupOptions.earliest())
// json 序列化
.deserializer(new JsonDebeziumDeserializationSchema())
.build();
}
}
scala 版本
ScalaFlinkCDC.scala
package com.kanaikee.bigdata.flink.scala;
import com.alibaba.fastjson.{JSON, JSONObject}
import com.kanaikee.bigdata.config.DataBaseConfig
import com.ververica.cdc.connectors.mysql.source.MySqlSource
import com.ververica.cdc.connectors.mysql.table.StartupOptions
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema
import org.apache.doris.flink.cfg.{DorisExecutionOptions, DorisOptions, DorisReadOptions}
import org.apache.doris.flink.sink.DorisSink
import org.apache.doris.flink.sink.writer.SimpleStringSerializer
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.api.common.functions.FlatMapFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector
import java.util.{Properties, UUID}
object ScalaFlinkCDC {
def main(args: Array[String]): Unit = {
val mySqlSource: MySqlSource[String] = buildMysqlSource
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// enable checkpoint
env.enableCheckpointing(10000)
import org.apache.flink.api.scala._
val cdcSource = env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks[String](), "MySQL CDC Source")
cdcSource.flatMap(new FlatMapFunction[String, String]() {
override def flatMap(value: String, out: Collector[String]): Unit = {
val rowJson: JSONObject = JSON.parseObject(value)
val op: String = rowJson.getString("op")
val source: JSONObject = rowJson.getJSONObject("source")
val table: String = source.getString("table")
// sync insert and update
if ( // ("c".equals(op) || "u".equals(op)) &&
("employees_1" == (table) || "employees_2" == (table))) {
out.collect(rowJson.getJSONObject("after").toJSONString)
}
}
}).sinkTo(buildDorisSink("all_employees_info"))
env.execute("Full Database Sync ")
}
/**
* build Doris-Sink
*
* @param table tableName
* @return DorisSink
*/
def buildDorisSink(table: String): DorisSink[String] = {
val builder: DorisSink.Builder[String] = DorisSink.builder[String]
val dorisBuilder: DorisOptions.Builder = DorisOptions.builder
dorisBuilder.setFenodes(DataBaseConfig.DORIS_IP + ":" + DataBaseConfig.DORIS_PORT)
.setTableIdentifier("test_db" + "." + table)
.setUsername(DataBaseConfig.DORIS_USER)
.setPassword(DataBaseConfig.DORIS_PASS)
val uuid: String = UUID.randomUUID.toString
val pro: Properties = new Properties
// json data format
pro.setProperty("format", "json")
pro.setProperty("read_json_by_line", "true")
val executionOptions: DorisExecutionOptions = DorisExecutionOptions.builder
.setDeletable(false)
.setLabelPrefix("label-" + uuid + table)
.setStreamLoadProp(pro)
.build
builder.setDorisReadOptions(DorisReadOptions.builder.build)
.setDorisExecutionOptions(executionOptions)
.setSerializer(new SimpleStringSerializer)
.setDorisOptions(dorisBuilder.build)
builder.build
}
/**
* build MySql-Source
*
* @return MySqlSource
*/
def buildMysqlSource: MySqlSource[String] = {
MySqlSource.builder[String]
.hostname("192.168.10.10")
.port(3306)
.databaseList("emp_1", "emp_2")
.tableList(
"emp_1" + "." + "employees_1",
"emp_1" + "." + "employees_2",
"emp_2" + "." + "employees_1",
"emp_2" + "." + "employees_2"
)
.username("******")
.password("******")
.startupOptions(StartupOptions.earliest)
.deserializer(new JsonDebeziumDeserializationSchema)
.build
}
}
总结
从以上三种方式可以看出,如果相对简单的job,对数据不做任何处理的时候,或涉及表较少时,选择Flink-SQL/CLI方式更为便捷,如果要对数据进行一定处理,那么Flink-DataStream方式更加简便,多表情况也可用Flink侧输出流搞定