前面一章使用flink sql的方式来做数据的同步,但是有一个问题,如果我们一个库里面有100多张表或者更多,那就要写100多个create table来连接,mysql的连接数也会一下子暴涨,所以对于多个表或者整个库或者多库的数据同步,只能使用datastream来同步。
下面用java来实现flink datastream流式处理数据同步:
先创建一个project,取名flinkcdc-mysql-doris。pom文件里面需要导入以下jar包
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>ztorn-cdc</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<name>ztorn-cdc</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<commons.version>1.3.1</commons.version>
<flink.guava.version>16.0</flink.guava.version>
<flink.version>1.16.0</flink.version>
<flinkcdc.version>2.3.0</flinkcdc.version>
<mockito.version>3.12.4</mockito.version>
</properties>
<dependencies>
<dependency>
<groupId>com.ztorn</groupId>
<artifactId>ztorn-common</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>com.ztorn.framework</groupId>
<artifactId>ztorn-framework</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-python</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_2.12</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-runtime</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-scala-bridge_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-common</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-yarn</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-common</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-client</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-kubernetes</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-shaded-guava</artifactId>
<version>30.1.1-jre-${flink.guava.version}</version>
</dependency>
<dependency>
<groupId>com.ververica</groupId>
<artifactId>flink-sql-connector-mysql-cdc</artifactId>
<version>${flinkcdc.version}</version>
</dependency>
<dependency>
<groupId>com.ververica</groupId>
<artifactId>flink-sql-connector-oracle-cdc</artifactId>
<version>${flinkcdc.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.36</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.36</version>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>${commons.version}</version>
</dependency>
<!-- <dependency>-->
<!-- <groupId>org.apache.doris</groupId>-->
<!-- <artifactId>flink-doris-connector-1.16</artifactId>-->
<!-- <version>1.4.0</version>-->
<!-- </dependency>-->
<dependency>
<groupId>org.apache.doris</groupId>
<artifactId>flink-doris-connector-1.15</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-python_2.12</artifactId>
<version>1.13.6</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-common</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner-blink_2.12</artifactId>
<version>1.13.6</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.12</artifactId>
<version>1.13.6</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-yarn_2.12</artifactId>
<version>1.13.6</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-common</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-client</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-kubernetes_2.12</artifactId>
<version>1.13.6</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-pulsar</artifactId>
<version>1.16.2</version>
</dependency>
<dependency>
<groupId>io.streamnative.connectors</groupId>
<artifactId>pulsar-flink-connector-2.12-1.12</artifactId>
<version>2.7.6</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.24</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>2.0.32</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.13.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.49</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.14.1</version>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-inline</artifactId>
<version>${mockito.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>io.debezium</groupId>
<artifactId>debezium-core</artifactId>
<version>1.6.4.Final</version>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
</resource>
<resource>
<directory>src/main/java</directory>
<includes>
<include>**/*.xml</include>
</includes>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.4.2</version>
<configuration>
<!-- get all project dependencies -->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
这个项目不需要用到spirng、springboot和springclound,直接在main方法里面实现逻辑;
第一步先创建flink datastream的环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//设置并行度,这里设置1
if (Asserts.isNotNull(config.getParallelism())) {
env.setParallelism(config.getParallelism());
log.info("Set parallelism: " + config.getParallelism());
}
//设置checkpoint和checkpoint的地址
if (Asserts.isNotNull(config.getCheckpoint())) {
env.enableCheckpointing(config.getCheckpoint());
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
checkpointConfig.enableUnalignedCheckpoints();
checkpointConfig.setCheckpointStorage("file:///D:/F/checkpoint/");
log.info("Set checkpoint: " + config.getCheckpoint());
}
这里创建一个FlinkCDCConfig类,用于承接各个配置参数。
public class FlinkCDCConfig {
public static final String SINK_DB = "sink.db";
public static final String AUTO_CREATE = "auto.create";
public static final String TABLE_PREFIX = "table.prefix";
public static final String TABLE_SUFFIX = "table.suffix";
public static final String TABLE_UPPER = "table.upper";
public static final String TABLE_LOWER = "table.lower";
public static final String COLUMN_REPLACE_LINE_BREAK = "column.replace.line-break";
public static final String TIMEZONE = "timezone";
private String type;
private String hostname;
private Integer port;
private String username;
private String password;
private Integer checkpoint;
private Integer parallelism;
private String database;
private String schema;
private String table;
private List<String> schemaTableNameList;
private String startupMode;
private Map<String, String> split;
private Map<String, String> debezium;
private Map<String, String> source;
private Map<String, String> jdbc;
private Map<String, String> sink;
private List<Schema> schemaList;
private List<String> excludeTableList;
private String schemaFieldName;
private boolean scanNewlyAddedTableEnabled = false;
private boolean includeSchemaChanges = false;
public FlinkCDCConfig(
String type,
String hostname,
Integer port,
String username,
String password,
Integer checkpoint,
Integer parallelism,
String database,
String schema,
String table,
String startupMode,
Map<String, String> split,
Map<String, String> debezium,
Map<String, String> source,
Map<String, String> sink,
Map<String, String> jdbc,
boolean scanNewlyAddedTableEnabled,
boolean includeSchemaChanges) {
init(
type,
hostname,
port,
username,
password,
checkpoint,
parallelism,
database,
schema,
table,
startupMode,
split,
debezium,
source,
sink,
jdbc,
scanNewlyAddedTableEnabled,
includeSchemaChanges);
}
public void init(
String type,
String hostname,
Integer port,
String username,
String password,
Integer checkpoint,
Integer parallelism,
String database,
String schema,
String table,
String startupMode,
Map<String, String> split,
Map<String, String> debezium,
Map<String, String> source,
Map<String, String> sink,
Map<String, String> jdbc,
boolean scanNewlyAddedTableEnabled,
boolean includeSchemaChanges) {
this.type = type;
this.hostname = hostname;
this.port = port;
this.username = username;
this.password = password;
this.checkpoint = checkpoint;
this.parallelism = parallelism;
this.database = database;
this.schema = schema;
this.table = table;
this.startupMode = startupMode;
this.split = split;
this.debezium = debezium;
this.source = source;
this.sink = sink;
this.jdbc = jdbc;
this.scanNewlyAddedTableEnabled = scanNewlyAddedTableEnabled;
this.includeSchemaChanges = includeSchemaChanges;
}
private boolean isSkip(String key) {
switch (key) {
case SINK_DB:
case AUTO_CREATE:
case TABLE_PREFIX:
case TABLE_SUFFIX:
case TABLE_UPPER:
case TABLE_LOWER:
case COLUMN_REPLACE_LINE_BREAK:
case TIMEZONE:
return true;
default:
return false;
}
}
public String getSinkConfigurationString() {
return sink.entrySet().stream()
.filter(t -> !isSkip(t.getKey()))
.map(t -> String.format("'%s' = '%s'", t.getKey(), t.getValue()))
.collect(Collectors.joining(",\n"));
}
以上就是对Flink的环境进行准备,项目已经发布到了gitee上flinkcdc-mysql-doris: flinkcdc-mysql-doris是利用flinkcdc读取mysql的binlog日志,然后同步到doris里面,实现了业务数据实时的同步到doris。 该项目支持自动根据mysql的表结构自动在doris里面创建表,自动同步表结构的变更。 数据支持全量和增量的拉取; (gitee.com)