一、CDC简介
CDC(Change Data Capture,变更数据获取)
核心思想:监测并捕获数据库的变动(包括数据或表的增删改),将变动按顺序记录下来
种类:
社区开源地址:https://github.com/ververica/flink-cdc-connectors
二、实操
1)开启MySQL Binlog并重启MySQL
sudo vim /etc/my.cnf
查看默认存储位置binlog日志(根据配置 以mysql-bin开头)
切换到root用户:su -
cd /var/lib/mysql
注意binlog监控的数据库binlog-do-db 如有改动my.cnf 需重启数据库
systemctl restart mysqld
切回普通用户 exit
2)编写代码 并打包到集群运行
在pom.xml中导入依赖
注意版本与flink对应版本是否支持,查看官网:
Overview — CDC Connectors for Apache Flink® documentation (ververica.github.io)
<!--这里flink用的是1.17.0版本-->
<dependency>
<groupId>com.ververica</groupId>
<artifactId>flink-connector-mysql-cdc</artifactId>
<version>2.4.0</version>
</dependency>
package com.atguigu.flinkcdc;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* ClassName: CDC_Test
* Package: com.atguigu.flinkcdc
* Description:
*
* @Author Wish
* @Create 2023/10/25 16:47
* @Version 1.0
*/
public class CDC_Test {
public static void main(String[] args) throws Exception {
//todo 1.获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//开启checkpoint
env.enableCheckpointing(5000L);
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointStorage("hdfs://hadoop102:8020/flinkcdc/checkpoint");
//todo 2.创建MySQLsource
MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
.hostname("hadoop102")
.port(3306)
.databaseList("gmall") //不写表则可以监控所有库 vim /etc/my.cnf
.tableList("gmall.base_trademark") //必须加库名 不写表则可以监控库中所有表
.username("root")
.password("000000")
.deserializer(new JsonDebeziumDeserializationSchema())
.startupOptions(StartupOptions.initial())
//.startupOptions(StartupOptions.latest())
.build();
//todo 3.读取MySQL变化数据 并打印
env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "mysql-source").print();
//todo 4.启动任务 设置任务名 在flink页面中显示
env.execute("CDC_Test");
}
}
Lifecycle-->package-->将jar包拖至虚拟机flink目录下-->运行
打包Java代码上传到集群运行时
bin/flink run -t yarn-per-job -c com.atguigu.flinkcdc.CDC_Test ./Flink-1.0-SNAPSHOT.jar
三、问题
错误一:
Caused by: java.lang.NoClassDefFoundError: com/ververica/cdc/debezium/DebeziumDeserializationSchema
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at org.apache.flink.client.program.PackagedProgram.hasMainMethod(PackagedProgram.java:307)
... 14 more
原因:
打包时未添加打包插件
解决:
在pom.xml中添加
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<excludes>
<exclude>com.google.code.findbugs:jsr305</exclude>
<exclude>org.slf4j:*</exclude>
<exclude>log4j:*</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<!-- Do not copy the signatures in the META-INF folder.
Otherwise, this might cause SecurityExceptions when using the JAR. -->
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers combine.children="append">
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer">
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
重新打包上传运行
错误二:
Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user":atguigu:supergroup:drwxr-xr-x
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/user":atguigu:supergroup:drwxr-xr-x
原因:
用户(user=root
)在Hadoop分布式文件系统(HDFS)上尝试进行写操作(access=WRITE
),但由于权限问题被拒绝(Permission denied
)所引发的。错误消息表明用户 root
尝试在HDFS上的根目录 /user
进行写操作,但未获得足够的权限。
查看MySQL配置后未切换用户
解决方案:
切回普通用户
su atguigu
bin/flink run -t yarn-per-job -c com.atguigu.flinkcdc.CDC_Test ./Flink-1.0-SNAPSHOT.jar
错误三:
Could not get job jar and dependencies from JAR file: JAR file does not exist: ./Flink-1.0-SNAPSHOT.jar
原因:
使用root用户上传的jar包 普通用户无法使用
或
jar包路径错误
解决方案:
rm -rf Flink-1.0-SNAPSHOT.jar
重新上传jar包 并重新运行
bin/flink run -t yarn-per-job -c com.atguigu.flinkcdc.CDC_Test Flink-1.0-SNAPSHOT.jar