目录
2.2 flink- cdc-mysql 增量同步更新的方式
二. mysql 数据源
2.1 mysql 全量缓存的方式
2.1.1 flink sql
CREATE TABLE `triggers_mysql` (
`trigger_id` int NOT NULL ,
`trigger_source` STRING,
`modify_time` BIGINT NOT NULL,
`enc_type` TINYINT
) with (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://localhost:3306/azkaban',
'table-name' = 'triggers_test',
'username' = 'root', 'password' = '123456',
'lookup.cache.max-rows' = '100',
'lookup.cache.ttl' = '10s',
'lookup.max-retries' = '3' );
select * from triggers_mysql;
2.1.2 数据准备
只需要一个mysql 表中有相关的 数据就行.
注意flink sql 和mysql之间的对应关系
2.1.3 sql-client
(1) jar
flink-connector-jdbc_2.11-1.12.0.jar
参考
mysql-connector-java-5.1.48.jar
flink-connector-mysql-cdc-1.1.1.jar
2.1.4 代码
(1)代码
public class FlinkMysqlAllSource implements Serializable {
public static final String SQL = " CREATE TABLE " +
" `triggers_mysql` ( \n" +
" `trigger_id` int NOT NULL ,\n" +
" `trigger_source` STRING,\n" +
" `modify_time` BIGINT NOT NULL,\n" +
" `enc_type` TINYINT\n" +
" ) with ( " +
" 'connector' = 'jdbc' ," +
" 'url' = 'jdbc:mysql://localhost:3306/azkaban', " +
" 'table-name' = 'triggers_test'," +
" 'username' = 'root', " +
" 'password' = '123456', " +
" 'lookup.cache.max-rows' = '100', " +
" 'lookup.cache.ttl' = '10s', " +
" 'lookup.max-retries' = '3' ) ";
public static void main(String[] args) {
//构建StreamExecutionEnvironment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//构建EnvironmentSettings 并指定Blink Planner
EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
//构建StreamTableEnvironment
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, bsSettings);
tEnv.executeSql(SQL);
//执行查询
Table table = tEnv.sqlQuery("select * from triggers_mysql ");
//转回DataStream并输出
DataStream<Row> rowDataStream = tEnv.toAppendStream(table, Row.class);
rowDataStream.map(new MapFunction<Row, Row>() {
@Override
public Row map(Row row) throws Exception {
System.out.println(row);
return row;
}
}).print("sql==").setParallelism(1);
//任务启动,这行必不可少!
try {
env.execute("test");
} catch (Exception e) {
e.printStackTrace();
}
}
(2) pom
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc_2.11</artifactId>
<version>1.12.0</version>
</dependency>
2.1.5 采坑
(1)查的时候 注意俩个参数的使用
'lookup.cache.ttl' = '10s',
'lookup.max-retries' = '3' ); 意思是攒满多少条才触发写入
(2)写的时候注意俩个参数的使用
sink.buffer-flush.max-rows
sink.buffer-flush.interval
2.1.6 强调配置
sink.buffer-flush.max-rows -- 意思是攒满多少条才触发写入
sink.buffer-flush.interval --攒满多少秒才触发写入;
俩参数同时判断 sink.max-retries-- 如果将记录写入数据库失败,则最大重试时间。
lookup.max-retries -- 查找数据库失败时的最大重试时间。
lookup.cache.ttl ---每一行的最长生存时间
lookup.cache.max-rows ---缓存的最大行数 scan.fetch-size
-- 每次去数据库抓数据的行数 scan.auto-commit --JDBC驱动程序上设置自动提交 其他的都能看明白,我就不解释了
2.2 flink- cdc-mysql 增量同步更新的方式
2.2.1flink sql
CREATE TABLE `triggers_cdc` (
`trigger_id` INT ,
`trigger_source` STRING NULL,
`modify_time` BIGINT NOT NULL,
`enc_type` TINYINT NULL,
PRIMARY KEY (`trigger_id`) NOT ENFORCED
) with (
'connector' = 'mysql-cdc',
'hostname' = 'localhost',
'port' = '3306',
'username' = 'root',
'password' = '123456',
'database-name' = 'azkaban',
'table-name' = 'triggers_test' );
2.2.2 准备数据
这里,只需要有mysql 环境,并且开启binlog 的配置,至于怎么mysql 开启binlog ,请网上查阅
2.2.3 sql client
(1)jar
wget
2.2.4 代码
(1)代码
public class FlinkMysqlCdcSource implements Serializable {
public static final String SQL = " CREATE TABLE `triggers_cdc` ( \n" +
"`trigger_id` INT ,\n" +
"`trigger_source` STRING NULL,\n" +
"`modify_time` BIGINT NOT NULL,\n" +
"`enc_type` TINYINT NULL,\n" +
"PRIMARY KEY (`trigger_id`) NOT ENFORCED\n" +
" ) with ( \n" +
" 'connector' = 'mysql-cdc',\n" +
" 'hostname' = 'localhost',\n" +
" 'port' = '3306',\n" +
" 'username' = 'root',\n" +
" 'password' = '123456',\n" +
" 'database-name' = 'azkaban',\n" +
" 'table-name' = 'triggers_test'\n" +
") ";
public static void main(String[] args) {
//构建StreamExecutionEnvironment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//构建EnvironmentSettings 并指定Blink Planner
EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
//构建StreamTableEnvironment
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, bsSettings);
tEnv.executeSql(SQL);
//执行查询
Table table = tEnv.sqlQuery("select * from triggers_cdc ");
DataStream<Tuple2<Boolean, Row>> dataStream = tEnv.toRetractStream(table, Row.class);
dataStream.map(new MapFunction<Tuple2<Boolean, Row>, Row>() {
@Override
public Row map(Tuple2<Boolean, Row> booleanRowTuple2) throws Exception {
System.out.println(booleanRowTuple2.f0 + "--" + booleanRowTuple2.f1);
if (booleanRowTuple2.f0) {
return booleanRowTuple2.f1;
} else {
}
return null;
}
}).filter(new FilterFunction<Row>() {
@Override
public boolean filter(Row row) throws Exception {
return row != null;
}
}).print("triggers_cdc ").setParallelism(1);
//任务启动,这行必不可少!
try {
env.execute("test");
} catch (Exception e) {
e.printStackTrace();
}
}
(2)pom
<dependency>
<groupId>com.alibaba.ververica</groupId>
<artifactId>flink-connector-mysql-cdc</artifactId>
<version>1.1.1</version>
</dependency>
如果要使用Kafka的changelog-json Format,对于程序而言,需要添加如下依赖:
<dependency>
<groupId>com.alibaba.ververica</groupId>
<artifactId>flink-format-changelog-json</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.11.0</version>
</dependency>
2.2.5 采坑
(1)
一定要将相关的jar 放在flink/lib下,放在其他的目录可能会出现加载不完善的情况,
2.2.6 配置
配置:
connector database-name hostname password port property-version-- 可选
server-id -- 可选 mysql里面提供的配置
server-time-zone --- 可选 时区 table-name username

本文详细介绍了如何使用Flink进行MySQL数据源的全量缓存和CDC增量同步。在全量缓存方式中,通过Flink SQL创建JDBC连接,设置缓存参数实现数据抽取。而在增量同步更新方式中,利用Flink的mysql-cdc connector,配合MySQL的binlog实现实时数据变更捕获。文章列举了关键代码示例,并分享了实施过程中的常见问题与配置要点。
842

被折叠的 条评论
为什么被折叠?



