Flink cdc使用及参数设置

HD0do(迪答数据)

已于 2022-02-10 18:45:34 修改

阅读量9.7k

点赞数 2

分类专栏： Flink 文章标签： flink 大数据 big data

于 2022-01-20 20:06:52 首次发布

本文链接：https://blog.csdn.net/hd0do/article/details/122608350

版权

Flink 专栏收录该内容

8 篇文章 3 订阅

订阅专栏

Flink Sql通过CDC监听mysql

create table order_source_ms(id BIGINT,deal_amt DOUBLE,shop_id STRING,customer_id String,city_id bigint,product_count double,
order_at timestamp(3),last_updated_at timestamp(3),pay_at timestamp,refund_at timestamp,
tenant_id STRING,order_category STRING,
h as hour(last_updated_at),
pay_hour as hour(pay_at),
refund_hour as hour(refund_at),
m as MINUTE(last_updated_at),
dt as to_DATE(cast(last_updated_at as string)),
pay_dt as to_DATE(cast(pay_at as string)),
refund_dt as to_DATE(cast(refund_at as string)),
PRIMARY KEY(id) NOT ENFORCED)
with(
'connector' ='mysql-cdc',
'hostname' ='ip',
'port'='3306',
'username' = 'username',
'password' = 'password',
'database-name'='databasename',
'scan.startup.mode'='latest-offset',
'debezium.skipped.operations'='d',
'table-name'='tablename')

可以通过SQLclient的方式执行上面的SQL语句，就建立了和mysql对应的表的连接。当然前提都是需要将需要的jar包 flink-sql-connector-mysql-cdc-2.2-SNAPSHOT.jar依赖放到flink的lib目录下面

在案例中建表时有使用到flinkSQL相关的时间函数，获取每条数据的小时，分钟，日期数据的截取

h as hour(last_updated_at),
pay_hour as hour(pay_at),
refund_hour as hour(refund_at),
m as MINUTE(last_updated_at),
dt as to_DATE(cast(last_updated_at as string)),
pay_dt as to_DATE(cast(pay_at as string)),
refund_dt as to_DATE(cast(refund_at as string)),

flinksql的内置函数

参数解读：

对于一般的参数可以通过官网查看：

Option	Required	Default	Type	Description
connector	required	(none)	String	Specify what connector to use, here should be `'mysql-cdc'`.
hostname	required	(none)	String	IP address or hostname of the MySQL database server.
username	required	(none)	String	Name of the MySQL database to use when connecting to the MySQL database server.
password	required	(none)	String	Password to use when connecting to the MySQL database server.
database-name	required	(none)	String	Database name of the MySQL server to monitor. The database-name also supports regular expressions to monitor multiple tables matches the regular expression.
table-name	required	(none)	String	Table name of the MySQL database to monitor. The table-name also supports regular expressions to monitor multiple tables matches the regular expression.
port	optional	3306	Integer	Integer port number of the MySQL database server.
server-id	optional	(none)	Integer	A numeric ID or a numeric ID range of this database client, The numeric ID syntax is like '5400', the numeric ID range syntax is like '5400-5408', The numeric ID range syntax is recommended when 'scan.incremental.snapshot.enabled' enabled. Every ID must be unique across all currently-running database processes in the MySQL cluster. This connector joins the MySQL cluster as another server (with this unique ID) so it can read the binlog. By default, a random number is generated between 5400 and 6400, though we recommend setting an explicit value.
scan.incremental.snapshot.enabled	optional	true	Boolean	Incremental snapshot is a new mechanism to read snapshot of a table. Compared to the old snapshot mechanism, the incremental snapshot has many advantages, including: (1) source can be parallel during snapshot reading, (2) source can perform checkpoints in the chunk granularity during snapshot reading, (3) source doesn't need to acquire global read lock (FLUSH TABLES WITH READ LOCK) before snapshot reading. If you would like the source run in parallel, each parallel reader should have an unique server id, so the 'server-id' must be a range like '5400-6400', and the range must be larger than the parallelism. Please see Incremental Snapshot Readingsection for more detailed information.
scan.incremental.snapshot.chunk.size	optional	8096	Integer	The chunk size (number of rows) of table snapshot, captured tables are split into multiple chunks when read the snapshot of table.
scan.snapshot.fetch.size	optional	1024	Integer	The maximum fetch size for per poll when read table snapshot.
scan.startup.mode	optional	initial	String	Optional startup mode for MySQL CDC consumer, valid enumerations are "initial" and "latest-offset". Please see Startup Reading Positionsection for more detailed information.
server-time-zone	optional	UTC	String	The session time zone in database server, e.g. "Asia/Shanghai". It controls how the TIMESTAMP type in MYSQL converted to STRING. See more here.
debezium.min.row. count.to.stream.result	optional	1000	Integer	During a snapshot operation, the connector will query each included table to produce a read event for all rows in that table. This parameter determines whether the MySQL connection will pull all results for a table into memory (which is fast but requires large amounts of memory), or whether the results will instead be streamed (can be slower, but will work for very large tables). The value specifies the minimum number of rows a table must contain before the connector will stream results, and defaults to 1,000. Set this parameter to '0' to skip all table size checks and always stream all results during a snapshot.
connect.timeout	optional	30s	Duration	The maximum time that the connector should wait after trying to connect to the MySQL database server before timing out.
debezium.*	optional	(none)	String	Pass-through Debezium's properties to Debezium Embedded Engine which is used to capture data changes from MySQL server. For example: `'debezium.snapshot.mode' = 'never'`. See more about the Debezium's MySQL Connector properties

要说明的是其中的一个参数设置

'debezium.skipped.operations'='d',

这个参数的配置是监听mysql的binlog时要跳过删除操作，这个参数是找了好久才发现的，因为业务需求需要对删除操作进行过滤，一直没有找到通过flinkSQL过滤的参数，最后发现：

官网直接提供的参数配置的最后一行，通过查看debezium所提供的参数来扩展，可以通过提供的连接去找到自己还需要的监听参数，我所找到的过滤删除操作的参数地址

代码操作连接监听mysqlCDC

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;

public class MySqlSourceExample {
  public static void main(String[] args) throws Exception {
    MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
        .hostname("yourHostname")
        .port(yourPort)
        .databaseList("yourDatabaseName") // set captured database
        .tableList("yourDatabaseName.yourTableName") // set captured table
        .username("yourUsername")
        .password("yourPassword")
        .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
        .build();

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    // enable checkpoint
    env.enableCheckpointing(3000);

    env
      .fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
      // set 4 parallel source tasks
      .setParallelism(4)
      .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

    env.execute("Print MySQL Snapshot + Binlog");
  }
}

这里有一点需要补充的是，我在使用flinkSQL建表的时候对于金额类型的字段使用的是double数据类型和decimal类型对应，虽然建表的时候不会报错，但是在对金额进行运算的时候会出现精度损失，所以需要使用decimal类型来建表；

更多有关FlinkSQL的使用可以参考：FlinkSQL详细系统的全面讲解及在企业生产的实践使用