Dinky实践系列之FlinkCDC整库实时入仓入湖

本文详细介绍了如何使用Dinky进行FlinkCDC驱动的整库同步,涵盖从MySQL源库到Hudi、StarRocks、MySQL、Kafka和PostgreSQL等目标库的同步过程,包括环境配置、参数设置和具体同步步骤。同时,文章提供了创建表、提交作业和验证数据的示例。
摘要由CSDN通过智能技术生成

摘要:本文介绍了 Dinky 功能实践系列的 Flink CDC 整库实时入仓入湖的分析。内容包括:

  1. 前言

  2. 环境要求

  3. 源库准备

  4. 整库同步参数介绍

  5. 整库入湖 Hudi

  6. 整库入仓 StarRocks

  7. 整库入库 MySQL

  8. 整库同步 Kafka

  9. 整库入库 PostgreSQL

  10. 整库入仓 ClickHouse

  11. 总结

一、前言

    Dinky 整库同步发布已经有一段时间,通过阅读本文,您将会熟悉 Dinky 整库同步的用法。为此Dinky 社区准备了整库同步的系列,方便大家快速上手使用。

    因业界中 Sink 端的库较多,特意选择了现在比较通用或者流行的库做为演示。并选择了 mysql-cdc做为 Source 端实现整库同步到 各 Sink 端。当然通过阅读本文,如果你的 Source 端 oracle-cdc,那么将 mysql-cdc替换即可。

二、环境要求

软件版本
CDH6.2.0
Hadoop3.0.0-cdh6.2.0
Hive2.1.1-cdh6.2.0
Hudi0.11.1
Flink1.13.6
Flink CDC2.2.1
StarRocks2.2.0
Dinky0.6.6-SNAPSHOT
MySQL5.7
PostgreSQL13
ClickHouse22.2.2.1(单机版)

所需依赖

‍    整库同步需要在 Flink下加载周围组件所需要的 Flink connector 即可。依赖如下:

# hive依赖包antlr-runtime-3.5.2.jarhive-exec-2.1.1-cdh6.2.0.jarlibfb303-0.9.3.jarflink-sql-connector-hive-2.2.0_2.12-1.13.6.jarhive-site.xml# hadoop依赖flink-shaded-hadoop-2-uber-3.0.0-cdh6.3.0-7.0.jar# Flink Starrrocks依赖flink-connector-starrocks-1.2.2_flink-1.13_2.12.jar# Hudi 依赖hudi-flink1.13-bundle_2.12-0.11.1.jar# Dinky hadoop依赖flink-shaded-hadoop-3-uber-3.1.1.7.2.8.0-224-9.0.jar# Dinky 整库同步依赖包dlink-client-1.13-0.6.5.jardlink-client-base-0.6.5.jardlink-common-0.6.5.jar# flink cdc依赖包flink-sql-connector-mysql-cdc-2.2.1.jar# mysql 驱动依赖mysql-connector-java-8.0.21.jar# kafka flink依赖flink-sql-connector-kafka_2.12-1.13.6.jar# postgresql jdbc依赖postgresql-42.2.14.jar# clickhouse 依赖clickhouse-jdbc-0.2.6.jarflink-connector-clickhouse-1.13.6.jar

说明

    1.Hive 依赖包放置 $FLINK_HOME/lib 和 $DINKY_HOME/plugins 下

    2.Hadoop 依赖包放置 $FLINK_HOME/lib 

    3.Flink Starrrocks 依赖包放置 $FLINK_HOME/lib $DINKY_HOME/plugins 

    4.Hudi 依赖包放置 $FLINK_HOME/lib 和 $DINKY_HOME/plugins 下

    5.Dinky hadoop 依赖包放置 $DINKY_HOME/plugins 下(网盘或者群公告下载)

    6.Dinky 整库同步依赖包放置 $FLINK_HOME/lib 

    7.Flink CDC 依赖包放置 $FLINK_HOME/lib 和 $DINKY_HOME/plugins 

    8.MySQL 驱动依赖放置 $FLINK_HOME/lib 和 $DINKY_HOME/plugins 

    9.Kafka Flink 依赖 $FLINK_HOME/lib 和 $DINKY_HOME/plugins 

    10.PostgreSQL jdbc 依赖放置 $FLINK_HOME/lib $DINKY_HOME/plugins 

    11.ClickHouse 依赖放置 $FLINK_HOME/lib 和 $DINKY_HOME/plugins 

    以上依赖放入后,重启 Flink 集群和 Dinky。如果中间遇到一些jar包冲突,可自行根据报错解决相关冲突的包。

三、源库准备

MySQL 建表

    如下sql脚本采用 Flink CDC 官网

# mysql建表语句(同步到Starocks)CREATE TABLE bigdata.products (  id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,  name VARCHAR(255) NOT NULL,  description VARCHAR(512));
ALTER TABLE bigdata.products AUTO_INCREMENT = 101;
INSERT INTO bigdata.productsVALUES (default,"scooter","Small 2-wheel scooter"),       (default,"car battery","12V car battery"),       (default,"12-pack drill bits","12-pack of drill bits with sizes ranging from #40 to #3"),       (default,"hammer","12oz carpenter's hammer"),       (default,"hammer","14oz carpenter's hammer"),       (default,"hammer","16oz carpenter's hammer"),       (default,"rocks","box of assorted rocks"),       (default,"jacket","water resistent black wind breaker"),       (default,"spare tire","24 inch spare tire");       CREATE TABLE bigdata.orders (  order_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,  order_date DATETIME NOT NULL,  customer_name VARCHAR(255) NOT NULL,  price DECIMAL(10, 5) NOT NULL,  product_id INTEGER NOT NULL,  order_status BOOLEAN NOT NULL -- Whether order has been placed) AUTO_INCREMENT = 10001;
INSERT INTO bigdata.ordersVALUES (default, '2020-07-30 10:08:22', 'Jark', 50.50, 102, false),       (default, '2020-07-30 10:11:09', 'Sally', 15.00, 105, false),       (default, '2020-07-30 12:00:30', 'Edward', 25.25, 106, false);

四、整库同步参数介绍

   

 对于 Dinky 整库同步的公共参数,在大多数 Sink 目标端都是适用的。除个别 Sink 目标端,因底层实现方式不同,所以不能一概而论。如 Hudi。公共参数依据 Dinky 提供的语法,如下:

keyvalue上下游
connectormysql-cdcsource 端
hostname主机名source 端
port端口source 端
username用户名source 端
password密码source 端
checkpointcheckpoint 时间间隔source 端
scan.startup.mode全量或增量读取source 端
parallelism1source 端
database-name数据库名称source 端
table-name表名称,支持正则source 端
sink.**代表 sink 端所有参数sink 端

    提示: 对于sink.*,在使用的过程中需要注意的是,sink是必须要写的,'*' 星号代表的是所有sink端的参数,比如原生 Flink Sink建表语句的连接器写"connector",在 Dinky 整库同步语法中必须是"sink.connector"。所有的 Sink 端必须参照此语法规范。

五、整库入湖 Hudi

作业脚本

EXECUTE CDCSOURCE demo_hudi2 WITH ( 'connector' = 'mysql-cdc', 'hostname' = '192.168.0.4', 'port' = '4406', 'username' = 'root', 'password' = '123456', 'checkpoint' = '3000', 'scan.startup.mode' = 'initial', 'parallelism' = '1','database-name'='bigdata','table-name'='bigdata\.products,bigdata\.orders','sink.connector'='hudi','sink.path'='hdfs://nameservice1/data/hudi/${tableName}','sink.hoodie.datasource.write.recordkey.field'='${pkList}','sink.hoodie.parquet.max.file.size'='268435456',--'sink.write.precombine.field'='gmt_modified','sink.write.tasks'='1','sink.write.bucket_assign.tasks'='2','sink.write.precombine'='true','sink.compaction.async.enabled'='true','sink.write.task.max.size'='1024','sink.write.rate.limit'='3000','sink.write.operation'='upsert','sink.table.type'='COPY_ON_WRITE','sink.compaction.tasks'='1','sink.compaction.delta_seconds'='20','sink.compaction.async.enabled'='true','sink.read.streaming.skip_compaction'='true','sink.compaction.delta_commits'='20','sink.compaction.trigger.strategy'='num_or_time','sink.compaction.max_memory'='500','sink.changelog.enabled'='true','sink.read.streaming.enabled'='true','sink.read.streaming.check.interval'='3','sink.hive_sync.skip_ro_suffix' = 'true', 'sink.hive_sync.enable'='true','sink.hive_sync.mode'='hms','sink.hive_sync.metastore.uris'='thrift://bigdata1:9083','sink.hive_sync.db'='qhc_hudi_ods','sink.hive_sync.table'='${tableName}','sink.table.prefix.schema'='true')

创建并提交作业

查看 HDFS 目录及 Hive 表

创建 StarRocks Hudi 外部表

    在创建外部表之前,在Starrocks上首先保证要将hdfs-site.xml文件分别部署到FE和BE节点的conf目录下。重启FE和BE节点。Hudi 外表是只读的,只能用于查询操作。当前支持 Hudi 的表类型为 Copy on write。

创建和管理 Hudi 资源

CREATE EXTERNAL RESOURCE "hudi0" PROPERTIES (     "type" = "hudi",     "hive.metastore.uris" = "thrift://bigdata1:9083");SHOW RESOURCES;

创建 Hudi 外部表

CREATE EXTERNAL TABLE qhc_sta.orders (   `order_id` int  NULL COMMENT "",  `order_date` datetime  NULL COMMENT "",  `customer_name` string NULL COMMENT "",  `price` decimal(10, 5)  NULL COMMENT "",  `product_id` int NULL COMMENT "",  `order_status` int NULL COMMENT  "") ENGINE=HUDI PROPERTIES (     "resource" = "hudi0",     "database" = "qhc_hudi_ods",     "table" = "bigdata_orders" ); 
CREATE EXTERNAL TABLE qhc_sta.products (      id INT,  name STRING,  description STRING) ENGINE=HUDI PROPERTIES (     "resource" = "hudi0",     "database" = "qhc_hudi_ods",     "table" = "bigdata_products" ); 

查看 Hudi 外部表数据

六、整库入仓 StarRocks

作业脚本

EXECUTE CDCSOURCE jobname WITH (   'connector' = 'mysql-cdc',  'hostname' = '192.168.0.4',  'port' = '3306',  'username' = 'root',  'password' = '123456',  'checkpoint' = '3000',  'scan.startup.mode' = 'initial',  'parallelism' = '1',  'table-name' = 'bigdata\.products,bigdata\.orders',  'sink.connector' = 'starrocks',  'sink.jdbc-url' = 'jdbc:mysql://192.168.0.4:19035',  'sink.load-url' = '192.168.0.4:18035',  'sink.username' = 'devuser',  'sink.password' = '123456',  'sink.sink.db' = 'qhc_ods',  'sink.table.prefix' = 'ods_',  'sink.table.lower' = 'true',  'sink.database-name' = 'qhc_ods',  'sink.table-name' = '${tableName}',   'sink.sink.properties.format' = 'json',   'sink.sink.properties.strip_outer_array' = 'true',   'sink.sink.max-retries' = '10',   'sink.sink.buffer-flush.interval-ms' = '15000',   'sink.sink.parallelism' = '1')

创建作业

StarRocks 建表

​​​​​​​

CREATE TABLE qhc_ods.`ods_orders` (  `order_id` largeint(40) NOT NULL COMMENT "",  `order_date` datetime NOT NULL COMMENT "",  `customer_name` varchar(65533) NULL COMMENT "",  `price` decimal64(10, 5) NOT NULL COMMENT "",  `product_id` bigint(20) NULL COMMENT "",  `order_status` boolean NULL COMMENT "") ENGINE=OLAP PRIMARY KEY(`order_id`)COMMENT "OLAP"DISTRIBUTED BY HASH(`order_id`) BUCKETS 10 PROPERTIES ("replication_num" = "3","colocate_with" = "qhc","in_memory" = "false","storage_format" = "DEFAULT");
CREATE TABLE qhc_ods.`ods_products` (  `id` largeint(40) NOT NULL COMMENT "",  `name` varchar(65533) NOT NULL COMMENT "",  `description` varchar(65533) NULL COMMENT "") ENGINE=OLAP PRIMARY KEY(`id`)COMMENT "OLAP"DISTRIBUTED BY HASH(`id`) BUCKETS 10 PROPERTIES ("replication_num" = "3","colocate_with" = "qhc","in_memory" = "false","storage_format" = "DEFAULT");

查看 StarRocks 表

    查看Starrocks表中数据是不是为空

提交 Flink 整库同步作业

再次查看 StarRocks

七、整库入库 MySQL

作业脚本

​​​​​​​

EXECUTE CDCSOURCE cdc_mysql2 WITH ( 'connector' = 'mysql-cdc', 'hostname' = '192.168.0.4', 'port' = '3306', 'username' = 'root', 'password' = '123456', 'checkpoint' = '3000', 'scan.startup.mode' = 'initial', 'parallelism' = '1', 'table-name' = 'bigdata\.products,bigdata\.orders', 'sink.connector' = 'jdbc', 'sink.url' = 'jdbc:mysql://192.168.0.5:3306/test?characterEncoding=utf-8&useSSL=false', 'sink.username' = 'root', 'sink.password' = '123456', 'sink.sink.db' = 'test', 'sink.table.prefix' = 'test_', 'sink.table.lower' = 'true', 'sink.table-name' = '${tableName}', 'sink.driver' = 'com.mysql.jdbc.Driver', 'sink.sink.buffer-flush.interval' = '2s', 'sink.sink.buffer-flush.max-rows' = '100', 'sink.sink.max-retries' = '5')

创建作业

创建 MySQL 表

​​​​​​​

drop table test.test_products;CREATE TABLE test.test_products (  id INTEGER NOT NULL ,  name VARCHAR(255) NOT NULL,  description VARCHAR(512));
drop table test.test_orders;CREATE TABLE test.test_orders (  order_id INTEGER NOT NULL ,  order_date DATETIME NOT NULL,  customer_name VARCHAR(255) NOT NULL,  price DECIMAL(10, 5) NOT NULL,  product_id INTEGER NOT NULL,  order_status BOOLEAN NOT NULL -- Whether order has been placed); 

提交 Flink 整库同步作业

查看 MySQL 数据

八、整库同步 Kafka

作业脚本

​​​​​​​

# cdc作业EXECUTE CDCSOURCE cdc_kafka WITH ( 'connector' = 'mysql-cdc', 'hostname' = '192.168.0.4', 'port' = '3306', 'username' = 'root', 'password' = '123456', 'checkpoint' = '3000', 'scan.startup.mode' = 'initial', 'parallelism' = '1', 'table-name' = 'bigdata\.products,bigdata\.orders',  'sink.connector'='datastream-kafka',  'sink.topic'='cdctest',  'sink.brokers'='bigdata2:9092,bigdata3:9092,bigdata4:9092')

创建作业

创建 Kafka Topic

    创建 topic 可忽略,Dinky 整库同步会自动创建。

# 创建topic./bin/kafka-topics.sh \ --create \ --zookeeper   bigdata2:2181,bigdata3:2181,bigdata4:2181 \ --replication-factor 3 \ --partitions 1 \ --topic cdctest# 查看topic./bin/kafka-topics.sh  --list \--zookeeper bigdata2:2181,bigdata3:2181,bigdata4:2181

提交 Flink 整库同步作业

查看消费者

    查看是否2张表数据

./bin/kafka-console-consumer.sh  --bootstrap-server bigdata2:9092,bigdata3:9092,bigdata4:9092  --topic cdctest  --from-beginning  --group test_id

九、整库入库 PostgreSQL

作业脚本

​​​​​​​

EXECUTE CDCSOURCE cdc_postgresql5 WITH ( 'connector' = 'mysql-cdc', 'hostname' = '192.168.0.4', 'port' = '3306', 'username' = 'root', 'password' = '123456', 'checkpoint' = '3000', 'scan.startup.mode' = 'initial', 'parallelism' = '1', 'table-name' = 'bigdata\.products,bigdata\.orders', 'sink.connector' = 'jdbc', 'sink.url' = 'jdbc:postgresql://192.168.0.5:5432/test', 'sink.username' = 'test', 'sink.password' = '123456', 'sink.sink.db' = 'test', 'sink.table.prefix' = 'test_', 'sink.table.lower' = 'true', 'sink.table-name' = '${tableName}', 'sink.driver' = 'org.postgresql.Driver', 'sink.sink.buffer-flush.interval' = '2s', 'sink.sink.buffer-flush.max-rows' = '100', 'sink.sink.max-retries' = '5')

创建作业

创建 PostgreSQL 表

​​​​​​​

CREATE schema test;drop table test.test_products;CREATE TABLE test.test_products (  id INTEGER UNIQUE NOT NULL ,  name VARCHAR(255) NOT NULL,  description VARCHAR(512));
drop table test.test_orders;CREATE TABLE test.test_orders (  order_id INTEGER UNIQUE NOT NULL ,  order_date timestamp   NULL,  customer_name VARCHAR(255) NOT NULL,  price DECIMAL(10, 5) NULL,  product_id INTEGER NULL,  order_status INTEGER NOT NULL -- Whether order has been placed); 

提交 Flink 整库同步作业

查看 PostgreSQL 数据

十、整库入仓 ClickHouse

作业脚本

​​​​​​​

EXECUTE CDCSOURCE cdc_clickhouse WITH ( 'connector' = 'mysql-cdc', 'hostname' = '192.168.0.4', 'port' = '4406', 'username' = 'root', 'password' = '123456', 'checkpoint' = '3000', 'scan.startup.mode' = 'initial', 'parallelism' = '1', 'table-name' = 'bigdata\.products,bigdata\.orders',  'sink.connector' = 'clickhouse',  'sink.url' = 'clickhouse://192.168.0.5:8123',  'sink.username' = 'default',  'sink.password' = '123456',  'sink.sink.db' = 'test',  'sink.table.prefix' = 'test_',  'sink.table.lower' = 'true',  'sink.database-name' = 'test',  'sink.table-name' = '${tableName}',  'sink.sink.batch-size' = '500',  'sink.sink.flush-interval' = '1000',  'sink.sink.max-retries' = '3')

创建作业

创建 ClickHouse 表

​​​​​​​

# 创建语句为本地表create database test;drop table test.test_products;CREATE TABLE test.test_products (  id Int64  NOT NULL ,  name String NOT NULL,  description String)ENGINE = MergeTree()ORDER BY idPRIMARY KEY id;
drop table test.test_orders;CREATE TABLE test.test_orders (  order_id Int64  NOT NULL ,  order_date DATETIME NOT NULL,  customer_name String NOT NULL,  price DECIMAL(10, 5) NOT NULL,  product_id Int64  NOT NULL,  order_status BOOLEAN NOT NULL -- Whether order has been placed)ENGINE = MergeTree()ORDER BY order_idPRIMARY KEY order_id;

提交 Flink 整库同步作业

查看 ClickHouse 数据

  • 4
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
整库同步是指将数据库中的所有表数据都同步到目标端的操作。在使用Flink CDC进行整库同步时,可以通过配置相关参数实现。 根据引用,选择了mysql-cdc作为Source端来实现整库同步到各Sink端。需要在引用中的代码中设置`table-name`参数为要同步的所有表的名称,以逗号分隔。例如,如果要同步数据库中的bigdata库中的products和orders表,可以将`table-name`设置为`bigdata\.products,bigdata\.orders`。 此外,还需要在sink库中创建相应的表,以便将数据写入目标端。这一步骤在引用中提到,要在sink库中先创建好相应的表。 最后,执行CDC作业即可开始整库同步的过程。在引用中的代码中,通过执行`EXECUTE CDCSOURCE`命令来启动CDC作业。该命令将使用mysql-cdc作为连接器连接到源端数据库,并将数据写入到指定的Kafka主题中。 总结起来,使用Flink CDC进行整库同步的步骤包括: 1. 设置`table-name`参数为要同步的所有表的名称。 2. 在sink库中创建相应的表。 3. 执行CDC作业,将数据从源端同步到目标端。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [Dinky实践系列FlinkCDC整库实时入仓入湖](https://blog.csdn.net/mars275173972/article/details/128551989)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 50%"] - *2* [Flink CDC整库同步(多表异构同步)](https://blog.csdn.net/qq_36062467/article/details/128117647)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值