验证内容:
1、验证创建hadoop类型的catalog
2、验证创建hive类型的catalog
3、通过cdc读取mysql表,写入hive类型catlog的iceberg表。
验证失败
4、通过cdc读取mysql表,写入kafka,再写入hive类型catlog的iceberg表。
验证失败
5、总结
-
在flink1.11.1版本中 flink mysql cdc可以成功作为source,但是不能写入到Kafka,当时的cdc版本还不支持,也不支持sink是iceberg。
-
source是kafka,sink是hive_catalog的iceberg,可以成功将kafka中的数据写入到iceberg。
准备用到的软件包
FLINK_VERSION=1.11.1SCALA_VERSION=2.12ICEBERG_VERSION=0.11.1HIVE_VERSION=2.3.6JAVA_VERSION=1.8APACHE_FLINK_URL=archive.apache.org/dist/flink/MAVEN_URL=https://repo1.maven.org/maven2wget ${APACHE_FLINK_URL}/flink-${FLINK_VERSION}/flink-${FLINK_VERSION}-bin-scala_${SCALA_VERSION}.tgztar xzvf flink-${FLINK_VERSION}-bin-scala_${SCALA_VERSION}.tgzICEBERG_MAVEN_URL=${MAVEN_URL}/org/apache/icebergICEBERG_PACKAGE=iceberg-flink-runtimewget ${ICEBERG_MAVEN_URL}/${ICEBERG_PACKAGE}/${ICEBERG_VERSION}/${ICEBERG_PACKAGE}-${ICEBERG_VERSION}.jarFLINK_CONNECTOR_URL=${MAVEN_URL}/org/apache/flinkFLINK_CONNECTOR_PACKAGE=flink-sql-connector-hivewget ${FLINK_CONNECTOR_URL}/${FLINK_CONNECTOR_PACKAGE}-${HIVE_VERSION}_${SCALA_VERSION}/${FLINK_VERSION}/${FLINK_CONNECTOR_PACKAGE}-${HIVE_VERSION}_${SCALA_VERSION}-${FLINK_VERSION}.jar
配置Hadoop catalog
配置hadoop环境变量
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
提前启动好Hadoop,等会会用HDFS服务。
提前启动好flink。
把iceberg-flink-runtime-0.11.1.jar放到Flink_HOME/lib下,启动flinksql client
./bin/sql-client.sh embedded shell
创建hadoop_catalog
CREATE CATALOG hadoop_catalog WITH ( 'type'='iceberg', 'catalog-type'='hadoop', 'warehouse'='hdfs://lbpc:8020/warehouse/path', 'property-version'='1');//这里注意下hadoop3.0中namenode的默认端口不是9000了
创建数据库及表
use catalog hadoop_catalog;//如果之前创建过hadoop_catalog,再次创建只是会关联起来之前创建的hadoop_catalog,hadoop_catalog中的database都会在,不需要重新创建。CREATE DATABASE hadoop_catalog.iceberg_db;CREATE TABLE hadoop_catalog.iceberg_db.sample (id BIGINT COMMENT 'unique id',data STRING);INSERT INTO hadoop_catalog.iceberg_db.sample VALUES (10, 'test10_U'), (11, 'test11'), (12, 'test12');SET execution.type = streaming;select * from hadoop_catalog.iceberg_db.sample;SET execution.type = batch ;select * from hadoop_catalog.iceberg_db.sample;
配置Hive catalog
把flink-sql-connector-hive-2.3.6_2.11-1.11.0.jar放到Flink_HOME/lib下:
https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.6_2.11/1.11.0/flink-sql-connector-hive-2.3.6_2.11-1.11.0.jar
启动flinksql client
CREATE CATALOG hive_catalog WITH ('type'='iceberg','catalog-type'='hive','uri'='thrift://localhost:9083','clients'='5','property-version'='1','warehouse'='hdfs://lbpc:8020/user/hive/warehouse');
说明:
type: 只能使用iceberg,用于 iceberg 表格式。(必须)
catalog-type: Iceberg 当前支持hive或hadoopcatalog 类型。(必须)
uri: Hive metastore 的 thrift URI。 (必须)
clients: Hive metastore 客户端池大小,默认值为 2。 (可选)
property-version: 版本号来描述属性版本。此属性可用于在属性格式发生更改时进行向后兼容。当前的属性版本是 1。(可选)
warehouse: Hive 仓库位置, 如果既不将 hive-conf-dir 设置为指定包含 hive-site.xml 配置文件的位置,也不将正确的 hive-site.xml 添加到类路径,则用户应指定此路径。
hive-conf-dir: 包含 Hive-site.xml 配置文件的目录的路径,该配置文件将用于提供自定义的 Hive 配置值。 如果在创建 iceberg catalog 时同时设置 hive-conf-dir 和 warehouse,那么将使用 warehouse 值覆盖 < hive-conf-dir >/hive-site.xml (或者 classpath 中的 hive 配置文件)中的 hive.metastore.warehouse.dir 的值。
注释:这里也可以将hive_catalog配置到 flinksql配置文件中,这样不用每次启动sqlclient去创建hive_catalog
创建hive_catalog类型表
use catalog hive_catalog;CREATE DATABASE hive_catalog.iceberg_db;CREATE TABLE hive_catalog.iceberg_db.sampleHive (id BIGINT COMMENT 'unique id',data STRING);INSERT INTO hive_catalog.iceberg_db.sampleHive VALUES (10, 'test10_U'), (11, 'test11'), (12, 'test12');select * from hive_catalog.iceberg_db.sampleHive;
验证Flink mysql cdc 写入kafka再写入iceberg:
准备flink mysql cdc需要的connector 包, 2.0版本集成失败。
准备mysql table
CREATE TABLE ordersmysql (order_id INT,customer_name STRING,PRIMARY KEY (order_id) NOT ENFORCED) WITH ('connector' = 'mysql-cdc','hostname' = 'localhost','port' = '3306','username' = 'root','password' = '12345678','database-name' = 'mydb','table-name' = 'orders1');drop table ordersmysql;select * from ordersmysql;
准备kafka table
CREATE TABLE ordersKafka(order_id INT,customer_name STRING) WITH ('connector' = 'kafka','topic' = 'test','properties.bootstrap.servers' = 'lbpc:9092','properties.group.id' = 'testGroup1','scan.startup.mode' = 'latest-offset', //latest-offset或earliest-offset'format' = 'debezium-json'-- 这里不能写debezium-json);drop table default_catalog.default_database.ordersKafka;select * from ordersKafka;
mysql cdc表写入kafka
insert into ordersKafkaselect order_id,customer_namefrom default_catalog.default_database.ordersmysql;
写入失败:flink 1.11.1版本对应的mysql cdc还不支持直接往kafka写。
下面文章是flink1.11 Flink PMC的说法,文中指出当时的版本还不支持upsert到kafka.
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.ordersKafka' doesn't support consuming update and delete changes which is produced by node TableSourceScan(table=[[default_catalog, default_database, ordersmysql]], fields=[order_id, customer_name])
创建iceberg表
CREATE CATALOG hive_catalog WITH ('type'='iceberg','catalog-type'='hive','uri'='thrift://localhost:9083','clients'='5','property-version'='1','warehouse'='hdfs://lbpc:8020/user/hive/warehouse');CREATE DATABASE hive_catalog.iceberg_db;CREATE TABLE hive_catalog.iceberg_db.ordersIceberg (order_id INT,customer_name STRING);
kafka表数据插入数据
insert into hive_catalog.iceberg_db.ordersIcebergselect order_id,customer_namefrom default_catalog.default_database.ordersKafka;select * from hive_catalog.iceberg_db.ordersIceberg;
ERROR汇总:
1、如果kafka往iceberg写入数据不报错,但是没有数据,可以开启DEBUG日志查看详细信息,我这里遇到了Hive lock的问题:
2、必须开启checkpoint,不然还是写入数据看不到结果。
3、如果insert into 语句一直在running,直到结束期间是看不到数据的。但是insert into的job结束后就能看到iceberg表。
这是因为flinksql client默认只有一个slot,一次只执行一个Job。需要修改flink-conf.yaml
taskmanager.numberOfTaskSlots: 2
4、如果在mysql中的表字段类型是timestamp,那么即便在flinksql中定义的字段类型也是timestamp,两者都不能映射起来。
从flink官网的文档来看,timestamp对应的就是timestamp,所以这个应该是个bug。
flink官网类型映射:(最下边)Apache Flink 1.11 Documentation: JDBC SQL Connector
错误信息:
java.lang.ClassCastException: java.time.LocalDateTime cannot be cast to java.sql.Timestamp
5、Hive目前对JDK11支持有bug:
Hive中查看iceberg表
//TODO
set iceberg.engine.hive.enabled=true
CREATE EXTERNAL TABLE iceberg_hive(order_id int,customer_name STRING)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION '/user/hive/warehouse/iceberg_db/ordersIceberg/';
Flink参数配置
SET sql-client.execution.result-mode=table;
SET sql-client.execution.result-mode=changelog;
SET sql-client.execution.result-mode=tableau;
SET sql-client.execution.checkpointing.interval = 3s;
SET execution.type = streaming;
SET execution.type = batch ;
如果要用到hint,记得开启下面配置
/*+ OPTIONS('streaming'='true',
'monitor-interval'='3s'
)*/;
set table.dynamic-table-options.enabled=true;