测试背景:
1.iceberg v1 表,可以使用flink 正常流写,流读;不能实现在自动按主健去重
2.iceberg v2 表,可以使用flink 正常流写,但是不能流读; 设置upsert 可以实现自动去重
测试目的:想通过曲线的方式实现,iceberg v1表能正常的去重后的 流写,流读,微批写也可以,分钟级别的 准实时流可以。
测试内容:
1.使用flink 建立iceberg v1 表 ;
-- flink
-- v1
CREATE TABLE `sample_stream_test01` (
`id` BIGINT NOT NULL,
`data` VARCHAR(2147483647)
) PARTITIONED BY (`id`)
WITH (
'catalog-database' = 'test001',
'write.metadata.delete-after-commit.enabled' = 'true',
'warehouse' = 'hdfs://nameservice2/user/hive/warehouse/',
'uri' = 'thrift://xxxxxxx:9083',
'write.metadata.previous-versions-max' = '2',
'catalog-table' = 'sample_stream_test01',
'catalog-type' = 'hive',
'write.distribution-mode' = 'hash'
);
2.使用spark 流式读,发现报空指针
val spark = SparkSession.builder()
.master("local[*]")
.appName("OrderStreamRead")
.config("spark.streaming.stopGracefullyOnShutdown", true)
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
// .config("spark.sql.session.timeZone", "GMT+8")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.sql.catalog.spark_catalog.type", "hive")
.config("spark.sql.catalog.spark_catalog.warehouse", DEFAULTFS + HIVE_WAREHOUSE)
.config("spark.sql.catalog.spark_catalog.iceberg.handle-timestamp-without-timezone", "true")
.enableHiveSupport()
.getOrCreate();
spark.sparkContext.setLogLevel("WARN")
import spark.implicits._
import spark.sql
val df = spark.readStream
.format("iceberg")
// .option("stream-from-timestamp", "1650988800000") // 2022-04-26
// .option("snapshot-id", "8256076131935366289") // 2022-04-26
// .option("start-snapshot-id", "8256076131935366289") // 2022-04-26
.option("streaming-skip-overwrite-snapshots", "true")
.option("streaming-skip-delete-snapshots", "true")
.load("test001.sample_stream_test02")
// stdout / console
val query = df.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("2 seconds"))
.outputMode(OutputMode.Append())
.option("checkpointLocation","~/tmp/1.log")
.option("fanout-enabled","true")
.start()
.awaitTermination()
报错消息
Identifier: [id = 031b8474-dc90-439a-9ef8-5afec3c6f4bc, runId = 3a85dfa9-e82f-4a3e-ba21-246ef47a1cdc]
Current Committed Offsets: {}
Current Available Offsets: {org.apache.iceberg.spark.source.SparkMicroBatchStream@42d07354: {"version":1,"snapshot_id":640627351335439071,"position":2,"scan_all_files":false}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
WriteToMicroBatchDataSource ConsoleWriter[numRows=20, truncate=true]
+- StreamingDataSourceV2Relation [id#0L, data#1], IcebergScan(table=spark_catalog.test001.sample_stream_test02, type=struct<1: id: required long, 2: data: optional string>, filters=[], runtimeFilters=[], caseSensitive=false), org.apache.iceberg.spark.source.SparkMicroBatchStream@42d07354
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:325)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:209)
Caused by: java.lang.NullPointerException
3、使用spark 批读是正常的
val df = spark.table("spark_catalog.test001.sample_stream_test02").show()
问题:要怎么使用spark 能实现流读iceberg v1 去重后的表呢 ?有兴趣的小伙伴,也可以一起学习交流 ,添加微信:celltobigs