spark 流读iceberg v1 表

测试背景:

1.iceberg v1 表,可以使用flink 正常流写,流读;不能实现在自动按主健去重

2.iceberg v2 表,可以使用flink 正常流写,但是不能流读; 设置upsert 可以实现自动去重

测试目的:想通过曲线的方式实现,iceberg v1表能正常的去重后的 流写,流读,微批写也可以,分钟级别的 准实时流可以。

测试内容:

1.使用flink 建立iceberg v1 表 ;

-- flink 
-- v1
CREATE TABLE `sample_stream_test01` (
  `id` BIGINT NOT NULL,
  `data` VARCHAR(2147483647)
) PARTITIONED BY (`id`)
WITH (
  'catalog-database' = 'test001',
  'write.metadata.delete-after-commit.enabled' = 'true',
  'warehouse' = 'hdfs://nameservice2/user/hive/warehouse/',
  'uri' = 'thrift://xxxxxxx:9083',
  'write.metadata.previous-versions-max' = '2',
  'catalog-table' = 'sample_stream_test01',
  'catalog-type' = 'hive',
  'write.distribution-mode' = 'hash'
);

2.使用spark 流式读,发现报空指针

val spark = SparkSession.builder()
      .master("local[*]")
      .appName("OrderStreamRead")
      .config("spark.streaming.stopGracefullyOnShutdown", true)
      .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
//      .config("spark.sql.session.timeZone", "GMT+8")
      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
      .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
      .config("spark.sql.catalog.spark_catalog.type", "hive")
      .config("spark.sql.catalog.spark_catalog.warehouse", DEFAULTFS + HIVE_WAREHOUSE)
      .config("spark.sql.catalog.spark_catalog.iceberg.handle-timestamp-without-timezone", "true")
      .enableHiveSupport()
      .getOrCreate();

    spark.sparkContext.setLogLevel("WARN")
    import spark.implicits._
    import spark.sql


    val df = spark.readStream
      .format("iceberg")
//      .option("stream-from-timestamp", "1650988800000") // 2022-04-26
//      .option("snapshot-id", "8256076131935366289") // 2022-04-26
//      .option("start-snapshot-id", "8256076131935366289") // 2022-04-26
      .option("streaming-skip-overwrite-snapshots", "true")
      .option("streaming-skip-delete-snapshots", "true")
      .load("test001.sample_stream_test02")



    // stdout / console
    val query = df.writeStream
      .format("console")
      .trigger(Trigger.ProcessingTime("2 seconds"))
      .outputMode(OutputMode.Append())
      .option("checkpointLocation","~/tmp/1.log")
      .option("fanout-enabled","true")
      .start()
      .awaitTermination()

报错消息

Identifier: [id = 031b8474-dc90-439a-9ef8-5afec3c6f4bc, runId = 3a85dfa9-e82f-4a3e-ba21-246ef47a1cdc]
Current Committed Offsets: {}
Current Available Offsets: {org.apache.iceberg.spark.source.SparkMicroBatchStream@42d07354: {"version":1,"snapshot_id":640627351335439071,"position":2,"scan_all_files":false}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
WriteToMicroBatchDataSource ConsoleWriter[numRows=20, truncate=true]
+- StreamingDataSourceV2Relation [id#0L, data#1], IcebergScan(table=spark_catalog.test001.sample_stream_test02, type=struct<1: id: required long, 2: data: optional string>, filters=[], runtimeFilters=[], caseSensitive=false), org.apache.iceberg.spark.source.SparkMicroBatchStream@42d07354

	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:325)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:209)
Caused by: java.lang.NullPointerException

3、使用spark 批读是正常的

val df = spark.table("spark_catalog.test001.sample_stream_test02").show()

问题:要怎么使用spark 能实现流读iceberg v1 去重后的表呢 ?有兴趣的小伙伴,也可以一起学习交流 ,添加微信:celltobigs

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值