spark 流读iceberg v1 表

最新推荐文章于 2024-06-25 15:52:41 发布

小皮蛋儿子

最新推荐文章于 2024-06-25 15:52:41 发布

阅读量961

点赞数

分类专栏：大数据文章标签：大数据

本文链接：https://blog.csdn.net/celltobig/article/details/124469099

版权

大数据专栏收录该内容

16 篇文章 0 订阅

订阅专栏

测试背景：

1.iceberg v1 表，可以使用flink 正常流写，流读；不能实现在自动按主健去重

2.iceberg v2 表，可以使用flink 正常流写，但是不能流读; 设置upsert 可以实现自动去重

测试目的：想通过曲线的方式实现，iceberg v1表能正常的去重后的流写，流读，微批写也可以，分钟级别的准实时流可以。

测试内容：

1.使用flink 建立iceberg v1 表；

-- flink 
-- v1
CREATE TABLE `sample_stream_test01` (
  `id` BIGINT NOT NULL,
  `data` VARCHAR(2147483647)
) PARTITIONED BY (`id`)
WITH (
  'catalog-database' = 'test001',
  'write.metadata.delete-after-commit.enabled' = 'true',
  'warehouse' = 'hdfs://nameservice2/user/hive/warehouse/',
  'uri' = 'thrift://xxxxxxx:9083',
  'write.metadata.previous-versions-max' = '2',
  'catalog-table' = 'sample_stream_test01',
  'catalog-type' = 'hive',
  'write.distribution-mode' = 'hash'
);

2.使用spark 流式读，发现报空指针

val spark = SparkSession.builder()
      .master("local[*]")
      .appName("OrderStreamRead")
      .config("spark.streaming.stopGracefullyOnShutdown", true)
      .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
//      .config("spark.sql.session.timeZone", "GMT+8")
      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
      .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
      .config("spark.sql.catalog.spark_catalog.type", "hive")
      .config("spark.sql.catalog.spark_catalog.warehouse", DEFAULTFS + HIVE_WAREHOUSE)
      .config("spark.sql.catalog.spark_catalog.iceberg.handle-timestamp-without-timezone", "true")
      .enableHiveSupport()
      .getOrCreate();

    spark.sparkContext.setLogLevel("WARN")
    import spark.implicits._
    import spark.sql


    val df = spark.readStream
      .format("iceberg")
//      .option("stream-from-timestamp", "1650988800000") // 2022-04-26
//      .option("snapshot-id", "8256076131935366289") // 2022-04-26
//      .option("start-snapshot-id", "8256076131935366289") // 2022-04-26
      .option("streaming-skip-overwrite-snapshots", "true")
      .option("streaming-skip-delete-snapshots", "true")
      .load("test001.sample_stream_test02")



    // stdout / console
    val query = df.writeStream
      .format("console")
      .trigger(Trigger.ProcessingTime("2 seconds"))
      .outputMode(OutputMode.Append())
      .option("checkpointLocation","~/tmp/1.log")
      .option("fanout-enabled","true")
      .start()
      .awaitTermination()

报错消息

Identifier: [id = 031b8474-dc90-439a-9ef8-5afec3c6f4bc, runId = 3a85dfa9-e82f-4a3e-ba21-246ef47a1cdc]
Current Committed Offsets: {}
Current Available Offsets: {org.apache.iceberg.spark.source.SparkMicroBatchStream@42d07354: {"version":1,"snapshot_id":640627351335439071,"position":2,"scan_all_files":false}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
WriteToMicroBatchDataSource ConsoleWriter[numRows=20, truncate=true]
+- StreamingDataSourceV2Relation [id#0L, data#1], IcebergScan(table=spark_catalog.test001.sample_stream_test02, type=struct<1: id: required long, 2: data: optional string>, filters=[], runtimeFilters=[], caseSensitive=false), org.apache.iceberg.spark.source.SparkMicroBatchStream@42d07354

	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:325)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:209)
Caused by: java.lang.NullPointerException

3、使用spark 批读是正常的

val df = spark.table("spark_catalog.test001.sample_stream_test02").show()

问题：要怎么使用spark 能实现流读iceberg v1 去重后的表呢？有兴趣的小伙伴，也可以一起学习交流，添加微信：celltobigs

小皮蛋儿子

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
spark 流读iceberg v1 表

spark 流读iceberg v1 表
复制链接

扫一扫

专栏目录