hive2.1.1+spark3.0+iceberg踩坑日记

1、sparkSQL3.0支持iceberg需要引入iceberg-spark3-runtime jar包
    iceberg-spark3-runtime-0.11.1.jar
    下载地址http://iceberg.apache.org/releases/
2、修改conf/spark-default.conf文件添加默认iceberg catalog
    # hive metastore的版本设置为 2.1.1
    spark.sql.hive.metastore.version=2.1.1

    # 引用 hive2.1.1 相关的jar包
    spark.sql.hive.metastore.jars=/opt/cloudera/parcels/CDH/lib/hive/lib/*
    
    #连接hive元数据作为iceberg的catalog
    spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.hive_prod.type=hive
    spark.sql.catalog.hive_prod.uri=thrift://X.X.X.X:9083

    #用hdfs作为iceberg的catalog
    spark.sql.catalog.zdm=org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.zdm.type=hadoop
    spark.sql.catalog.zdm.warehouse=hdfs:///tmp/iceberg_test/
    
3、使用spark-sql脚本进入sparkSQL客户端
 使用hdfs类型的iceberg
   1)进入zdm hdfs的catalog
   spark-sql> use zdm;
   2)创建数据库
   spark-sql> create database test;
   3)创建iceberg表
   spark-sql> create table iceberg_spark(id int ,name string) using iceberg;
   4)插入数据
   spark-sql> insert into table iceberg_spark values(1,'a'),(2,'b'),(3,'c'),(4,'d');
   5)查询数据
   spark-sql> select * from iceberg_spark;
    1    a
    2    b
    3    c
    4    d
   6)查看hdfs iceberg元数据warehouse目录
    ./tmp/
    └── iceberg_test
        └── test
            └── iceberg_spark
                ├── data
                │   ├── 00000-0-b8aab2a2-3d45-4ec9-8553-b7c76540f442-00001.parquet
                │   ├── 00001-1-894b55e6-2918-4964-b663-a29d40f86aa2-00001.parquet
                │   ├── 00002-2-3ab42919-e27a-44c9-afdf-e92ee613f5f4-00001.parquet
                │   └── 00003-3-c8f917ab-db98-49bc-9099-804499e53ccb-00001.parquet
                └── metadata
                    ├── 63f25f73-2bb8-475b-90c4-1397c8c3f248-m0.avro
                    ├── snap-7436248840915571836-1-63f25f73-2bb8-475b-90c4-1397c8c3f248.avro
                    ├── v1.metadata.json
                    ├── v2.metadata.json
                    └── version-hint.text
                    
                    
 使用hive类型的iceberg
   1)进入 hive_prod hive元数据的catalog
   spark-sql> use hive_prod.test;  //hive中已创建test库
   2)创建iceberg表
   spark-sql> create table ygy_iceberg_spark(id int ,name string) using iceberg;
   报错:
     Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_table_req'
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_req(ThriftHiveMetastore.java:1567)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1554)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350)
        at org.apache.iceberg.hive.HiveTableOperations.lambda$doRefresh$0(HiveTableOperations.java:130)
        at org.apache.iceberg.hive.ClientPool.run(ClientPool.java:55)
        at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:130)
        ... 59 more
    替换了hive-metastore jar包到2.1.1版本之后报错
      Caused by: java.lang.NoSuchFieldError: METASTORE_CLIENT_FILTER_ENABLED
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getIfClientFilterEnabled(HiveMetaStoreClient.java:307)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:237)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:219)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.iceberg.common.DynConstructors$Ctor.newInstanceChecked(DynConstructors.java:60)
        at org.apache.iceberg.common.DynConstructors$Ctor.newInstance(DynConstructors.java:73)
        at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:54)
    替换了hive-exec jar包到2.1.1版本之后启动报如下错误:
      Exception in thread "main" java.lang.NoSuchFieldError: JAVA_9
      at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:207)
      at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
      at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:93)
      at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:370)
      at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:311)
      at org.apache.spark.SparkEnv$.create(SparkEnv.scala:359)
      at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189)
      at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:272)
      at org.apache.spark.SparkContext.<init>(SparkContext.scala:448)
      at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2589)
      at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:934)
      at scala.Option.getOrElse(Option.scala:189)
     报错原因:
     spark3.0会依赖commons-lang3这个jar包这个jar包在原有的依赖版本里面已经有了commons-lang3.jar,在做spark3.0和hive2.1.1版本适配时将hive-exec jar包引入了。
     但是hive-exec-2.2.1版本jar包里面有将commons-lang3的代码打包进去了且版本和spark3.0需要依赖的版本不一致导致导致在引用时报错。
     解决方案:
     重新下载hive-2.1.1源码,下载地址:https://github.com/apache/hive/tree/rel/release-2.1.1
     找到hive-exec模块在模块的pom文件中将
     <include>org.apache.commons:commons-lang3</include>
     
     <dependency>
       <groupId>org.apache.commons</groupId>
       <artifactId>commons-lang3</artifactId>
       <version>${commons-lang3.version}</version>
     </dependency>
     两个配置项去掉重新进行编译打包。将打包后的hive-exec-2.1.1.jar然后替换spark原有的hive-exec-2.3.7-core.jar
     
    spark-sql>create table ygy_iceberg_spark(id int ,name string) using iceberg;
   3)插入数据
    spark-sql>insert into table ygy_iceberg_spark values(1,'a'),(2,'b'),(3,'c'),(4,'d');
   4)查询数据
    spark-sql>select * from ygy_iceberg_spark;

读取kafka数据插入iceberg表中
1、进入spark-shell
   创建kafka readStream
scala> var streamingInputDF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "X.X.X.X:9092").option("subscribe", "test1").option("startingOffsets", "latest").option("minPartitions", "10").option("failOnDataLoss", "true").load();
报错:
org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:681)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:195)
  ... 47 elided
需要将spark-sql-kafka-0-10_2.12-3.0.2.jar放进jars目录下

scala> var streamingInputDF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "100.116.3.228:9092").option("subscribe", "test1").option("startingOffsets", "latest").option("minPartitions", "10").option("failOnDataLoss", "true").load();
java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer
  at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:557)
  at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
  at org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:326)
  at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:71)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:235)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:116)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:116)
  at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:206)
  ... 47 elided
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.ByteArraySerializer
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 56 more
需要将kafka-clients jar放进jars目录下

Caused by: java.lang.ClassNotFoundException: org.apache.spark.kafka010.KafkaConfigUpdater
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 52 more
添加 spark-token-provider-kafka-0-10_2.12-3.0.1.jar

java.lang.NoClassDefFoundError: org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig
    at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<init>(KafkaDataConsumer.scala:606)
    at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<clinit>(KafkaDataConsumer.scala)
    at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.<init>(KafkaBatchPartitionReader.scala:52)
    at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:40)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:60)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
添加 commons-pool2-2.6.2.jar


scala> val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","X.X.X.X:9092").option("subscribe", "test1").load();
df: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]

2、获取kafka topic消息信息进行裁剪
scala> val ds = df.withColumn("id", split(col("value").cast("string"),",")(0).cast("Int")).withColumn("name", split(col("value").cast("string"),",")(1)).select("id","name");
ds: org.apache.spark.sql.DataFrame = [id: int, name: string]

3、创建writeStream 写入iceberg表中
scala> import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeUnit

scala> import org.apache.spark.sql.streaming.{OutputMode, Trigger };
import org.apache.spark.sql.streaming.{OutputMode, Trigger}

scala> ds.writeStream.format("iceberg").outputMode("append").trigger(Trigger.ProcessingTime(1, TimeUnit.SECONDS)).option("path", "hive_prod.test.ygy_iceberg_spark").option("checkpointLocation", "hdfs:///tmp/checkpoint_table").start().awaitTermination();

4、启动一个topic test1的console producer
   kafka-console-producer --broker-list X.X.X.X:9092 --topic test1
   写入数据
   1,3
   1,4
   1,5

5、查看iceberg表是否有有接收到kafka数据
scala> spark.sql("select * from hive_prod.test.ygy_iceberg_spark").show;
+----+----+
|  id|name|
+----+----+
|   2|   g|
|   3|   h|
|   1|   f|
|null|null|
|   1|   6|
|   1|   7|
|   1|   5|
|   1|   a|
|   2|   b|
|   3|   c|
|   4|   d|
+----+----+

评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值