1、sparkSQL3.0支持iceberg需要引入iceberg-spark3-runtime jar包
iceberg-spark3-runtime-0.11.1.jar
下载地址http://iceberg.apache.org/releases/
2、修改conf/spark-default.conf文件添加默认iceberg catalog
# hive metastore的版本设置为 2.1.1
spark.sql.hive.metastore.version=2.1.1
# 引用 hive2.1.1 相关的jar包
spark.sql.hive.metastore.jars=/opt/cloudera/parcels/CDH/lib/hive/lib/*
#连接hive元数据作为iceberg的catalog
spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hive_prod.type=hive
spark.sql.catalog.hive_prod.uri=thrift://X.X.X.X:9083
#用hdfs作为iceberg的catalog
spark.sql.catalog.zdm=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.zdm.type=hadoop
spark.sql.catalog.zdm.warehouse=hdfs:///tmp/iceberg_test/
3、使用spark-sql脚本进入sparkSQL客户端
使用hdfs类型的iceberg
1)进入zdm hdfs的catalog
spark-sql> use zdm;
2)创建数据库
spark-sql> create database test;
3)创建iceberg表
spark-sql> create table iceberg_spark(id int ,name string) using iceberg;
4)插入数据
spark-sql> insert into table iceberg_spark values(1,'a'),(2,'b'),(3,'c'),(4,'d');
5)查询数据
spark-sql> select * from iceberg_spark;
1 a
2 b
3 c
4 d
6)查看hdfs iceberg元数据warehouse目录
./tmp/
└── iceberg_test
└── test
└── iceberg_spark
├── data
│ ├── 00000-0-b8aab2a2-3d45-4ec9-8553-b7c76540f442-00001.parquet
│ ├── 00001-1-894b55e6-2918-4964-b663-a29d40f86aa2-00001.parquet
│ ├── 00002-2-3ab42919-e27a-44c9-afdf-e92ee613f5f4-00001.parquet
│ └── 00003-3-c8f917ab-db98-49bc-9099-804499e53ccb-00001.parquet
└── metadata
├── 63f25f73-2bb8-475b-90c4-1397c8c3f248-m0.avro
├── snap-7436248840915571836-1-63f25f73-2bb8-475b-90c4-1397c8c3f248.avro
├── v1.metadata.json
├── v2.metadata.json
└── version-hint.text
使用hive类型的iceberg
1)进入 hive_prod hive元数据的catalog
spark-sql> use hive_prod.test; //hive中已创建test库
2)创建iceberg表
spark-sql> create table ygy_iceberg_spark(id int ,name string) using iceberg;
报错:
Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_table_req'
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_req(ThriftHiveMetastore.java:1567)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1554)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350)
at org.apache.iceberg.hive.HiveTableOperations.lambda$doRefresh$0(HiveTableOperations.java:130)
at org.apache.iceberg.hive.ClientPool.run(ClientPool.java:55)
at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:130)
... 59 more
替换了hive-metastore jar包到2.1.1版本之后报错
Caused by: java.lang.NoSuchFieldError: METASTORE_CLIENT_FILTER_ENABLED
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getIfClientFilterEnabled(HiveMetaStoreClient.java:307)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:237)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:219)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.iceberg.common.DynConstructors$Ctor.newInstanceChecked(DynConstructors.java:60)
at org.apache.iceberg.common.DynConstructors$Ctor.newInstance(DynConstructors.java:73)
at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:54)
替换了hive-exec jar包到2.1.1版本之后启动报如下错误:
Exception in thread "main" java.lang.NoSuchFieldError: JAVA_9
at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:207)
at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:93)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:370)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:311)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:359)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:272)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:448)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2589)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:934)
at scala.Option.getOrElse(Option.scala:189)
报错原因:
spark3.0会依赖commons-lang3这个jar包这个jar包在原有的依赖版本里面已经有了commons-lang3.jar,在做spark3.0和hive2.1.1版本适配时将hive-exec jar包引入了。
但是hive-exec-2.2.1版本jar包里面有将commons-lang3的代码打包进去了且版本和spark3.0需要依赖的版本不一致导致导致在引用时报错。
解决方案:
重新下载hive-2.1.1源码,下载地址:https://github.com/apache/hive/tree/rel/release-2.1.1
找到hive-exec模块在模块的pom文件中将
<include>org.apache.commons:commons-lang3</include>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>${commons-lang3.version}</version>
</dependency>
两个配置项去掉重新进行编译打包。将打包后的hive-exec-2.1.1.jar然后替换spark原有的hive-exec-2.3.7-core.jar
spark-sql>create table ygy_iceberg_spark(id int ,name string) using iceberg;
3)插入数据
spark-sql>insert into table ygy_iceberg_spark values(1,'a'),(2,'b'),(3,'c'),(4,'d');
4)查询数据
spark-sql>select * from ygy_iceberg_spark;
读取kafka数据插入iceberg表中
1、进入spark-shell
创建kafka readStream
scala> var streamingInputDF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "X.X.X.X:9092").option("subscribe", "test1").option("startingOffsets", "latest").option("minPartitions", "10").option("failOnDataLoss", "true").load();
报错:
org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:681)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:195)
... 47 elided
需要将spark-sql-kafka-0-10_2.12-3.0.2.jar放进jars目录下
scala> var streamingInputDF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "100.116.3.228:9092").option("subscribe", "test1").option("startingOffsets", "latest").option("minPartitions", "10").option("failOnDataLoss", "true").load();
java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer
at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:557)
at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:326)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:71)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:235)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:116)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:116)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:206)
... 47 elided
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.ByteArraySerializer
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 56 more
需要将kafka-clients jar放进jars目录下
Caused by: java.lang.ClassNotFoundException: org.apache.spark.kafka010.KafkaConfigUpdater
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 52 more
添加 spark-token-provider-kafka-0-10_2.12-3.0.1.jar
java.lang.NoClassDefFoundError: org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig
at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<init>(KafkaDataConsumer.scala:606)
at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<clinit>(KafkaDataConsumer.scala)
at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.<init>(KafkaBatchPartitionReader.scala:52)
at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:40)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
添加 commons-pool2-2.6.2.jar
scala> val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","X.X.X.X:9092").option("subscribe", "test1").load();
df: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]
2、获取kafka topic消息信息进行裁剪
scala> val ds = df.withColumn("id", split(col("value").cast("string"),",")(0).cast("Int")).withColumn("name", split(col("value").cast("string"),",")(1)).select("id","name");
ds: org.apache.spark.sql.DataFrame = [id: int, name: string]
3、创建writeStream 写入iceberg表中
scala> import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeUnit
scala> import org.apache.spark.sql.streaming.{OutputMode, Trigger };
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
scala> ds.writeStream.format("iceberg").outputMode("append").trigger(Trigger.ProcessingTime(1, TimeUnit.SECONDS)).option("path", "hive_prod.test.ygy_iceberg_spark").option("checkpointLocation", "hdfs:///tmp/checkpoint_table").start().awaitTermination();
4、启动一个topic test1的console producer
kafka-console-producer --broker-list X.X.X.X:9092 --topic test1
写入数据
1,3
1,4
1,5
5、查看iceberg表是否有有接收到kafka数据
scala> spark.sql("select * from hive_prod.test.ygy_iceberg_spark").show;
+----+----+
| id|name|
+----+----+
| 2| g|
| 3| h|
| 1| f|
|null|null|
| 1| 6|
| 1| 7|
| 1| 5|
| 1| a|
| 2| b|
| 3| c|
| 4| d|
+----+----+