FlinkCDC-Hudi:Mysql数据实时入湖全攻略二:Hudi与Spark整合时所遇异常与解决方案

2 篇文章 0 订阅

一、背景

在这里插入图片描述

根据Hudi官方文档,Hudi与Spark整合时只要在以下命令中选择相应的版本,执行命令即可。spark内置的ivy依赖管理工具会自动下载对应的jar包(需要在外网环境下)。

# Spark SQL for spark 3.1
spark-sql --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

# Spark SQL for spark 3.0
spark-sql --packages org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

# Spark SQL for spark 2 with scala 2.11
spark-sql --packages org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

# Spark SQL for spark 2 with scala 2.12
spark-sql \
  --packages org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

如果在内网环境下,要怎么办呢?
方法一、可以先在有外网的机器环境下载对应jar包,然后将jar包上传到内网环境,然后用–jars替换–packages即可。

bin/spark-sql --jars hudi-spark3.1.2-bundle_2.12-0.11.0-SNAPSHOT.jar,spark-avro_2.12-3.1.2.jar --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

方法二、在编译hudi时,将编译生成的的hudi-spark-bundle*.jar上传到内网环境中。spark-avro文件可以在maven仓库中下载中得到。

由于在探索使用hudi时,需要经常编译hudi的包进行测试验证,笔者使用的是方法二。

Hudi与Spark整合路上,一路坑,以此为记,为后人鉴。

补充:
如无spark安装环境,可以spark官网下载spark安装包后,上传到指定目录。spark安装包解决后,将hive的配置文件 hive-site.xml拷贝到spark/conf中即可快速连通spark-sql hive。

二、异常、分析与解决方案

2.1、scala.Product$class类找不到

2.1.1 现象

hudi与spark整合后,在spark-sql执行查询,报scala.Product$class。异常栈如下:

Caused by: java.lang.NoClassDefFoundError: scala/Product$class
	at org.apache.hudi.HoodieTableFileIndexBase$PartitionPath.<init>(HoodieTableFileIndexBase.scala:276)
	at org.apache.hudi.HoodieTableFileIndexBase$$anonfun$getAllQueryPartitionPaths$1.apply(HoodieTableFileIndexBase.scala:241)
	at org.apache.hudi.HoodieTableFileIndexBase$$anonfun$getAllQueryPartitionPaths$1.apply(HoodieTableFileIndexBase.scala:239)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.hudi.HoodieTableFileIndexBase.getAllQueryPartitionPaths(HoodieTableFileIndexBase.scala:239)
	at org.apache.hudi.HoodieTableFileIndexBase.loadPartitionPathFiles(HoodieTableFileIndexBase.scala:195)
	at org.apache.hudi.HoodieTableFileIndexBase.refresh0(HoodieTableFileIndexBase.scala:108)
	at org.apache.hudi.HoodieTableFileIndexBase.<init>(HoodieTableFileIndexBase.scala:88)
	at org.apache.hudi.SparkHoodieTableFileIndex.<init>(SparkHoodieTableFileIndex.scala:58)
	at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:64)
	at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:120)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261)
	at org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
	at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
	at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
	at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
	at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
	... 95 more
Caused by: java.lang.ClassNotFoundException: scala.Product$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 120 more

2.1.2 原因分析

报的是scala语言的基础包找不到,但spark是内部集成对应的scala包的,hudi-spark-bundle也对应集成了相应的scala包。比较发现spark的scala版本是2.12,hudi-spark-hundle的scala版本是2.11。两个scala包版本不一致导致运行时找不到对应的scala基类。

2.1.3 解决方式

将hudi按spark的scala版权重新编译一次即可。
编译命令:
mvn clean install -DskipTests -D rat.skip=true -D scala-2.12 -D hadoop.version=3.2.1 -D hive.version=3.1.2 -Pflink-bundle-shade-hive3

2.2 Spark3Adapter找不到

2.2.1 现象

执行

bin/spark-sql --jars hudi-spark-bundle_2.12-0.11.0-SNAPSHOT.jar,spark-avro_2.12-3.1.2.jar --conf 'spark.serializer=org.apache.spark.seriSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

后,在spark-sql中执行查询语句时,报找不到Spark3Adapter。异常栈如下:

java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3Adapter
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.hudi.SparkAdapterSupport.sparkAdapter(SparkAdapterSupport.scala:35)
	at org.apache.hudi.SparkAdapterSupport.sparkAdapter$(SparkAdapterSupport.scala:29)
	at org.apache.spark.sql.parser.HoodieCommonSqlParser.sparkAdapter$lzycompute(HoodieCommonSqlParser.scala:34)
	at org.apache.spark.sql.parser.HoodieCommonSqlParser.sparkAdapter(HoodieCommonSqlParser.scala:34)
	at org.apache.spark.sql.parser.HoodieCommonSqlParser.sparkExtendedParser$lzycompute(HoodieCommonSqlParser.scala:38)
	at org.apache.spark.sql.parser.HoodieCommonSqlParser.sparkExtendedParser(HoodieCommonSqlParser.scala:38)
	at org.apache.spark.sql.parser.HoodieCommonSqlParser.$anonfun$parsePlan$1(HoodieCommonSqlParser.scala:44)
	at org.apache.spark.sql.parser.HoodieCommonSqlParser.parse(HoodieCommonSqlParser.scala:84)
	at org.apache.spark.sql.parser.HoodieCommonSqlParser.parsePlan(HoodieCommonSqlParser.scala:41)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:616)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:616)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:67)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:381)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:500)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:494)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:494)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:284)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

2.2.2 原因分析:

反编译发现 hudi-spark-bundle_2.12-0.11.0-SNAPSHOT.jar对应包下的是Spark2Adapter,hudi默认编译的spark版本是spark2.*。hudi编译的spark版本与spark运行版本不一致,导致了该问题。

2.2.3 解决方案:

重新编译hudi on spark3的jar包。
注意:hudi的pom.xml的profile变量中,有spark2,spark3(对应spark3.2.0),spark3.1x。选择自己环境相应的profile编译。
编译命令:
mvn clean install -DskipTests -D rat.skip=true -D scala-2.12 -D hadoop.version=3.2.1 -D hive.version=3.1.2 -Pflink-bundle-shade-hive3 -Pspark3.1.x

2.3 从hudi中selece数据时找不到parquet文件

2.3.1 现象

在spark-sql中查询hudi表,刚开始能正常查询,间隔一段时间后,再次查询时会报找不到parquet文件的错误。异常栈如下:

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:200)
	... 45 more
Caused by: java.io.FileNotFoundException: File does not exist:  98d4f2c9-e61b-4617-8bda-fc194446126f_0-1-14_20220214130525756.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

2.3.2 原因分析

hudi会定期compact文件,旧文件在compact完成后会删除掉,查询时spark缓存了已删除的旧文件会报错。

2.3.3 解决方案:

在spark-sql中执行refresh table tablename来刷新缓存。

2.4 Fail to find data souce:hudi

2.4.1 现象

未配置hudi依赖的情况下,在spark-sql中查询hudi表时,会报。异常栈如下:

java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
	at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
	at org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
	at org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
	at org.sparkproject.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
	at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
	at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
	at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
	at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
	at org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:155)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:249)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:288)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:278)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable.apply(DataSourceStrategy.scala:278)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable.apply(DataSourceStrategy.scala:243)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:89)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:98)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:67)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:381)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:500)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:494)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:494)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:284)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:99)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:98)
	at org.apache.spark.sql.execution.datasources.DataSource.providingInstance(DataSource.scala:112)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261)
	at org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
	at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
	at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
	... 97 more
Caused by: java.lang.ClassNotFoundException: hudi.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
	... 105 more

2.4.2 异常分析

spark-sql启动时未添加hudi的jar包作为运行时依赖。

2.4.3 解决方案

spark-sql启动时引入Hudi-spark-bundle包。引入方法见“一、背景”。

  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
以下是一个基本的Java代码示例,用于将MySQL中的数据同步到Hudi: ```java import org.apache.hudi.DataSourceWriteOptions; import org.apache.hudi.HoodieSparkUtils; import org.apache.hudi.OverwriteWithLatestAvroPayload; import org.apache.hudi.QuickstartUtils; import org.apache.hudi.api.HoodieWriteClient; import org.apache.hudi.common.model.HoodieTableType; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import java.util.Collections; import java.util.Properties; public class MySQLToHudiSync { public static void main(String[] args) { String tableName = "hudi_table"; String basePath = "file:///tmp/hudi_table"; String jdbcUrl = "jdbc:mysql://<mysql_host>:<mysql_port>/<mysql_db>"; String username = "<mysql_username>"; String password = "<mysql_password>"; String partitionKey = "id"; String hudiTableType = HoodieTableType.COPY_ON_WRITE.name(); SparkSession spark = SparkSession.builder().appName("MySQLToHudiSync").config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").getOrCreate(); JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext()); Properties connectionProperties = new Properties(); connectionProperties.put("user", username); connectionProperties.put("password", password); Dataset<Row> jdbcDF = spark.read().jdbc(jdbcUrl, tableName, connectionProperties); JavaRDD<Row> rowsRDD = jdbcDF.javaRDD(); HoodieWriteConfig hoodieWriteConfig = HoodieWriteConfig.newBuilder().withPath(basePath) .withSchema(QuickstartUtils.getSchema()).withParallelism(2, 2) .withBulkInsertParallelism(3).withFinalizeWriteParallelism(2) .withStorageConfig(HoodieSparkUtils.getDefaultHoodieConf(jsc.hadoopConfiguration())) .withAutoCommit(false).withTableType(hudiTableType) .forTable(tableName).withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build()) .withProps(Collections.singletonMap(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY().key(), partitionKey)).build(); HoodieWriteClient hoodieWriteClient = new HoodieWriteClient<>(jsc, hoodieWriteConfig); hoodieWriteClient.upsert(rowsRDD.map(row -> { String key = row.getAs(partitionKey).toString(); return new UpsertPOJO(key, row); }), hoodieWriteConfig.getBasePath(), hoodieWriteConfig.getTableType(), OverwriteWithLatestAvroPayload.class.getName()); hoodieWriteClient.commit(); } public static class UpsertPOJO implements Serializable { private String key; private Row row; public UpsertPOJO(String key, Row row) { this.key = key; this.row = row; } public String getKey() { return key; } public void setKey(String key) { this.key = key; } public Row getRow() { return row; } public void setRow(Row row) { this.row = row; } } } ``` 代码中的`tableName`是要同步的MySQL表的名称,`basePath`是Hudi表的根路径。`jdbcUrl`,`username`和`password`是连接MySQL所需的凭据。`partitionKey`是Hudi表中用作分区键的字段名称。`hudiTableType`是Hudi表的类型,可以是`COPY_ON_WRITE`或`MERGE_ON_READ`。 代码中使用`HoodieWriteConfig`对象配置Hudi写入选项,例如`withPath`,`withSchema`,`withParallelism`,`withBulkInsertParallelism`等。`forTable`方法指定Hudi表的名称。`withIndexConfig`方法配置Hudi索引选项,例如索引类型和配置。`withProps`方法设置自定义属性。`withAutoCommit`方法用于控制提交方式,可以是自动提交或手动提交。 最后,使用`HoodieWriteClient`对象将MySQL数据插入Hudi表,使用`upsert`方法进行插入。`UpsertPOJO`类是自定义的POJO类,用于将MySQL中的行转换为要插入到Hudi表中的数据。`commit`方法用于提交更改。 请注意,此代码示例仅用于演示目的,并且可能需要进行修改以适应您的特定需求。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值