spark集成spline相关

本文详细介绍了如何在Spark中集成SplineAgent,包括下载、配置及遇到的内存问题。作者分享了解决ClassGraphException导致的OutOfMemoryError的经验,即通过增加Spark应用程序驱动器内存来确保资源充足。
摘要由CSDN通过智能技术生成


spline 分为3部分构成,agent 、server、ui,以下仅介绍agent集成

下载

选择合适的spark和scala版本,这里提供两种方式
直接下载打包好的:https://repo1.maven.org/maven2/za/co/absa/spline/agent/spark/
或者clone代码带本地自己编译: https://github.com/AbsaOSS/spline-spark-agent

安装spline agent

根据官方介绍spline有两种,我们这里这使用第一种codeless方式集成:

  1. 获取第一步agent;
  2. 将spark-A.B-spline-agent-bundle_X.Y.jar添加到spark lib路径下;
  3. 修改spark-default.conf;
vim $SPARK_HOME/conf/spark-defaults.conf

// 将消息记录在log
spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener
spark.spline.lineageDispatcher=log
spark.spline.lineageDispatcher.log.level=INFO
spark.spline.lineageDispatcher.log.className=za.co.absa.spline.harvester.dispatcher.LoggingLineageDispatcher

// 将消息记录在kafka
spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener
spark.spline.lineageDispatcher=kafka
spark.spline.lineageDispatcher.kafka.topic=spark_lineage_test
spark.spline.lineageDispatcher.kafka.producer.bootstrap.servers=localhost:9092

//多种dispatcher合用
spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener
spark.spline.lineageDispatcher=composite
spark.spline.lineageDispatcher.composite.dispatchers=log,kafka
spark.spline.lineageDispatcher.composite.className=za.co.absa.spline.harvester.dispatcher.CompositeLineageDispatcher
spark.spline.lineageDispatcher.composite.failOnErrors=false
spark.spline.lineageDispatcher.log.level=INFO
spark.spline.lineageDispatcher.log.className=za.co.absa.spline.harvester.dispatcher.LoggingLineageDispatcher
spark.spline.lineageDispatcher.kafka.topic=spark_lineage_test
spark.spline.lineageDispatcher.kafka.producer.bootstrap.servers=localhost:9092

这里只介绍几种,更多的请参考github:

https://github.com/AbsaOSS/spline-spark-agent/?tab=readme-ov-file#configuration

此时不管是在spark app log中还是kafka topic种,均可找到spline的血缘解析json。

记录一个我集成时的错误,痛苦了好久

在spark app提交后,spline会自动init,而且这东西需要500MB左右driver内存消耗,当时一直不知道,,,报错如下

ERROR ApplicationMaster: User class threw exception: java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
	at za.co.absa.spline.harvester.plugin.registry.AutoDiscoveryPluginRegistry.<init>(AutoDiscoveryPluginRegistry.scala:51)
	at za.co.absa.spline.agent.SplineAgent$.create(SplineAgent.scala:66)
	at za.co.absa.spline.harvester.SparkLineageInitializer.createListener(SparkLineageInitializer.scala:162)
	at za.co.absa.spline.harvester.SparkLineageInitializer.$anonfun$createListener$6(SparkLineageInitializer.scala:139)
	at za.co.absa.spline.harvester.SparkLineageInitializer.withErrorHandling(SparkLineageInitializer.scala:176)
	at za.co.absa.spline.harvester.SparkLineageInitializer.createListener(SparkLineageInitializer.scala:138)
	at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener.<init>(SplineQueryExecutionListener.scala:37)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2930)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2919)
	at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$2(QueryExecutionListener.scala:90)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.internal.SQLConf$.withExistingConf(SQLConf.scala:158)
	at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1(QueryExecutionListener.scala:90)
	at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1$adapted(QueryExecutionListener.scala:88)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.sql.util.ExecutionListenerManager.<init>(QueryExecutionListener.scala:88)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$listenerManager$2(BaseSessionStateBuilder.scala:336)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder.listenerManager(BaseSessionStateBuilder.scala:336)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:364)
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1175)
	at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:162)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:160)
	at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:157)
	at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:698)
	at org.apache.spark.sql.SparkSession.read(SparkSession.scala:662)
	at com.hs.sdi.utils.DeltaMulTableSDIJob.$anonfun$createDataFrame$1(DeltaMulTableSDIJob.scala:390)
	at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
	at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
	at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
	at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
	at com.hs.sdi.utils.DeltaMulTableSDIJob.createDataFrame(DeltaMulTableSDIJob.scala:387)
	at com.hs.sdi.utils.DeltaMulTableSDIJob.calculation(DeltaMulTableSDIJob.scala:480)
	at com.hs.sdi.DeltaMulTableSDIJobMain$.main(DeltaMulTableSDIJobMain.scala:67)
	at com.hs.sdi.DeltaMulTableSDIJobMain.main(DeltaMulTableSDIJobMain.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:739)
Caused by: io.github.classgraph.ClassGraphException: Uncaught exception during scan
	at io.github.classgraph.ClassGraph.scan(ClassGraph.java:1558)
	at io.github.classgraph.ClassGraph.scan(ClassGraph.java:1575)
	at io.github.classgraph.ClassGraph.scan(ClassGraph.java:1588)
	at za.co.absa.spline.harvester.plugin.registry.AutoDiscoveryPluginRegistry$.$anonfun$PluginClasses$2(AutoDiscoveryPluginRegistry.scala:96)
	at za.co.absa.commons.lang.ARM$.using(ARM.scala:30)
	at za.co.absa.commons.lang.ARM$ResourceWrapper.flatMap(ARM.scala:43)
	at za.co.absa.spline.harvester.plugin.registry.AutoDiscoveryPluginRegistry$.<init>(AutoDiscoveryPluginRegistry.scala:96)
	at za.co.absa.spline.harvester.plugin.registry.AutoDiscoveryPluginRegistry$.<clinit>(AutoDiscoveryPluginRegistry.scala)
	... 53 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at nonapi.io.github.classgraph.fileslice.reader.ClassfileReader.<init>(ClassfileReader.java:141)
	at io.github.classgraph.ClasspathElementZip$1.openClassfile(ClasspathElementZip.java:409)
	at io.github.classgraph.Classfile.<init>(Classfile.java:1925)
	at io.github.classgraph.Scanner$ClassfileScannerWorkUnitProcessor.processWorkUnit(Scanner.java:741)
	at io.github.classgraph.Scanner$ClassfileScannerWorkUnitProcessor.processWorkUnit(Scanner.java:664)
	at nonapi.io.github.classgraph.concurrency.WorkQueue.runWorkLoop(WorkQueue.java:246)
	at nonapi.io.github.classgraph.concurrency.WorkQueue.access$000(WorkQueue.java:50)
	at nonapi.io.github.classgraph.concurrency.WorkQueue$1.call(WorkQueue.java:201)
	at nonapi.io.github.classgraph.concurrency.WorkQueue$1.call(WorkQueue.java:198)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

翻遍了官网和各种博客,后来在issues中找到了灵感,因为classgraph组件需要吃一定资源,我把我原本spark app driver memory由500M调整至1G,解决问题,以下是两个灵感来源

https://github.com/AbsaOSS/spline-spark-agent/issues/636
https://github.com/classgraph/classgraph/issues/338

  • 29
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值