示例读取的table格式为parqut格式,spark每次读取会扫描table根目录下所有的目录和文件信息,然后生成file的FakeFileStatus信息,用于生成table的schema信息,并且每次查询table都会判断该table的schema的信息是否有变化,如果有变化则从parquet的meta文件、data文件的footeer(如果meta文件不存在),再次生成schame信息,但是生的FakeFileStatus全部缓存在driver进程,如果文件过多会占用driver内存过多,导致gc oom异常,而且该设计方式针对在本次提交的spark的任务有效,本次任务结束driver进程结束,缓存也会消失,那么理解针对stream或在一次spark提交的任务多次查询同一张table有效率提升,针对只有一次table查询会导致效率的下降,并且为什么spark不是去从hive的metastore获取table的schema信息而要遍历查询文件? 
 
异常堆栈信息:
 
异常堆栈信息:
  java.lang.OutOfMemoryError: GC overhead limit exceeded 
 
  at java.lang.StringBuilder. at java.io.ObjectStreamClass.getClassSignature(ObjectStreamClass.java:1458) 
 
  at java.io.ObjectStreamClass$MemberSignature. at java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1701) 
 
  at java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:69) 
 
  at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:247) 
 
  at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:245) 
 
  at java.security.AccessController.doPrivileged(Native Method) 
 
  at java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:244) 
 
  at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1601) 
 
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1514) 
 
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750) 
 
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) 
 
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964) 
 
  at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:498) 
 
  at org.apache.spark.rpc.netty.NettyRpcEndpointRef.readObject(NettyRpcEnv.scala:499) 
 
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
 
  at java.lang.reflect.Method.invoke(Method.java:601) 
 
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866) 
 
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) 
 
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964) 
 
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888) 
 
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) 
 
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) 
 
  at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) 
 
  Exception in thread "dispatcher-event-loop-23" java.lang.OutOfMemoryError: GC overhead limit exceeded 
 
  at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:187) 
 
  at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) 
 
  at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:293) 
 
  at scala.Option.foreach(Option.scala:236) 
 
  at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:284) 
 
  at scala.collection.immutable.List.foreach(List.scala:318) 
 
  at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:200) 
 
  at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) 
 
  at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) 
 
  at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) 
 
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) 
 
  at java.lang.Thread.run(Thread.java:722) 
 
  java.lang.OutOfMemoryError: GC overhead limit exceeded 
 
  at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:409) 
 
  at java.lang.Thread.run(Thread.java:722) 
 
  17/05/25 12:14:43 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriverActorSystem-scheduler-1] shutting down ActorSystem [sparkDriverActorSystem] 
 
  java.lang.OutOfMemoryError: GC overhead limit exceeded 
 
  at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(5,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(1,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(4,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(8,WrappedArray()) 
 
  17/05/25 12:14:43 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(20,WrappedArray()) 
 
  java.nio.channels.ClosedChannelException 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(16,WrappedArray()) 
 
  java.nio.channels.ClosedChannelException 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(7,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(15,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(12,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(5,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(20,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(7,WrappedArray()) 
 
  17/05/25 12:14:43 INFO YarnClientSchedulerBackend: Shutting down all executors 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(1,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR YarnScheduler: Lost executor 8 on hdp42.car.bj2.yongche.com: Executor heartbeat timed out after 143220 ms 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR YarnScheduler: Lost executor 11 on hdp46.car.bj2.yongche.com: Executor heartbeat timed out after 153371 ms 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(4,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(11,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(6,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(6,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(14,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(18,WrappedArray()) 
 
  java.nio.channels.ClosedChannelException 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(16,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(9,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(14,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(9,WrappedArray()) 
 
  17/05/25 12:14:43 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(4,WrappedArray()) 
 
  17/05/25 12:14:43 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e2304_1488635601234_1323367_01_000009 on host: hdp42.car.bj2.yongche.com. Exit status: 1. Diagnostics: Exception from container-launch. 
 
  Exit code: 1 
 
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) 
 
  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) 
 
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) 
 
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) 
 
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) 
 
  at java.lang.Thread.run(Thread.java:722) 
 
  org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout 
 
  at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 
 
  at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 
 
  at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 
 
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) 
 
  at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) 
 
  at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) 
 
  at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:335) 
 
  at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:190) 
 
  at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:446) 
 
  at org.apache.spark.SparkContext$$anonfun$stop$9.apply$mcV$sp(SparkContext.scala:1740) 
 
  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219) 
 
  at com.yongche.App$.etl$1(App.scala:111) 
 
  at com.yongche.App.main(App.scala) 
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
 
  at java.lang.reflect.Method.invoke(Method.java:601) 
 
  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) 
 
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) 
 
  Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] 
 
  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) 
 
  at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) 
 
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) 
 
  Container id: container_e2304_1488635601234_1323367_01_000009 
 
  Exit code: 1 
 
  at org.apache.hadoop.util.Shell.run(Shell.java:456) 
 
  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) 
 
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) 
 
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) 
 
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) 
 
  Container exited with a non-zero exit code 1 
 
  org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout 
 
  at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 
 
  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) 
 
  at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) 
 
  at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.removeExecut
                
 
                   
                   
                   
                   在使用Spark SQL读取Hive分区表时,由于每次查询都会扫描所有文件生成schema信息并缓存在driver,当文件数量过多时可能导致GC OOM异常。这种设计对频繁查询相同表的stream作业有利,但对只查询一次的场景效率较低。异常堆栈显示在处理executor心跳时也遇到了内存问题。解决办法可能包括优化schema获取策略或调整Spark配置以减少内存使用。
在使用Spark SQL读取Hive分区表时,由于每次查询都会扫描所有文件生成schema信息并缓存在driver,当文件数量过多时可能导致GC OOM异常。这种设计对频繁查询相同表的stream作业有利,但对只查询一次的场景效率较低。异常堆栈显示在处理executor心跳时也遇到了内存问题。解决办法可能包括优化schema获取策略或调整Spark配置以减少内存使用。
           最低0.47元/天 解锁文章
最低0.47元/天 解锁文章
                           
                       
       
           
                 
                 
                 
                 
                 
                
               
                 
                 
                 
                 
                
               
                 
                 扫一扫
扫一扫
                     
              
             
                   1778
					1778
					
 被折叠的  条评论
		 为什么被折叠?
被折叠的  条评论
		 为什么被折叠?
		 
		  到【灌水乐园】发言
到【灌水乐园】发言                                
		 
		 
    
   
    
   
             
            


 
            