org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 13

在使用spark的时候报错:

在这里插入图片描述

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6
at org.apache.spark.MapOutputTrackerKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲org$apache$spar…convertMapStatuses 2. a p p l y ( M a p O u t p u t T r a c k e r . s c a l a : 697 ) a t o r g . a p a c h e . s p a r k . M a p O u t p u t T r a c k e r 2.apply(MapOutputTracker.scala:697) at org.apache.spark.MapOutputTracker 2.apply(MapOutputTracker.scala:697)atorg.apache.spark.MapOutputTracker a n o n f u n anonfun anonfunorg a p a c h e apache apachespark M a p O u t p u t T r a c k e r MapOutputTracker MapOutputTracker$convertMapStatuses 2. a p p l y ( M a p O u t p u t T r a c k e r . s c a l a : 693 ) a t s c a l a . c o l l e c t i o n . T r a v e r s a b l e L i k e 2.apply(MapOutputTracker.scala:693) at scala.collection.TraversableLike 2.apply(MapOutputTracker.scala:693)atscala.collection.TraversableLikeWithFilterKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲foreach$1.apply…convertMapStatuses(MapOutputTracker.scala:693)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:165)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor T a s k R u n n e r . r u n ( E x e c u t o r . s c a l a : 338 ) a t j a v a . u t i l . c o n c u r r e n t . T h r e a d P o o l E x e c u t o r . r u n W o r k e r ( T h r e a d P o o l E x e c u t o r . j a v a : 1142 ) a t j a v a . u t i l . c o n c u r r e n t . T h r e a d P o o l E x e c u t o r TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor TaskRunner.run(Executor.scala:338)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)atjava.util.concurrent.ThreadPoolExecutorWorker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

)

org.apache.spark.shuffle.FetchFailedException: Failed to connect to hostname/xx.xx.xx.xx:26969
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
at scala.collection.IteratorKaTeX parse error: Can't use function '$' in math mode at position 5: anon$̲12.nextCur(Iter…anon 12. h a s N e x t ( I t e r a t o r . s c a l a : 440 ) a t s c a l a . c o l l e c t i o n . I t e r a t o r 12.hasNext(Iterator.scala:440) at scala.collection.Iterator 12.hasNext(Iterator.scala:440)atscala.collection.Iterator$anon 11. h a s N e x t ( I t e r a t o r . s c a l a : 408 ) a t o r g . a p a c h e . s p a r k . u t i l . C o m p l e t i o n I t e r a t o r . h a s N e x t ( C o m p l e t i o n I t e r a t o r . s c a l a : 32 ) a t o r g . a p a c h e . s p a r k . I n t e r r u p t i b l e I t e r a t o r . h a s N e x t ( I n t e r r u p t i b l e I t e r a t o r . s c a l a : 37 ) a t s c a l a . c o l l e c t i o n . I t e r a t o r 11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator 11.hasNext(Iterator.scala:408)atorg.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)atorg.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)atscala.collection.Iterator$anon 11. h a s N e x t ( I t e r a t o r . s c a l a : 408 ) a t o r g . a p a c h e . s p a r k . s q l . c a t a l y s t . e x p r e s s i o n s . G e n e r a t e d C l a s s 11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass 11.hasNext(Iterator.scala:408)atorg.apache.spark.sql.catalyst.expressions.GeneratedClassGeneratedIterator.sort_addToSorter ( U n k n o w n S o u r c e ) a t o r g . a p a c h e . s p a r k . s q l . c a t a l y s t . e x p r e s s i o n s . G e n e r a t e d C l a s s (Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass (UnknownSource)atorg.apache.spark.sql.catalyst.expressions.GeneratedClassGeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExecKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲8anon 1. h a s N e x t ( W h o l e S t a g e C o d e g e n E x e c . s c a l a : 395 ) a t o r g . a p a c h e . s p a r k . s q l . e x e c u t i o n . R o w I t e r a t o r F r o m S c a l a . a d v a n c e N e x t ( R o w I t e r a t o r . s c a l a : 83 ) a t o r g . a p a c h e . s p a r k . s q l . e x e c u t i o n . j o i n s . S o r t M e r g e J o i n S c a n n e r . a d v a n c e d S t r e a m e d ( S o r t M e r g e J o i n E x e c . s c a l a : 776 ) a t o r g . a p a c h e . s p a r k . s q l . e x e c u t i o n . j o i n s . S o r t M e r g e J o i n S c a n n e r . f i n d N e x t O u t e r J o i n R o w s ( S o r t M e r g e J o i n E x e c . s c a l a : 737 ) a t o r g . a p a c h e . s p a r k . s q l . e x e c u t i o n . j o i n s . O n e S i d e O u t e r I t e r a t o r . a d v a n c e S t r e a m ( S o r t M e r g e J o i n E x e c . s c a l a : 899 ) a t o r g . a p a c h e . s p a r k . s q l . e x e c u t i o n . j o i n s . O n e S i d e O u t e r I t e r a t o r . a d v a n c e N e x t ( S o r t M e r g e J o i n E x e c . s c a l a : 935 ) a t o r g . a p a c h e . s p a r k . s q l . e x e c u t i o n . R o w I t e r a t o r T o S c a l a . h a s N e x t ( R o w I t e r a t o r . s c a l a : 68 ) a t o r g . a p a c h e . s p a r k . s q l . c a t a l y s t . e x p r e s s i o n s . G e n e r a t e d C l a s s 1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:776) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:737) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:899) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:935) at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) at org.apache.spark.sql.catalyst.expressions.GeneratedClass 1.hasNext(WholeStageCodegenExec.scala:395)atorg.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)atorg.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:776)atorg.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:737)atorg.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:899)atorg.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:935)atorg.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)atorg.apache.spark.sql.catalyst.expressions.GeneratedClassGeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExecKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲8anon 1. h a s N e x t ( W h o l e S t a g e C o d e g e n E x e c . s c a l a : 395 ) a t o r g . a p a c h e . s p a r k . s q l . e x e c u t i o n . d a t a s o u r c e s . F i l e F o r m a t W r i t e r 1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.datasources.FileFormatWriter 1.hasNext(WholeStageCodegenExec.scala:395)atorg.apache.spark.sql.execution.datasources.FileFormatWriterSingleDirectoryWriteTask.execute(FileFormatWriter.scala:313)
at org.apache.spark.sql.execution.datasources.FileFormatWriterKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲org$apache$spar…executeTask 3. a p p l y ( F i l e F o r m a t W r i t e r . s c a l a : 256 ) a t o r g . a p a c h e . s p a r k . s q l . e x e c u t i o n . d a t a s o u r c e s . F i l e F o r m a t W r i t e r 3.apply(FileFormatWriter.scala:256) at org.apache.spark.sql.execution.datasources.FileFormatWriter 3.apply(FileFormatWriter.scala:256)atorg.apache.spark.sql.execution.datasources.FileFormatWriter a n o n f u n anonfun anonfunorg a p a c h e apache apachespark s q l sql sqlexecution d a t a s o u r c e s datasources datasourcesFileFormatWriterKaTeX parse error: Can't use function '$' in math mode at position 12: executeTask$̲3.apply(FileFor…executeTask(FileFormatWriter.scala:259)
at org.apache.spark.sql.execution.datasources.FileFormatWriterKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲write$1anonfun a p p l y apply applymcV$sp 1. a p p l y ( F i l e F o r m a t W r i t e r . s c a l a : 189 ) a t o r g . a p a c h e . s p a r k . s q l . e x e c u t i o n . d a t a s o u r c e s . F i l e F o r m a t W r i t e r 1.apply(FileFormatWriter.scala:189) at org.apache.spark.sql.execution.datasources.FileFormatWriter 1.apply(FileFormatWriter.scala:189)atorg.apache.spark.sql.execution.datasources.FileFormatWriter a n o n f u n anonfun anonfunwrite 1 1 1 a n o n f u n anonfun anonfunapply m c V mcV mcVsp 1. a p p l y ( F i l e F o r m a t W r i t e r . s c a l a : 188 ) a t o r g . a p a c h e . s p a r k . s c h e d u l e r . R e s u l t T a s k . r u n T a s k ( R e s u l t T a s k . s c a l a : 87 ) a t o r g . a p a c h e . s p a r k . s c h e d u l e r . T a s k . r u n ( T a s k . s c a l a : 108 ) a t o r g . a p a c h e . s p a r k . e x e c u t o r . E x e c u t o r 1.apply(FileFormatWriter.scala:188) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor 1.apply(FileFormatWriter.scala:188)atorg.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)atorg.apache.spark.scheduler.Task.run(Task.scala:108)atorg.apache.spark.executor.ExecutorTaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor W o r k e r . r u n ( T h r e a d P o o l E x e c u t o r . j a v a : 617 ) a t j a v a . l a n g . T h r e a d . r u n ( T h r e a d . j a v a : 745 ) C a u s e d b y : j a v a . i o . I O E x c e p t i o n : F a i l e d t o c o n n e c t t o n o d e 01 / x x . x x . x x . x x : 26969 a t o r g . a p a c h e . s p a r k . n e t w o r k . c l i e n t . T r a n s p o r t C l i e n t F a c t o r y . c r e a t e C l i e n t ( T r a n s p o r t C l i e n t F a c t o r y . j a v a : 232 ) a t o r g . a p a c h e . s p a r k . n e t w o r k . c l i e n t . T r a n s p o r t C l i e n t F a c t o r y . c r e a t e C l i e n t ( T r a n s p o r t C l i e n t F a c t o r y . j a v a : 182 ) a t o r g . a p a c h e . s p a r k . n e t w o r k . n e t t y . N e t t y B l o c k T r a n s f e r S e r v i c e Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to node01/xx.xx.xx.xx:26969 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) at org.apache.spark.network.netty.NettyBlockTransferService Worker.run(ThreadPoolExecutor.java:617)atjava.lang.Thread.run(Thread.java:745)Causedby:java.io.IOException:Failedtoconnecttonode01/xx.xx.xx.xx:26969atorg.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)atorg.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)atorg.apache.spark.network.netty.NettyBlockTransferService$anon 1. c r e a t e A n d S t a r t ( N e t t y B l o c k T r a n s f e r S e r v i c e . s c a l a : 97 ) a t o r g . a p a c h e . s p a r k . n e t w o r k . s h u f f l e . R e t r y i n g B l o c k F e t c h e r . f e t c h A l l O u t s t a n d i n g ( R e t r y i n g B l o c k F e t c h e r . j a v a : 141 ) a t o r g . a p a c h e . s p a r k . n e t w o r k . s h u f f l e . R e t r y i n g B l o c k F e t c h e r . l a m b d a 1.createAndStart(NettyBlockTransferService.scala:97) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda 1.createAndStart(NettyBlockTransferService.scala:97)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.lambdainitiateRetry 0 ( R e t r y i n g B l o c k F e t c h e r . j a v a : 169 ) a t j a v a . u t i l . c o n c u r r e n t . E x e c u t o r s 0(RetryingBlockFetcher.java:169) at java.util.concurrent.Executors 0(RetryingBlockFetcher.java:169)atjava.util.concurrent.ExecutorsRunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor W o r k e r . r u n ( T h r e a d P o o l E x e c u t o r . j a v a : 617 ) a t i o . n e t t y . u t i l . c o n c u r r e n t . D e f a u l t T h r e a d F a c t o r y Worker.run(ThreadPoolExecutor.java:617) at io.netty.util.concurrent.DefaultThreadFactory Worker.run(ThreadPoolExecutor.java:617)atio.netty.util.concurrent.DefaultThreadFactoryDefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
… 1 more
报错分析:这里是报的shuffle中获取不到元数据的异常,没有空间用于shuffle了
shuffle又分为shuffle read和shuffle write 两个部分。

shuffle write的分区数由上一阶段的RDD分区数控制,shuffle read的分区数则是由Spark提供的一些参数控制。

shuffle read的时候数据的分区数则是由spark提供的一些参数控制,如果这个参数值设置的很小,同时shuffle read的量很大,那么将会导致一个task需要处理的数据非常大。结果导致JVM crash,从而导致取shuffle数据失败,同时executor也丢失了。

既然知道了是shuffle产生了问题,接下来进行解决:

增加executor的内存spark.executor.memory
考虑用broadcast或者map join 进行规避shuffle的产生
在shuffle前进行过滤,减少shuffle的数据量
提高并行度
另外考虑是否有数据倾斜现象的产生

1.1 --num-executors 100
参数解释:任务可以申请的Excutor最大数量,并不是一次性分配100个Excutor;Excutor数量会在任务的运行过程中动态调整,有 job处于pending状态则申请Excutor,一个Excutor空闲时间过长则将其移除;Excutor的数量决定了任务的并行度;

申请Excutor:当有任务处于pending状态(积压)超过一定时间,就认为资源不足,需要申请Excutor;

何时申请:当pending积压的任务超过spark.dynamicAllocation.schedulerBacklogTimeout(1秒)就申请
申请多少:申请数量 = 正在运行和pending的任务数量 * spark.dynamicAllocation.executorAllocationRatio(1)/ 并行度
移除Excutor:

spark.dynamicAllocation.enabled(false)决定是否使用资源动态分配;必须开启外部shuffle;
spark.dynamicAllocation.executorIdleTimeout (60s)空闲60s就会被回收(并且没有缓存);
决定任务的并行度:executor的数量就是工作节点的数量,直接决定了任务的并行度;准确的说是由executor*core决定的;这只是物理上提供的最大并行度,而任务实际的并行度还是由程序中设置的并行度决定,也就是RDD的分区数;

1.2 --executor-memory 5g
参数解释:每个executor的内存大小;对于spark调优和OOM异常,通常都是对executor的内存做调整,spark内存模型也是指executor的内存分配,所以executor的内存管理是非常重要的;
内存分配:该参数是总的内存分配,而在任务运行中,会根据spark内存模型对这个总内存再次细分;在实际生产中,通常需要根据程序中使用的缓存内存和计算内存,来划分不同的比例,从而合理的利用内存,避免OOM,提高性能;
1.3 --executor-cores 4
参数解释:每个executor的核数;是每个executor内部的并行度,即一个executor中可同时执行的task数量;
并行度:core的数量决定了一个executor同时执行的task数量,如果task数量越多,则意味着占用的executor内存也越多;所以,在executor内存固定的情况下,可以通过增加executor数量,减少core数量,使任务总并行度不变的前提下,降低OOM风险;如果任务需要广播大变量,可以增大core数,使更多的task共用广播变量;
1.4 --driver-memory
参数解释:driver端的内存大小;如果要collect大量数据到driver端,或者要广播大变量时,就需要调大driver端的内存;一般给个3G、4G就够了;
内存参数
spark.storage.memoryFraction、spark.shuffle.memoryFraction(spark1.6之前静态内存管理)
参数解释:在spark1.6之前,使用的是静态内存管理,而这两个参数就是用来决定缓存内存和执行内存大小的;在spark1.6及之后,采用的是统一内存管理(也叫动态内存管理),这两个参数就废弃了(但也可以让它生效)
spark.memory.fraction(spark1.6及之后,统一内存管理)
参数解释:spark1.6及之后采用的是统一内存管理,也叫动态内存管理,顾名思义,就是缓存内存和执行内存统一管理,并且是动态的;首先解释“统一”:spark.memory.fraction是堆内内存中用于执行、shuffle、缓存的内存比例;这个值越低,则执行时溢出到磁盘更频繁、同时缓存被逐出内存也更频繁;一般使用默认值就好了,spark2.2默认是0.6,那么剩下的0.4就是用于存储用户的数据结构(比如map算子中定义的中间数据)以及spark内部的元数据;
spark.memory.storageFraction
参数解释:存储内存不会被逐出内存的总量,这个是基于spark.memory.fraction的占比;这个值越高,则执行、shuffle的内存就越少,从而溢写到磁盘就越频繁;一般使用默认值就好了,spark2.2默认是0.5;
spark.kryoserializer.buffer.max
参数解释:kryo序列化时使用的缓存大小;如果collect大量数据到driver端,可能会抛buffer limit exceeded异常,这个时候就要调大该参数;默认是64m,挂了就设置为1024m;如果序列化的一个对象很大,那么就需要增大改参数的值spark.kryoserializer.buffer(默认64k);
dfs.client.block.write.locateFollowingBlock.retries
参数解释:写入块后尝试关闭的次数;Unable to close file because the last block does not have enough number of replicas异常的原因;2.7.4已修复;默认是5,挂了就设置为6;
spark.driver.maxResultSize
参数解释:一次collect到driver端的最大内存大小,Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)异常时需要调大该值;默认1g,挂了就设置为2g,0表示不限制;
shuffle参数
spark.shuffle.file.buffer
参数解释:shuffle write时,会先写到BufferedOutputStream缓冲区中,然后再溢写到磁盘;该参数就是缓存区大小,默认32k,建议设置为64k;
spark.shuffle.spill.batchSize
参数解释:shuffle在spill溢写过程中需要将数据序列化和反序列化,这个是一个批次处理的条数;默认是10000,可以调大该值,2万5万都可以;
spark.shuffle.io.maxRetries
参数解释:shuffle read拉取数据时,由于网络异常或者gc导致拉取失败,会自动重试,改参数就是配置重试次数,在数据量达到十亿、百亿级别的时候,最好调大该参数以增加稳定性;默认是3次,建议设置为10到20;
spark.shuffle.io.retryWait
参数解释:该参数是 spark.shuffle.io.maxRetries的重试间隔,默认是5s,建议设置为20s;
spark.reducer.maxSizeInFlight
参数解释:shuffle read拉取数据时的缓存区大小,也就是一次拉取的数据大小;注意是从5个节点拉取48M的数据,而不是从一个节点获取48M;默认48m,建议设置为96m;
原理解释:从远程节点拉取数据时,是并行的从发送5个请求,每个请求拉取的最大长度是 48M / 5,但是拉取时都是以block为最小单位的,所以实际获取的有可能会大于这个值;
spark.reducer.maxReqsInFlight
参数解释:shuffle read时,一个task的一个批次同时发送的请求数量;默认是 Int的最大值;
原理解释:构造远程请求时,单个请求大小限制是 48M / 5,而在一次拉取远程block数据时,是按批次拉取,一个批次的大小限制是 48M,所以理想情况下一个批次会发送5个请求;但如果block的分布不均匀,导致一个请求的请求大小远小于 48M / 5 (例如1M),而一个批次的大小限制是48M,所以这个批次就会发送48个请求;当节点数较多时,一个task的一个批次可能会发送非常多的请求,导致某些节点的入站连接数过多,从而导致失败;
spark.reducer.maxReqSizeShuffleToMem
参数解释:shuffle read时,从远程拉取block如果大于这个值就会强行落盘,默认是Long的最大值,建议小于2G,一般设为200M,spark2.2开始生效;(spark2.3开始换成了这个参数spark.maxRemoteBlockSizeFetchToMem);shuffle read这个部分的参数在spark的版本更新中变化较大,所以在优化时一定要根据集群的spark版本设置对应的参数;
原理解释:一次拉取请求中,如果要拉取的数据比较大,内存放不下,就直接落盘;对于数据倾斜比较严重的任务,有可能一个block非常大,而没有足够的内存存放时就会OOM,所以最好限制该参数的大小;还有一个原因就是 netty的最大限制是2G,所以大于2G肯定会报错;spark2.4该参数的默认值是:Int的最大值-512 (2G,减512用来存储元数据);spark3.0的最大值也是2G,并且给了默认值200M;
spark.reducer.maxBlocksInFlightPerAddress
参数解释:shuffle read时,一个节点同时被拉取的最大block数,如果太多可能会导致executor服务或nodemanager崩溃;默认Int的最大值;(spark2.2.1开始支持);

原理解释:shuffle read时每个task都会从shuffle write所在的节点拉取自己的block数据,如果一个shuffle write的executor运行了9个task,就会write9个data文件;如果shuffle read有1000核,那么同时运行1000个task,每个task要到shuffle write所在的executor获取9个block,极端情况下一个shuffle write的executor会被请求9000次;当节点数非常多时,一个shuffle write的executor会同时被很多节点拉取block,从而导致失败;

文件相关
spark.sql.files.maxPartitionBytes
参数解释:sparksql读取文件时,每个分区的最大文件大小,这个参数决定了读文件时的并行度;默认128M;例如一个300M的text文件,按128M划分为3个切片,所以SparkSQL读取时最少有3个分区;
原理解释:sparksql读取文件的并行度=max(spark默认并行度,切片数量(文件大小/ 该参数));这里要注意压缩文件是否可分割;但是要注意,对于parquet格式,一个切片对应一个row group;
spark.sql.parquet.compression.codec
参数解释:parquet格式的压缩方式,默认是snappy,可以选择gzip、lzo、或者uncompressed不压缩;
spark.io.compression.codec
参数解释:spark中rdd分区、广播变量、shuffle输出的压缩格式,spark2.2默认是lz4;
spark.serializer
参数解释:spark序列化的实现,这里的序列化是针对shuffle、广播和rdd cache的序列化方式;默认使用java的序列化方式org.apache.spark.serializer.JavaSerializer性能比较低,所以一般都使用org.apache.spark.serializer.KryoSerializer ,使用Kryo序列化时最好注册十分需要空间的类型,可以节省很多空间;spark task的序列化由参数spark.closure.serializer配置,目前只支持JavaSerializer;
spark.sql.hive.convertMetastoreParquet
参数解释:是否采用spark自己的Serde来解析Parquet文件;Spark SQL为了更好的性能,在读取hive metastore创建的parquet文件时,会采用自己Parquet Serde,而不是采用hive的Parquet Serde来序列化和反序列化,这在处理null值和decimal精度时会有问题;默认为true,设为false即可(会采用与hive相同的Serde);
spark.sql.parquet.writeLegacyFormat
参数解释:是否使用遗留的(hive的方式)format来写Parquet文件;由于decimal精度问题,hive读取spark创建的Parquet文件会报错;所以这里的spark采用与hive相同的writeFormat来写Parquet文件,这样hive在读取时就不会报错;并且上下游表的精度最好一致,例如a表的字段精度为decimal(10,2),b表也最好是decimal(10,2);
原理解释:在hive中decimal类型是固定的用int32来表示,而标准的parquet规范约定,根据精度的不同会采用int32和int64来存储,而spark就是采用的标准的parquet格式;所以对于精度不同decimal的,底层的存储类型有变化;所以使用spark存储的parquet文件,在使用hive读取时报错;将spark.sql.parquet.writeLegacyFormat(默认false)配置设为true,即采用与hive相同的format类来读写parquet文件;
在这里插入图片描述

根据50G数据量评估

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值