在使用spark的时候报错:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6
at org.apache.spark.MapOutputTrackerKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲org$apache$spar…convertMapStatuses
2.
a
p
p
l
y
(
M
a
p
O
u
t
p
u
t
T
r
a
c
k
e
r
.
s
c
a
l
a
:
697
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
M
a
p
O
u
t
p
u
t
T
r
a
c
k
e
r
2.apply(MapOutputTracker.scala:697) at org.apache.spark.MapOutputTracker
2.apply(MapOutputTracker.scala:697)atorg.apache.spark.MapOutputTracker
a
n
o
n
f
u
n
anonfun
anonfunorg
a
p
a
c
h
e
apache
apachespark
M
a
p
O
u
t
p
u
t
T
r
a
c
k
e
r
MapOutputTracker
MapOutputTracker$convertMapStatuses
2.
a
p
p
l
y
(
M
a
p
O
u
t
p
u
t
T
r
a
c
k
e
r
.
s
c
a
l
a
:
693
)
a
t
s
c
a
l
a
.
c
o
l
l
e
c
t
i
o
n
.
T
r
a
v
e
r
s
a
b
l
e
L
i
k
e
2.apply(MapOutputTracker.scala:693) at scala.collection.TraversableLike
2.apply(MapOutputTracker.scala:693)atscala.collection.TraversableLikeWithFilterKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲foreach$1.apply…convertMapStatuses(MapOutputTracker.scala:693)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:165)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor
T
a
s
k
R
u
n
n
e
r
.
r
u
n
(
E
x
e
c
u
t
o
r
.
s
c
a
l
a
:
338
)
a
t
j
a
v
a
.
u
t
i
l
.
c
o
n
c
u
r
r
e
n
t
.
T
h
r
e
a
d
P
o
o
l
E
x
e
c
u
t
o
r
.
r
u
n
W
o
r
k
e
r
(
T
h
r
e
a
d
P
o
o
l
E
x
e
c
u
t
o
r
.
j
a
v
a
:
1142
)
a
t
j
a
v
a
.
u
t
i
l
.
c
o
n
c
u
r
r
e
n
t
.
T
h
r
e
a
d
P
o
o
l
E
x
e
c
u
t
o
r
TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor
TaskRunner.run(Executor.scala:338)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)atjava.util.concurrent.ThreadPoolExecutorWorker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
)
org.apache.spark.shuffle.FetchFailedException: Failed to connect to hostname/xx.xx.xx.xx:26969
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
at scala.collection.IteratorKaTeX parse error: Can't use function '$' in math mode at position 5: anon$̲12.nextCur(Iter…anon
12.
h
a
s
N
e
x
t
(
I
t
e
r
a
t
o
r
.
s
c
a
l
a
:
440
)
a
t
s
c
a
l
a
.
c
o
l
l
e
c
t
i
o
n
.
I
t
e
r
a
t
o
r
12.hasNext(Iterator.scala:440) at scala.collection.Iterator
12.hasNext(Iterator.scala:440)atscala.collection.Iterator$anon
11.
h
a
s
N
e
x
t
(
I
t
e
r
a
t
o
r
.
s
c
a
l
a
:
408
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
u
t
i
l
.
C
o
m
p
l
e
t
i
o
n
I
t
e
r
a
t
o
r
.
h
a
s
N
e
x
t
(
C
o
m
p
l
e
t
i
o
n
I
t
e
r
a
t
o
r
.
s
c
a
l
a
:
32
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
I
n
t
e
r
r
u
p
t
i
b
l
e
I
t
e
r
a
t
o
r
.
h
a
s
N
e
x
t
(
I
n
t
e
r
r
u
p
t
i
b
l
e
I
t
e
r
a
t
o
r
.
s
c
a
l
a
:
37
)
a
t
s
c
a
l
a
.
c
o
l
l
e
c
t
i
o
n
.
I
t
e
r
a
t
o
r
11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator
11.hasNext(Iterator.scala:408)atorg.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)atorg.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)atscala.collection.Iterator$anon
11.
h
a
s
N
e
x
t
(
I
t
e
r
a
t
o
r
.
s
c
a
l
a
:
408
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
c
a
t
a
l
y
s
t
.
e
x
p
r
e
s
s
i
o
n
s
.
G
e
n
e
r
a
t
e
d
C
l
a
s
s
11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass
11.hasNext(Iterator.scala:408)atorg.apache.spark.sql.catalyst.expressions.GeneratedClassGeneratedIterator.sort_addToSorter
(
U
n
k
n
o
w
n
S
o
u
r
c
e
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
c
a
t
a
l
y
s
t
.
e
x
p
r
e
s
s
i
o
n
s
.
G
e
n
e
r
a
t
e
d
C
l
a
s
s
(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass
(UnknownSource)atorg.apache.spark.sql.catalyst.expressions.GeneratedClassGeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExecKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲8anon
1.
h
a
s
N
e
x
t
(
W
h
o
l
e
S
t
a
g
e
C
o
d
e
g
e
n
E
x
e
c
.
s
c
a
l
a
:
395
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
e
x
e
c
u
t
i
o
n
.
R
o
w
I
t
e
r
a
t
o
r
F
r
o
m
S
c
a
l
a
.
a
d
v
a
n
c
e
N
e
x
t
(
R
o
w
I
t
e
r
a
t
o
r
.
s
c
a
l
a
:
83
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
e
x
e
c
u
t
i
o
n
.
j
o
i
n
s
.
S
o
r
t
M
e
r
g
e
J
o
i
n
S
c
a
n
n
e
r
.
a
d
v
a
n
c
e
d
S
t
r
e
a
m
e
d
(
S
o
r
t
M
e
r
g
e
J
o
i
n
E
x
e
c
.
s
c
a
l
a
:
776
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
e
x
e
c
u
t
i
o
n
.
j
o
i
n
s
.
S
o
r
t
M
e
r
g
e
J
o
i
n
S
c
a
n
n
e
r
.
f
i
n
d
N
e
x
t
O
u
t
e
r
J
o
i
n
R
o
w
s
(
S
o
r
t
M
e
r
g
e
J
o
i
n
E
x
e
c
.
s
c
a
l
a
:
737
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
e
x
e
c
u
t
i
o
n
.
j
o
i
n
s
.
O
n
e
S
i
d
e
O
u
t
e
r
I
t
e
r
a
t
o
r
.
a
d
v
a
n
c
e
S
t
r
e
a
m
(
S
o
r
t
M
e
r
g
e
J
o
i
n
E
x
e
c
.
s
c
a
l
a
:
899
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
e
x
e
c
u
t
i
o
n
.
j
o
i
n
s
.
O
n
e
S
i
d
e
O
u
t
e
r
I
t
e
r
a
t
o
r
.
a
d
v
a
n
c
e
N
e
x
t
(
S
o
r
t
M
e
r
g
e
J
o
i
n
E
x
e
c
.
s
c
a
l
a
:
935
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
e
x
e
c
u
t
i
o
n
.
R
o
w
I
t
e
r
a
t
o
r
T
o
S
c
a
l
a
.
h
a
s
N
e
x
t
(
R
o
w
I
t
e
r
a
t
o
r
.
s
c
a
l
a
:
68
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
c
a
t
a
l
y
s
t
.
e
x
p
r
e
s
s
i
o
n
s
.
G
e
n
e
r
a
t
e
d
C
l
a
s
s
1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:776) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:737) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:899) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:935) at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) at org.apache.spark.sql.catalyst.expressions.GeneratedClass
1.hasNext(WholeStageCodegenExec.scala:395)atorg.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)atorg.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:776)atorg.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:737)atorg.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:899)atorg.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:935)atorg.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)atorg.apache.spark.sql.catalyst.expressions.GeneratedClassGeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExecKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲8anon
1.
h
a
s
N
e
x
t
(
W
h
o
l
e
S
t
a
g
e
C
o
d
e
g
e
n
E
x
e
c
.
s
c
a
l
a
:
395
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
e
x
e
c
u
t
i
o
n
.
d
a
t
a
s
o
u
r
c
e
s
.
F
i
l
e
F
o
r
m
a
t
W
r
i
t
e
r
1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.datasources.FileFormatWriter
1.hasNext(WholeStageCodegenExec.scala:395)atorg.apache.spark.sql.execution.datasources.FileFormatWriterSingleDirectoryWriteTask.execute(FileFormatWriter.scala:313)
at org.apache.spark.sql.execution.datasources.FileFormatWriterKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲org$apache$spar…executeTask
3.
a
p
p
l
y
(
F
i
l
e
F
o
r
m
a
t
W
r
i
t
e
r
.
s
c
a
l
a
:
256
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
e
x
e
c
u
t
i
o
n
.
d
a
t
a
s
o
u
r
c
e
s
.
F
i
l
e
F
o
r
m
a
t
W
r
i
t
e
r
3.apply(FileFormatWriter.scala:256) at org.apache.spark.sql.execution.datasources.FileFormatWriter
3.apply(FileFormatWriter.scala:256)atorg.apache.spark.sql.execution.datasources.FileFormatWriter
a
n
o
n
f
u
n
anonfun
anonfunorg
a
p
a
c
h
e
apache
apachespark
s
q
l
sql
sqlexecution
d
a
t
a
s
o
u
r
c
e
s
datasources
datasourcesFileFormatWriterKaTeX parse error: Can't use function '$' in math mode at position 12: executeTask$̲3.apply(FileFor…executeTask(FileFormatWriter.scala:259)
at org.apache.spark.sql.execution.datasources.FileFormatWriterKaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲write$1anonfun
a
p
p
l
y
apply
applymcV$sp
1.
a
p
p
l
y
(
F
i
l
e
F
o
r
m
a
t
W
r
i
t
e
r
.
s
c
a
l
a
:
189
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
q
l
.
e
x
e
c
u
t
i
o
n
.
d
a
t
a
s
o
u
r
c
e
s
.
F
i
l
e
F
o
r
m
a
t
W
r
i
t
e
r
1.apply(FileFormatWriter.scala:189) at org.apache.spark.sql.execution.datasources.FileFormatWriter
1.apply(FileFormatWriter.scala:189)atorg.apache.spark.sql.execution.datasources.FileFormatWriter
a
n
o
n
f
u
n
anonfun
anonfunwrite
1
1
1
a
n
o
n
f
u
n
anonfun
anonfunapply
m
c
V
mcV
mcVsp
1.
a
p
p
l
y
(
F
i
l
e
F
o
r
m
a
t
W
r
i
t
e
r
.
s
c
a
l
a
:
188
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
c
h
e
d
u
l
e
r
.
R
e
s
u
l
t
T
a
s
k
.
r
u
n
T
a
s
k
(
R
e
s
u
l
t
T
a
s
k
.
s
c
a
l
a
:
87
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
s
c
h
e
d
u
l
e
r
.
T
a
s
k
.
r
u
n
(
T
a
s
k
.
s
c
a
l
a
:
108
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
e
x
e
c
u
t
o
r
.
E
x
e
c
u
t
o
r
1.apply(FileFormatWriter.scala:188) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor
1.apply(FileFormatWriter.scala:188)atorg.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)atorg.apache.spark.scheduler.Task.run(Task.scala:108)atorg.apache.spark.executor.ExecutorTaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor
W
o
r
k
e
r
.
r
u
n
(
T
h
r
e
a
d
P
o
o
l
E
x
e
c
u
t
o
r
.
j
a
v
a
:
617
)
a
t
j
a
v
a
.
l
a
n
g
.
T
h
r
e
a
d
.
r
u
n
(
T
h
r
e
a
d
.
j
a
v
a
:
745
)
C
a
u
s
e
d
b
y
:
j
a
v
a
.
i
o
.
I
O
E
x
c
e
p
t
i
o
n
:
F
a
i
l
e
d
t
o
c
o
n
n
e
c
t
t
o
n
o
d
e
01
/
x
x
.
x
x
.
x
x
.
x
x
:
26969
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
n
e
t
w
o
r
k
.
c
l
i
e
n
t
.
T
r
a
n
s
p
o
r
t
C
l
i
e
n
t
F
a
c
t
o
r
y
.
c
r
e
a
t
e
C
l
i
e
n
t
(
T
r
a
n
s
p
o
r
t
C
l
i
e
n
t
F
a
c
t
o
r
y
.
j
a
v
a
:
232
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
n
e
t
w
o
r
k
.
c
l
i
e
n
t
.
T
r
a
n
s
p
o
r
t
C
l
i
e
n
t
F
a
c
t
o
r
y
.
c
r
e
a
t
e
C
l
i
e
n
t
(
T
r
a
n
s
p
o
r
t
C
l
i
e
n
t
F
a
c
t
o
r
y
.
j
a
v
a
:
182
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
n
e
t
w
o
r
k
.
n
e
t
t
y
.
N
e
t
t
y
B
l
o
c
k
T
r
a
n
s
f
e
r
S
e
r
v
i
c
e
Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to node01/xx.xx.xx.xx:26969 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) at org.apache.spark.network.netty.NettyBlockTransferService
Worker.run(ThreadPoolExecutor.java:617)atjava.lang.Thread.run(Thread.java:745)Causedby:java.io.IOException:Failedtoconnecttonode01/xx.xx.xx.xx:26969atorg.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)atorg.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)atorg.apache.spark.network.netty.NettyBlockTransferService$anon
1.
c
r
e
a
t
e
A
n
d
S
t
a
r
t
(
N
e
t
t
y
B
l
o
c
k
T
r
a
n
s
f
e
r
S
e
r
v
i
c
e
.
s
c
a
l
a
:
97
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
n
e
t
w
o
r
k
.
s
h
u
f
f
l
e
.
R
e
t
r
y
i
n
g
B
l
o
c
k
F
e
t
c
h
e
r
.
f
e
t
c
h
A
l
l
O
u
t
s
t
a
n
d
i
n
g
(
R
e
t
r
y
i
n
g
B
l
o
c
k
F
e
t
c
h
e
r
.
j
a
v
a
:
141
)
a
t
o
r
g
.
a
p
a
c
h
e
.
s
p
a
r
k
.
n
e
t
w
o
r
k
.
s
h
u
f
f
l
e
.
R
e
t
r
y
i
n
g
B
l
o
c
k
F
e
t
c
h
e
r
.
l
a
m
b
d
a
1.createAndStart(NettyBlockTransferService.scala:97) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda
1.createAndStart(NettyBlockTransferService.scala:97)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)atorg.apache.spark.network.shuffle.RetryingBlockFetcher.lambdainitiateRetry
0
(
R
e
t
r
y
i
n
g
B
l
o
c
k
F
e
t
c
h
e
r
.
j
a
v
a
:
169
)
a
t
j
a
v
a
.
u
t
i
l
.
c
o
n
c
u
r
r
e
n
t
.
E
x
e
c
u
t
o
r
s
0(RetryingBlockFetcher.java:169) at java.util.concurrent.Executors
0(RetryingBlockFetcher.java:169)atjava.util.concurrent.ExecutorsRunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor
W
o
r
k
e
r
.
r
u
n
(
T
h
r
e
a
d
P
o
o
l
E
x
e
c
u
t
o
r
.
j
a
v
a
:
617
)
a
t
i
o
.
n
e
t
t
y
.
u
t
i
l
.
c
o
n
c
u
r
r
e
n
t
.
D
e
f
a
u
l
t
T
h
r
e
a
d
F
a
c
t
o
r
y
Worker.run(ThreadPoolExecutor.java:617) at io.netty.util.concurrent.DefaultThreadFactory
Worker.run(ThreadPoolExecutor.java:617)atio.netty.util.concurrent.DefaultThreadFactoryDefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
… 1 more
报错分析:这里是报的shuffle中获取不到元数据的异常,没有空间用于shuffle了
shuffle又分为shuffle read和shuffle write 两个部分。
shuffle write的分区数由上一阶段的RDD分区数控制,shuffle read的分区数则是由Spark提供的一些参数控制。
shuffle read的时候数据的分区数则是由spark提供的一些参数控制,如果这个参数值设置的很小,同时shuffle read的量很大,那么将会导致一个task需要处理的数据非常大。结果导致JVM crash,从而导致取shuffle数据失败,同时executor也丢失了。
既然知道了是shuffle产生了问题,接下来进行解决:
增加executor的内存spark.executor.memory
考虑用broadcast或者map join 进行规避shuffle的产生
在shuffle前进行过滤,减少shuffle的数据量
提高并行度
另外考虑是否有数据倾斜现象的产生
1.1 --num-executors 100
参数解释:任务可以申请的Excutor最大数量,并不是一次性分配100个Excutor;Excutor数量会在任务的运行过程中动态调整,有 job处于pending状态则申请Excutor,一个Excutor空闲时间过长则将其移除;Excutor的数量决定了任务的并行度;
申请Excutor:当有任务处于pending状态(积压)超过一定时间,就认为资源不足,需要申请Excutor;
何时申请:当pending积压的任务超过spark.dynamicAllocation.schedulerBacklogTimeout(1秒)就申请
申请多少:申请数量 = 正在运行和pending的任务数量 * spark.dynamicAllocation.executorAllocationRatio(1)/ 并行度
移除Excutor:
spark.dynamicAllocation.enabled(false)决定是否使用资源动态分配;必须开启外部shuffle;
spark.dynamicAllocation.executorIdleTimeout (60s)空闲60s就会被回收(并且没有缓存);
决定任务的并行度:executor的数量就是工作节点的数量,直接决定了任务的并行度;准确的说是由executor*core决定的;这只是物理上提供的最大并行度,而任务实际的并行度还是由程序中设置的并行度决定,也就是RDD的分区数;
1.2 --executor-memory 5g
参数解释:每个executor的内存大小;对于spark调优和OOM异常,通常都是对executor的内存做调整,spark内存模型也是指executor的内存分配,所以executor的内存管理是非常重要的;
内存分配:该参数是总的内存分配,而在任务运行中,会根据spark内存模型对这个总内存再次细分;在实际生产中,通常需要根据程序中使用的缓存内存和计算内存,来划分不同的比例,从而合理的利用内存,避免OOM,提高性能;
1.3 --executor-cores 4
参数解释:每个executor的核数;是每个executor内部的并行度,即一个executor中可同时执行的task数量;
并行度:core的数量决定了一个executor同时执行的task数量,如果task数量越多,则意味着占用的executor内存也越多;所以,在executor内存固定的情况下,可以通过增加executor数量,减少core数量,使任务总并行度不变的前提下,降低OOM风险;如果任务需要广播大变量,可以增大core数,使更多的task共用广播变量;
1.4 --driver-memory
参数解释:driver端的内存大小;如果要collect大量数据到driver端,或者要广播大变量时,就需要调大driver端的内存;一般给个3G、4G就够了;
内存参数
spark.storage.memoryFraction、spark.shuffle.memoryFraction(spark1.6之前静态内存管理)
参数解释:在spark1.6之前,使用的是静态内存管理,而这两个参数就是用来决定缓存内存和执行内存大小的;在spark1.6及之后,采用的是统一内存管理(也叫动态内存管理),这两个参数就废弃了(但也可以让它生效)
spark.memory.fraction(spark1.6及之后,统一内存管理)
参数解释:spark1.6及之后采用的是统一内存管理,也叫动态内存管理,顾名思义,就是缓存内存和执行内存统一管理,并且是动态的;首先解释“统一”:spark.memory.fraction是堆内内存中用于执行、shuffle、缓存的内存比例;这个值越低,则执行时溢出到磁盘更频繁、同时缓存被逐出内存也更频繁;一般使用默认值就好了,spark2.2默认是0.6,那么剩下的0.4就是用于存储用户的数据结构(比如map算子中定义的中间数据)以及spark内部的元数据;
spark.memory.storageFraction
参数解释:存储内存不会被逐出内存的总量,这个是基于spark.memory.fraction的占比;这个值越高,则执行、shuffle的内存就越少,从而溢写到磁盘就越频繁;一般使用默认值就好了,spark2.2默认是0.5;
spark.kryoserializer.buffer.max
参数解释:kryo序列化时使用的缓存大小;如果collect大量数据到driver端,可能会抛buffer limit exceeded异常,这个时候就要调大该参数;默认是64m,挂了就设置为1024m;如果序列化的一个对象很大,那么就需要增大改参数的值spark.kryoserializer.buffer(默认64k);
dfs.client.block.write.locateFollowingBlock.retries
参数解释:写入块后尝试关闭的次数;Unable to close file because the last block does not have enough number of replicas异常的原因;2.7.4已修复;默认是5,挂了就设置为6;
spark.driver.maxResultSize
参数解释:一次collect到driver端的最大内存大小,Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)异常时需要调大该值;默认1g,挂了就设置为2g,0表示不限制;
shuffle参数
spark.shuffle.file.buffer
参数解释:shuffle write时,会先写到BufferedOutputStream缓冲区中,然后再溢写到磁盘;该参数就是缓存区大小,默认32k,建议设置为64k;
spark.shuffle.spill.batchSize
参数解释:shuffle在spill溢写过程中需要将数据序列化和反序列化,这个是一个批次处理的条数;默认是10000,可以调大该值,2万5万都可以;
spark.shuffle.io.maxRetries
参数解释:shuffle read拉取数据时,由于网络异常或者gc导致拉取失败,会自动重试,改参数就是配置重试次数,在数据量达到十亿、百亿级别的时候,最好调大该参数以增加稳定性;默认是3次,建议设置为10到20;
spark.shuffle.io.retryWait
参数解释:该参数是 spark.shuffle.io.maxRetries的重试间隔,默认是5s,建议设置为20s;
spark.reducer.maxSizeInFlight
参数解释:shuffle read拉取数据时的缓存区大小,也就是一次拉取的数据大小;注意是从5个节点拉取48M的数据,而不是从一个节点获取48M;默认48m,建议设置为96m;
原理解释:从远程节点拉取数据时,是并行的从发送5个请求,每个请求拉取的最大长度是 48M / 5,但是拉取时都是以block为最小单位的,所以实际获取的有可能会大于这个值;
spark.reducer.maxReqsInFlight
参数解释:shuffle read时,一个task的一个批次同时发送的请求数量;默认是 Int的最大值;
原理解释:构造远程请求时,单个请求大小限制是 48M / 5,而在一次拉取远程block数据时,是按批次拉取,一个批次的大小限制是 48M,所以理想情况下一个批次会发送5个请求;但如果block的分布不均匀,导致一个请求的请求大小远小于 48M / 5 (例如1M),而一个批次的大小限制是48M,所以这个批次就会发送48个请求;当节点数较多时,一个task的一个批次可能会发送非常多的请求,导致某些节点的入站连接数过多,从而导致失败;
spark.reducer.maxReqSizeShuffleToMem
参数解释:shuffle read时,从远程拉取block如果大于这个值就会强行落盘,默认是Long的最大值,建议小于2G,一般设为200M,spark2.2开始生效;(spark2.3开始换成了这个参数spark.maxRemoteBlockSizeFetchToMem);shuffle read这个部分的参数在spark的版本更新中变化较大,所以在优化时一定要根据集群的spark版本设置对应的参数;
原理解释:一次拉取请求中,如果要拉取的数据比较大,内存放不下,就直接落盘;对于数据倾斜比较严重的任务,有可能一个block非常大,而没有足够的内存存放时就会OOM,所以最好限制该参数的大小;还有一个原因就是 netty的最大限制是2G,所以大于2G肯定会报错;spark2.4该参数的默认值是:Int的最大值-512 (2G,减512用来存储元数据);spark3.0的最大值也是2G,并且给了默认值200M;
spark.reducer.maxBlocksInFlightPerAddress
参数解释:shuffle read时,一个节点同时被拉取的最大block数,如果太多可能会导致executor服务或nodemanager崩溃;默认Int的最大值;(spark2.2.1开始支持);
原理解释:shuffle read时每个task都会从shuffle write所在的节点拉取自己的block数据,如果一个shuffle write的executor运行了9个task,就会write9个data文件;如果shuffle read有1000核,那么同时运行1000个task,每个task要到shuffle write所在的executor获取9个block,极端情况下一个shuffle write的executor会被请求9000次;当节点数非常多时,一个shuffle write的executor会同时被很多节点拉取block,从而导致失败;
文件相关
spark.sql.files.maxPartitionBytes
参数解释:sparksql读取文件时,每个分区的最大文件大小,这个参数决定了读文件时的并行度;默认128M;例如一个300M的text文件,按128M划分为3个切片,所以SparkSQL读取时最少有3个分区;
原理解释:sparksql读取文件的并行度=max(spark默认并行度,切片数量(文件大小/ 该参数));这里要注意压缩文件是否可分割;但是要注意,对于parquet格式,一个切片对应一个row group;
spark.sql.parquet.compression.codec
参数解释:parquet格式的压缩方式,默认是snappy,可以选择gzip、lzo、或者uncompressed不压缩;
spark.io.compression.codec
参数解释:spark中rdd分区、广播变量、shuffle输出的压缩格式,spark2.2默认是lz4;
spark.serializer
参数解释:spark序列化的实现,这里的序列化是针对shuffle、广播和rdd cache的序列化方式;默认使用java的序列化方式org.apache.spark.serializer.JavaSerializer性能比较低,所以一般都使用org.apache.spark.serializer.KryoSerializer ,使用Kryo序列化时最好注册十分需要空间的类型,可以节省很多空间;spark task的序列化由参数spark.closure.serializer配置,目前只支持JavaSerializer;
spark.sql.hive.convertMetastoreParquet
参数解释:是否采用spark自己的Serde来解析Parquet文件;Spark SQL为了更好的性能,在读取hive metastore创建的parquet文件时,会采用自己Parquet Serde,而不是采用hive的Parquet Serde来序列化和反序列化,这在处理null值和decimal精度时会有问题;默认为true,设为false即可(会采用与hive相同的Serde);
spark.sql.parquet.writeLegacyFormat
参数解释:是否使用遗留的(hive的方式)format来写Parquet文件;由于decimal精度问题,hive读取spark创建的Parquet文件会报错;所以这里的spark采用与hive相同的writeFormat来写Parquet文件,这样hive在读取时就不会报错;并且上下游表的精度最好一致,例如a表的字段精度为decimal(10,2),b表也最好是decimal(10,2);
原理解释:在hive中decimal类型是固定的用int32来表示,而标准的parquet规范约定,根据精度的不同会采用int32和int64来存储,而spark就是采用的标准的parquet格式;所以对于精度不同decimal的,底层的存储类型有变化;所以使用spark存储的parquet文件,在使用hive读取时报错;将spark.sql.parquet.writeLegacyFormat(默认false)配置设为true,即采用与hive相同的format类来读写parquet文件;
根据50G数据量评估