遇到的问题
其他部门的同事开发了一个SparkStreaming消费Kafka数据的应用,运行了一个多月后,不能消费数据了,但是应用在Yarn上一直处于RUNNING状态。
进入ApplicationMaster查看Spark UI,发现了一个奇怪的现象
在Streaming页面中,原本4s执行一批次的数据处理,在某个时刻就不再执行了。
打开了stderr日志
Exception in thread “pool-23-thread-1” java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1025)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Exception in thread “JobGenerator” java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:290)
at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:297)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:186)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
奇怪的是,Executor发生了OutOfMemory异常,却没有导致整个作业Fail掉,从而监控作业运行状态的应用程序没有发现这个作业已经挂掉了,导致没有及时预警
作业资源配置
Executor个数:1,每个Executor内存:1G
解决方案
1、将Executor内存提升至2G,重新提交运行。
2、建议代码开发者检查代码,找一找为什么Executor进程挂掉了而Driver进程仍然没有挂掉的原因。
个人猜测,可能代码导致了Executor抛出的Error没有传递给Driver,或者是Driver得到了Executor的Error但没有异常终止.。因为他们开发的其他SparkStreaming作业,在运行时Executor抛出的Exception都引起了Driver的Fail。