2021-09-18 Stage/Job cancelled because SparkContext was shut down

枪枪枪

已于 2022-01-25 16:40:25 修改

阅读量9.5k

点赞数 4

文章标签： scala spark big data

于 2021-09-22 11:31:52 首次发布

本文链接：https://blog.csdn.net/az9996/article/details/120366581

版权

在这里插入图片描述
查看输出日志

[2021-09-17 21:10:49,078] {ssh.py:141} INFO - 21/09/18 05:10:49 INFO yarn.Client: Application report for application_1630745810692_0149 (state: RUNNING)
[2021-09-17 21:10:50,084] {ssh.py:141} INFO - 21/09/18 05:10:50 INFO yarn.Client: Application report for application_1630745810692_0149 (state: RUNNING)
[2021-09-17 21:10:51,094] {ssh.py:141} INFO - 21/09/18 05:10:51 INFO yarn.Client: Application report for application_1630745810692_0149 (state: FAILED)
21/09/18 05:10:51 INFO yarn.Client: 
	 client token: N/A
	 diagnostics: Application application_1630745810692_0149 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1630745810692_0149_000001 exited with  exitCode: -100
Failing this attempt.Diagnostics: Container released on a *lost* nodeFor more detailed output, check the application tracking page: http://bd.vn0038.jmrh.com:8088/cluster/app/application_1630745810692_0149 Then click on links to logs of each attempt.
. Failing the application.
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: root.users.hdfs
	 start time: 1631859822991
	 final status: FAILED
	 tracking URL: http://bd.vn0038.jmrh.com:8088/cluster/app/application_1630745810692_0149
	 user: hdfs
[2021-09-17 21:10:51,426] {ssh.py:141} INFO - 21/09/18 05:10:51 INFO yarn.Client: Deleted staging directory hdfs://bd.vn0038.jmrh.com:8020/user/hdfs/.sparkStaging/application_1630745810692_0149
[2021-09-17 21:10:51,583] {ssh.py:141} INFO - 21/09/18 05:10:51 ERROR yarn.Client: Application diagnostics message: Application application_1630745810692_0149 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1630745810692_0149_000001 exited with  exitCode: -100
Failing this attempt.Diagnostics: Container released on a *lost* nodeFor more detailed output, check the application tracking page: http://bd.vn0038.jmrh.com:8088/cluster/app/application_1630745810692_0149 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1630745810692_0149 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1158)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1606)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[2021-09-17 21:10:51,635] {ssh.py:141} INFO - 21/09/18 05:10:51 INFO util.ShutdownHookManager: Shutdown hook called
[2021-09-17 21:10:51,677] {ssh.py:141} INFO - 21/09/18 05:10:51 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-eb5770c2-1e88-4312-b1d9-a869ead6b77f
[2021-09-17 21:10:51,695] {ssh.py:141} INFO - 21/09/18 05:10:51 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-1db0d829-4b28-4892-bdde-ad3c9f7561ff

sudo -u hdfs yarn logs -applicationId application_1630745810692_0149 > application_1630745810692_0149.log
获取yarn application日志到本地

判断

问题应该是出在driver端内存不足或者hdfs空间不足

对于1，增加driver端资源
–conf “spark.driver.cores=2” \默认是1
–conf “spark.driver.memory=4G” \ 默认是1G

对于2，减少程序对磁盘空间的占有或者移除hdfs上多余文件来释放空间
sudo -u hdfs hdfs dfs -du -h /
查看hdfs上各个路径下磁盘空间

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.0-1.cdh6.3.0.p0.1279813/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.0-1.cdh6.3.0.p0.1279813/lib/hadoop/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.0-1.cdh6.3.0.p0.1279813/lib/hadoop/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
9.8 T  28.4 T  /data
2.5 T  7.4 T   /distcp_dir
0      0       /output
18     384 M   /system
1.2 G  3.6 G   /tmp
8.2 T  24.5 T  /user

看着也还都ok，先采取措施1，重新提交任务试试看

https://knowledge.informatica.com/s/article/576684?language=en_US
https://mail-archives.apache.org/mod_mbox/spark-user/201607.mbox/%3CCANvfmP-OeTrcx1JXRnhrhHAx3Uc+kcCHbRNGg0v0cney33B_aQ@mail.gmail.com%3E

信息1

ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
在这里插入图片描述

在这里插入图片描述

信息2

21/09/18 05:10:50 ERROR yarn.ApplicationMaster: Exception from Reporter thread.
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1630745810692_0149_000001 doesn 't exist in ApplicationMasterService cache.
在这里插入图片描述

日志记录

container_1632616543267_0307_01_000001] is running 477093888B beyond the ‘PHYSICAL’ memory limit. Current usage: 1.9 GB of 1.5 GB physical memory used; 6.3 GB of 3.1 GB virtual memory used. Killing container.

解决方法：
增加driver端内存（默认为1G）
spark.driver.memory=4G

yarn有一个监控容器内存使用的进程，当容器内存超出限制时会增加容器的内存配置，如果增加后还不够，就会杀死容器，其实不考虑其它因素时把这项功能关闭或许会好些。

[2021-10-11 06:51:05,045] {ssh.py:141} INFO - 21/10/11 14:51:05 INFO yarn.Client: Application report for application_1632616543267_0307 (state: RUNNING)
[2021-10-11 06:51:06,049] {ssh.py:141} INFO - 21/10/11 14:51:06 INFO yarn.Client: Application report for application_1632616543267_0307 (state: FAILED)
21/10/11 14:51:06 INFO yarn.Client: 
	 client token: N/A
	 diagnostics: Application application_1632616543267_0307 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1632616543267_0307_000001 exited with  exitCode: -104
Failing this attempt.Diagnostics: [2021-10-11 14:50:52.822]Container [pid=169925,containerID=container_1632616543267_0307_01_000001] is running 477093888B beyond the 'PHYSICAL' memory limit. Current usage: 1.9 GB of 1.5 GB physical memory used; 6.3 GB of 3.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1632616543267_0307_01_000001 :
	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
	|- 186405 169942 169925 169925 (java) 0 0 3377315840 254662 /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class myspark.warehouse.DriverTripClassification --jar file:/opt/project/deltaentropy/com.deltaentropy.bigdata.jar --arg  --arg ods_xty --arg vnbd_gps_202106_p --arg dw_xty --arg dwd_vnbd_driver_continuous_202106_f --arg dw_xty --arg dwd_vnbd_driver_merge_202106_f --properties-file /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_dist_cache__.properties 
	|- 169942 169925 169925 169925 (java) 16385 958 3377315840 254662 /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class myspark.warehouse.DriverTripClassification --jar file:/opt/project/deltaentropy/com.deltaentropy.bigdata.jar --arg  --arg ods_xty --arg vnbd_gps_202106_p --arg dw_xty --arg dwd_vnbd_driver_continuous_202106_f --arg dw_xty --arg dwd_vnbd_driver_merge_202106_f --properties-file /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_dist_cache__.properties 
	|- 169925 169920 169925 169925 (bash) 0 0 12144640 370 /bin/bash -c LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-6.3.0-1.cdh6.3.0.p0.1279813/lib/hadoop/../../../CDH-6.3.0-1.cdh6.3.0.p0.1279813/lib/hadoop/lib/native:" /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'myspark.warehouse.DriverTripClassification' --jar file:/opt/project/deltaentropy/com.deltaentropy.bigdata.jar --arg '' --arg 'ods_xty' --arg 'vnbd_gps_202106_p' --arg 'dw_xty' --arg 'dwd_vnbd_driver_continuous_202106_f' --arg 'dw_xty' --arg 'dwd_vnbd_driver_merge_202106_f' --properties-file /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data/hadoop/yarn/nm/usercache/hdfs/appcac
[2021-10-11 06:51:06,050] {ssh.py:141} INFO - he/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_dist_cache__.properties 1> /var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001/stdout 2> /var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001/stderr 

[2021-10-11 14:51:05.137]Container killed on request. Exit code is 143
[2021-10-11 14:51:05.139]Container exited with a non-zero exit code 143. 
For more detailed output, check the application tracking page: http://bd.vn0038.jmrh.com:8088/cluster/app/application_1632616543267_0307 Then click on links to logs of each attempt.
. Failing the application.
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: root.users.hdfs
	 start time: 1633933132795
	 final status: FAILED
	 tracking URL: http://bd.vn0038.jmrh.com:8088/cluster/app/application_1632616543267_0307
	 user: hdfs
[2021-10-11 06:51:06,116] {ssh.py:141} INFO - 21/10/11 14:51:06 INFO yarn.Client: Deleted staging directory hdfs://bd.vn0038.jmrh.com:8020/user/hdfs/.sparkStaging/application_1632616543267_0307
[2021-10-11 06:51:06,144] {ssh.py:141} INFO - 21/10/11 14:51:06 ERROR yarn.Client: Application diagnostics message: Application application_1632616543267_0307 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1632616543267_0307_000001 exited with  exitCode: -104
Failing this attempt.Diagnostics: [2021-10-11 14:50:52.822]Container [pid=169925,containerID=container_1632616543267_0307_01_000001] is running 477093888B beyond the 'PHYSICAL' memory limit. Current usage: 1.9 GB of 1.5 GB physical memory used; 6.3 GB of 3.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1632616543267_0307_01_000001 :
	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
	
[2021-10-11 06:51:06,145] {ssh.py:141} INFO - |- 186405 169942 169925 169925 (java) 0 0 3377315840 254662 /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class myspark.warehouse.DriverTripClassification --jar file:/opt/project/deltaentropy/com.deltaentropy.bigdata.jar --arg  --arg ods_xty --arg vnbd_gps_202106_p --arg dw_xty --arg dwd_vnbd_driver_continuous_202106_f --arg dw_xty --arg dwd_vnbd_driver_merge_202106_f --properties-file /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_dist_cache__.properties 
	|- 169942 169925 169925 169925 (java) 16385 958 3377315840 254662 /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class myspark.warehouse.DriverTripClassification --jar file:/opt/project/deltaentropy/com.deltaentropy.bigdata.jar --arg  --arg ods_xty --arg vnbd_gps_202106_p --arg dw_xty --arg dwd_vnbd_driver_continuous_202106_f --arg dw_xty --arg dwd_vnbd_driver_merge_202106_f --properties-file /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_dist_cache__.properties 
	|- 169925 169920 169925 169925 (bash) 0 0 12144640 370 /bin/bash -c LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-6.3.0-1.cdh6.3.0.p0.1279813/lib/hadoop/../../../CDH-6.3.0-1.cdh6.3.0.p0.1279813/lib/hadoop/lib/native:" /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'myspark.warehouse.DriverTripClassification' --jar file:/opt/project/deltaentropy/com.deltaentropy.bigdata.jar --arg '' --arg 'ods_xty' --arg 'vnbd_gps_202106_p' --arg 'dw_xty' --arg 'dwd_vnbd_driver_continuous_202106_f' --arg 'dw_xty' --arg 'dwd_vnbd_driver_merge_202106_f' --properties-file /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /data/hadoop/yarn/nm/usercache/hdfs/appcache/application_1632616543267_0307/container_1632616543267_0307_01_000001/__spark_conf__/__spark_dist_cache__.properties 1> /var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001/stdout 2> /var/yarn/container-logs/application_1632616543267_0307/container_1632616543267_0307_01_000001/stderr 

[2021-10-11 14:51:05.137]Container killed on request. Exit code is 143
[2021-10-11 14:51:05.139]Container exited with a non-zero exit code 143. 
For more detailed output, check the application tracking page: http://bd.vn0038.jmrh.com:8088/cluster/app/application_1632616543267_0307 Then click on links to logs of each attempt.
. Failing the application.
[2021-10-11 06:51:06,146] {ssh.py:141} INFO - Exception in thread "main" 
[2021-10-11 06:51:06,147] {ssh.py:141} INFO - org.apache.spark.SparkException: Application application_1632616543267_0307 finished with failed status
[2021-10-11 06:51:06,147] {ssh.py:141} INFO - 
[2021-10-11 06:51:06,148] {ssh.py:141} INFO - 	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1158)
[2021-10-11 06:51:06,148] {ssh.py:141} INFO - 	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1606)
[2021-10-11 06:51:06,149] {ssh.py:141} INFO - 	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
[2021-10-11 06:51:06,149] {ssh.py:141} INFO - 	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	
[2021-10-11 06:51:06,150] {ssh.py:141} INFO - at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
[2021-10-11 06:51:06,150] {ssh.py:141} INFO - 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[2021-10-11 06:51:06,152] {ssh.py:141} INFO - 21/10/11 14:51:06 INFO util.ShutdownHookManager: Shutdown hook called
[2021-10-11 06:51:06,154] {ssh.py:141} INFO - 21/10/11 14:51:06 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a991f5d0-8c67-4aed-8b2d-093ec9942cea
[2021-10-11 06:51:06,265] {ssh.py:141} INFO - 21/10/11 14:51:06 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-6e5df4c5-f52f-4ccd-9af3-f8ba5e65b380

更新记录----2022-01-25

再次遇到了这个错误，按上面的方式增加资源后仍然会失败，查看cdh上的yarn日志
在任务失败时间附近，有如下两条提示
在这里插入图片描述
查看失败的容器所在的服务器
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container marked as failed: container_1642990739641_0001_01_000002 on host: bd.vn0108.jmrh.com. Exit status: -100. Diagnostics: Container released on a lost node
在这里插入图片描述

在这里插入图片描述

yarn有个机制，当node manager磁盘空间超过90%时，会将该节点置为不可用状态，并逐步停用该节点，这个估计能解释为什么挂掉的容器会收到信号，然后shutdown

这个错误和我在下面资料中描述的情况类似
https://aws.amazon.com/cn/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/
在这里插入图片描述

这个集群的磁盘挂载方式比较乱，集群中有两台配置很高的节点，对这几个节点的磁盘，挂载到了几个配置比较低的节点上，上面bd.vn0108这个节点的截图可以看出。