[SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency - ASF JIRA (apache.org)
问题背景
业务的大计算,有大量shuffle write read的时候,经常出现task失败的问题,后台日志大量报7337连接失败,7337是我们shuffle server的服务端口号。服务器的磁盘IO和CPU负载很高。7337在某个时间连接数超级多。
说明:该记录只是问题梳理,不会涉及任何业务信息。
21/12/09 08:07:55 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from 10.12.13.194:7337 is closed
21/12/09 08:07:55 ERROR BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds...
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:258)
at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144)
at org.apache.spark.storage.BlockManager$$anonfun$registerWit