PySpark报错：Connection reset by peer: socket write error

Gklearlove

已于 2022-02-11 15:54:24 修改

阅读量1.9k

点赞数

分类专栏： Spark 文章标签： spark python

于 2022-02-11 15:27:28 首次发布

本文链接：https://blog.csdn.net/qq_40407889/article/details/122881675

版权

Spark 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

pyspark报错如下：

Caused by: java.net.SocketException: Connection reset by peer: socket write error
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
	at java.io.DataOutputStream.write(DataOutputStream.java:107)
	at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
	at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:477)
	at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:297)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:307)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:680)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:434)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:269)


Process finished with exit code 1

解决方法：

修改文件：spark-2.4.3-bin-hadoop2.7\python\pyspark\worker.py，在process方法处添加如下：

    def process():
        iterator = deserializer.load_stream(infile)
        serializer.dump_stream(func(split_index, iterator), outfile)
	    #添加的代码片段
	    for obj in iterator:
	        pass

然后 python/lib 文件夹中重建 pyspark.zip 以包含更改。

官网解释如下：
The issue may be that the worker process is completing before the executor has written all the data to it. The thread writing the data down the socket throws an exception and if this happens before the executor marks the task as complete it will cause trouble. The idea is to try to get the worker to pull all the data from the executor even if its not needed to lazily compute the output. This is very inefficient of course so it is a workaround rather than a proper solution.

更改后，您需要在 python/lib 文件夹中重建 pyspark.zip 以包含更改。

问题可能是工作进程在执行程序将所有数据写入它之前完成。将数据写入套接字的线程会引发异常，如果在执行程序将任务标记为完成之前发生这种情况，则会导致麻烦。这个想法是试图让工作人员从执行程序中提取所有数据，即使它不需要懒惰地计算输出。这当然是非常低效的，因此它是一种解决方法而不是适当的解决方案。

Gklearlove

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
PySpark报错：Connection reset by peer: socket write error

pyspark报错如下：Caused by: java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) at java.net.SocketOutputStr
复制链接

扫一扫

专栏目录