通过 PySpark SQL 处理离线数据,遇到报错。
报错信息:
[INFO] 2023-06-12 00:57:59.872 +0800 [taskAppId=TASK-20230612-9836545666240_2-1453120-2222026] TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[63] - -> 23/06/12 00:57:59 WARN DataStreamer: Exception for BP-1725103169-10.238.8.29-1660021500540:blk_1686328797_614494010
java.io.EOFException: Unexpected EOF while trying to read response from server
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:549)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1086)
23/06/12 00:57:59 WARN DataStreamer: Error Recovery for BP-1725103169-10.238.8.29-1660021500540:blk_1686328797_614494010 in pipeline [DatanodeInfoWithStorage[10.238.8.31:50010,DS-7fa1dfa7-bdb4-47b8-b8c7-ce88ae6cac6d,DISK], DatanodeInfoWithStorage[10.238.6.114:50010,DS-c35b1205-bdb0-4cae-99b3-0f7c5e0c5e53,DISK], DatanodeInfoWithStorage[10.5.147.97:50010,DS-83003630-7b91-4854-9735-0bb83e12ea2f,DISK]]: datanode 0(DatanodeInfoWithStorage[10.238.8.31:50010,DS-7fa1dfa7-bdb4-47b8-b8c7-ce88ae6cac6d,DISK]) is bad.
直接原因:
在模块A中定义并注册了 Spark UDF,并在模块B中引用该 UDF
解决方法:
将定义并注册UDF的代码和引用该UDF的代码写在同一个模块中,问题解决。