spark如何解决文件不存在_Spark Streaming：java.io.FileNotFoundException：文件不存在：< input_filename> ._ COPYIN...

最新推荐文章于 2022-12-29 09:23:17 发布

weixin_39874196

最新推荐文章于 2022-12-29 09:23:17 发布

阅读量384

点赞数

文章标签： spark如何解决文件不存在

本文链接：https://blog.csdn.net/weixin_39874196/article/details/111862807

版权

I am writing a spark streaming application which reads input from HDFS. I submit spark application to yarn and then run a script which copies data from local fs to HDFS.

But Spark application starts throwing fileNotFoundException.

I believe this is happening because spark is picking up files before it is being copied fully onto HDFS.

Following is the some part of exception trace:

java.io.FileNotFoundException: File does not exist: ._COPYING_

at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)

at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)

at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:87)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:363)

at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)

at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)

Any suggestion how to resolve this?

Thanks

解决方案

You need to have your data producer filename the current file being copied different from the completely copied file. Then you need to add filter on file dstream getting the completely copied file only. E.g.

Ongoing copy file prefix _copying*

Completely copies file prefix data*

weixin_39874196

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark如何解决文件不存在_Spark Streaming：java.io.FileNotFoundException：文件不存在：< input_filename> ._ COPYIN...

I am writing a spark streaming application which reads input from HDFS. I submit spark application to yarn and then run a script which copies data from local fs to HDFS.But Spark application starts th...
复制链接

扫一扫