在使用kettle从windows客户端向hadoop集群传输文件的过程中,如下:
报这样的错误,文件上传成功,但是没有写入数据。
2017/04/12 15:32:08 - Hadoop Copy Files - 正在处理行, 源文件/目录: [E:/ArcGIS/2009-05-16_BJ_4.csv] ... 目标文件/目录 : [hdfs://192.168.190.129:9000/nytaxi/csv]... 通配符 : [ ^.*\.csv]
2017/04/12 15:32:08 - Hadoop Copy Files - ERROR (version 5.1.0.0, build 1 from 2014-06-19_19-02-57 by buildguy) : 文件系统异常:Could not copy "file:///E:/ArcGIS/2009-05-16_BJ_4.csv" to "hdfs://192.168.190.129:9000/nytaxi/csv/2009-05-16_BJ_4.csv".2017/04/12 15:32:08 - Hadoop Copy Files - ERROR (version 5.1.0.0, build 1 from 2014-06-19_19-02-57 by buildguy) : 原因:Could not close the output stream for file "hdfs://192.168.190.129:9000/nytaxi/csv/2009-05-16_BJ_4.csv".
2017/04/12 15:32:08 - Hadoop Copy Files - ERROR (version 5.1.0.0, build 1 from 2014-06-19_19-02-57 by buildguy) : 原因:DataStreamer Exception:
2017/04/12 15:32:08 - Hadoop Copy Files - ERROR (version 5.1.0.0, build 1 from 2014-06-19_19-02-57 by buildguy) : 原因:Could not initialize class org.apache.hadoop.hdfs.protocol.HdfsConstants
2017/04/12 15:32:08 - job_local2hdfs - 完成作业项[Hadoop Copy Files] (结果=[false])
2017/04/12 15:32:08 - job_local2hdfs - 任务执行完毕
2017/04/12 15:32:08 - Spoon - 任务已经结束.
自己百度折腾了好久,一直不知道什么原因,最后在官网看到:
Problem: Hadoop copy files step creates an empty file in HDFS and hangs or never writes any data.
Check: The Hadoop client side API that Pentaho calls to copy files to HDFS requires that PDI has network connectivity to the nodes in the cluster. The DNS names or IP addresses used within the cluster must resolve the same relative to the PDI machine as they do in the cluster. When PDI requests to put a file into HDFS, the Name Node will return the DNS names (or IP address' depending on the configuration) of the actual nodes that the data will be copied to.
大概意思是网络问题,不在同一个网段内。解决办法是把kettle安装在linux上,再用在xmanager中的xshell运行进入图形化界面
参考:http://www.mamicode.com/info-detail-1458941.html
那么问题来了,本来想从windows直接上传文件到hdfs上,跳过中间上传到linux这个环节,但是kettle部署在linux上,那不是要把文件还得上传到linux上,这样做没有意义了。