近日在研究使用python操作hdfs集群上面的文件,由于暂时不想使用第三方库,故使用thrift的方式。
在借鉴了这里之后,写了简单一段脚本hdfs-test.py
import sys
sys.path.append('gen-py')
from hdfs import hadoopthrift_cli
host = '10.33.28.200'
port = 10086
fs_con = hadoopthrift_cli(host,port)
fs_con.connect()
fs_con.do_ls(r'hdfs://10.33.28.200:9000/')
然后修改服务器端脚本start_thrift_server.sh,主要是修改其中jar包的位置信息。
启动服务器端
[root@hadoop1 scripts]# sh start_thrift_server.sh 10086
Starting the hadoop thrift server on port [10086]...
15/04/18 21:30:52 INFO hadoop.thrift: Starting the hadoop thrift server on port [10086]...
启动客户端
python hdfs-test.py
不料,此时的提示是这样的
[root@test py-hdfs]# python hdfs-test.py
Traceback (most recent call last):
File "hdfs-test.py", line 11, in <module>
fs_con.do_ls(r'hdfs://10.33.28.200:9000/')
File "/root/py-hdfs/hdfs.py", line 297, in do_ls
status = self.client.stat(path)
File "gen-py/hadoopfs/ThriftHadoopFileSystem.py", line 452, in stat
return self.recv_stat()
File "gen-py/hadoopfs/ThriftHadoopFileSystem.py", line 463, in recv_stat
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "build/bdist.linux-i686/egg/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
File "build/bdist.linux-i686/egg/thrift/protocol/TBinaryProtocol.py", line 206, in readI32
File "build/bdist.linux-i686/egg/thrift/transport/TTransport.py", line 58, in readAll
File "build/bdist.linux-i686/egg/thrift/transport/TTransport.py", line 159, in read
File "build/bdist.linux-i686/egg/thrift/transport/TSocket.py", line 120, in read
thrift.transport.TTransport.TTransportException: TSocket read 0 bytes
而服务器端也报错
[root@hadoop1 scripts]# sh start_thrift_server.sh 10086
Starting the hadoop thrift server on port [10086]...
15/04/18 22:49:39 INFO hadoop.thrift: Starting the hadoop thrift server on port [10086]...
java.lang.IllegalArgumentException: Wrong FS: hdfs://10.33.28.200:9000/, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:390)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:398)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
at org.apache.hadoop.thriftfs.HadoopThriftServer$HadoopThriftHandler.stat(HadoopThriftServer.java:425)
at org.apache.hadoop.thriftfs.api.ThriftHadoopFileSystem$Processor$stat.process(Unknown Source)
at org.apache.hadoop.thriftfs.api.ThriftHadoopFileSystem$Processor.process(Unknown Source)
at com.facebook.thrift.server.TThreadPoolServer$WorkerProcess.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
但是如果将客户端文件里面的访问文件地址改为
fs_con.do_ls(r'/')
则会显示出服务器的本地文件结构
[root@test py-hdfs]# python hdfs-test.py
1 4096 1429308924000 rwxrwxrwx root root file:/tmp
1 12288 1400066227000 r-xr-xr-x root root file:/sbin
1 4096 1401269647000 rwxrwxrwx root root file:/share
1 4096 1316778468000 rwxr-xr-x root root file:/mnt
1 0 1429300092000 rw-r--r-- root root file:/.autofsck
1 12288 1400066213000 r-xr-xr-x root root file:/lib
1 16384 1396620511000 rwx------ root root file:/lost+found
1 4096 1400066199000 rwxr-xr-x root root file:/var
1 4096 1429311134000 r-xr-x--- root root file:/root
1 4096 1316778468000 rwxr-xr-x root root file:/srv
1 4096 1396620633000 rwxr-xr-x root root file:/selinux
1 1024 1396620854000 r-xr-xr-x root root file:/boot
1 0 1429300085000 rwxr-xr-x root root file:/sys
1 0 1400066782000 rw-r--r-- root root file:/.autorelabel
1 4096 1316778468000 rwxr-xr-x root root file:/home
1 4096 1401446050000 rwxr-xr-x root root file:/media
1 0 1429300085000 r-xr-xr-x root root file:/proc
1 4096 1429364692000 rwxr-xr-x root root file:/etc
1 4096 1416433067000 rwxr-xr-x root root file:/usr
1 4096 1316778468000 rwxr-xr-x root root file:/opt
1 3720 1429300104000 rwxr-xr-x root root file:/dev
1 4096 1400066213000 r-xr-xr-x root root file:/bin
感觉应该是项目无法找到文件系统的配置文件,即core-site.xml
中的关于hdfs的地址。但将conf目录放入path仍然无法解决。
后来查看各类代码,包括HadoopThriftServer.java
等,查找了各种path,pathname等字段的写法,也未发现问题。
人言道,外事不决问谷歌,内事不决问百度。可能搜索能力还是不够,在谷歌折腾一天也没找到,倒是反而在百度搜到了解决办法。详情请看这里。
果然是因为项目无法找到原来是需要将配置文件,而此情况下,需将配置文件放于项目目录下即可。对于本项目,就把core-site.xml
放在start_thrift_server.sh
文件同一个目录下即可。
[root@test py-hdfs]# python hdfs-test.py
0 0 1413390909861 rwxr-xr-x root supergroup hdfs://10.33.28.200:9000/root
0 0 1413412534130 rwxr-xr-x root supergroup hdfs://10.33.28.200:9000/user