Python无法读取hdfs，requests.exceptions.ConnectionError: HTTPConnectionPool(host='big08', port=50075): Ma

本文链接：https://blog.csdn.net/u010916338/article/details/97377766

一，问题描述：

在用python的hdfs库操作HDFS时，可以正常的获取到hdfs的文件

from hdfs.client import Client


#读取hdfs文件内容,将每行存入数组返回
def read_hdfs_file(client,filename):
    #with client.read('samples.csv', encoding='utf-8', delimiter='\n') as reader:
    #  for line in reader:
    #pass
    lines = []
    with client.read(filename, encoding='utf-8', delimiter='\n') as reader:
        for line in reader:
            #pass
            #print line.strip()
            lines.append(line.strip())
            print(lines)
    return lines


client = Client("http://192.168.129.14:50070", root='/')
print('连接没问题')
data = read_hdfs_file(client,'/input/python/data1')

报错信息：

requests.exceptions.ConnectionError: HTTPConnectionPool(host='big08', port=50075): Max retries exceeded with url: /webhdfs/v1/input/python/data1?op=OPEN&namenoderpcaddress=ns&offset=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1557e064a8>: Failed to establish a new connection: [Errno 111] Connection refused',))

二，解决办办法

（1）因为没有指定根路径（root path）,需要在调用Client方法连接hdfs时指定root path


client = Client("http://10.0.30.9:50070", root='/')

（2）在运行python程序的主机的hosts文件中没有加上访问集群的主机名和ip的映射

解析：博主这里配置了6台虚拟机服务器的集群，在第7台虚拟机服务器上访问集群，按理说上面已经制定了入口ip地址，映射文件的作用就是解析主机名去找ip，但是这里直接指定ip反而不行，具体内部机制未知，从报错信息猜测：虽然指定了active的namenode节点ip，但是这个节点应该是又去找其他节点，会用到这个hosts文件。