下面代码中的hosts为hdfs中的namenode节点,NameNode节点查看方法见《学习笔记之Hdfs的Ha高可用原理》
一、snakebite
通过rpc方式操作hdfs
github:https://github.com/spotify/snakebite
文档:https://snakebite.readthedocs.io/en/latest/client.html
#coding=utf-8
from snakebite.client import HAClient
from snakebite.namenode import Namenode
n1 = Namenode("172.21.2.131", 8020)
n2 = Namenode("172.21.2.132", 8020)
client = HAClient([n1,n2], use_trash=False,effective_user='hadoop')
for x in client.ls(['/user/hadoop/tmp.db/']):
print x['path']
二、pyhdfs
通过webhdfs方式操作hdfs,需hdfs-site.xml中设置
dfs.webhdfs.enabled
为True
文档:https://pypi.org/project/PyHDFS/
读取hdfs生成dataframe
from pyhdfs import HdfsClient
import pandas as pd
client = HdfsClient(hosts='172.21.2.131:50070,172.21.2.132:50070', user_name='hadoop')
# 打开hdfs文件
inputfile=client.open('/user/jrapp/pyspark_featuretools_data/p0/members.csv')
# pandas调用读取方法read_table
df=pd.read_table(inputfile,encoding='utf-8',
sep=',',
parse_dates=['registration_init_time'],
infer_datetime_format = True,
dtype = {'gender': 'category'})#参数为源文件,编码,分隔符
print(df.head(10))
dataframe保存到hdfs
from hdfs import InsecureClient
import pandas as pd
client_hdfs = InsecureClient('http://172.25.239.166:50070',user='hadoop')
# Writing Dataframe to hdfs
with client_hdfs.write('/user/jrapp/pyspark_featuretools_data/p0/tmp.csv', encoding = 'utf-8') as writer:
df.to_csv(writer)
参考:
https://creativedata.atlassian.net/wiki/spaces/SAP/pages/61177860
https://blog.csdn.net/wxfghy/article/details/80941088
三、pydoop
需要在配置好hadoop相关环境变量的机器上执行
# 引入pydoop的hdfs模块
# 建议在notebook中先os.environ['USER']=集市用户名
# 然后使用
import pydoop.hdfs as hdfs
# 在hdfs的当前用户目录下创建目录,加入您是mart_bag用户,那么会在hdfs://<namespace>/user/mart_bag/目录下创建test目录
hdfs.mkdir('test')
# 查看当前某目录下的文件
hdfs.lsl("./test")
# 读取文件
with hdfs.open('test/hello.txt', 'r') as fi:
fi.read(3)
参考:
pydoop操作hdfs文件:https://crs4.github.io/pydoop/tutorial/hdfs_api.html
pydoop mr:https://crs4.github.io/pydoop/tutorial/mapred_api.html
pydoop api:https://crs4.github.io/pydoop/api_docs/hdfs_api.html#hdfs-api