Python操作hdfs

最新推荐文章于 2024-05-28 09:48:46 发布

HaiwiSong

最新推荐文章于 2024-05-28 09:48:46 发布

阅读量2.5k

点赞数 1

分类专栏： Python

本文链接：https://blog.csdn.net/oTengYue/article/details/88193115

版权

Python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

下面代码中的hosts为hdfs中的namenode节点，NameNode节点查看方法见《学习笔记之Hdfs的Ha高可用原理》

一、snakebite

通过rpc方式操作hdfs
github：https://github.com/spotify/snakebite
文档：https://snakebite.readthedocs.io/en/latest/client.html

#coding=utf-8
from snakebite.client import HAClient
from snakebite.namenode import Namenode
n1 = Namenode("172.21.2.131", 8020)
n2 = Namenode("172.21.2.132", 8020)
client = HAClient([n1,n2], use_trash=False,effective_user='hadoop')

for x in client.ls(['/user/hadoop/tmp.db/']):
    print x['path']

二、pyhdfs

通过webhdfs方式操作hdfs，需hdfs-site.xml中设置dfs.webhdfs.enabled为True

文档：https://pypi.org/project/PyHDFS/

读取hdfs生成dataframe

from pyhdfs import HdfsClient
import pandas as pd

client = HdfsClient(hosts='172.21.2.131:50070,172.21.2.132:50070', user_name='hadoop')
# 打开hdfs文件
inputfile=client.open('/user/jrapp/pyspark_featuretools_data/p0/members.csv')
# pandas调用读取方法read_table
df=pd.read_table(inputfile,encoding='utf-8',
                 sep=',',
                 parse_dates=['registration_init_time'],
                 infer_datetime_format = True,
                 dtype = {'gender': 'category'})#参数为源文件,编码,分隔符
print(df.head(10))

dataframe保存到hdfs

from hdfs import InsecureClient
import pandas as pd

client_hdfs = InsecureClient('http://172.25.239.166:50070',user='hadoop')
# Writing Dataframe to hdfs
with client_hdfs.write('/user/jrapp/pyspark_featuretools_data/p0/tmp.csv', encoding = 'utf-8') as writer:
    df.to_csv(writer)

参考：

https://creativedata.atlassian.net/wiki/spaces/SAP/pages/61177860
https://blog.csdn.net/wxfghy/article/details/80941088

三、pydoop

需要在配置好hadoop相关环境变量的机器上执行

# 引入pydoop的hdfs模块
# 建议在notebook中先os.environ['USER']=集市用户名
# 然后使用
import pydoop.hdfs as hdfs
# 在hdfs的当前用户目录下创建目录，加入您是mart_bag用户，那么会在hdfs://<namespace>/user/mart_bag/目录下创建test目录
hdfs.mkdir('test')
# 查看当前某目录下的文件
hdfs.lsl("./test")
# 读取文件
with hdfs.open('test/hello.txt', 'r') as fi:
       fi.read(3)

参考：
pydoop操作hdfs文件：https://crs4.github.io/pydoop/tutorial/hdfs_api.html
pydoop mr：https://crs4.github.io/pydoop/tutorial/mapred_api.html
pydoop api：https://crs4.github.io/pydoop/api_docs/hdfs_api.html#hdfs-api

HaiwiSong

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python操作hdfs

下面代码中的hosts为hdfs中的namenode节点，NameNode节点查看方法见学习笔记之Hdfs的Ha高可用原理snakebite通过rpc方式操作hdfsgithub：https://github.com/spotify/snakebite文档：https://snakebite.readthedocs.io/en/latest/client.html#coding=u...
复制链接

扫一扫

专栏目录