如果想把pandas生成的json,csv导入到hdfs,直接使用hdfs的地址时不行的
ps:其实直接使用spark SQL 的to_csv,to_json,就已经完美解决了,这里就是说用pandas来写入
使用HDFS package
import pandas as pd
from hdfs import InsecureClient
首先需要连接到hdfs的WebURI
#连接到hdfs:hadoop3接口为9870
client_hdfs = InsecureClient('http://master:9870')
然后就可以使用hdfs客户端执行hadoop命令
写入(overwrite=True 覆盖文件,等价于put,不覆盖文件等价于copyFromLocal)
with client_hdfs.write('/user/ang/helloworld.csv', encoding = 'utf-8',overwrite=True) as writer:
df.to_csv(writer)
读取
with client_hdfs.read('/user/ang/helloworld.csv', encoding = 'utf-8') as reader:
df1 = pd.read_csv(reader,index_col=0)
其他常用操作
操作 | python hdfs | hdfs 命令 |
删除文件 | client_hdfs.delete("path") | hadoop fs -rm path |
下载 | client_hdfs.download("hdfs_path",'local_path') | hadoop fs -copyToLocal hdfs_path local_path |
上传文件 | client_hdfs.upload('hdfs_path','local_path') | hadoop fs -copyFromLocal local hdfs |
改名 | client_hdfs.rename('old_path','new_path') | hadoop fs -mv old_path new_path |
查看文件 | client_hdfs.list('path') | hadoop fs -ls path |
新建路径 | client_hdfs.makedirs('path') | hadoop fs -mkdir path |
说明文档
https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write