TensorFlow2.5.0读取hdfs数据

最新推荐文章于 2023-02-02 13:05:41 发布

新时代深漂农民工

最新推荐文章于 2023-02-02 13:05:41 发布

阅读量1.8k

点赞数

分类专栏：机器学习工程问题文章标签： hdfs tensorflow2

本文链接：https://blog.csdn.net/u013227785/article/details/120080258

版权

机器学习工程问题专栏收录该内容

3 篇文章 0 订阅

订阅专栏

file_path = "hdfs://worker1:8020/tmp/tfrecord-dnn/train/*.tfrecord"  # 一定要三个引号
files = tf.io.gfile.glob(file_path)
print(files)
dataset = tf.data.TFRecordDataset(files)
#Example这里就省略了哈
parsed_dataset = dataset.map(decode_and_normalize)
print(parsed_dataset)
batch_size = 20

parsed_dataset = parsed_dataset.batch(batch_size)
for i in parsed_dataset:
    print(i)

表面上看好像是很简单的，TensorFlow应该提供了HDFS的读写API，但是报错了和这个博主差不多的bug。

UnimplementedError: File system scheme hdfs not implemented #125

https://github.com/yahoo/TensorFlowOnSpark/issues/125

通过他的bug排查记录，我看到了一句特别重要的话：

The tensorflow which is installed through the pip install tensorflow occurred errors while it's running on CentOS7 (install_tensorflow_centos7). After I reinstall it with pip install tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl ,(download it) it works well.
In addition, I still have a problem, Could the mnist demo TFOS_spark_demo.ipynb run in jupyter notebook? When I am running it in jupyter notebook, the spark task has been keep waitting.

虽然我用的版本是TensorFlow2但是从这个人的实践中可以发现，pip安装和pip安装whl文件是不一样的，文件系统未实现的bug也就解开了。因为我重新安装之后，bug提示变了。如下：

调用Tensorflow读取器的python脚本时遇到错误“libhdfs.so:cannot open shared object file:No such file or directory”。

一开始我也不知道这个libhdfs.so是什么鬼，然后我Google一下发现了：

https://www.codegrepper.com/code-examples/whatever/+libcudart.so.10.1%3A+cannot+open+shared+object+file%3A+No+such+file+or+directory+detectron2