file_path = "hdfs://worker1:8020/tmp/tfrecord-dnn/train/*.tfrecord" # 一定要三个引号
files = tf.io.gfile.glob(file_path)
print(files)
dataset = tf.data.TFRecordDataset(files)
#Example这里就省略了哈
parsed_dataset = dataset.map(decode_and_normalize)
print(parsed_dataset)
batch_size = 20
parsed_dataset = parsed_dataset.batch(batch_size)
for i in parsed_dataset:
print(i)
表面上看好像是很简单的,TensorFlow应该提供了HDFS的读写API,但是报错了和这个博主差不多的bug。
UnimplementedError: File system scheme hdfs not implemented #125
https://github.com/yahoo/TensorFlowOnSpark/issues/125
通过他的bug排查记录,我看到了一句特别重要的话:
The tensorflow which is installed through the pip install tensorflow
occurred errors while it's running on CentOS7 (install_tensorflow_centos7). After I reinstall it with pip install tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl
,(download it) it works well.
In addition, I still have a problem, Could the mnist demo TFOS_spark_demo.ipynb run in jupyter notebook? When I am running it in jupyter notebook, the spark task has been keep waitting.
虽然我用的版本是TensorFlow2但是从这个人的实践中可以发现,pip安装和pip安装whl文件是不一样的,文件系统未实现的bug也就解开了。因为我重新安装之后,bug提示变了。如下:
调用Tensorflow读取器的python脚本时遇到错误“libhdfs.so:cannot open shared object file:No such file or directory”。
一开始我也不知道这个libhdfs.so是什么鬼,然后我Google一下发现了:
他是缺cuda的包,我是缺hdfs的,所以又发现了一个博主写的TensorFlow连接hdfs
https://blog.csdn.net/ustbbsy/article/details/116529836
我照着他操作,还是同样的bug。
最后我发现libhdfs.so这个文件再CDH中所在的位置是
/opt/cloudera/parcels/CDH/lib64
最终我在etc/profile中设置环境变量为下图:
特别提示:TensorFlow安装的服务器属于CDH节点