MindSpore数据集转换支持将CV、NLP类数据集、Cifar10\Cifar100, Imagenet公开数据集以及TFrecord格式数据集转换为MindSpore数据格式,可参考:MindSpore数据格式转换
在Ascend + Mindspore1.3上进行MindSpore数据格式转换脚本如下:
1 from mindspore.mindrecord import FileWriter
2 data_record_path = './datasets/convert_dataset_to_mindrecord/data_to_mindrecord/test.mindrecord'
3 writer = FileWriter(file_name=data_record_path, shard_num=4)
4
5 # 定义schema
6 data_schema = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
7 writer.add_schema(data_schema, "test_schema")
8
9 # 数据准备
10 file_name = "./datasets/convert_dataset_to_mindrecord/images/transform.jpg"
11 with open(file_name, "rb") as f:
12 bytes_data = f.read()
13 data = [{"file_name": "transform.jpg", "label": 1, "data": bytes_data}]
14
15 indexes = ["file_name", "label"]
16 writer.add_index(indexes)
17
18 # 数据写入
19 writer.write_raw_data(data)
20
21 # 生成本地数据
22 writer.commit()
出现如下报错信息:
MRMOpenError Traceback (most recent call last)
/tmp/ipykernel_108737/3747345416.py in <module>
17
18 # 数据写入
---> 19 writer.write_raw_data(data)
20
21 # 生成本地数据
/opt/nvme0n1/root/miniforge3/envs/mdp/lib/python3.7/site-packages/mindspore/mindrecord/filewriter.py in write_raw_data(self, raw_data, parallel_writer)
250 """
251 if not self._writer.is_open:
--> 252 self._writer.open(self._paths)
253 if not self._writer.get_shard_header():
254 self._writer.set_shard_header(self._header)
/opt/nvme0n1/root/miniforge3/envs/mdp/lib/python3.7/site-packages/mindspore/mindrecord/shardwriter.py in open(self, paths)
53 if ret != ms.MSRStatus.SUCCESS:
54 logger.error("Failed to open paths")
---> 55 raise MRMOpenError
56 self._is_open = True
57 return ret
MRMOpenError: [MRMOpenError]: MindRecord File could not open successfully.
原因分析:
MindSpore1.6.0版本之前,MindSpore格式数据集生成时不支持覆盖写,当输出目录下存在同名文件时,异常信息不能准确反应错误信息,此时需要查看日志信息。如下所示,日志中第2行提示输出目录下已经存在同名mindrecord文件,需要提前删除。
[ERROR] ME(108737:281473404131632,MainProcess):2022-04-11-17:39:33.512.803 [mindspore/mindrecord/shardwriter.py:54] Failed to open paths
[ERROR] MD(108737,ffffa2445930,python3.7):2022-04-11-17:39:33.512.713 [mindspore/ccsrc/minddata/mindrecord/io/shard_writer.cc:92] OpenDataFiles] MindRecord file already existed, please delete file: /opt/nvme0n1/l00475263/workspace/datasets/convert_dataset_to_mindrecord/data_to_mindrecord/test.mindrecord0
[ERROR] MD(108737,ffffa2445930,python3.7):2022-04-11-17:39:33.512.752 [mindspore/ccsrc/minddata/mindrecord/io/shard_writer.cc:167] Open] Open data files failed.
解决办法:
1. 代码中添加删除逻辑,保证每次输出前删除目录下的重名MindRecord文件。
2. MindSpore1.6.0之后版本,定义FileWriter对象时,可以加上overwrite=True来实现覆盖写。
代码中第3行修改为:
writer = FileWriter(file_name=data_record_path, shard_num=4, overwrite=True)