MindSpore中的Dataset调用create_dict_iterator()接口时,会拉起数据加载&处理的流水线,一些错误会在这个时间点发生。但是错误发生点可能不在这个接口本身,需要用户根据错误日志进一步分析来定位。
下面介绍两类调用create_dict_iterator时发生的错误:
'DictIterator' has no attribute 'get_next'
1 #从测试集中取出一组样本,输入模型进行预测
2 test_ = ds_test.create_dict_iterator().get_next()
3 #利用key值选出样本
4 test = Tensor(test_['x'], mindspore.float32)
5
6 AttributeError: 'DictIterator' object has no attribute 'get_next'
原因分析:
create_dict_iterator()返回类型为DictIterator对象,其继承内部Iterator类,Iterator中通过实现__iter__()和__next__()两个内置函数实现迭代器协议。
1 class Iterator:
2 """
3 General Iterator over a dataset.
4
5 Attributes:
6 dataset: Dataset to be iterated over
7 """
8
9 def __init__(self, dataset, num_epochs=-1, output_numpy=False, do_copy=True):
10 ......
11
12 def __iter__(self):
13 return self
14
15 def __next__(self):
16 if not self._runtime_context:
17 logger.warning("Iterator does not have a running C++ pipeline." +
18 "It might because Iterator stop() had been called, or C++ pipeline crashed silently.")
19 raise RuntimeError("Iterator does not have a running C++ pipeline.")
20
21 data = self._get_next()
22 if not data:
23 if self.__index == 0:
24 logger.warning("No records available.")
25 if self.__ori_dataset.dataset_size is None:
26 self.__ori_dataset.dataset_size = self.__index
27 raise StopIteration
28 self.__index += 1
29
30 if self.offload_model is not None:
31 data = offload.apply_offload_iterators(data, self.offload_model)
32
33 return data
从Iterator定义可以看出,通过调用__next__方法可以取到下一条数据,第21行 data = self._get_next()表示实际取数据的实现定义子类DictIterator或者TupleIterator中_get_next()方法中。用户可以通过next(ds_test.create_dict_iterator()) 或者 for item in ds_test.create_dict_iterator(): 两种方式从迭代器中取下一条数据。
早期版本中get_next()为公开方法,通过ds_test.create_dict_iterator().get_next()可以取到下一个条dict类型的数据。
解决办法:通过迭代器方式来获取处理后的数据。
1. next(ds_test.create_dict_iterator())
2. for item in ds_test.create_dict_iterator():
2. 调用create_dict_iterator时报错无效的数据类型
用户使用GeneratorDataset来加载数据,并定义了如下:自定义的随机访问。
报错信息如下:
1 E:\anaconda\envs\mindspore\lib\site-packages\mindspore\dataset\engine\datasets.py:3533: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
2 yield tuple([np.array(x, copy=False) for x in val])
3 Traceback (most recent call last):
4 File "C:/Users/wkml996/Desktop/hypertext/mindspore/project/main.py", line 18, in <module>
5 for data in train_iter.create_dict_iterator():
6 File "E:\anaconda\envs\mindspore\lib\site-packages\mindspore\dataset\engine\iterators.py", line 122, in __next__
7 data = self._get_next()
8 File "E:\anaconda\envs\mindspore\lib\site-packages\mindspore\dataset\engine\iterators.py", line 173, in _get_next
9 raise err
10 File "E:\anaconda\envs\mindspore\lib\site-packages\mindspore\dataset\engine\iterators.py", line 166, in _get_next
11 return {k: self._transform_tensor(t) for k, t in self._iterator.GetNextAsMap().items()}
12 RuntimeError: Unexpected error. Invalid data type.
13 Line of code : 114
14 File : D:\jenkins\agent-working-dir\workspace\Compile_CPU_Windows\mindspore\mindspore\ccsrc\minddata\dataset\core\tensor.cc
15
16 WARNING: Logging before InitGoogleLogging() is written to STDERR
17 [ERROR] MD(23788,2,?):2021-9-27 18:49:1 [mindspore\ccsrc\minddata\dataset\core\data_type.cc:159] FromNpArray] Cannot convert from numpy type. Unknown data type is returned!
18 [ERROR] MD(23788,2,?):2021-9-27 18:49:1 [mindspore\ccsrc\minddata\dataset\core\data_type.cc:159] FromNpArray] Cannot convert from numpy type. Unknown data type is returned!
19 [ERROR] MD(23788,2,?):2021-9-27 18:49:1 [mindspore\ccsrc\minddata\dataset\util\task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Unexpected error. Invalid data type.
20 Line of code : 114
21 File : D:\jenkins\agent-working-dir\workspace\Compile_CPU_Windows\mindspore\mindspore\ccsrc\minddata\dataset\core\tensor.cc
22
23
24 进程已结束,退出代码为 1
~
~
原因分析:错误信息第12行中提示"Invalid data type",表示输入的Numpy array的dtype不符合预期,MindSpore支持输入dtype为int, float, str类型的Numpy array。
用户脚本中使用np.array()进行转换时,当输入数据为不同length或shape的list, tuple, ndarrays 组成的list或tuple时,输出的Numpy array的dtype为预期外的object类型,导致 MindSpore加载数据时出错。
比如: one_sample[0]为nd.array组成的list:one_sample[0] = [np.array([1,2]), np.array([1,2,3])], 其中每个元素的dtype为int64,执行完np.array()转换后的data1的dtype为obejct类型,Mindspore在执行到Tensor的转换时抛出"Invalid data type"异常。
解决方法:用户在脚本中调用np.array()将输入数据转为Numpy array后,需要进一步确认其数据类型。假设data1的dtype类型为object,需要在转换前保证每个元素的shape一致。