引例
最近用paddle重写pytorch项目代码时,遇到了DataLoader一直报错的问题,看API文档时,发现两个框架并无区别,于是简单拿来重用,结果调试浪费了很多时间,还是要看API的源代码,比较底层实现。现将问题记录如下:
Paddle读取数据主要用到两个类:paddle.io.Dataset
和paddle.io.DataLoader
。
下面的例子来自于官方文档:
import numpy as np
from paddle.io import Dataset, DataLoader
import paddle
# define a random dataset
class RandomDataset(Dataset):
def __init__(self, num_samples):
self.num_samples = num_samples
def __getitem__(self, idx):
image = np.random.random([784]).astype('float32')
label = np.random.randint(0, 9, (1, )).astype('int64')
return image, label
def __len__(self):
return self.num_samples
dataset = RandomDataset(10)
loader = DataLoader(dataset,
batch_size=BATCH_SIZE,
shuffle=True,
drop_last=True,
num_workers=2)
# 查看数据
for i in range(len(dataset)):
print(dataset[i])
#迭代地读取数据用于训练
for i, (image, label) in enumerate(loader()):
print('Got it!')
错误1
Dataset类的__getitem__(self, idx)
返回的数据不是numpy.ndarray类型。
比如在return前加一句:
image = paddle.to_tensor(image)
则会报错:
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
[Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:158)
错误2
Dataset类的__getitem__(self, idx)
返回的数据为字典(Dict) 类型。
例如将返回的语句改为:
return {'input': image, 'lb': label}
会报完全一样的错误。
参考
https://blog.csdn.net/qq_37668436/article/details/114336142
https://blog.csdn.net/qq_32097577/article/details/112385033