我的训练环境:华为云ModelArts Notebook 镜像:mindspore1.7.0-py3.7-ubuntu18.04
操作系统:ubuntu18.04
MindSpore1.7.0;
CPU;
python3.7.5;
在调用GeneratorDataset生成数据集后,调用create_dict_iterator()函数时抛出索引错误:
IndexError: Only support integers, slices(`:`), ellipsis(`...`), None, bool, tensor with int, list and tuple ,but got 0 with type <class 'numpy.int64'>
这是我实现的Dataset类:
这是调用时的代码:
这是程序抛出的错误:
问题可能出现在ds_xtrain和ds_ytrain两个参数上,它们是tensor类型,如下:
下面是所有的代码:
import os
import csv
import time
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import mindspore as ms
import mindspore.dataset as ds
import mindspore.context as context
# import mindspore.dataset.transforms.c_transforms as C
# import mindspore.dataset.transforms.vision.c_transforms as CV
import mindspore.dataset.transforms.c_transforms as C
import mindspore.dataset.vision.c_transforms as CV
from mindspore import nn, Tensor
from mindspore.train import Model
from mindspore.nn.metrics import Accuracy, MAE, MSE
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor
context.set_context(mode=context.GRAPH_MODE, device_target='CPU')
#加载并观察数据集,为数据处理做准备
with open('./auto-mpg.data') as csv_file:
data = list(csv.reader(csv_file, delimiter=','))
#MPG:Miles Per Gallon,每加仑燃油可以跑多少英里
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
'Acceleration', 'Model Year', 'Origin']
#利用pandas读取数据并对其进行初步处理,如:遇到?换成nan,忽略\t之后的内容等
raw_data = pd.read_csv("./auto-mpg.data", header=None, names=column_names, sep=r'\s+', na_values='?', usecols=['MPG','Cylinders','Displacement','Horsepower','Weight',
'Acceleration', 'Model Year', 'Origin'])
data = raw_data.copy()
#对于数据集中的空值,我们要进行处理。
#代码需自己补充
data = data.dropna()
#查看训练数据集的结构,代码需自己补充
origin = data.pop("Origin")
labels = data.pop("MPG")
states = data.describe()
states = states.transpose()
#对特征进行归一化
def norm(x):
return (x-states['mean'])/states['std']
normed_data = norm(data)
# normed_data["Origin"] = origin
normed_data["MPG"] = labels
#将数据集按照4:1划分成训练集和测试集
train_dataset = normed_data.sample(frac=0.8,random_state=0)
test_dataset = normed_data.drop(train_dataset.index)
#模型训练需要区分特征值与目标值,也就是我们常说的X值与Y值,此处MPG为Y值,其余的特征为X值。
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')
X_train, Y_train = np.array(train_dataset), np.array(train_labels)
X_test, Y_test = np.array(test_dataset), np.array(test_labels)
#将数据集转换为Tensor格式
ds_xtrain = Tensor(X_train, ms.float32)
ds_ytrain = Tensor(Y_train, ms.int32)
print(ds_xtrain)
print(ds_ytrain)
ds_xtest = Tensor(X_test, ms.float32)
ds_ytest = Tensor(Y_test, ms.int32)
class Dataset(object):
def __init__(self, xdata, ydata):
self.xdata = xdata
self.ydata = ydata
def __getitem__(self, index):
return [self.xdata[index], self.ydata[index]]
def __len__(self):
return len(self.xdata)
print(ds_xtrain[0])
gendataset = Dataset(ds_xtrain, ds_ytrain)
gendataset = ds.GeneratorDataset(gendataset, ["data", "label"], shuffle=False)
for data in gendataset.create_dict_iterator():
print('{}'.format(data["data"]), '{}'.format(data["label"]))
****************************************************解答*****************************************************
问题出在GeneratorDataset中调用__getitem__时传入的是一个numpy.int64类型的数据,这种类型并不能作为索引值,因此需要在__getitem__中加一个类型强转的处理过程。不过我还是不明白GeneratorDataset在底层调用的时候到底进行了什么操作,为什么传入的是一个numpy.int64类型的数据呢?
解决后的代码如图。