keras中生成器的使用
使用背景:在加载数据时,在有些项目中,直接全部加载所有数据的话会很占内存,甚至发生内存崩溃,所以在这种背景下,引入了keras中的数据生成器,可以有效缓解内存不足的问题
1.常规做法
全部加载后,直接用model.fit
import numpy as np
from keras.models import Sequential
# Load entire dataset
X, y = np.load('some_training_set_with_labels.npy')
# Design model
model = Sequential()
[...] # Your architecture
model.compile()
# Train model on your dataset
model.fit(x=X, y=y)
2.使用生成器做法
设置Python类DataGenerator,该类将实时把数据投喂到Keras模型中
初始化函数:其中包含与数据相关的信息
# 初始化函数
def __init__(self, keys, allblocks, batch_size, word_vectors, w2v_model, shuffle=True):
'Initialization'
self.batch_size = batch_size
self.keys = keys
self.allblocks = allblocks
self.word_vectors = word_vectors
self.w2v_model = w2v_model
self.shuffle = shuffle
self.on_epoch_end()
on_epoch_end函数:这个方法一般是处理在每个epoch之后要做的事,特别的是指定在每个epoch后是否将数据打乱(shuffle),在每个epoch后打乱数据的次序会使我们训练出来的模型表现更加稳定
# 每个epoch之后做的事
def on_epoch_end(self):
print("一个epoch结束")
'Updates indexes after each epoch'
# self.indexes = numpy.arange(len(self.keys))
# if self.shuffle == True:
# numpy.random.shuffle(self.indexes)
len函数:这个方法被定义是用来建立每一个betch的索引,该索引的长度在0到数据总数/batch_size之间
这里使用floor向下取整的原因是:假如,我们数据总的个数为1002,batch_size为10,我们在一个epoch中最多迭代100次,多了就会有重复迭代的数据。因此我们使用了np.floor向下取整函数
# 计算每个epoch所需要的批次
def __len__(self):
print("计算len")
'Denotes the number of batches per epoch'
return int(numpy.floor(len(self.keys) / self.batch_size))
getitem函数:当一个batch的索引确定之后,将使用getitem 方法产生一个batch的数据
# 按批次索引来生成一个批次数据
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
batch_keys = self.keys[index * self.batch_size: (index + 1) * self.batch_size]
batch_data = [self.allblocks[key] for key in batch_keys]
# Generate data
X, y = self.__data_generation(batch_data)
return X, y
__data_generation函数:一个自定义的函数,用于实质生成一组数据,返回给getitem函数
# 实质生成数据的函数
def __data_generation(self, batch_data):
batch_X = []
batch_Y = []
for sample in batch_data:
# print("?")
# print(sample)
code, label = sample
# print(code)
# print(label)
tokens = myutils.getTokens(code)
vector_list = []
for token in tokens:
if token in self.word_vectors.key_to_index and token != " ":
vector = self.w2v_model.wv[token]
vector_list.append(vector.tolist())
# Pad sequences to a fixed length
# list = []
# list.append(vectorlist)
# print(list)
padded_sequence = sequence.pad_sequences([vector_list], padding='post', maxlen=150, dtype='float64')
batch_X.append(padded_sequence[0])
# print("###########################")
# print(batch_X)
# print(batch_X)
batch_Y.append(label)
if (str(padded_sequence.shape) != '(1, 150, 200)'):
print(padded_sequence.shape)
return numpy.array(batch_X), numpy.array(batch_Y)
到此我们的数据生成器就定义完毕了,下面是它怎么在深度学习框架中使用
下面的例子是一个使用Keras建立和训练了一个LSTM(长短时记忆)模型,用于一个多分类问题,其中包含了模型定义、处理生成器、类别平衡、模型训练以及验证模型等模块
# 创建模型
model = Sequential()
model.add(LSTM(neurons, dropout=dropout, recurrent_dropout=dropout)) # around 50 seems good
model.add(Dense(8, activation='softmax'))
model.compile(loss="sparse_categorical_crossentropy", optimizer='adam', metrics=['accuracy'])
now = datetime.now() # current date and time
nowformat = now.strftime("%H:%M")
print("Compiled LSTM: ", nowformat)
# 训练集生成器
train_generator = DataGenerator(keystrain, allblocks, batchsize, word_vectors, w2v_model)
# 验证集生成器
validate_generator = DataGenerator(keystest, allblocks, batchsize, word_vectors, w2v_model)
# 验证集生成器
finaltest_generator = DataGenerator(keysfinaltest, allblocks, batchsize, word_vectors, w2v_model)
# 考虑到数据的不平衡性,使用class_weights进行训练 调试加
class_weights = class_weight.compute_class_weight('balanced', classes=numpy.unique(y_train), y=y_train)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}
print(class_weights_dict)
history = model.fit_generator(
generator=train_generator,
# steps_per_epoch=len(keystrain)/batchsize+1,
epochs=epochs,
# validation_data=validate_generator,
# validation_steps=len(keystest)/batchsize+1,
class_weight=class_weights_dict
)
# 在测试集上进行验证
yhat = model.predict(finaltest_generator, steps=len(keysfinaltest), verbose=0)
yhat_classes = numpy.argmax(yhat, axis=1)
accuracy = accuracy_score(FinaltestY, yhat_classes)
precision = precision_score(FinaltestY, yhat_classes, average='weighted')
recall = recall_score(FinaltestY, yhat_classes, average='weighted')
F1Score = f1_score(FinaltestY, yhat_classes, average='weighted')
print("Accuracy: " + str(accuracy))
print("Precision: " + str(precision))
print("Recall: " + str(recall))
print('F1 score: %f' % F1Score)
print("\n")
参考链接:
1.在Keras中使用数据生成器的详细示例
2.基于Pytorch、Keras、Tensorflow的图片数据生成器搭建