keras中生成器的使用

饭碗儿的碗

于 2024-02-03 23:14:23 发布

阅读量512

点赞数 6

文章标签： keras 人工智能深度学习

本文链接：https://blog.csdn.net/m0_51946536/article/details/136018026

版权

keras中生成器的使用

使用背景：在加载数据时，在有些项目中，直接全部加载所有数据的话会很占内存，甚至发生内存崩溃，所以在这种背景下，引入了keras中的数据生成器，可以有效缓解内存不足的问题

1.常规做法

全部加载后，直接用model.fit

import numpy as np
from keras.models import Sequential

# Load entire dataset
X, y = np.load('some_training_set_with_labels.npy')

# Design model
model = Sequential()
[...] # Your architecture
model.compile()

# Train model on your dataset
model.fit(x=X, y=y)

2.使用生成器做法

设置Python类DataGenerator，该类将实时把数据投喂到Keras模型中

初始化函数：其中包含与数据相关的信息

# 初始化函数
    def __init__(self,  keys, allblocks, batch_size, word_vectors, w2v_model, shuffle=True):
        'Initialization'
        self.batch_size = batch_size
        self.keys = keys
        self.allblocks = allblocks
        self.word_vectors = word_vectors
        self.w2v_model = w2v_model
        self.shuffle = shuffle
        self.on_epoch_end()

on_epoch_end函数：这个方法一般是处理在每个epoch之后要做的事，特别的是指定在每个epoch后是否将数据打乱（shuffle），在每个epoch后打乱数据的次序会使我们训练出来的模型表现更加稳定

# 每个epoch之后做的事
    def on_epoch_end(self):
        print("一个epoch结束")
        'Updates indexes after each epoch'
        # self.indexes = numpy.arange(len(self.keys))
        # if self.shuffle == True:
        #     numpy.random.shuffle(self.indexes)

len函数：这个方法被定义是用来建立每一个betch的索引，该索引的长度在0到数据总数/batch_size之间

这里使用floor向下取整的原因是：假如，我们数据总的个数为1002,batch_size为10，我们在一个epoch中最多迭代100次，多了就会有重复迭代的数据。因此我们使用了np.floor向下取整函数

# 计算每个epoch所需要的批次
    def __len__(self):
        print("计算len")
        'Denotes the number of batches per epoch'
        return int(numpy.floor(len(self.keys) / self.batch_size))

getitem函数：当一个batch的索引确定之后，将使用getitem 方法产生一个batch的数据

# 按批次索引来生成一个批次数据
    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        batch_keys = self.keys[index * self.batch_size: (index + 1) * self.batch_size]
        batch_data = [self.allblocks[key] for key in batch_keys]


        # Generate data
        X, y = self.__data_generation(batch_data)

        return X, y

__data_generation函数：一个自定义的函数，用于实质生成一组数据，返回给getitem函数

# 实质生成数据的函数
def __data_generation(self, batch_data):
    batch_X = []
    batch_Y = []
    for sample in batch_data:
        # print("?")
        # print(sample)
        code, label = sample
        # print(code)
        # print(label)
        tokens = myutils.getTokens(code)

        vector_list = []
        for token in tokens:
            if token in self.word_vectors.key_to_index and token != " ":
                vector = self.w2v_model.wv[token]
                vector_list.append(vector.tolist())

        # Pad sequences to a fixed length
        # list = []
        # list.append(vectorlist)
        # print(list)
        padded_sequence = sequence.pad_sequences([vector_list], padding='post', maxlen=150, dtype='float64')
        batch_X.append(padded_sequence[0])
        # print("###########################")
        # print(batch_X)
        # print(batch_X)
        batch_Y.append(label)
        if (str(padded_sequence.shape) != '(1, 150, 200)'):
            print(padded_sequence.shape)

    return numpy.array(batch_X), numpy.array(batch_Y)

到此我们的数据生成器就定义完毕了，下面是它怎么在深度学习框架中使用

下面的例子是一个使用Keras建立和训练了一个LSTM（长短时记忆）模型，用于一个多分类问题，其中包含了模型定义、处理生成器、类别平衡、模型训练以及验证模型等模块

# 创建模型
model = Sequential()
model.add(LSTM(neurons, dropout=dropout, recurrent_dropout=dropout))  # around 50 seems good
model.add(Dense(8, activation='softmax'))
model.compile(loss="sparse_categorical_crossentropy", optimizer='adam', metrics=['accuracy'])

now = datetime.now()  # current date and time
nowformat = now.strftime("%H:%M")
print("Compiled LSTM: ", nowformat)

# 训练集生成器
train_generator = DataGenerator(keystrain, allblocks, batchsize, word_vectors, w2v_model)

# 验证集生成器
validate_generator = DataGenerator(keystest, allblocks, batchsize, word_vectors, w2v_model)

# 验证集生成器
finaltest_generator = DataGenerator(keysfinaltest, allblocks, batchsize, word_vectors, w2v_model)

# 考虑到数据的不平衡性，使用class_weights进行训练  调试加
class_weights = class_weight.compute_class_weight('balanced', classes=numpy.unique(y_train), y=y_train)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}
print(class_weights_dict)

history = model.fit_generator(
    generator=train_generator,
    # steps_per_epoch=len(keystrain)/batchsize+1,
    epochs=epochs,
    # validation_data=validate_generator,
    # validation_steps=len(keystest)/batchsize+1,
    class_weight=class_weights_dict
)

# 在测试集上进行验证
yhat = model.predict(finaltest_generator, steps=len(keysfinaltest), verbose=0)
yhat_classes = numpy.argmax(yhat, axis=1)
accuracy = accuracy_score(FinaltestY, yhat_classes)
precision = precision_score(FinaltestY, yhat_classes, average='weighted')
recall = recall_score(FinaltestY, yhat_classes, average='weighted')
F1Score = f1_score(FinaltestY, yhat_classes, average='weighted')
print("Accuracy: " + str(accuracy))
print("Precision: " + str(precision))
print("Recall: " + str(recall))
print('F1 score: %f' % F1Score)
print("\n")

参考链接：
1.在Keras中使用数据生成器的详细示例
2.基于Pytorch、Keras、Tensorflow的图片数据生成器搭建

饭碗儿的碗

关注

6
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
keras中生成器的使用

on_epoch_end函数：这个方法一般是处理在每个epoch之后要做的事，特别的是指定在每个epoch后是否将数据打乱（shuffle），在每个epoch后打乱数据的次序会使我们训练出来的模型表现更加稳定。使用背景：在加载数据时，在有些项目中，直接全部加载所有数据的话会很占内存，甚至发生内存崩溃，所以在这种背景下，引入了keras中的数据生成器，可以有效缓解内存不足的问题。getitem函数：当一个batch的索引确定之后，将使用getitem 方法产生一个batch的数据。
复制链接

扫一扫