Keras flow_from_dataframe教程

最新推荐文章于 2021-08-11 23:52:59 发布

YaksaWang

最新推荐文章于 2021-08-11 23:52:59 发布

阅读量9.5k

点赞数 10

分类专栏：机器学习

原文链接：https://medium.com/@vijayabhaskar96/tutorial-on-keras-flow-from-dataframe-1fd4493d237c

版权

机器学习专栏收录该内容

10 篇文章

订阅专栏

本文介绍如何使用Keras的flow_from_dataframe函数处理图像数据集，特别针对那些图像和类别信息分别存储在文件和CSV文件中的情况。通过实例演示了如何设置数据增强、划分训练集和验证集，以及如何利用此函数进行模型训练和预测。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

使用flow_from_dataframe函数实例

原文链接：Tutorial on Keras flow_from_dataframe

注意：本文假设您至少具有使用Keras的一些经验

网上图像数据集主要有两种常见格式

第一种是最常见的，所有图像保存在以类名命名的文件夹中，可以使用Keras的ImageDataGenerator设置数据增强并使用flow_from_directory方法从目录中读取图像。

我在网上找到的第二种常见格式是，所有图像都存在于一个目录中，它们各自的类映射在CSV或JSON文件中。Keras之前不支持这种格式，处理时必须将图像移动到以各自类命名的单独目录中或编写自定义生成器，所以我写了一个函数flow_from_dataframe，它允许输入一个包含文件名的Pandas数据帧（有/无扩展名）列和一个具有类名的列，直接读取目录中的图像并映射各自的类名，最近这个函数被收到官方keras-preprocessing中。

要使用flow_from_dataframe函数，需要安装pandas

pip install pandas

注意：确保您使用最新的keras-preprocessing库，(使用最新版本的keras或者)直接从Github repo安装它。

更新keras预处理库：
卸载旧版本keras-preprocessing库

pip uninstall keras-preprocessing

安装keras-preprocessing预处理库

pip install git + https：//github.com/keras-team/keras-preprocessing.git

最后，如果需要，重新启动内核

以前，如果必须执行回归或预测多个列并使用ImageDataGenerator的图像增强功能，则必须编写自定义生成器。现在，您可以将目标值仅作为另一列（必须是数值数据类型），只需将列名提供给flow_from_dataframe就可以了！

开始~
首先，下载数据集并将图像文件保存在单个目录下
例如，我将使用cifar-10数据集
下载并解压缩train.7z和test.7z，将获得 “train”的文件夹和“test”文件夹
下载trainLabels.csv文件，该文件将训练图像的文件名映射到类名
在这里插入图片描述

让我们深入研究代码吧！

导入包并使用pandas读取CSV文件

from keras.models import Sequential
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D
from keras import regularizers, optimizers
import pandas as pd
import numpy as np
def append_ext(fn):
    return fn+".png"
traindf=pd.read_csv(“./trainLabels.csv”,dtype=str)
testdf=pd.read_csv("./sampleSubmission.csv",dtype=str)
traindf["id"]=traindf["id"].apply(append_ext)
testdf["id"]=testdf["id"].apply(append_ext)
datagen=ImageDataGenerator(rescale=1./255.,validation_split=0.25)

您会注意到，我将“.png”附加到数据框“id”列中的所有文件名，以将文件ID转换为实际文件名
在这里插入图片描述
请注意，我将train分为2组，一组用于训练(train)，另一组用于验证(vaild)，只需指定参数validation_split = 0.25，将数据集拆分为2组，其中验证集将占总图像的25％
将数据帧传递给两个不同的flow_from_dataframe函数

train_generator = datagen.flow_from_dataframe（dataframe = traindf，
                                      directory =“./ train /”，
                                      x_col =“id”，
                                      y_col =“label”，
                                      subset =“training”，
                                      batch_size = 32，
                                      seed = 42，
                                      shuffle = True，
                                      class_mode =“categorical”，
                                      target_size =（32,32））
valid_generator = datagen.flow_from_dataframe（dataframe = traindf，
                                      directory =“./ train /”，
                                      x_col =“id”，
                                      y_col =“label”，
                                      subset =“validation”，
                                      batch_size = 32，
                                      seed = 42，
                                      shuffle = True，
                                      class_mode =“categorical”，
                                      target_size =（32,32））
testdatagen = ImageDataGenerator（rescale = 1. / 255。） 
test_generator = test_datagen.flow_from_dataframe（dataframe = testdf，
                                     directory =“./ test /”，
                                     x_col =“id”，
                                     y_col = None，
                                     batch_size = 32，
                                     seed = 42，
                                     shuffle = False，
                                     class_mode = None，
                                     target_size =（32,32））

由于使用了validation_split分割数据集，因此必须指定哪个集合用于哪个flow_from_dataframe函数

flow_from_dataframe的参数：

directory — (str)Path to the directory which contains all the images.
set this to None if your x_col contains absolute_paths pointing to each image files instead of just filenames.
x_col — (str) The name of the column which contains the filenames of the images.
y_col — (str or list of str) If class_mode is not “raw” or not “input” you should pass the name of the column which contains the class names.
None, if used for test_generator.
class_mode — (str) Similar to flow_from_directory, this accepts “categorical”(default), ”binary”, ”sparse”, ”input”, None and also an extra argument “raw”.
If class_mode is set to “raw” it treats the data in the column or list of columns of the dataframe as raw target values(which means you should be sure that data in these columns must be of numerical datatypes), will be helpful if you’re building a model for regression task like predicting the angle from the images of steering wheel or building a model that needs to predict multiple values at the same time.
For Test generator: Set this to None, to return only the images.
batch_size: For train and valid generator you can keep this according to your needs but for test generator:
Set this to some number that divides your total number of images in your test set exactly.
Why this only for test_generator?
Actually, you should set the “batch_size” in both train and valid generators to some number that divides your total number of images in your train set and valid respectively, but this doesn’t matter before because even if batch_size doesn’t match the number of samples in the train or valid sets and some images gets missed out every time we yield the images from generator, but it would be sampled the very next epoch you train.
But for the test set, you should sample the images exactly once, no less or no more. If Confusing, just set it to 1(but maybe a little bit slower).
shuffle: Set this to False(For Test generator only, for others set True), because you need to yield the images in “order”, to predict the outputs and match them with their unique ids or filenames.
drop_duplicates: If you’re for some reason don’t want duplicate entries in your dataframe’s x_col, set this to False, default is True.
validate_filenames: whether to validate image filenames in x_col. If True, invalid images will be ignored. Disabling this option can lead to speed-up in the instantiation of this class if you have a huge amount of files, default is True.

创建模型：

model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
                 input_shape=(32,32,3)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
model.compile(optimizers.rmsprop(lr=0.0001, decay=1e-6),loss="categorical_crossentropy",metrics=["accuracy"])

拟合模型

STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size
STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
model.fit_generator(generator=train_generator,
                    steps_per_epoch=STEP_SIZE_TRAIN,
                    validation_data=valid_generator,
                    validation_steps=STEP_SIZE_VALID,
                    epochs=10
                   )

评估模型

model.evaluate_generator(generator=valid_generator,
steps=STEP_SIZE_TEST)

由于我们正在评估模型，因此我们应该将验证集视为测试集。因此，我们应该只在验证集中对图像进行一次采样（如果您计划进行评估，则需要将有效生成器的批量大小更改为1或者确切地划分验证集中样本总数的内容），但是顺序无关紧要，所以让“shuffle”变得像以前一样真实

预测模型

为了预测模型，您可以使用flow_from_directory，因为是从目录中找到图像使用模型进行预测，使用没有类名的dataframe没有意义。
如果你希望使用flow_from_directory进行预测，请参照本教程的最后一部分。

但是如果你坚持，也可以使用flow_from_dataframe进行预测！

test_generator.reset（）
pred = model.predict_generator（test_generator，
                                       steps = STEP_SIZE_TEST，
                                       verbose = 1）

无论何时调用predict_generator，都需要重置test_generator。这很重要，如果没有重置test_generator，输出的的顺序会打乱。

predicted_class_indices=np.argmax(pred,axis=1)

现在predict_class_indices中有预测的标签，但你还不能知道预测的什么，因为你能看到的只是0,1,4,1,0,6等数字…
需要用预测的标签映射他们独特的ID（如文件名），以找出预测的图像

labels = (train_generator.class_indices)
labels = dict((v,k) for k,v in labels.items())
predictions = [labels[k] for k in predicted_class_indices]

最后，将结果保存为CSV文件

filenames=test_generator.filenames
results=pd.DataFrame({"Filename":filenames, "Predictions":predictions})
results.to_csv("results.csv",index=False)