深度学习数据预处理

最新推荐文章于 2024-03-14 23:03:57 发布

Remote Sensing

最新推荐文章于 2024-03-14 23:03:57 发布

阅读量536

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/RSstudent/article/details/103929467

版权

机器学习专栏收录该内容

29 篇文章 7 订阅

订阅专栏

网上这方面的资料确实不多。
这里的预处理，针对的是得到了大量图片后，如何将其制作成数据集，就像mnist那样子的数据集，然后进行使用。
由于我使用的keras自带图片数据生成器，可以自动进行特征标准化等，因此，我这里的操作，主要就是以下几个：
1.数据重新命名：将数据统一命名。我这里统一成：0_1.jpg 这样的格式。其中，前面的数字0处，代表类别的标签，后面的数字1处，代表的是该图片的编号。
2.数据重新组织大小：由于网络的输入必须大小一致，因此必须将所有图片调整为相同的大小
3.数据存入numpy数组，并保存。因为喂入神经网络的都是numpy数组，因此必须将数组从一个个图片的形式，组织成一个numpy数组的形式。
4.将该numpy数组保存起来，供神经网络读取。

进行处理前，必须将图片组织成一个主文件夹，包含不同类别的文件夹，每个文件夹存储同类的图片。

代码：

import os
import numpy as np
from PIL import Image
#用于图像处理的类
class PreProcess(object):
    def __init__(self,filepath,fileoutpath,datatype,width,height,validation_per=10,test_per=10):
        #the flower_photos folder
        self.filepath = filepath
        self.fileoutpath = fileoutpath
        #the type folder
        self.datatype = datatype
        self.width = width
        self.height = height
        self.validation_per = validation_per
        self.test_per = test_per
    #用于重命名
    def RenamePhoto(self):
        typecount = 0
        for type in self.datatype:
            subdir = os.listdir(self.filepath+'\\'+type)
            photoconunt = 0
            for photo in subdir:    
                os.rename(self.filepath+'\\'+type+'\\'+photo,self.filepath+'\\'+type+'\\'+str(typecount)+'_'+str(photoconunt)+'.jpg')
                photoconunt = photoconunt+1 
            typecount = typecount + 1
        print('Finish rename!')
    #重新设置大小
    def ResizePhoto(self):
        for type in os.listdir(self.filepath):
            for photo in os.listdir(self.filepath+'\\'+type):
                img = Image.open(self.filepath+'\\'+type+'\\'+photo)
                processedimg = img.resize((self.width,self.height),Image.BILINEAR)
                path=os.path.join(self.fileoutpath+'\\'+os.path.basename(photo))
                processedimg.save(path)
        print('Finish resize!')
    #将图片存入numpy数组，并分为训练集、验证集和测试集
    def MakeDataset(self):
        train_images = []
        train_labels = []
        validation_images = []
        validation_labels = []
        test_images = []
        test_labels = []
        for photo in os.listdir(self.fileoutpath):    
            img = Image.open(self.fileoutpath+'\\'+photo)
            img = np.array(img,dtype=np.float32)
            chance = np.random.randint(100)
            if chance < self.validation_per:
                validation_images.append(img)
                validation_labels.append(int(photo.split('_')[0]))
            elif chance <(self.validation_per+self.test_per):
                test_images.append(img)
                test_labels.append(int(photo.split('_')[0]))
            else:
                train_images.append(img)
                train_labels.append(int(photo.split('_')[0]))
        #shuffle train dataset
        state = np.random.get_state()
        np.random.shuffle(train_images)
        np.random.set_state(state)
        np.random.shuffle(train_labels)
        print('Finish making dataset!')
        return np.asarray([train_images,train_labels,validation_images,
        validation_labels,test_images,test_labels])
#定义主函数
def main():
    preprocess=PreProcess(
    filepath='C:\\Users\\Dash\\Desktop\\Tensorflow\\preprocess\\flower_photos',
    fileoutpath='C:\\Users\\Dash\\Desktop\\Tensorflow\\preprocess\\processed_flowers',
    datatype=['daisy','dandelion','roses','sunflowers','tulips'],
    width=299,height=299
    )
    preprocess.RenamePhoto()
    preprocess.ResizePhoto()
    dataset = preprocess.MakeDataset()
    #将制作好的numpy数组存入我D盘某处
    np.save('D:\\dataset\\train_images',dataset[0])
    np.save('D:\\dataset\\train_labels',dataset[1])
    np.save('D:\\dataset\\validation_images',dataset[2])
    np.save('D:\\dataset\\validation_labels',dataset[3])
    np.save('D:\\dataset\\test_images',dataset[4])
    np.save('D:\\dataset\\test_lebels',dataset[5])
    print('Finish save image as numpy array!')
#入口
if __name__ == '__main__':
    main()

这段代码是可以反复使用的，只要将数据存储为指定形式，就可以利用这段代码进行数据的预处理工作。产生下面的效果：
在这里插入图片描述

在读取时，仅需对应进行这样的操作，读取出来的数据与mnist如出一辙。

from keras.models import Sequential
from keras.layers import Conv2D,Input,Flatten,Dense,Dropout
from keras.utils import to_categorical
import numpy as np

#加载数据
X_train = np.load('D:\\dataset\\train_images.npy')
Y_train = np.load('D:\\dataset\\train_labels.npy')
X_validation = np.load('D:\\dataset\\validation_images.npy')
Y_validation = np.load('D:\\dataset\\validation_labels.npy')
X_test = np.load('D:\\dataset\\test_images.npy')
Y_test = np.load('D:\\dataset\\test_lebels.npy')
print(X_train.shape)
print(Y_train.shape)
print(X_validation.shape)
print(Y_validation.shape)
print(X_test.shape)
print(Y_test.shape)

由于数据不小，你且得读一会儿呢！
读取的结果：
在这里插入图片描述
可以看到，验证集和测试集各自占据了10%左右，数组组织的形式是numpy数组，而且是channel_last,非常符合预期。

主要参考资料：
Mike高的正正小课堂
《Tensorflow实战Google深度学习框架第二版》郑泽宇等著
《Keras深度学习实战》拉蒂普·杜瓦等著

Remote Sensing

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
深度学习数据预处理

网上这方面的资料确实不多。这里的预处理，针对的是得到了大量图片后，如何将其制作成数据集，就像mnist那样子的数据集，然后进行使用。由于我使用的keras自带图片数据生成器，可以自动进行特征标准化等，因此，我这里的操作，主要就是以下几个：1.数据重新命名：将数据统一命名。我这里统一成：0_1.jpg 这样的格式。其中，前面的数字0处，代表类别的标签，后面的数字1处，代表的是该图片的编号。2...
复制链接

扫一扫

专栏目录