CNN项目准备工作-图片的 Extract, Transform, Load (ETL) | PyTorch系列（十一）

最新推荐文章于 2024-05-01 19:50:18 发布

flyfor2013

最新推荐文章于 2024-05-01 19:50:18 发布

阅读量976

点赞数 4

分类专栏：高效入门PyTorch系列

本文链接：https://blog.csdn.net/flyfor2013/article/details/105911642

版权

高效入门PyTorch系列专栏收录该内容

20 篇文章 253 订阅

订阅专栏

点击上方“AI算法与图像处理”，选择加"星标"或“置顶”

重磅干货，第一时间送达

文 |AI_study

使用PyTorch提取、转换和加载(ETL)

欢迎回到PyTorch神经网络编程系列。在这篇文章中，我们将编写本系列第二部分的第一个代码。

我们将使用torchvision演示一个非常简单的提取、转换和加载管道，这是用于机器学习的PyTorch计算机视觉包。言归正传，我们开始吧。

项目概要

有四个步骤，我们将遵循我们通过这个项目:

准备数据
构建模型
训练模型
分析模型的结果

ETL过程

在这篇文章中，我们将首先准备数据。为了准备我们的数据，我们将遵循所谓的ETL过程。

ETL：https://en.wikipedia.org/wiki/Extract,transform,load

从数据源中提取数据。
将数据转换为所需的格式。
将数据加载到适当的结构中。

ETL过程可以被认为是一个分形过程（ fractal process），因为它可以应用于各种规模。该流程可以小规模应用，比如单个程序，也可以大规模应用，一直到企业级别，在企业级别有处理每个单独部分的大型系统。

如果您想了解更多关于通用数据科学管道的信息，请查看data science post，在那里我们将对此进行更详细的介绍。

data science post: https://deeplizard.com/learn/video/d11chG7Z-xk

一旦我们完成了ETL过程，我们就准备开始构建和训练我们的深度学习模型。PyTorch有一些内置的包和类，使ETL过程非常简单。

PyTorch Imports

我们首先导入所有必需的PyTorch库。

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


import torchvision
import torchvision.transforms as transforms

这个表格描述了每一个包:

Package	Description
torch	The top-level PyTorch package and tensor library.（顶级的PyTorch包和张量库。）
torch.nn	A subpackage that contains modules and extensible classes for building neural networks.（包含用于构建神经网络的模块和可扩展类的子包。）
torch.optim	A subpackage that contains standard optimization operations like SGD and Adam.（包含标准优化操作(如SGD和Adam)的子包。）
torch.nn.functional	A functional interface that contains typical operations used for building neural networks like loss functions and convolutions.（一个函数接口，包含用于构建神经网络的典型操作，如loss函数和卷积。）
torchvision	A package that provides access to popular datasets, model architectures, and image transformations for computer vision.（为计算机视觉提供对流行数据集、模型体系结构和图像转换的访问的包。）
torchvision.transforms	An interface that contains common transforms for image processing.（包含图像处理常用转换的接口。）

Other Imports

下一个导入是Python中用于数据科学的标准包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


from sklearn.metrics import confusion_matrix
#from plotcm import plot_confusion_matrix


import pdb


torch.set_printoptions(linewidth=120)

注意，pdb是Python调试器，注释掉的 import *** 是一个本地文件，我们将在以后的文章中介绍它，以绘制混淆矩阵，最后一行设置了PyTorch print语句的打印选项。

我们现在准备好准备数据了。

使用PyTorch准备数据

我们在准备数据时的最终目标是做以下(ETL):

提取——从数据源获取Fashion-MNIST图像数据。
转换——把数据转换成张量的形式。
加载——把我们的数据放入一个对象，使它容易访问。

出于这些目的，PyTorch为我们提供了两个类:

Class	Description
torch.utils.data.Dataset	An abstract class for representing a dataset.（表示数据集的抽象类。）
torch.utils.data.DataLoader	Wraps a dataset and provides access to the underlying data.（包装数据集并提供对底层数据的访问。）

抽象类是一个Python类，它有我们必须实现的方法，所以我们可以通过创建一个子类来扩展dataset类的功能来创建一个自定义数据集。

抽象类：https://docs.python.org/3/library/abc.html

为了使用PyTorch创建自定义数据集，我们通过创建实现这些所需方法的子类来扩展dataset。这样做之后，我们的新子类就可以传递给一个PyTorch DataLoader对象。

我们将使用内置在torchvision包中的fashion-MNIST数据集，因此我们的项目不需要这样做。只需知道Fashion-MNIST内建的dataset类正在幕后做这件事。

All subclasses of the Dataset class must override __len__, that provides the size of the dataset, and __getitem__, supporting integer indexing in range from 0 to len(self) exclusive.

具体来说，有两种方法需要实现。__len__方法返回数据集的长度，而__getitem__方法从数据集中获取位于数据集中特定索引位置的元素。

PyTorch Torchvision包

torchvision软件包使我们可以访问以下资源：

Datasets (like MNIST and Fashion-MNIST)
Models (like VGG16)
Transforms
Utils

计算机视觉

所有这些资源都与深度学习计算机视觉任务有关。

当我们在之前的文章中了解到Fashion -MNIST数据集时，介绍fashion数据集的arXiv论文指出，作者希望它是原始MNIST数据集的一个补充。

这样一来，像PyTorch这样的框架就可以通过更改URL来检索数据，从而添加Fashion-MNIST。

这就是PyTorch的情况。PyTorch FashionMNIST数据集只是扩展了MNIST数据集并覆盖了url。

下面是来自PyTorch的torchvision源代码的类定义:

class FashionMNIST(MNIST):
    """`Fashion-MNIST <https://github.com/zalandoresearch/fashion-mnist>`_ Dataset.


    Args:
        root (string): Root directory of dataset where ``processed/training.pt``
            and  ``processed/test.pt`` exist.
        train (bool, optional): If True, creates dataset from ``training.pt``,
            otherwise from ``test.pt``.
        download (bool, optional): If true, downloads the dataset from the internet and
            puts it in root directory. If dataset is already downloaded, it is not
            downloaded again.
        transform (callable, optional): A function/transform that  takes in an PIL image
            and returns a transformed version. E.g, ``transforms.RandomCrop``
        target_transform (callable, optional): A function/transform that takes in the
            target and transforms it.
    """
    urls = [
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz',
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz',
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz',
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz',
    ]

现在让我们看看如何利用torchvision。

PyTorch数据集类

要使用torchvision获取FashionMNIST数据集的实例，我们只需创建一个类似的实例:

train_set = torchvision.datasets.FashionMNIST(
    root='./data'
    ,train=True
    ,download=True
    ,transform=transforms.Compose([
        transforms.ToTensor()
    ])
)

注意，root 参数以前是 './data/FashionMNIST' 。然而，由于torchvision 的更新，它已经改变了。

我们指定以下参数:

Parameter	Description
root	The location on disk where the data is located.（磁盘上数据所在的位置。）
train	If the dataset is the training set（数据集是否为训练集）
download	If the data should be downloaded.（数据是否要下载）
transform	A composition of transformations that should be performed on the dataset elements.（应该在数据集元素上执行的转换组合。）

因为我们希望我们的图像被转换成张量，所以我们使用了内置的transform . totensor() 转换，并且由于这个数据集将用于训练，我们将把实例命名为train_set。

当我们第一次运行这段代码时，Fashion-MNIST数据集将在本地下载。随后的调用在下载之前检查数据。因此，我们不必担心重复下载或重复网络调用。

PyTorch DataLoader类

要为我们的训练集创建一个DataLoader包装器，我们这样做:

train_loader = torch.utils.data.DataLoader(train_set
    ,batch_size=1000
    ,shuffle=True
)

我们只是传递train_set作为参数。现在，我们可以利用加载器的完成这个任务，否则手动实现相当复杂:

batch_size(在本例中为1000)
shuffle(例子中是True的)
num_workers(默认值为0，表示将使用主进程)

ETL的总结

从ETL的角度来看，我们在创建数据集时，已经实现了使用torchvision进行提取和转换:

Extract —从web中提取原始数据。
Transform——原始图像数据变换成一个张量。
Load——train_set被(加载到)数据加载器包装，使我们能够访问底层数据。

现在，我们应该很好地理解了PyTorch提供的torchvision 模块，以及如何在PyTorch的 torch.utils.data 使用Datasets 和 DataLoaders 来简化ETL任务。

文章中内容都是经过仔细研究的，本人水平有限，翻译无法做到完美，但是真的是费了很大功夫，希望小伙伴能动动你性感的小手，分享朋友圈或点个“在看”，支持一下我 ^_^

英文原文链接是：

https://deeplizard.com/learn/video/8n-TGaBZnk4

加群交流

欢迎小伙伴加群交流，目前已有交流群的方向包括：AI学习交流群，目标检测，秋招互助，资料下载等等；加群可扫描并回复感兴趣方向即可（注明：地区+学校/企业+研究方向+昵称）

谢谢你看到这里！ ????

flyfor2013

关注

4
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
CNN项目准备工作-图片的 Extract, Transform, Load (ETL) | PyTorch系列（十一）

点击上方“AI算法与图像处理”，选择加"星标"或“置顶”重磅干货，第一时间送达文 |AI_study使用PyTorch提取、转换和加载(ETL)欢迎回到PyTorch神经...
复制链接

扫一扫

专栏目录