卷积神经网络学习指南_卷积神经网络的直观指南-CSDN博客

本文深入探讨卷积神经网络(CNN)的核心原理，从大脑结构的启发到CNN的架构设计，包括卷积、池化等关键技术。文章还提供了Python实现CNN的示例代码，帮助读者理解CNN如何进行特征提取和分类。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

卷积神经网络学习指南

by Daphne Cornelisse

达芙妮·康妮莉丝(Daphne Cornelisse)

卷积神经网络的直观指南 (An intuitive guide to Convolutional Neural Networks)

In this article, we will explore Convolutional Neural Networks (CNNs) and, on a high level, go through how they are inspired by the structure of the brain. If you want to read more about the brain specifically, there are more resources at the end of the article to help you further.

在本文中，我们将探索卷积神经网络(CNN)，并在更高层次上探讨它们如何受到大脑结构的启发。如果您想专门阅读有关大脑的更多信息，那么本文结尾处有更多资源可为您提供进一步的帮助。

大脑 (The Brain)

We are constantly analysing the world around us. Without conscious effort, we make predictions about everything we see, and act upon them. When we see something, we label every object based on what we have learned in the past. To illustrate this, look at this picture for a moment.

我们正在不断分析我们周围的世界。无需有意识的努力，我们就可以对所看到的一切进行预测，并对其采取行动。当我们看到某些东西时，我们会根据过去的经验来标记每个对象。为了说明这一点，请看一下这张图片。

You probably thought something like “that’s a happy little boy standing on a chair”. Or maybe you thought he looks like he is screaming, about to attack this cake in front of him.

您可能会想到类似“那是一个快乐的小男孩，站在椅子上”之类的东西。或者，也许您认为他看起来像是在尖叫，正要攻击他面前的这个蛋糕。

This is what we subconciously do all day. We see, label, make predictions, and recognize patterns. But how do we do that? How is it that we can interpret everything what we see?

这就是我们整日潜意识地做的事情。我们可以看到，标记，做出预测和识别模式。但是我们该怎么做呢？我们如何解释我们所看到的一切？

It took nature over 500 million years to create a system to do this. The collaboration between the eyes and the brain, called the primary visual pathway, is the reason we can make sense of the world around us.

大自然花了五亿多年的时间来创建一个系统来做到这一点。眼睛和大脑之间的协作(称为主要视觉通路)是我们可以理解周围世界的原因。

While vision starts in the eyes, the actual interpretation of what we see happens in the brain, in the primary visual cortex.

当视觉始于眼睛时，对所见事物的实际解释发生在大脑的主要视觉皮层中 。

When you see an object, the light receptors in your eyes send signals via the optic nerve to the primary visual cortex, where the input is being processed. The primary visual cortex makes sense of what the eye sees.

当您看到一个物体时，您眼睛中的光感受器会通过视神经将信号发送到初级视觉皮层，在该皮质中处理输入。主要的视觉皮层使人眼所见。

All of this seems very natural to us. We barely even think about how special it is that we are able to recognise all the objects and people we see in our lives. The deeply complex hierarchical structure of neurons and connections in the brain play a major role in this process of remembering and labelling objects.

所有这些对我们来说似乎都是很自然的。我们什至几乎没有想到，我们能够识别生活中看到的所有物体和人物有多么特别。大脑中神经元和连接的深层次复杂结构在记忆和标记物体的过程中起着重要作用。

Think about how we learned what, for example, an umbrella is. Or a duck, lamp, candle, or book. In the beginning, our parents or family told us the name of the objects in our direct environment. We learned by examples that were given to us. Slowly but surely we started to recognise certain things more and more often in our environment. They became so common that the next time we saw them, we would instantly know what the name of this object was. They became part of our model on the world.

想想我们如何学习，例如一把雨伞。或是鸭子，灯，蜡烛或书。最初，我们的父母或家人告诉我们直接环境中物体的名称。我们从给我们的例子中学到了东西。但可以肯定的是，我们慢慢地开始越来越认识环境中的某些事物。它们变得如此普遍，以至于下次我们看到它们时，我们将立即知道该对象的名称是什么。它们成为我们在世界上的榜样的一部分。

卷积神经网络 (Convolutional Neural Networks)

Similar to how a child learns to recognise objects, we need to show an algorithm millions of pictures before it is be able to generalize the input and make predictions for images it has never seen before.

与儿童学习识别对象的方式类似，我们需要向算法展示数以百万计的图片，然后才能对输入进行泛化，并对从未见过的图像进行预测。

Computers ‘see’ in a different way than we do. Their world consists of only numbers. Every image can be represented as 2-dimensional arrays of numbers, known as pixels.

计算机以与我们不同的方式“看”。他们的世界只有数字。每个图像都可以表示为数字的二维数组，称为像素。

But the fact that they perceive images in a different way, doesn’t mean we can’t train them to recognize patterns, like we do. We just have to think of what an image is in a different way.

但是，它们以不同的方式感知图像的事实并不意味着我们不能像我们一样训练它们识别模式。我们只需要考虑图像的不同方式。

To teach an algorithm how to recognise objects in images, we use a specific type of Artificial Neural Network: a Convolutional Neural Network (CNN). Their name stems from one of the most important operations in the network: convolution.

为了教授算法如何识别图像中的对象，我们使用一种特定类型的人工神经网络：卷积神经网络(CNN)。它们的名称源于网络中最重要的操作之一：卷积。

Convolutional Neural Networks are inspired by the brain. Research in the 1950s and 1960s by D.H Hubel and T.N Wiesel on the brain of mammals suggested a new model for how mammals perceive the world visually. They showed that cat and monkey visual cortexes include neurons that exclusively respond to neurons in their direct environment.

卷积神经网络受到大脑的启发。 DH Hubel和TN Wiesel在1950年代和1960年代对哺乳动物的大脑进行的研究提出了一种关于哺乳动物如何从视觉上感知世界的新模型。他们表明，猫和猴的视觉皮层包括仅对直接环境中的神经元有React的神经元。

In their paper, they described two basic types of visual neuron cells in the brain that each act in a different way: simple cells (S cells) and complex cells (C cells).

他们在论文中描述了大脑中视觉神经元细胞的两种基本类型，它们各自以不同的方式起作用：简单细胞( S细胞 )和复杂细胞( C细胞 )。

The simple cells activate, for example, when they identify basic shapes as lines in a fixed area and a specific angle. The complex cells have larger receptive fields and their output is not sensitive to the specific position in the field.

例如，当简单单元将基本形状识别为固定区域和特定角度的线时，它们就会激活。复杂的单元格具有较大的接收场，并且它们的输出对场中的特定位置不敏感。

The complex cells continue to respond to a certain stimulus, even though its absolute position on the retina changes. Complex refers to more flexible, in this case.

即使其在视网膜上的绝对位置发生变化，复杂的细胞也会继续对某种刺激作出React。在这种情况下，复杂是指更灵活。

In vision, a receptive field of a single sensory neuron is the specific region of the retina in which something will affect the firing of that neuron (that is, will active the neuron). Every sensory neuron cell has similar receptive fields, and their fields are overlying.

在视觉中，单个感觉神经元的感受野是视网膜的特定区域，其中某些事物会影响该神经元的发射(即将激活神经元)。每个感觉神经元细胞都有相似的感受野，并且它们的场是重叠的。

Further, the concept of hierarchy plays a significant role in the brain. Information is stored in sequences of patterns, in sequential order. The neocortex, which is the outermost layer of the brain, stores information hierarchically. It is stored in cortical columns, or uniformly organised groupings of neurons in the neocortex.

此外，等级概念在大脑中起着重要作用。信息按模式顺序以顺序存储。 新大脑皮层是大脑的最外层，它分层存储信息。它存储在皮层列或新皮层中神经元的统一组织的分组中。

In 1980, a researcher called Fukushima proposed a hierarchical neural network model. He called it the neocognitron. This model was inspired by the concepts of the Simple and Complex cells. The neocognitron was able to recognise patterns by learning about the shapes of objects.

1980年，一位名叫福岛的研究人员提出了一种层次神经网络模型。他称它为新认知加速器 。该模型的灵感来自于简单和复杂单元格的概念。新认知加速器能够通过学习物体的形状来识别模式。

Later, in 1998, Convolutional Neural Networks were introduced in a paper by Bengio, Le Cun, Bottou and Haffner. Their first Convolutional Neural Network was called LeNet-5 and was able to classify digits from hand-written numbers.

后来，在1998年，Bengio，Le Cun，Bottou和Haffner在一篇论文中介绍了卷积神经网络。他们的第一个卷积神经网络称为LeNet-5 ，能够从手写数字中对数字进行分类。

For the entire history on Convolutional Neural Nets, you can go here.

有关卷积神经网络的整个历史，可以在这里进行。

建筑 (Architecture)

In the remainder of this article, I will take you through the architecture of a CNN and show you the Python implementation as well.

在本文的其余部分，我将带您了解CNN的体系结构，并向您展示Python实现。

Convolutional Neural Networks have a different architecture than regular Neural Networks. Regular Neural Networks transform an input by putting it through a series of hidden layers. Every layer is made up of a set of neurons, where each layer is fully connected to all neurons in the layer before. Finally, there is a last fully-connected layer — the output layer — that represent the predictions.

卷积神经网络与常规神经网络具有不同的体系结构。常规神经网络通过将输入置于一系列隐藏层中来转换输入。每层都由一组神经元组成 ，其中每一层之前都与该层中的所有神经元完全连接。最后，最后一个完全连接的层(输出层)代表了预测。

Convolutional Neural Networks are a bit different. First of all, the layers are organised in 3 dimensions: width, height and depth. Further, the neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it. Lastly, the final output will be reduced to a single vector of probability scores, organized along the depth dimension.

卷积神经网络有点不同。首先，这些图层按3个维度进行组织：宽度，高度和深度。此外，一层中的神经元不连接到下一层中的所有神经元，而仅连接到它的一小部分。最后，最终输出将减少为沿着深度维度组织的单个概率分数矢量。

CNNs have two components:

CNN具有两个组成部分：

The Hidden layers/Feature extraction part
隐藏层/特征提取部分

In this part, the network will perform a series of convolutions and pooling operations during which the features are detected. If you had a picture of a zebra, this is the part where the network would recognise its stripes, two ears, and four legs.

在这一部分中，网络将执行一系列卷积和池化操作，在此过程中将检测特征 。如果您有斑马的图片，这是网络可以识别其条纹，两只耳朵和四只腿的部分。

The Classification part
分类部分

Here, the fully connected layers will serve as a classifier on top of these extracted features. They will assign a probability for the object on the image being what the algorithm predicts it is.

在这里，完全连接的图层将用作这些提取特征的分类器 。他们将为图像上的对象分配算法预测的概率。

# before we start building we import the libraries

import numpy as np

from keras.layers import Conv2D, Activation, MaxPool2D, Flatten, Densefrom keras.models import Sequential

特征提取 (Feature extraction)

Convolution is one of the main building blocks of a CNN. The term convolution refers to the mathematical combination of two functions to produce a third function. It merges two sets of information.

卷积是CNN的主要组成部分之一。术语卷积是指两个函数的数学组合以产生第三个函数。它合并了两组信息。

In the case of a CNN, the convolution is performed on the input data with the use of a filter or kernel (these terms are used interchangeably) to then produce a feature map.

对于CNN，使用过滤器或内核对输入数据执行卷积(这些术语可互换使用) 然后产生一个特征图 。

We execute a convolution by sliding the filter over the input. At every location, a matrix multiplication is performed and sums the result onto the feature map.

我们通过在输入上滑动过滤器来执行卷积。在每个位置执行矩阵乘法，并将结果求和到特征图上。

In the animation below, you can see the convolution operation. You can see the filter (the green square) is sliding over our input (the blue square) and the sum of the convolution goes into the feature map (the red square).

在下面的动画中，您可以看到卷积操作。您可以看到过滤器 (绿色正方形)在我们的输入 (蓝色正方形)上滑动，并且卷积的总和进入特征图 (红色正方形)。

The area of our filter is also called the receptive field, named after the neuron cells! The size of this filter is 3x3.

我们的过滤器区域也称为感受野，以神经元细胞命名！该滤镜的大小为3x3。

For the sake of explaining, I have shown you the operation in 2D, but in reality convolutions are performed in 3D. Each image is namely represented as a 3D matrix with a dimension for width, height, and depth. Depth is a dimension because of the colours channels used in an image (RGB).

为了说明起见，我向您展示了2D的操作，但实际上卷积是在3D中执行的。每个图像即表示为3D矩阵，其尺寸为width，height和depth 。深度是尺寸，因为图像(RGB)中使用了颜色通道。

We perfom numerous convolutions on our input, where each operation uses a different filter. This results in different feature maps. In the end, we take all of these feature maps and put them together as the final output of the convolution layer.

我们在输入上进行许多卷积，其中每个运算使用不同的过滤器。这导致不同的特征图。最后，我们获取所有这些特征图，并将它们放在一起作为卷积层的最终输出。

Just like any other Neural Network, we use an activation function to make our output non-linear. In the case of a Convolutional Neural Network, the output of the convolution will be passed through the activation function. This could be the ReLU activation function.

与其他任何神经网络一样，我们使用激活函数使输出非线性。在卷积神经网络的情况下，卷积的输出将通过激活函数传递。这可能是ReLU激活功能。

Stride is the size of the step the convolution filter moves each time. A stride size is usually 1, meaning the filter slides pixel by pixel. By increasing the stride size, your filter is sliding over the input with a larger interval and thus has less overlap between the cells.

步幅是卷积滤波器每次移动的步长。步幅通常为1，这表示滤镜会逐像素滑动。通过增加步幅大小，滤波器将以较大的间隔在输入上滑动，从而使单元之间的重叠较少。

The animation below shows stride size 1 in action.

下面的动画显示了步幅大小为1的动作。

Because the size of the feature map is always smaller than the input, we have to do something to prevent our feature map from shrinking. This is where we use padding.

因为要素图的大小始终小于输入，所以我们必须采取一些措施来防止要素图缩小。这是我们使用padding的地方 。

A layer of zero-value pixels is added to surround the input with zeros, so that our feature map will not shrink. In addition to keeping the spatial size constant after performing convolution, padding also improves performance and makes sure the kernel and stride size will fit in the input.

添加了一个零值像素层，以用零包围输入，以便我们的特征图不会缩小。进行卷积运算后，除了使空间大小保持恒定外，填充还可以提高性能，并确保内核和步幅大小适合输入。

After a convolution layer, it is common to add a pooling layer in between CNN layers. The function of pooling is to continuously reduce the dimensionality to reduce the number of parameters and computation in the network. This shortens the training time and controls overfitting.

在卷积层之后，通常在CNN层之间添加池化层 。池化的功能是不断减小维数，以减少网络中的参数和计算量。这样可以缩短训练时间并控制过度拟合。

The most frequent type of pooling is max pooling, which takes the maximum value in each window. These window sizes need to be specified beforehand. This decreases the feature map size while at the same time keeping the significant information.

池的最常见类型是最大池 ，在每个窗口中取最大值。这些窗口大小需要事先指定。这样可以减少要素图的大小，同时保留重要的信息。

Thus when using a CNN, the four important hyperparameters we have to decide on are:

因此，在使用CNN时，我们必须确定的四个重要超参数是：

the kernel size
内核大小
the filter count (that is, how many filters do we want to use)
过滤器计数(即我们要使用多少个过滤器)
stride (how big are the steps of the filter)
大步(过滤器的台阶有多大)
padding
填充

# Images fed into this model are 512 x 512 pixels with 3 channels

img_shape = (28,28,1)

# Set up the model

model = Sequential()

# Add convolutional layer with 3, 3 by 3 filters and a stride size of 1# Set padding so that input size equals output size

model.add(Conv2D(6,2,input_shape=img_shape))

# Add relu activation to the layer

model.add(Activation('relu'))

#Pooling

model.add(MaxPool2D(2))

A nice way of visualizing a convolution layer is shown below. Try to look at it for a bit and really understand what is happening.

可视化卷积层的一种好方法如下所示。试着看一下，真正了解正在发生的事情。

分类 (Classification)

After the convolution and pooling layers, our classification part consists of a few fully connected layers. However, these fully connected layers can only accept 1 Dimensional data. To convert our 3D data to 1D, we use the function flatten in Python. This essentially arranges our 3D volume into a 1D vector.

在卷积和合并层之后，我们的分类部分包括一些完全连接的层。但是，这些完全连接的层只能接受1维数据。要将3D数据转换为1D，我们使用Python中的flatten函数。这实质上将3D体积排列成1D向量。

The last layers of a Convolutional NN are fully connected layers. Neurons in a fully connected layer have full connections to all the activations in the previous layer. This part is in principle the same as a regular Neural Network.

卷积神经网络的最后一层是完全连接的层。全连接层中的神经元与上一层中的所有激活都具有完整连接。这部分原则上与常规神经网络相同。

#Fully connected layers

# Use Flatten to convert 3D data to 1Dmodel.add(Flatten())

# Add dense layer with 10 neuronsmodel.add(Dense(10))

# we use the softmax activation function for our last layermodel.add(Activation('softmax'))

# give an overview of our model

model.summary

_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================conv2d_1 (Conv2D)            (None, 27, 27, 6)         30        _________________________________________________________________activation_1 (Activation)    (None, 27, 27, 6)         0         _________________________________________________________________max_pooling2d_1 (MaxPooling2 (None, 13, 13, 6)         0         _________________________________________________________________flatten_1 (Flatten)          (None, 1014)              0         _________________________________________________________________dense_1 (Dense)              (None, 10)                10150     _________________________________________________________________activation_2 (Activation)    (None, 10)                0         =================================================================Total params: 10,180Trainable params: 10,180Non-trainable params: 0__________________________________________________________________

训练 (Training)

Training a CNN works in the same way as a regular neural network, using backpropagration or gradient descent. However, here this is a bit more mathematically complex because of the convolution operations.

使用反向传播或梯度下降，训练CNN的工作方式与常规神经网络相同。但是，由于卷积运算，在数学上这有点复杂。

If you would like to read more about how regular neural nets work, you can read my previous article.

如果您想了解有关常规神经网络如何工作的更多信息，可以阅读我以前的文章。

"""Before the training process, we have to put together a learning process in a particular form. It consists of 3 elements: an optimiser, a loss function and a metric."""

model.compile(loss='sparse_categorical_crossentropy', optimizer = 'adam', metrics=['acc'])

# dataset with handwritten digits to train the model onfrom keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = np.expand_dims(x_train,-1)

x_test = np.expand_dims(x_test,-1)

# Train the model, iterating on the data in batches of 32 samples# for 10 epochs

model.fit(x_train, y_train, batch_size=32, epochs=10, validation_data=(x_test,y_test)

# Training...

Train on 60000 samples, validate on 10000 samplesEpoch 1/1060000/60000 [==============================] - 10s 175us/step - loss: 4.0330 - acc: 0.7424 - val_loss: 3.5352 - val_acc: 0.7746Epoch 2/1060000/60000 [==============================] - 10s 169us/step - loss: 3.5208 - acc: 0.7746 - val_loss: 3.4403 - val_acc: 0.7794Epoch 3/1060000/60000 [==============================] - 11s 176us/step - loss: 2.4443 - acc: 0.8372 - val_loss: 1.9846 - val_acc: 0.8645Epoch 4/1060000/60000 [==============================] - 10s 173us/step - loss: 1.8943 - acc: 0.8691 - val_loss: 1.8478 - val_acc: 0.8713Epoch 5/1060000/60000 [==============================] - 10s 174us/step - loss: 1.7726 - acc: 0.8735 - val_loss: 1.7595 - val_acc: 0.8718Epoch 6/1060000/60000 [==============================] - 10s 174us/step - loss: 1.6943 - acc: 0.8765 - val_loss: 1.7150 - val_acc: 0.8745Epoch 7/1060000/60000 [==============================] - 10s 173us/step - loss: 1.6765 - acc: 0.8777 - val_loss: 1.7268 - val_acc: 0.8688Epoch 8/1060000/60000 [==============================] - 10s 173us/step - loss: 1.6676 - acc: 0.8799 - val_loss: 1.7110 - val_acc: 0.8749Epoch 9/1060000/60000 [==============================] - 10s 172us/step - loss: 1.4759 - acc: 0.8888 - val_loss: 0.1346 - val_acc: 0.9597Epoch 10/1060000/60000 [==============================] - 11s 177us/step - loss: 0.1026 - acc: 0.9681 - val_loss: 0.1144 - val_acc: 0.9693

摘要 (Summary)

In summary, CNNs are especially useful for image classification and recognition. They have two main parts: a feature extraction part and a classification part.

总之，CNN对于图像分类和识别特别有用。它们有两个主要部分：特征提取部分和分类部分。

The main special technique in CNNs is convolution, where a filter slides over the input and merges the input value + the filter value on the feature map. In the end, our goal is to feed new images to our CNN so it can give a probability for the object it thinks it sees or describe an image with text.

CNN中的主要特殊技术是卷积，其中过滤器会在输入上滑动并将输入值+过滤器值合并到特征图上。最后，我们的目标是向我们的CNN提供新图像，以便它可以为认为自己看到的对象或描述带有文本的图像提供可能性。

You can find the entire code here.

您可以在此处找到完整的代码。