深度学习经典论文保姆级带读！LeNet开山之作，CNN早期经典论文《Handwritten Digit Recognition with a Back-Propagation Network》

Fyaxm

已于 2024-10-24 14:04:30 修改

阅读量2.5k

点赞数 21

分类专栏：论文带读文章标签： 1024程序员节深度学习机器学习 cnn 神经网络人工智能计算机视觉

于 2024-10-24 13:21:31 首次发布

本文链接：https://blog.csdn.net/liaoziqiang/article/details/143206275

版权

论文带读专栏收录该内容

3 篇文章

订阅专栏

文章目录

前言：在深度学习的世界中，卷积神经网络（CNN）已经成为图像识别的中流砥柱。与很多其他重大发现一样，它的发展也充满了坎坷。对于CNN原理的初步研究可以追溯到Hubel和Wiesel在1950年代和1960年代对于视觉机制的研究。20世纪60~70年代，日本科学家福岛邦彦提出了人类历史上首个CNN网络，但参数需要手动设计。转折发生在1989年，这一年Bell实验室的LeCun等人在《Backpropagation Applied to Handwritten Zip Code Recognition》中首次使用反向传播训练了CNN网络，并取得了不错的成果，这一成果也成为了大名鼎鼎的LeNet的雏形。同年，LeCun等人又在《Handwritten Digit Recognition with a Back-Propagation Network》中提出了更为成熟的版本，该论文引入平均池化层，大大降低了参数量并提高了鲁棒性。这个网络就是著名的LeNet1——深度学习史上最著名的网络之一。今天我们就来尝试解读下《Handwritten Digit Recognition with a Back-Propagation Network》这一篇论文，因为它的架构更为成熟，也与现代架构更为接近。希望本文能帮助大家从源头更加深入地理解CNN。

同时，由于本人水平有限，本文还有很多解释不够清晰或者错误的地方，恳请读者指正。

Abstract（摘要）

We present an application of back-propagation networks to handwritten digit recognition. Minimal preprocessing of the data was required, but the architecture of the network was highly constrained and specifically designed for the task. The input of the network consists of normalized images of isolated digits. The method has a 1% error rate and about a 9% reject rate on zipcode digits provided by the U.S. Postal Service.

翻译：我们展示了反向传播网络在手写数字识别中的应用。数据所需的预处理非常少，但网络的架构受到严格限制，并专门为此任务设计。网络的输入由标准化的孤立数字图像组成。该方法的错误率为1%，在美国邮政服务提供的邮政编码数字上拒绝率约为9%。

第一句说明了论文的目的：手写数字识别。这是一个经典的任务，许多深度学习初学者在学习过程中都会使用这个任务进行练习。该任务属于基础的分类问题，共有10个类别，网络接收一张图像作为输入，并输出类别标识0至9。

“Minimal preprocessing of the data was required”强调了深度学习的一个重要特性：自动特征提取。深度学习模型能直接处理原始输入数据，例如图像、音频或文本，神经网络能够自动提取特征，而无需进行复杂的手动特征工程。而在深度学习之前，人们实现类似任务的过程更复杂，比如通过空域和频域的互相转换、主成分分析（PCA）等手段对数据进行预处理使其适应各类复杂的特征提取算法。特征也往往是手动设计的，操作难度极大的同时效果也往往不慎理想，尽管它们在计算上可能更高效。

“the architecture of the network was highly constrained”则指出了神经网络的一个限制：输入输出的固定性。一般来说，神经网络要求输入具有固定大小，并生成固定大小的输出，这在许多实际应用中可能不够灵活。为了解决这一问题，研究者们提出了多种策略，比如使用序列模型（如循环神经网络）、自注意力机制（如Transformer），以及变长卷积等方法。另一方面，针对输入数据的调整也是一种有效的解决方案，例如通过填充、裁剪或数据增强等手段，使数据适应模型要求。“specifically designed for the task”说明本论文选择了从数据下功夫，所有数据的大小被调整为一致的并进行标准化处理，从而保证输入格式正确。

从最后一句来看，本论文取得了一个很高的正确率。其中“reject rate”指如果对判定结果的信息不够高或认为该图像不符合识别标准，它会将其拒绝处理，而不是给出一个可能不准确的分类结果。

1. Introduction（引入）

The main point of this paper is to show that large back-propagation (BP) networks can be applied to real image-recognition problems without a large, complex preprocessing stage requiring detailed engineering. Unlike most previous work on the subject (Denker et al., 1989), the learning network is directly fed with images, rather than feature vectors, thus demonstrating the ability of BP networks to deal with large amounts of low-level information.

翻译：这篇论文的主要观点是，大型反向传播网络可以应用于真实的图像识别问题，而无需复杂的预处理阶段和详细的工程设计。与以往大多数相关研究（Denker 等，1989）不同，学习网络直接接收图像作为输入，而不是特征向量，这展示了 BP 网络处理大量低级信息的能力。

这里解释一下最后一句。“低级信息”通常指图像中的低级特征，如颜色、角点、边缘和纹理等。传统方法可以通过Canny边缘检测、Harris角点等技术获得这些特征，而在深度学习中，这些低级特征的提取一般由神经网络的前几层负责。神经网络通过多层结构逐层提取特征，能够逐渐获取更高级的特征，如形状和模式，甚至动作与姿态等复杂信息。这一过程使得模型能够在不同层次上理解图像内容。

文中提到的“特征向量”正是对高级特征的一种表达形式。你可以把它看作是数据的“简化描述”，它保存了数据中对任务最重要的特征，并以一个向量的形式存储，非常适合神经网络。特征向量可以通过预训练模型获得，这意味着我们可以直接使用用特征向量作为下游任务的输入，从而省去大量的训练工作。例如，在自然语言处理中，我们就可以用预训练模型生成一个Embedding向量，然后使用它微调下游任务，例如文本分类、情感识别、实体识别等。

传统方法依赖手动特征提取，特征向量作为输入意味着研究者需设计特征提取算法，这一过程既耗时又主观。相比之下，深度学习模型仅需输入原始图像，就能够自动从低级到高级进行特征提取。这种自动化特征学习显著降低了建模的复杂性和难度，同时提高了模型的适应性和性能。

Previous work performed on simple digit images(Le Cun, 1989)showed that the architecture of the network strongly influences the network’s generalization ability. Good generalization can only be obtained by designing a network architecture that contains a certain amount of a priori knowledge about the problem. The basic design principle is to minimize the number of free parameters that must be determined by the learning algorithm, without overly reducing the computational power of the network. This principle increases the probability of correct generalization because it results in a specialized network architecture that has a reduced entropy (Denker et al., 1987; Patarnello and Carnevali, 1987; Tishby, Levin and Solla, 1989; Le Cun, 1989). On the other hand, some effort must be devoted to designing appropriate constraints into the architecture.

翻译：以往在简单数字图像上进行的研究（Le Cun，1989）表明，网络架构对网络的泛化能力有很大影响。只有通过设计包含一定先验知识的网络架构，才能获得良好的泛化效果。基本设计原则是最小化学习算法必须确定的自由参数数量，同时不显著降低网络的计算能力。该原则提高了正确泛化的概率，因为它导致了具有较低熵值的专用网络架构（Denker 等，1987；Patarnello 和 Carnevali，1987；Tishby、Levin 和 Solla，1989；Le Cun，1989）。另一方面，必须在架构设计中投入一些精力以设定适当的约束。

首先解释一下“先验知识”。先验知识是指在特定领域或任务中，基于经验、理论或观察获得的有关数据或模型特性的知识，简单来说就是已知的信息。虽然科学家们一直在探索完全排除人类主观介入的方法，但先验知识的引入有时至关重要。例如，如果图像处理使用全连接网络，将会产生巨大的计算量。而基于图像局部性设计的卷积神经网络则显著降低了参数数量，使复杂的图像处理任务变得可行。

在设计神经网络时，重要的原则是控制参数数量和合理配置层数。你如果尝试过调参，就会惊奇的发现，很多时候参数只是略微调整，效果就天差地别。一般而言，过多的参数会导致过拟合，因为模型过于复杂，容易记住训练数据的细节，而不是学习到一般规律（就像死记硬背知识点的学生）。这使得模型在新数据上表现不佳（即泛化能力差）。因此，论文提出需要“最小化学习算法必须确定的自由参数数量”，以确保模型学习到一般规律，从而能够泛化到未知数据中。

2. Zipcode Recognition（邮政编码识别）

The handwritten digit-recognition application was chosen because it is a relatively simple machine vision task: the input consists of black or white pixels, the digits are usually well-separated from the background, and there are only ten output categories. Yet the problem deals with objects in a real two-dimensional space and the mapping from image space to category space has both considerable regularity and considerable complexity. The problem has added attraction because it is of great practical value.

翻译：选择手写数字识别应用是因为它是一个相对简单的机器视觉任务：输入由黑白像素组成，数字通常与背景有良好的分离，并且只有十个输出类别。然而，这个问题涉及真实的二维空间中的对象，图像空间到类别空间的映射既具有相当的规律性，又具有相当的复杂性。这个问题的吸引力在于它具有很大的实际价值。

这里没有太多需要解释的。“输入由黑白像素组成”指可以使用0/1的二值图像进行处理，这显著降低了数据量和处理难度。

The database used to train and test the network is a superset of the one used in the work reported last year (Denker et al., 1989). We emphasize that the method of solution reported here relies more heavily on automatic learning, and much less on hand-designed preprocessing.

翻译：用于训练和测试网络的数据库是去年的研究（Denker等，1989）所用数据库的超集。我们强调，此处报告的解决方案方法更依赖于自动学习，较少依赖于手工设计的预处理。

The database consists of 9298 segmented numerals digitized from handwritten zipcodes that appeared on real U.S. Mail passing through the Buffalo, N.Y. post office. Examples of such images are shown in figure 1. The digits were written by many different people, using a great variety of sizes, writing styles and instruments, with widely varying levels of care. This was supplemented by a set of 3349 printed digits coming from 35 different fonts. The training set consisted of 7291 handwritten digits plus 2549 printed digits. The remaining 2007 handwritten and 700 printed digits were used as the test set. The printed fonts in the test set were different from the printed fonts in the training set. One important feature of this database, which is a common feature to all real-world databases, is that both the training set and the testing set contain numerous examples that are ambiguous, unclassifiable, or even misclassified.

翻译：该数据库由9298个从真实美国邮政邮件中数字化的手写邮政编码组成，这些邮件经过纽约州布法罗的邮局。图1展示了这些图像的示例。这些数字由许多不同的人书写，使用了各种尺寸、书写风格和工具，书写质量差异很大。此外，还补充了来自35种不同字体的3349个打印数字。训练集包括7291个手写数字和2549个打印数字。剩余的2007个手写数字和700个打印数字用作测试集。测试集中的打印字体与训练集中的打印字体不同。该数据库的一个重要特征（这是所有真实世界数据库的共同特征）是，训练集和测试集中都包含许多模糊、无法分类或甚至被错误分类的例子。

在这里插入图片描述

整理一下数据集信息：

数据集	手写数字数量	打印数字数量	总数量
训练集	7291	2549	9840
测试集	2007	700	2707
总计	9298	3249	12547

3. Preprocessing（预处理）

Acquisition, binarization, location of the zip code, and preliminary segmentation were performed by Postal Service contractors (Wang and Srihari, 1988). Some of these steps constitute very hard tasks in themselves. The segmentation (separating each digit from its neighbors) would be a relatively simple task if we could assume that a character is contiguous and is disconnected from its neighbors, but neither of these assumptions holds in practice. Many ambiguous characters in the database are the result of mis-segmentation (especially broken 5’s) as can be seen on figure 2.

翻译：邮政编码采集、二值化、位置确定和初步分割均由邮政服务承包商（Wang 和 Srihari，1988）完成。其中一些步骤本身就是非常困难的任务。如果我们能够假设字符是连续的，并且与邻近字符不相连，那么分割（将每个数字与其邻居分开）将是一个相对简单的任务，但实际上这两个假设都不成立。数据库中的许多模糊字符（尤其是断裂的数字5）都是由于错误分割造成的，具体情况见图2。

在这里插入图片描述

数据预处理在深度学习中至关重要。它包括清洗、标准化、归一化等步骤，旨在提升模型性能和训练效率。通过去除噪声、处理缺失值、缩放特征范围，预处理有助于加快收敛速度，降低过拟合风险。此外，适当的数据增强技术还能提高模型的泛化能力。有效的数据预处理往往是成功训练深度学习模型的基础。在这篇论文中，作者仅关注了单个字符的识别，字符位置框定等预处理步骤由人工完成。

这里数据的获取方式也值得说道说道。深度学习往往需要大量的数据进行训练，然而数据的获取往往较为困难。一些较为大众的任务可以直接通过网络进行搜集，但一些非常专业的任务就难以获取了。一种解决方案是通过专门的数据标注师，有多少人工就有多少智能，你的大语言模型背后也许其实是个阿三（🌸🐔）。第二种是通过众包，通过将任务发布到众包平台，发布者可以快速收集带标注的数据，用人民的智慧。

At this point, the size of a digit varies but is typically around 40 by 60 pixels. Since the input of a back-propagation network is fixed size, it is necessary to normalize the size of the characters. This was performed using a linear transformation to make the characters fit in a 16 by 16 pixel image. This transformation preserves the aspect ratio of the character and is performed after extraneous marks in the image have been removed. Because of the linear transformation, the resulting image is not binary but has multiple gray levels, since a variable number of pixels in the original image can fall into a given pixel in the target image. The gray levels of each image are scaled and translated to fall within the range -1 to 1.

翻译：此时，数字的大小各异，但通常约为40×60像素。由于反向传播网络的输入是固定大小，因此有必要对字符的大小进行规范化。此过程使用线性变换将字符调整为适合16×16像素的图像。该变换保持了字符的纵横比，并在去除图像中的多余标记之后进行。由于线性变换，得到的图像不是二值图像，而是具有多个灰度级，因为原始图像中的多个像素可能映射到目标图像中的一个像素。每幅图像的灰度级被缩放和转换，以使其范围在-1到1之间。

“the resulting image is not binary but has multiple gray levels”指图像缩放的过程中使用了类似线性插值等算法，因此产生了一些灰度信息。本论文将像素值缩放到 $[- 1, 1]$ 。选择将数据缩放到 $[- 1, 1]$ 可以有效防止梯度消失，同时也能保证数据的尺度，使得模型训练过程更加高效和稳定。这种做法在当时的背景下是合理且有效的。

注：目前常用的ReLU激活函数在2010年前后才开始被广泛使用，1989年更常用的是双曲正切函数 (tanh)等。当输入过大时，tanh的导数会过小（最终趋于0），因此会产生梯度消失的现象，导致参数无法得到更新。同时，tanh把输入数据映射到 $(- 1, 1)$ 的区间内，输入为 $[- 1, 1]$ 也有益于保证网络的稳定和避免尺度影响，从而增强泛化能力。

4 The Network（网络）

The remainder of the recognition is entirely performed by a multi-layer network. All of the connections in the network are adaptive, although heavily constrained, and are trained using back-propagation. This is in contrast with earlier work (Denker et al., 1989) where the first few layers of connections were hand-chosen constants. The input of the network is a 16 by 16 normalized image, and the output is composed of 10 units: one per class. When a pattern belonging to class $i$ is presented, the desired output is $+ 1$ for the $i$ th output unit, and $- 1$ for the other output units.

翻译：识别过程的其余部分完全由多层网络执行。网络中的所有连接都是自适应的，尽管受到了很大的限制，并使用反向传播进行训练。这与早期的工作（Denker 等，1989）形成对比，当时网络的前几层连接是手动选择的常数。网络的输入是一个16×16的标准化图像，输出由10个单元组成，每个类别对应一个单元。当属于类别 $i$ 的模式被输入时，期望输出在第 $i$ 个输出单元为 $+ 1$ ，而其他输出单元为 $- 1$ 。

“The remainder of the recognition”指前面提到的所有预处理以外的部分。该论文的网络是完全通过反向传播进行学习的，不需要人工指定参数。

这段话还描述了网络的输入和输出。输入可以看作是一个矩阵，大小为 $16 \times 16$ 。输出可以看作是一个向量 $c_0, c_1, ..., c_9]$ ，共10个分量，其代表的类别和0~9一一对应。注意原文说的“desired output is $+ 1$ ”是理想情况，实际情况下每个分量的输出都是属于 $(- 1, 1)$ 的一个浮点数，可以看作是对于图片是否属于该类别的“信心”。信心最高的一个分类将被确定为图片所属的类别。同时，根据摘要部分，当识别信心不足时，论文也会拒绝进行分类，具体标准见后文。

A fully connected network with enough discriminative power for the task would have far too many parameters can prevent the network from generalizing correctly. Therefore, a restricted connection scheme must be devised, guided by our prior knowledge about shape recognition. There are well-known advantages to performing shape recognition by detecting and combining local features. We have required our network to do this by constraining the connections in the first few layers to be local. In addition, if a feature detector is useful in one part of the image, it is likely to be useful in other parts of the image as well. One reason for this is that the salient features of a distorted character might be displaced slightly from their position in a typical character. One solution to this problem is to scan the input image with a single neuron that has a local receptive field and store the states of this neuron in corresponding locations in a layer called a feature map (see Figure 3). This operation is equivalent to a convolution with a small-sized kernel, followed by a squashing function. The process can be performed in parallel by implementing the feature map as a plane of neurons whose weight vectors are constrained to be equal. That is, units in a feature map are constrained to perform the same operation on different parts of the image. An interesting side-effect of this weight-sharing technique, already described in (Rumelhart, Hinton, and Williams, 1986), is to reduce the number of free parameters by a large amount since a large number of units share the same weights. In addition, a certain level of shift invariance is present in the system: shifting the input will shift the result on the feature map but will leave it unchanged otherwise. In practice, it will be necessary to have multiple feature maps, extracting different features from the same image.

翻译：① 一个具有足够判别能力的全连接网络会有过多的参数，这可能会阻碍网络的正确泛化。② 因此，必须设计一种受限的连接方案，以我们对形状识别的先验知识为指导。通过检测和组合局部特征进行形状识别有许多已知的优势。我们要求我们的网络通过限制前几层的连接为局部连接来实现这一点。③ 此外，如果一个特征检测器在图像的某一部分有效，那么它在图像的其他部分也很可能是有效的。其原因之一是，扭曲字符的显著特征可能会稍微偏离其在典型字符中的位置。④ 解决这个问题的一种方法是使用一个具有局部感受野的单个神经元扫描输入图像，并将该神经元的状态存储在称为特征图的层中的相应位置（见图 3）。这个操作相当于用小尺寸的卷积核进行卷积，随后应用一个压缩函数。⑤ 该过程可以通过将特征图实现为一个神经元平面来并行执行，其权重向量被约束为相等。也就是说，特征图中的单元被限制在图像的不同部分执行相同的操作。这种权重共享技术的一个有趣副作用是大大减少自由参数的数量，这已在（ Rumelhart、Hinton 和 Williams）的研究中描述，因为大量单元共享相同的权重。⑥ 此外，系统中存在一定程度的平移不变性：平移输入将平移特征图上的结果，但在其他方面将保持不变。⑦ 实际上，必须有多个特征图，从同一图像中提取不同的特征。

在这里插入图片描述

这一大段乍一看有点不讲人话，不过毕竟是1989年的论文了，不少术语还没有规范。这里我们用更现代的语言来个“中译中”：

①句简单来说就是因为全连接网络参数过大还容易过拟合，所以不使用。
想要理解②句，我们首先要说明“受限”是什么意思。“受限”指的是视野受限，也就是说一个卷积核只能“看”到它的覆盖范围内的东西，因此它只能处理图像的局部信息。与之相对的则是“不受限”的视野，比如上一句提到的全连接网络，后层的单个神经元和前一层的每一个神经元都有连接，能看到所有信息。简单来说就是使用卷积核实现对图像局部信息的处理。
理解③句可能有些困难。这里我们举例说明，假设有一个卷积核 $k$ ，它负责识别“转角点”这个特征，但这个特征可能出现在图像的各种地方。正如原文所说：“扭曲字符的显著特征可能会稍微偏离其在典型字符中的位置”，同样是数字“7”，和标准写法相比，就会有人写得长些，有些人写得歪一些，那么转角点也就会随之偏移到其他位置。也就是说，这个特征要在图像的所有位置查找，因而“它在图像的其他部分也很可能是有效的”。为了不管在什么位置都能识别到这个特征，我们就可以让这个卷积核和图像的所有位置进行卷积，这样，不管这个特征在哪，我们都能找到并识别。而这个方法正是④句所提到的操作。
④句还提到了两个概念，第一个是局部感受野（local receptive field），它指的是是该神经元对输入数据的响应范围，局部感受野就是只能对图片的一小部分进行响应。第二个是特征图（feature map）：它其实就是各个位置的卷积结果组合成的一张图，说白了就是卷积层的输出，由于其内容反映的是提取的特征，因此给它起个新名字叫“特征图”。
⑤句提到了一个非常重要的概念：权重共享。在一个卷积层中，不管处理的是图像的哪一部分，我们使用的都是同一个卷积核，参数都是一样的，也就是所有部分“共享”了一个卷积核，这就是权重共享。反过来说，一个卷积核只提取一个特征，不管在图像的哪一个位置都由它负责检测这个特征是否存在，因此卷积核也必须要共享。不过根据⑦句，一个卷积层往往需要提取多个特征，因此一个卷积层一般都有多个卷积核，输出的特征图也有多个，分别代表从图像中提取的不同特征。这句话还提到两个副作用：一是大大降低了参数数量，这对于深度学习这种对性能敏感的任务而言十分重要；二是能够并行化，这同样大大提升了计算速度。算法能否并行化是需要条件的，至少各个单元不应该有数据的相互依赖，而图像各个区域的卷积操作都是相互独立的，因此能够轻松并行化。（老黄の发家史）
⑥句提到了卷积的一个特性：平移不变性。如果想对这些细节有更深的理解，建议直接了解一下数字图像处理中“卷积”的相关内容。用通俗的话说就是：原图平移一个像素，卷积出来的特征图就平移一个像素，没有其他变化。这个性质对于保证网络的鲁棒性有很大的作用。其他卷积操作不满足的不变性：如旋转不变性、尺度不变性、放射不变性等。

The idea of local, convolutional feature maps can be applied to subsequent hidden layers as well, to extract features of increasing complexity and abstraction. Interestingly, higher-level features require less precise coding of their location. Reduced precision is actually advantageous since a slight distortion or translation of the input will have a reduced effect on the representation. Thus, each feature extraction in our network is followed by an additional layer that performs a local averaging and subsampling, reducing the resolution of the feature map. This layer introduces a certain level of invariance to distortions and translations. A functional module of our network consists of a layer of shared-weight feature maps followed by an averaging/subsampling layer. This is reminiscent of the Neocognitron architecture (Fukushima and Miyake, 1982), with the notable difference that we use backpropagation (rather than unsupervised learning), which we feel is more appropriate for this sort of classification problem.

翻译：① 局部卷积特征图的思想同样可以应用于后续的隐藏层，以提取越来越复杂和抽象的特征。② 有趣的是，高层特征对其位置的编码要求不那么精确。精度的降低实际上是有利的，因为输入的微小变形或平移对表示的影响会减小。③ 因此，我们网络中的每次特征提取之后都会有一个执行局部平均和下采样的附加层，从而降低特征图的分辨率。这一层引入了一定程度的抗变形和抗平移不变性。我们网络的功能模块由一个共享权重的特征图层和一个平均/下采样层组成。④ 这种结构类似于新认知机架构（Fukushima 和 Miyake，1982），但显著的区别在于我们使用了反向传播（而非无监督学习），我们认为反向传播更适合这种分类问题。

和上一段一样，这一段也是论述网络设计的思想和理论，整体比较难懂，我们耐心一句句啃下来：

①句是说这种卷积同样也可以扩展到后面的层用于提取高级特征，比如，第一层提取边缘、第二层提取角点、第三层提取形状、全连接层输出形状类别。这里每一层的特征都基于前一层输出的低一级特征。
②句大致是说：学习过CNN的应该都知道，随着层数的增加，输出尺寸往往是变小的，高层的一个像素可能对应的是原图 $\times 3$ 、 $\times 9$ 甚至更大的局域，因此原图（比如手写数字）歪了一些对高层信息也不会有太大影响。
③句则是说，由于上面②句给出的理由，作者给每一个卷积层附带一个执行局部平均和下采样的附加层，它简单的取各个区域的平均值，从而缩小分辨率，这样一来，缩小后的特征图对应原图的那一块区域有一些变形和位移也就无所谓了（加到一块就都一样了）。这其实就是大名鼎鼎的“池化层”，事实上单词pool除了水池外，还有聚合的意思，在这里就是把一片区域的信息聚合起来。虽然一般译作“池化层”，但从语义上看，也许译为“聚合层”更为合适。
④句把本文和Neocognitron网络进行了类比，Neocognitron的每个层级分为S层和C层，和卷积+池化的组合类似。但本文使用了有监督学习并使用反向传播进行训练，毕竟数字分类获取标签非常容易，且训练更加高效准确。

The network architecture, represented in Figure 4, is a direct extension of the ones described in (Le Cun, 1989; Le Cun et al., 1990a). The network has four hidden layers respectively named $H_1$ , $H_2$ , $H_3$ , and $H_4$ . Layers $H_1$ and $H_3$ are shared-weight feature extractors, while $H_2$ and $H_4$ are averaging/subsampling layers.

翻译：网络架构如图4所示，是（Le Cun，1989；Le Cun 等，1990a）中所述架构的直接扩展。该网络有四个隐藏层，分别命名为 $H_1$ ， $H_2$ ， $H_3$ 和 $H_4$ 。层 $H_1$ 和 $H_3$ 是共享权重的特征提取层，而 $H_2$ 和 $H_4$ 则是平均/下采样层。

在这里插入图片描述

从这一段开始就是网络结构的描述了。我计划在最后给出对网络整体结构的梳理，接下来几段我仅会对令人费解的部分进行解读。可以看到，这个网络共有六层：一个输入层、四个隐藏层、一个输出层，这是一个较小规模的网络，目前一台普通的笔记本电脑也能仅使用CPU就快速完成训练工作。

Although the size of the active part of the input is 16 by 16, the actual input is a 28 by 28 plane to avoid problems when a kernel overlaps a boundary. $H_1$ is composed of 4 groups of 576 units arranged as 4 independent 24 by 24 feature maps. These four feature maps are designated by $H_{1.1}$ , $H_{1.2}$ , $H_{1.3}$ , and $H_{1.4}$ . Each unit in a feature map takes its input from a 5 by 5 neighborhood on the input plane. As described above, corresponding connections on each unit in a given feature map are constrained to have the same weight. In other words, all of the 576 units in $H_{1.1}$ use the same set of 26 weights (including the bias). Units in another map (say $H_{1.4}$ ) share another set of 26 weights.

翻译：尽管输入的有效部分大小为 $16 \times 16$ ，但实际输入是一个 $28 \times 28$ 的平面，以避免卷积核与边界重叠时出现问题。 $H_1$ 由4组576个单元组成，这些单元排列为4个独立的 $24 \times 24$ 特征图。这四个特征图分别命名为 $H_{1.1}$ 、 $H_{1.2}$ 、 $H_{1.3}$ 和 $H_{1.4}$ 。每个特征图中的单元从输入平面的 $5\times 5$ 邻域中获取输入。如前所述，给定特征图中每个单元的对应连接被约束为具有相同的权重。① 换句话说， $H_{1.1}$ 中所有576个单元使用同一组26个权重（包括偏置）。另一张特征图（如 $H_{1.4}$ ）中的单元则共享另一组26个权重。

这篇论文并没有使用padding的方法，而是直接事先留足了空间，相当于在输入层一次性做了padding，这点相信如果学习过CNN的都能理解。这里我们重点说明一下为什么是28这个数字。首先，在H1层使用了 $\times 5$ 的卷积核进行卷积，也就是说，第一个能够被卷积的点的坐标为 $(2, 2)$ ，第一二行、第一二列、倒数第一二行、倒数第一二列都是不执行操作的，因此这里需要给边框预留4个像素；在H3层又需要预留4个像素，但由于H2进行了池化操作，因此这4个像素对应原图的8个像素，我们共需要预留4+8=12个像素才能保护数字部分的信息不被裁掉。由于原图大小为 $16 \times 16$ ，因此处理后就变为 $28 \times 28$ 了。

这里再解释一下参数的数量。H1一共有四个卷积核，因此每一个通道的卷积都需要 $\times 5 = 25$ 个参数，一般而言卷积后会再添加一个偏置项 $b$ 进行调整，这样一来每一个通道的卷积就需要26个参数。参数仅在同一个通道的内部共享，通道间是不共享的，因此共需要 $26 \times 4 = 104$ 个参数（这也解释了①句）。

Layer $H_2$ is the averaging/subsampling layer. It is composed of 4 planes of size 12 by 12. Each unit in one of these planes takes inputs from 4 units on the corresponding plane in $H_1$ . Receptive fields do not overlap. All the weights are constrained to be equal, even within a single unit. Therefore, $H_2$ performs a local averaging and a 2-to-1 subsampling of $H_1$ in each direction.

翻译： $H_2$ 层是平均/下采样层，由四个12x12的平面组成。每个平面中的单元从 $H_1$ 中对应平面上的四个单元获取输入。感受野不重叠。所有权重都被限制为相等，即使是在单个单元内。因此， $H_2$ 对 $H_1$ 进行了局部平均和每个方向上的2:1下采样。

简单来说就是原图按 $\times 2$ 划分为多个网格，每个网格求平均，输出一个长和宽变为原来一半的新图。

Layer $H_3$ is composed of 12 feature maps. Each feature map contains 64 units arranged in an 8 by 8 plane. These feature maps are designated as $H_{3.1}$ , $H_{3.2}$ , … $H_{3.12}$ . The connection scheme between $H_2$ and $H_3$ is similar to the one between the input and $H_1$ , but slightly more complicated because $H_3$ has multiple 2-D maps. Each unit’s receptive field is composed of one or two 5 by 5 neighborhoods centered around units that are at identical positions within each $H_2$ map. All units in a given map are constrained to have identical weight vectors. The maps in $H_2$ on which a map in $H_3$ takes its inputs are chosen according to a scheme described in Table 1. According to this scheme, the network is composed of two almost independent modules. Layer $H_4$ plays the same role as layer $H_2$ ; it is composed of 12 groups of 16 units arranged in 4 by 4 planes.

翻译： $H_3$ 层由12个特征图组成。每个特征图包含64个单元，排列为 $\times 8$ 的平面。这些特征图分别命名为 $H_{3.1}$ 、 $H_{3.2}$ 、…、 $H_{3.12}$ 。 $H_2$ 和 $H_3$ 之间的连接方案类似于输入和 $H_1$ 之间的连接方案，但稍微复杂一些，因为 $H_3$ 中有多个二维特征图。每个单元的感受野由一个或两个以 $H_2$ 各特征图中相同位置为中心的 $\times 5$ 邻域组成。每个特征图中的所有单元被约束为具有相同的权重向量。 $H_3$ 中各个特征图从哪些 $H_2$ 特征图获取输入是根据表1中描述的方案选择的。根据这个方案，网络由两个几乎独立的模块组成。 $H_4$ 层的作用与 $H_2$ 类似，它由12组16个单元组成，这些单元排列为4x4的平面。

	$H_{3.1}$	$H_{3.2}$	$H_{3.3}$	$H_{3.4}$	$H_{3.5}$	$H_{3.6}$	$H_{3.7}$	$H_{3.8}$	$H_{3.9}$	$H_{3.10}$	$H_{3.11}$	$H_{3.12}$
$H_{2.1}$	×	×	×		×	×
$H_{2.2}$		×	×	×	×	×
$H_{2.3}$							×	×	×		×	×
$H_{2.4}$								×	×	×	×	×

这一段对于参数的描述不太符合我们一般的惯例。这里转换一下语言：即输入大小 $12 \times 12$ ，卷积核大小 $\times 5 \times 12 \times d$ ，步长为1（可推测），无填充，输出大小 $\times 8 \times 12$ 。其中每个通道的特征图的深度由表格决定。

这一层有一个现在看来比较奇怪的设定：就是 $H_3$ 对 $H_2$ 的卷积并不是全连接的，而是进行了选择。这可以看做是一种对特征的分类和组合，由于某种特定高级特征仅需要部分的低级特征即可，因此不需要所有的低级特征都参与该通道的特征提取。这是一种基于先验知识的设定，可能是通过实验确定的经验组合。这么做的一个显著好处就是降低了计算成本，不过随着硬件水平的进步，目前很少用这种方法了。这个表格对应的连接可视化如下：

在这里插入图片描述

The output layer has 10 units and is fully connected to $H_4$ . In summary, the network has 4635 units, 98,442 connections, and 2578 independent parameters. This architecture was derived using the Optimal Brain Damage technique (Le Cun et al., 1990b) starting from a previous architecture (Le Cun et al., 1990a) that had 4 times more free parameters.

翻译：输出层有10个单元，与 $H_4$ 层完全连接。总而言之，该网络共有4635个单元，98,442个连接，以及2578个独立参数。该架构是使用“最佳脑损伤”（Optimal Brain Damage）技术（Le Cun 等，1990b）从先前的架构（Le Cun 等，1990a）中衍生而来的，初始架构的自由参数数量是该架构的四倍。

根据前文， $H_4$ 的输出大小为 $\times 4 \times 12$ ，这里所有参数和最后一层的10个神经元进行全连接，最后输出一个10维的向量。

到这里我们终于把网络结构完整过了一遍。我们可以整理一下这个网络的结构了：

层级	输入大小	Kernel大小	Padding	步长	输出大小	备注
输入	-	-	-	-	$28 \times 28 \times 1$	有效部分： $16 \times 16$
$H_1$	$28 \times 28 \times 1$	$\times 5 \times 4$	无	1	$24 \times 24 \times 4$	-
$H_2$	$24 \times 24 \times 4$	$\times 2$ 下采样	无	2	$12 \times 12 \times 4$	感受野不重叠，权重相等
$H_3$	$12 \times 12 \times 4$	$\times 5 \times 20$	无	1	$\times 8 \times 12$	因为一共有20个连接
$H_4$	$\times 8 \times 12$	$\times 2$ 下采样	无	2	$\times 4 \times 12$	类似于 $H_2$
输出	$\times 4 \times 12$	-	-	-	10	-

如果你想自己实现的话，并不推荐这个模型，建议在网络上寻找改进过的LeNet模型来练手。论文没有明确提到激活函数的类型，不过根据网络代码应该是使用了tanh函数。

5 Results（结果）

After 30 training passes, the error rate on the training set (7291 handwritten plus 2549 printed digits) was 1.1%, and the MSE was 0.017. On the whole test set (2007 handwritten plus 700 printed characters), the error rate was 3.4%, and the MSE was 0.024. All the classification errors occurred on handwritten characters.

翻译：经过30次训练后，训练集（包括7291个手写数字和2549个印刷数字）的错误率为1.1%，均方误差（MSE）为0.017。而在整个测试集（包括2007个手写数字和700个印刷字符）上，错误率为3.4%，MSE为0.024。所有的分类错误都发生在手写字符上。

In a realistic application, the user is not so much interested in the raw error rate as in the number of rejections necessary to reach a given level of accuracy. In our case, we measured the percentage of test patterns that must be rejected in order to achieve a 1% error rate. Our rejection criterion was based on three conditions: the activity level of the most-active output unit should be larger than a given threshold $t_1$ , the activity level of the second most-active unit should be smaller than a given threshold $t_2$ , and finally, the difference between the activity levels of these two units should be larger than a given threshold $t_d$ . The best percentage of rejections on the complete test set was 5.7% for a 1% error rate. On the handwritten set only, the result was 9% rejections for a 1% error rate. It should be emphasized that the rejection thresholds were obtained using performance measures on the test set. About half the substitution errors in the testing set were due to faulty segmentation, and an additional quarter were due to erroneous assignment of the desired category. Some of the remaining images were ambiguous even to humans, and in a few cases, the network misclassified the image for no discernible reason.

翻译：在实际应用中，用户更关心的是为了达到某一准确率所需的拒绝样本数量，而不仅仅是原始的错误率。在我们的实验中，我们测量了为了确保1%的错误率而需要拒绝的测试样本的比例。我制定的拒绝标准基于以下三个条件：第一，最活跃的输出单元的激活值必须大于设定的阈值 $t_1$ ；第二，第二活跃单元的激活值必须小于设定的阈值 $t_2$ ；第三，这两个单元的激活值差距必须大于设定的阈值 $t_d$ 。在整个测试集上，当拒绝率达到5.7%时，错误率可以降低到1%。仅在手写字符集上，当拒绝率达到9%时，错误率可以控制在1%。需要强调的是，这些拒绝阈值是基于测试集上的表现调整出来的。测试集中大约一半的替代错误是由于分割不准确导致的，四分之一的错误则是因为分类标签分配错误。其余的错误中，一些图像即使对人类来说也难以区分，而在少数情况下，网络会无明显原因地错误分类某些图像。

Even though a second-order version of back-propagation was used, it is interesting to note that the learning took only 30 passes through the training set. We think this can be attributed to the large amount of redundancy present in real data. A complete training session (30 passes through the training set plus test) takes about 3 days on a SUN SP ARCstation 1 using the SN2 connectionist simulator (Bottou and Le Cun, 1989).

翻译：尽管使用了二阶反向传播，但有趣的是，仅需经过30次对训练集的遍历即可完成学习。我们认为这可以归因于真实数据中存在大量冗余。一次完整的训练会话（包括对训练集的30次遍历以及测试）在使用SN2连接主义模拟器（Bottou和Le Cun，1989）时，大约需要在SUN SP ARCstation 1上耗时3天。

我们现在大多都使用一阶导数进行更新，但这篇论文使用了二阶导数。这里就不展开了，毕竟这方面还是太复杂。

After successful training, the network was implemented on a commercial Digital Signal Processor board containing an AT&T DSP-32C general-purpose DSP chip with a peak performance of 12.5 million multiply-add operations per second on 32-bit floating-point numbers. The DSP operates as a coprocessor in a PC connected to a video camera. The PC performs the digitization, binarization, and segmentation of the image, while the DSP performs the size normalization and the classification. The overall throughput of the digit recognizer, including image acquisition, is 10 to 12 classifications per second and is limited mainly by the normalization step. On normalized digits, the DSP performs more than 30 classifications per second.

翻译：训练完成后，该网络被部署在一个商用的数字信号处理器（DSP）板上，该板配备了一个AT&T DSP-32C通用DSP芯片，峰值性能为每秒1250万次32位浮点乘加运算。DSP作为PC的协处理器工作，PC连接到一个摄像头，负责图像的数字化、二值化和分割，而DSP则负责图像的尺寸归一化和分类。整个数字识别系统的处理速度（包括图像采集在内）为每秒10到12次分类，主要受限于归一化步骤。在处理归一化后的数字时，DSP每秒可以完成超过30次分类。

这一部分描述了训练、推理阶段的实验结果，我认为这部分应该没有需要特别解释的地方。内容可以总结如下：

训练集误差率为1.1%，MSE为0.017
测试集误差率为3.4%，MSE为0.024。所有分类错误均发生在手写字符上
对于整个测试集，达到1%错误率的最佳拒绝百分比为5.7%
对于手写数据，达到1%错误率的最佳拒绝百分比为9%
训练共需30次迭代，耗时约3天

作者还将其部署在了DSP上，使其能够应用于生产环境。不过LeCun等人到了1998年左右才正式拿出用于生产的方案。

6 Conclusion（结论）

Back-propagation learning was successfully applied to a large, real-world task. Our results appear to be at the state of the art in handwritten digit recognition. The network had many connections but relatively few free parameters. The network architecture and the constraints on the weights were designed to incorporate geometric knowledge about the task into the system. Because of its architecture, the network could be trained on a low-level representation of data that had minimal preprocessing (as opposed to elaborate feature extraction). Due to the redundant nature of the data and the constraints imposed on the network, the learning time was relatively short considering the size of the training set. Scaling properties were far better than one would expect just from extrapolating results of back-propagation on smaller, artificial problems. Preliminary results on alphanumeric characters show that the method can be directly extended to larger tasks.

翻译：反向传播学习成功应用于一个大型真实世界任务。我们的结果在手写数字识别方面达到了最先进的水平。该网络具有众多连接，但自由参数相对较少。网络架构和权重约束设计时考虑了与任务相关的几何知识。得益于其架构，网络能够在数据的低层次表示上进行训练，这些数据经过的预处理最小（与复杂的特征提取相对）。由于数据的冗余性质以及施加在网络上的约束，学习时间相对较短，考虑到训练集的大小，这一点尤其显著。与仅仅根据小型人工问题的反向传播结果进行外推所预期的相比，扩展性能要好得多。初步结果表明，对于字母数字字符，该方法可以直接扩展到更大的任务上。

The final network of connections and weights obtained by back-propagation learning was readily implementable on commercial digital signal processing hardware. Throughput rates, from camera to classified image, of more than ten digits per second were obtained.

翻译：通过反向传播学习获得的最终连接和权重网络可以轻松地在商业数字信号处理硬件上实现。从摄像头到分类图像的吞吐率超过每秒十个数字。

这部分也无需太多解释了，就是对整篇论文做一个总结。这篇论文提出了LeNet1，这无疑是划时代的一个创新，也是深度学习发展路程中的重要里程碑。在手写数字识别的探索中，卷积+反向传播网络的成功不仅推动了图像识别领域的发展，也为深度学习的广泛应用奠定了基础。这项技术革新展示了神经网络在处理复杂任务中的潜力，使得自动化特征提取成为可能。后来的各种CNN虽然与原始的LeNet有着很大的不同，但都是在LeNet的基础上开发的。作者LeCun在这篇论文后也继续对LeNet进行创新，于1995年提出了经典的LeNet5，这也是我们默认使用的LeNet版本。

不过遗憾的是，这篇论文后的几十年里，神经网络由于过于超前一直没能成为主流，期间很多SVM方法也能达到甚至超过LeNet的性能，直到2010年AlexNet的一场漂亮的翻身仗（敬请期待论文解读）。LeCun、Hinton等人的超前思维与不懈坚持终于迎来了属于它的辉煌时刻！