文献翻译：Backpropagation applied to handwritten zip code

最新推荐文章于 2022-08-13 22:20:44 发布

super鹿

最新推荐文章于 2022-08-13 22:20:44 发布

阅读量1.2k

点赞数 5

文章标签：神经网络深度学习计算机视觉算法

原文链接：https://www.mitpressjournals.org/doi/abs/10.1162/neco.1989.1.4.541

版权

Abstract：The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropogation network through the architecture of the network. This approach has been successfully pplied to the recognition of handwritten zip code digits provided by th U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final calssification.

译文：通过提供来自任务域的约束，可以大大增强学习网络的泛化能力。本文介绍了如何通过网络架构将这些约束集成到一个反向传播网络中。该方法已成功应用于美国邮政总局提供的手写邮政编码数字识别。单个网络学习整个识别操作，从字符的规格化图像到最终的分类。

1. Introduction

Previous work performed on recognizing sample digit images (LeCun 1989)showed that good generalization on complex tasks can be obtained by designing a network architecture that contains a certain amount of a priori knowledge about the task. The basic design priciple is to reduce the number of free parameters in the network as much as possible without overly reducing its computational power. Application of this priciple increases the propability of correct generalization because it results in a specialized network architecture that has a reduced entropy(Denker et al 1987; Patarnello and Carnevali 1987, Tishby et al, 1989; LeCun 1989), and a reduced Vapnik-Chervonenkis dimensionality(Baum and Haussler 1989).

译文：以前对样本数字图像的识别工作(LeCun 1989)表明，通过设计一个包含一定数量的任务先验知识的网络体系结构，可以获得对复杂任务的良好泛化。基本的设计原则是在不过度降低计算能力的前提下，尽可能减少网络中自由参数的数量。这一原则的应用增加了正确泛化的可能性，因为它产生了一种熵减小的专用网络体系结构(Denker等，1987;Patarnello and Carnevali 1987, Tishby et al, 1989;和简化的Vapnik-Chervonenkis维度(Baum and Haussler 1989)。

In this paper, we apply the backpropagation algorithm(Rumelhart et al 1986) to a real-world problem in recognizing handwritten digits taken from the U.S. Mail. Unlike previous results reported by our group on the problem (Denker et al 1989), the learning network is directly fed with images, rather than feature vectors, thus demonstrating the ability of backpropagation networks to deal with large amounts of low-level information.

译文：在本文中，我们将反向传播算法(Rumelhart et al . 1986)应用于识别从美国邮件中提取的手写数字的实际问题。不同于我们小组之前对该问题的研究结果(Denker et al 1989)，学习网络输入的直接是图像而不是特征向量，从而证明了反向传播网络处理大量低级信息的能力。

2. zip codes

2.1 data base. The data base used to train and test the network consist of 9298 segmented numerals digitized from handwritten zip codes that appeared on U.S. mail passing through the Buffalo, NY post office. Examples of such image are shown in Figure 1. The digits were written by many different people, using a great variety of sizes, writing styles, and instruments, with widely varying amounts of care; 7291 examples are used for training the network and 2007 are used for testing the generalization performance. One important feature of this data base is that both the training set and the testing set contain numerous examples that are ambiguous, unclassifiable, or even misclassified.

译文：2.1数据库。用于训练和测试网络的数据库由9298个分段数字组成，这些分段数字是从通过纽约州布法罗邮局的美国邮件中出现的手写邮政编码数字化的。这种图像的示例如图1所示。数字是由许多不同的人写的，使用的大小、书写风格和工具各不相同，而且小心程度也大不相同。 7291个示例用于训练网络，而2007个示例用于测试泛化性能。该数据库的一个重要特征是训练集和测试集都包含许多模棱两可，无法分类甚至是错误分类的示例。

Figure 1 Examples of original zip codes(top) and normalized digits from the testing set(bottom)

2.2 Preprocessing. Locating the zip code on the envelope and separating each digit from its neighbours, a very hard task in itself, was performed by Postal Service contractors(Wang and Srihari 1988). At this point, the size of a digit image varies but is typically around 40 by 60 pixels. A linear transformation is then applied to make the image fit in a 16 by 16 pixel image. This transformation preserves the aspect ratio of the character, and is performed after extraneous marks in the image have been removed. Because of the linear transformation, the resulting image is not binary but multiple gray levels, since a variable number of pixels in the origin image can fall into a given pixel in the target image. The gray levels of each image are scaled and translated to fall within the range -1 to 1.

译文：2.2预处理。在邮政信封上找到邮政编码并将每个数字与邻居分开，这本身就是一项艰巨的任务，由邮政服务承包商完成（Wang and Srihari 1988）。此时，数字图像的大小有所变化，但通常约为40 x 60像素。然后应用线性变换以使图像适合16 x 16像素的图像。此转换将保留字符的长宽比，并在图像中的多余标记被删除后执行。由于线性变换，所得图像不是二进制而是多个灰度级，因为原始图像中可变数量的像素可以落入目标图像中的给定像素。缩放并转换每个图像的灰度级，使其落在-1到1的范围内。

3. Network Design

3.1 Input and Output. The remainder of the recognition is entirely performed by a multilayer network. All of the connections in the network are adaptive, although heavily constrained, and are trained using backpropagation. This is in contrast with earlier work (Denker et al.1989) where the first few layers of connections were hand-chosen constants implemented on a neural-network chip. The input of the network is a 16 by 16 normalized image. The output is compoesd of 10 units(one per class) and uses place coding.

译文：输入和输出。识别的其余部分完全由一个多层网络来执行。网络中的所有连接都是自适应的，尽管有很大的限制，并且使用反向传播进行训练。这与早期的工作(Denker et al.1989)形成了对比，在早期的工作中，最初的几层连接是手工选择的常数，在神经网络芯片上实现。网络的输入是一个16×16的归一化图像。输出由10个单元组成(每个类一个单元)，并使用位置编码。

3.2 Feature Maps and Weight Sharing. Classical work in visual pattern recognition has demonstrated the advantage of extracting local features and combining them to form higher order features. Such knowledge can be easily built into the network by forcing the hidden units to combine only local sources of information. Distinctive features of an object can appear at various locations on the input image. Therefore it seems judicious to have a set of feature detectors that can detect a particular instance of a feature anywhere on the input plane. Since the precise location of a feature is not relevant to the classification, we can afford to lose some position information in the process. Nevertheless, approximate postion information must be preserved, to allow the next levels to detect higher order, more complex features(Fukushima 1980;Mozer 1987)

译文：3.2特征图和权重共享。经典的视觉模式识别研究已经证明了提取局部特征并将其结合形成高阶特征的优点。通过迫使隐藏单元仅结合局部信息源，可以轻松地将此类知识构建到网络中。对象的特征可以出现在输入图像的不同位置。因此，拥有一组可以检测输入平面上任何地方的特定特征实例的特征检测器似乎是明智的。由于特征的精确位置与分类无关，我们可能会在这个过程中丢失一些位置信息。然而，必须保留近似的位置信息，以便下一级别能够检测到更高级、更复杂的特征(Fukushima 1980;Mozer 1987)

The detection of a particular feature at any location on the input can be easily done using the "weight sharing" technique. Weight Sharing was described in Rumelhart et al (1989) for the so-called T-C problem and consists in having several connections(links) controlled by a single parameter(weight). It can be interpreted as imposing equality constraints among the connection strengths. This technique can be implemented with very little computational overhead.

译文：使用“权重共享”技术可以很容易地在输入的任何位置检测特定的特征。Rumelhart等人(1989)对所谓的T-C问题描述了权量共享，它包含由单个参数(权量)控制的多个连接。它可以解释为在连接强度之间施加相等约束。这种技术的实现只需要很少的计算开销。

Weight sharing not only greatly reduces the number of free parameters in the network but also can express information about the geometry and topology of the task. In our case, the first hidden layer is composed of several planes that we call feature maps. All units in a plane share the same set of weights, thereby detecting the same feature at different locations. Since the exact position of the feature is not important, the feature maps need not have as many units as the input.

译文：权值共享不仅大大减少了网络中自由参数的数量，而且可以表示任务的几何和拓扑信息。在我们的例子中，第一个隐藏层由几个面组成，我们称之为特征图。一个平面上的所有单元共享同一组权值，从而在不同的位置检测相同的特征。由于特性的确切位置并不重要，因此特征图不需要像输入一样有很多单元。

3.3 Network Architecture. The network is represented in Figure 2. Its architecture is a direct extension of the one proposed in LeCun(1989). The network has three hidden layers named H1, H2 and H3, respectively. Connections entering H1 and H2 are local and are heavily constrained.

译文：网络如图2所示。它的架构是对LeCun(1989)提出的架构的直接扩展。网络有三个隐藏层，分别命名为H1、H2和H3。输入H1和H2的连接是局部的，并且受到很大的限制。

Figure 2 Network architecture

H1 is composed of 12 groups of 64 units arranged as 12 independent 8 by 8 feature maps. These 12 feature maps will be designated by H1_1, H1_2,...,H1_12. Each unit in a feature map takes input on a 5 by 5 neighbourhood on the input plane. For units in layer H1 that are one unit apart, their receptive field (in the input layer) are two pixels apart. Thus, the input image is undersampled and some position information is eliminated. A simplar two-to-one undersampling occurs going from layer H1 to H2. The motivation is that high resolution may be needed to detect the presence of a feature, while its exact position need not be determined with equally high precision.

译文：H1由12组64个单元组成，这些单位元为12个独立的8 x 8特征图。这12个特征图将由H1_1，H1_2，...，H1_12指定。特征图中的每个单元在输入平面上的5 x 5邻域中接受输入。对于层H1中相隔一个单元的单元，它们的感受野（在输入层中）相隔两个像素。因此，输入图像被欠采样，一些位置信息被消除。从层H1到H2发生简单的二对一欠采样。其动机是：可能需要高分辨率来检测特征的存在，而无需以同样高的精度确定其确切位置。

It is also known that the kinds of features that are important at one place in the image are likely to be important in other places. Therefore, corresponding connections on each unit in a given feature map are constrained to have the same weights. Each unit performs the same operation on corresponding parts of the image. The function performed by a feature map can thus be interpreted as a nonlinear subsampled convolution with a 5 by 5 kernel.

译文：我们也知道，在图像的某个地方很重要的特征，可能在其他地方也很重要。因此，在给定的特征图中，每个单元对应的连接被约束为具有相同的权值。每个单元对图像的相应部分执行相同的操作。因此，由特征图执行的函数可以解释为与5×5核的非线性次采样卷积。

Of course, units in another (say H1.4) share another set of 25 weights. Units do not share their biases(thresholds). Each unit thus has 25 input lines plus a bias. Connections extending past the boundaries of the input plane take their input from a virtual background plane whose state is equal to a constant, predetermined background level, in our case-1. Thus, layer H1 comprises 768 units (8 by 8 times 12), 19968 connections(768 times 26), but only 1068 free parameters(768 bias plus 25 times 12 feature kernels) since many connections share the same weight.

译文：当然，另一个(比如H1.4)中的单元共享另一组25个权重。单元之间不共享偏差(阈值)。因此每个单元有25条输入线加上一个偏差。超越输入平面边界的连接从虚拟背景平面获取它们的输入，虚拟背景平面的状态等于一个预先确定的常量背景级别,在我们的案例中为-1中。因此，H1层包含768个单元(8×8乘以12)，19968个连接(768乘以26)，但只有1068个自由参数(768个偏差加上25乘以12个特征内核)，因为许多连接共享相同的权重。

Layer H2 is also composed of 12 features maps. Each feature map contains 16 units arranged in a 4 by 4 plane. As before, these feature maps will be designated as H2.1, H2,2, …, H2.12. The connection scheme between H1 and H2 is quite similar to the one between the input and H1, but slightly more complicated because H1 has multiple two-dimensional maps. Each unit in H2 combines local information coming from 8 of the 12 different feature maps in H1. Its receptive field is composed of eight 5 by 5 neighborhoods centered around units that are at identical positions within each of the eight maps. Thus, a unit in H2 has 200 inputs, 200 weights, and a bias. Once again, all units in a given map in H1 on which a map in H2 takes its inputs are chosen according a scheme that will not be described here. Connections falling off the boundaries are treated like as in H1. To summarize, layer H2 contains 192 units(12 times 4 by 4) and there is a total of 38592 connections between layers H1 and H2(192 units times 201 input lines). All these connections are controlled by only 2592 free parameters(12 feature maps times 200 weights plus 192 biases).

译文：层H2也由12个特征图组成。每个特征图包含16个单元，按4×4的平面排列。和以前一样，这些特征图将被指定为H2.1、H2.2、...、H2.12。H1和H2之间的连接方案与输入和H1之间的连接方案非常相似，但是稍微复杂一些，因为H1有多个二维特征图。H2中的每个单元结合来自H1中的12个不同特征图中的8个的局部信息。它的感受野是由8个5乘5的邻域组成的，这些邻域以在8张特征图上的相同位置的单元为中心。因此，H2中的一个单元有200个输入、200个权重和一个偏差。同样，H1中的给定特征图中的所有单元(H2中的特征图在这些单元上接受其输入)都是根据这里不进行描述的方案选择的。超出边界的连接将像在H1中一样对待。总而言之，H2层包含192个单元(12 * 4 * 4)，H1层和H2层之间总共有38592个连接(192个单元乘以201个输入线)。所有这些连接仅由2592个自由参数控制(12个特征图乘以200个权重加上192个偏差)。

Layer H3 has 30 units, and is fully connected to H2. The number of connections between H2 and H3 is thus 5790 (30 times 192 plus 30biases). The output layer has 10 units and is also fully connected to H3, adding another 310 weights. In summary, the network has 1256 units, 64660 connections, and 9760 independent parameters.

译文：H3层有30个单元，完全连接到H2。因此，H2和H3之间的连接数是5790(30乘以192加上30个偏差)。输出层有10个单元，也完全连接到H3，再增加310个权重。总的来说，网络有1256个

4. Experimental Enviroment

All simulations were performed using the backpropagation simulator SN(Bottou and LeCun 1988) runing on a SUN-4/260.

译文：所有的模拟都是使用SUN-4/260上运行的反向传播模拟器SN(Bottou和LeCun 1988)进行的。

The nonlinear functions used at each node was a scaled hyperbolic tangent. Symmetric functions of that kind are believed to yield faster convergence, although the learning can be extremey slow if some weights are too small (leCun 1987). The target values for the output units were chosen within the quasilinear range of the sigmoid. This prevents the weights from growing indefinitely and prevent the output units from operating in the flat spot of the sigmoid. The output cost fuction was the mean squared error.

译文：在每个节点上使用的非线性函数是一个缩放的双曲正切函数。这类对称函数被认为可以产生更快的收敛速度，尽管如果一些权重太小，学习可能会非常慢(leCun 1987)。输出单元的目标值选择在sigmoid的准线性范围内。这可以防止权值不确定地增长，并防止输出单元在sigmoid的平面点上运行。输出成本函数是均方误差。

Before training, the weights were initialized with random values using a uniform distribution between -24/Fi and 24/Fi, where Fi is the number of inputs(fan-in) of the unit to which the connection belongs. This technique tends to keep the total inputs within the operating range of the sigmoid.

译文：在训练之前，使用-24/Fi和24/Fi之间的均匀分布对权值进行初始化，其中Fi是连接所属单元的输入数。这种技术倾向于将总输入保持在sigmoid的操作范围内。

During each learning experiment, the patterns were repeatedly presented in a constant order. The weights were updated according to the so-called stochastic gradient or “on-line” procedure(updating after each presentation of a single pattern) as opposed to the “true” gradient procedure (averaging over the whole training set before updating the weights). From empirical study (supported by theoretical arguments), the stochastic gradient was found to converge much faster than the true gradient, especially on large, redundant data bases. It also finds solutions that are more robust.

All experiments were done using a special version of Newton’s algorithm that uses a positive, diagonal approximation of the Hessian matrix (LeCun 1987; Becker and LeCun 1988). This algorithm is not believed to bring a tremendous increase in learning speed but it converges reliably without requiring extensive adjustments of the paramenters.

译文：在每次学习实验中，这些模式都以固定的顺序重复出现。权重的更新是根据所谓的随机梯度或“在线”过程(在单个模式出现后更新)，而不是“真正的”梯度过程(在更新权重之前对整个训练集进行平均)。从实证研究(有理论论据支持)发现，随机梯度比真实梯度收敛得快得多，尤其是在大型、冗余的数据库上。它还找到了更鲁棒的解决方案。所有的实验都是使用牛顿算法的一个特殊版本来完成的，该算法使用了海森矩阵的一个正对角线的近似(LeCun 1987;贝克尔和勒存1988)。这种算法虽然不能显著提高学习速度，但它的收敛性很好，不需要对参数进行大量的调整。

5. Results

After each pass through the training set, the performance was measured both on the training and on the test set. The network was trained for 23 passes through the training set (167693 pattern presentations).

译文：每次通过训练集后，都对训练集和测试集的性能进行测量。网络通过训练集进行了23次训练(167693模式演示)。

After these 23 passes, the MSE averaged over the patterns and over the output units was $25*10^{-3}$ on the training set and $18*10^{-2}$

on the test set. The percentage of the misclassified patterns was 0.14% on the traning set (10 mistakes) and 5.0% on the test set (102 mistakes). As can be seen in Figure 3, the convergence is extremely quick, and shows that backpropagation can be used on fairly large tasks with reasonable training times. This is due in part to the high redundancy of real data.

译文：在这23遍训练之后，MSE在模式和输出单元上的平均值在训练集上是 $25*10^{-3}$ ，在测试集上是 $18*10^{-2}$ 。错误分类的模式在训练集占0.14%(10个错误)，在测试集占5.0%(102个错误)。从图3中可以看出，收敛速度非常快，并且可以在训练时间合理的情况下将反向传播用于相当大的任务。这部分是由于实际数据的高冗余。

In a realistic application, the user uaually is interested in the number of rejections necessary to reach a given level of accuracy rather than in the raw error rate. We measured the percentage of the test patterns that must be rejected in order to get 1% error on the remaining test patterns. Our main rejection criterion was that the difference between the activity levels of the two most active units should exceed a given threshold.

译文：在实际应用中，用户通常会对达到给定精度水平所需的拒绝次数感兴趣，而不是对原始错误率感兴趣。我们测量了必须被拒绝的测试模式的百分比，以便在剩下的测试模式上得到1%的错误。我们的主要拒绝标准是两个最活跃的单元的活动水平之间的差异应该超过给定的阈值。

Figure 3 Log mean squared error (MSE) (top) and raw error rate (bottom) versus number of training passes

The percetage of rejections was then 12.1% for 1% classification error on the remaining (nonrejected) test patterns. It should be emphasized that the rejection thresholds were obtained using performance measures on the test set.

译文：在其余（未拒绝）测试模式上，对于1％的分类错误，拒绝的百分比为12.1％。应该强调的是，拒绝阈值是使用测试集上的性能指标获得的。

Some kernels synthesized by the network can be interpreted as feature detectors remarkably similar to those found to exit in biological artifical character recognizers, such as spatial derivative estimators or off-center/on-surround type feature detectors.

译文：由网络合成的某些内核可以被解释为特征检测器，与发现的生物人工字符识别器中存在的特征检测器非常相似，例如空间导数估计器或偏心/环绕型特征检测器。

Most misclassifications are due to erroneous segmentation of the image into individual characters. Segmentation is a very difficult problem, especially when the characters overlap extensively. Other mistakes are due to ambiguous patterns, low resolution effects, or writing styles not present in the training set.

译文：大多数错误分类是由于将图像错误地分割为单个字符而引起的。分割是一个非常困难的问题，尤其是当字符大量重叠时。其他错误是由于模棱两可的图案，低分辨率的效果或训练集中没有的写作风格所致。

Other networks with fewer feature maps were tried, but produced worse results. Various fully connected, unconstrained networks were also tried, but regeneralization performance were quite bad. For example, a fully connected network with one hidden layer of 40 units (10690 connections total) gave the following results: 1.6% misclassification on the training set, 8.1% miscalssifications on the test set, and 19.4% rejections for 1% error on the remaning test patterns. A full comparative study will be described in another paper.

译文：尝试了其他具有较少特征图的网络，但产生了较差的结果。还尝试了各种完全连接且不受限制的网络，但再生性能很差。例如，一个具有40个单元的隐藏层的全连接网络（总共10690个连接）得出以下结果：训练集上的错误分类为1.6％，测试集上的错误分类为8.1％，剩余的测试模式上1％的错误则拒绝了19.4％。完整的比较研究将在另一篇论文中进行描述。

5.1 Comparison with Other Work. The first several stages of processing in our previous system (described in Denker et al 1989) involved convolutions in which the coefficients had been laboriously hand designed. In the present system, the first two layers of the network are constrained to be convolutional, but the system automatically learns the coefficients that make up the kernels. This “constrained backpropagation” is the key to success of the present system: it not only builds in shift-invariance, but vastly reduces the entropy, the Vapnik-Chervonenkis dimensionality, and the number of free parameters, thereby proportionately reducing the amount of training data required to achieve a given level of generalization performance (Denker et al. 1987; Baum and Haussler 1989). The present system performs slightly better than the previous system. This is remarkable considering that much less specific information about the problem was built into the network. Furthermore, the new approach seems to have more potential for improvement by designing more specialized architectures with more connections and fewer free parameters.

译文：5.1与其他工作的比较。在我们之前的系统(Denker等人1989年描述过)中，前几个处理阶段涉及到系数经过艰苦手工设计的卷积。在目前的系统中，网络的前两层被限制为卷积，但是系统自动学习构成内核的系数。这种“受限反向传播”是本系统成功的关键：它不仅建立了平移不变性，而且极大地降低了熵，Vapnik-Chervonenkis维数和自由参数的数量，从而成比例地减少了训练量达到给定水平的泛化性能所需的数据（Denker等，1987； Baum和Haussler，1989）。本系统比以前的系统性能稍好。考虑到网络中内置的有关该问题的信息要少得多，因此这非常了不起。此外，通过设计具有更多连接和更少自由参数的更专业的体系结构，新方法似乎具有更大的改进潜力。

Waibel (1989) describes a large network (but still small compared to ours) with about 18000 connections and 1800 free parameters, trained on a speech recognition task. Because training time was prohibitive (18 days on an Alliant mini-supercomputer), he suggested building the network from smaller, seperately trained networks. We did not needed such a modular construction procedure since our training times were “only” 3 days on a Sun workstation, and in any case it is not clear how to partition our problem into seperately trainable subproblems.

译文：Waibel(1989)描述了一个拥有18000个连接和1800个自由参数的大型网络(但与我们的网络相比仍然很小)，训练用于语音识别任务。由于训练时间非常长(在Alliant的迷你超级计算机上训练18天)，他建议使用更小的、单独训练的网络中构建网络。我们不需要这样一个模块化的构建过程，因为我们在Sun工作站上的培训时间“只有”3天，而且在任何情况下都不清楚如何将我们的问题划分为可单独培训的子问题。

5.2 DSP Implementaion. During the recognition process, almost all the compution time is spent performing multiply accumulate operations, a task that digital signal processors (DSP) are specifically designed for. We used an off-the-shelf board that contains 256 kbytes of local memory and an AT&T DSP-32C general purpose DSP with a peak performance of 12.5 million multiply add operations per second on 32 bit floating point numbers (25 MFLOPS). The DSP operates as a coprecessors; the host is a personal computer(PC), which also contains a video acquisition board connected to a camera.

译文：DSP实现。在识别过程中，几乎所有的计算时间都用来执行乘法累加运算，这是数字信号处理器(DSP)专门为其设计的任务。我们使用的现成板包含256 KB的本地内存和AT＆T DSP-32C通用DSP，其32位浮点数（25 MFLOPS）的峰值性能为每秒1,250万次乘法加法运算。 DSP作为协处理器运行；主机是一台个人计算机（PC），还包含连接到摄像机的视频采集板。

The personal computer digitized an image and binarizes it using an adaptive thresholding techniques. The threshold image is then scanned and each connected component (or segment) is isolated. Components that are too small or too large discarded.; remaining components are send to the DSP for normalization and recognition. The PC gives a variable sized pixel map representaion of a single digit to the DSP, which performs the normalization and the classification.

译文：个人计算机将图像数字化并使用自适应阈值技术对其进行二值化。然后对阈值图像进行扫描，并隔离每个连接的组件(或段)。组件太小或太大被丢弃。将剩余的组件发送给DSP进行归一化和识别。PC机向DSP提供一个可变大小的单个数字像素图表示，由DSP进行归一化和分类。

The overall throughput of the digit recognizer including image acquisition is 10 to 12 classifications per second and is limited mainly by he normalization step. On normalized digits, the DSP performs more than 30 classifications per second.

译文：包括图像采集在内的数字识别器的总体吞吐量为每秒10到12个分类，主要受归一化步骤的限制。在归一化数字上，DSP每秒执行30多个分类。

6. Conclusion

We have successfully applied backpropagation learning to a large, real-world task. Our results appear to be at the state of the art in digit recognition. Our network was trained on a low-level representation of data that had minimal preprocessing (was opposed to elaborate feature extraction). The network had many connections but relatively few free parameters. The network architecture and the constraints on the weights were designed to incorporate geometric knowledge about the task into the system. Because of the redundant nature of the data and because of the constraints imposed on the network, the learning time was relatively short considering the size of the training set. Scaling properties were far better than one would expect just from extrapolating results of backpropogation on smaller, artificial problems.

译文：我们已经成功地将反向传播学习应用于大型的现实任务。我们的结果似乎是数字识别领域的最佳状态。我们的网络是在预处理最少的低水平表示的数据上训练的（与复杂的特征提取相反）。该网络具有许多连接，但是自由参数相对较少。网络体系结构和权重约束旨在将有关任务的几何知识整合到系统中。由于数据的冗余性质和网络上的限制，考虑到训练集的大小，学习时间相对较短。缩放特性远远好于仅通过反向传播对较小的人为问题进行的推断得出的结果。

The final network of connections and weights obtained by backpropogation learning was readily implementable on commercial digital signal processing hardware. Throughput rates, from camera to classified image, of more than 10 digits per second were obtained.

译文：通过反向传播学习获得的最终连接和权重网络很容易在商业数字信号处理硬件上实现。从相机到分类图像的吞吐率都超过了每秒10位。

This work points out the necessity of having flexible “network design” software tools that ease the design of complex, specialized network architectures.

译文：这项工作指出了拥有灵活的“网络设计”软件工具的必要性，这些工具可以简化复杂的专用网络体系结构的设计。