【论文翻译】Deep learning

论文题目:Deep learning
论文来源:Deep learning
翻译人:BDML@CQUT实验室

Deep learning

Yann LeCun, Yoshua Bengio & Geoffrey Hinton

Abstract

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec­ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

摘要

深度学习能够让那些由多个处理层组成的计算模型学习如何表示高度抽象的数据。这些方法极大地改善了语音识别,视觉对象识别,对象检测以及许多其他领域的最新技术,例如药物发现和基因组学。深度学习通过使用BP算法来指示机器如何更改用于计算每层表示的内部参数,这些参数用于计算前一层的表示,从而发现大数据中的复杂结构。深层卷积网络在处理图像,视频,语音和音频方面带来了突破,而递归网络则在诸如文本和语音之类的序列数据等方面展现潜力。

正文

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social net­works to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.

机器学习技术为现代社会的各个方面提供了强大的动力:从网络搜索到社交网络上的内容过滤,再到电子商务网站上的推荐,并且它在诸如相机和智能手机之类的消费产品中越来越多地出现。 机器学习系统用于识别图像中的对象,将语音转换为文本,匹配新闻元素,根据用户兴趣匹配职位和产品,选择相关的搜索结果。这些应用程序越来越多地使用一类称为深度学习的技术。

Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, con­structing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a fea­ture extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.

传统的机器学习技术在处理未加工过的数据方面受到限制。几十年来,构建一个模式识别或机器学习系统需要仔细的工程设计和相当多的专业知识来设计一个特征提取器,它将原始数据(例如图像的像素值)转换成合适的内部表示或特征向量,子学习系统通常是分类器,可以检测或分类输入的样本。

Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representa­tion, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

特征学习是一组方法,能让机器通过未加工数据自动发现并总结检测和分类所需的“特征描述”。而深度学习就是具有多层次“特征描述”的特征学习,通过一些简单但非线性的模块将每一层“特征描述”(从未加工的数据开始)转化为更高一层的、稍微更抽象一些的“特征描述”。使用足够多的这样的转化,那些非常复杂的函数也可以被学习。对于分类型任务而言,更高层次的“特征描述”能增强对识别能力非常重要的输入数据的各个方面,同时削弱(输入数据里)无关紧要的变化因素。比如一个图像以像素数组的形式输入,第一层的“特征描述”通常会表示图像的特定位置或方向是否存在边界,第二层可能会将图案拼凑起来以使它们和一些熟悉的物体的某部分相一致,之后的层次则会将这些部分组合起来并据此识别物体。深度学习的关键在于这些层次的特征不是人工设计的:它们是使用一种通用的学习步骤从数据中学习的。

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence commu­nity for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applica­ble to many domains of science, business and government. In addition to beating records in image recognition and speech recognition, it has beaten other machine-learning techniques at predicting the activ­ity of potential drug molecules, analysing particle accelerator data, reconstructing brain circuits, and predicting the effects of mutations in non-coding DNA on gene expression and disease. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding, particularly topic classification, sentiment analysis, question answering and language translation.

在解决人工智能界多年来尽最大努力都无法解决的问题方面,深度学习正在取得重大进展。 事实证明,它非常善于发现高维数据中的复杂结构,因此适用于科学,商业和政府的许多领域。除了在图像识别和语音识别中打破记录外,它在预测潜在药物分子的活性,分析粒子加速器数据,重建脑回路以及预测非编码DNA突变对基因表达和疾病的影响方面还击败了其他机器学习技术。也许更令人惊讶的是,深度学习为自然语言理解中的各种任务(尤其是主题分类,情感分析,问题解答和语言翻译)取得了了非常有希望的结果。

We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available com­putation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only acceler­ate this progress.

我们认为深度学习将在不久的将来取得更多的成功,因为它只需要很少的手工操作,因此可以轻松利用可用计算和数据量的增加。 当前正在为深度神经网络开发的新的学习算法和体系结构只会加速这一进展。

Supervised learning

The most common form of machine learning, deep or not, is super­vised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or dis­tance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output func­tion of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.

监督学习

机器学习最常见的形式,无论深度与否,都是监督学习。想象一下,我们想建立一个可以将图像分类为包含房屋,汽车,人或宠物的系统。我们首先收集了房子,汽车,人和宠物大量的图像的数据集,每一个都有其分类的标签。在训练过程中,机器会显示一个图像,并以向量形式的分数生成输出,每个类别对应一个分数。我们希望理想的类别在所有类别中得分最高,但这不可能在训练前发生。我们计算一个目标函数来度量输出分数和期望的分数模式之间的误差(或距离)。然后,机器修改其内部可调参数以减少此误差。这些可调参数,通常称为权重,是一个实数,可以看作是定义机器输入输出功能的“旋钮”。在典型的深度学习系统中,可能有数以亿计的可调权重和标签样本,以用于训练机器。

To properly adjust the weight vector, the learning algorithm com­putes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direc­tion to the gradient vector.

为了正确地调整权重向量,该学习算法计算每个权重的梯度向量,表示如果权重稍微增加一点,误差会增加或减少多少。然后在与梯度向量相反的方向上调整权重向量。

The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.

在所有训练样本上取平均值的目标函数,可以看作是一种高维权重空间中的多变地形。负梯度向量表示该地形中下降最快的方向,使其更接近最小值,平均输出误差最低。

In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization tech­niques. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.

实际上,大多数从业者使用称为随机梯度下降(SGD)的算法。它包括了提供一些样本的输入向量,计算输出和误差,计算这些样本的平均梯度,并相应地调整权重。从训练集中的许多小样本重复这个过程,直到目标函数的平均值停止下降。之所以被称为随机的,是因为每个小样本集都给出了所有样本的平均梯度的噪声估计。与其他精心设计的优化技术相比,这个简单的过程通常以惊人的速度找到一组好的权重。在训练之后,系统的性能将在一组称为测试集的不同样本上进行测量。这用于测试机器的泛化能力,即它对新输入产生合理答案的能力,而这是在训练中从未见过的。

Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.

目前机器学习的许多实际应用是在手工设计的特征之上使用线性分类器。两类线性分类器计算特征向量分量的加权和。如果加权和高于阈值,则输入被归类为一个特定的类别中。

Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces sepa­rated by a hyperplane19. But problems such as image and speech recog­nition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods, but generic features such as those arising with the Gaussian kernel do not allow the learner to general­ize well far from the training examples. The conventional option is to hand design good feature extractors, which requires a consider­able amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

从1960年以来我们发现线性分类器只能把它们的输入空间分割成非常简单的区域,也就是用一个超平面把输入空间对半分开。但类似于图像识别和语音识别的问题需要一个对输入数据中那些不重要的变化不敏感的输入-输出函数,比如物体的位置、方向或者是光照情况,又或者是语音识别中音高或口音的差异,但同时,所得的函数还需要对一些非常微小的变化特别敏感(比如,一只白色的狼和一只和狼很相似的白色大狗萨摩耶)。在像素的层次上,不同图像上的姿势和所处环境不同的萨摩耶或许也会显得非常不同,然而不同图像上处于相同位置和相似背景中的萨摩耶和狼却可能非常相似。线性分类器,或者是任何其他在原始像素上进行操作的浅层分类器都不太可能在区别后两者的同时把前两者归为一类。这就是为什么浅层分类器需要一个好的特征提取器,通过提供对识别重要的但对无关变量不敏感的(例如动物的姿势)、严格筛选的图像特征来解决选择不变性的困难。为了让分类器性能更强,我们可以使用一些通用的非线性特征,比如通过核方法得到的(译注:此处原文对“核方法”有第20个引用说明,可在原文查看或直接搜索关键词),但如高斯核得到的通用特征无法让学习器得到很好的泛化概括结果。传统的选择是,手工设计一个良好的特征提取器,这需要大量的工程技巧和专业知识。但如果能用通用的学习步骤自动地学习出一些良好的特征,这些(麻烦)都可以被避免。这就是深度学习最关键的优势。

在这里插入图片描述

Figure 1 | Multilayer neural networks and backpropagation.

a, A multi-layer neural network (shown by the connected dots) can distort the input space to make the classes of data (examples of which are on the red and blue lines) linearly separable. Note how a regular grid (shown on the left) in input space is also transformed (shown in the middle panel) by hidden units. This is an illustrative example with only two input units, two hidden units and one output unit, but the networks used for object recognition or natural language processing contain tens or hundreds of thousands of units. Reproduced with permission from C. Olah (http://colah.github.io/).

b, The chain rule of derivatives tells us how two small effects (that of a small change of x on y, and that of y on z) are composed. A small change Δx in x gets transformed first into a small change Δy in y by getting multiplied by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change Δy creates a change Δz in z. Substituting one equation into the other gives the chain rule of derivatives — how Δx gets turned into Δz through multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x, y and z are vectors (and the derivatives are Jacobian matrices).

c, The equations used for computing the forward pass in a neural net with two hidden layers and one output layer, each constituting a module through which one can backpropagate gradients. At each layer, we first compute the total input z to each unit, which is a weighted sum of the outputs of the units in the layer below. Then a non-linear function f(.) is applied to z to get the output of the unit. For simplicity, we have omitted bias terms. The non-linear functions used in neural networks include the rectified linear unit (ReLU) f(z) = max(0,z), commonly used in recent years, as well as the more conventional sigmoids, such as the hyberbolic tangent, f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic, f(z) = 1/(1 + exp(−z)).

d, The equations used for computing the backward pass. At each hidden layer we compute the error derivative with respect to the output of each unit, which is a weighted sum of the error derivatives with respect to the total inputs to the units in the layer above. We then convert the error derivative with respect to the output into the error derivative with respect to the input by multiplying it by the gradient of f(z). At the output layer, the error derivative with respect to the output of a unit is computed by differentiating the cost function. This gives yl − tl if the cost function for unit l is 0.5(yl − tl)2, where tl is the target value. Once the ∂E/∂zk is known, the error-derivative for the weight wjk on the connection from unit j in the layer below is just yj ∂E/∂zk.

图1 多层神经网络和反向传播

a、多层神经网络(由连接点表示)可以扭曲输入空间,使数据类别(红色和蓝色线表示的样本)线性可分离。注意输入空间中的规则网格(如左图所示)是如何通过隐层进行变换的(如中间面板所示)。这是一个说明性的例子,只有两个输入节点,两个隐节点和一个输出节点,但是用于对象识别或自然语言处理的网络包含成千上万个节点。经C.Olah许可(http://colah.github.io/)可重新构建这个图.

b、导数的链式法则告诉我们两个小效应(x对y的微小变化,y对z的微小变化)是如何组织到一起的。首先将x中的微小变化Δx通过乘以∂y/∂x(即偏导数的定义)转换为y的变化量Δy。类似地,变化量Δy会在z中产生一个变化量Δz。将一个方程代入另一个方程就得到了导数的链式规则——Δx是如何通过∂y/∂x 和∂z/∂x相乘而变成Δz的。当x、y和z是向量(导数是雅可比矩阵)时,它同样适用。

c、 用于计算神经网络前向传播的公式,该神经网络有两个隐层和一个输出层,每一层构成一个模块,通过该模块可以反向传播梯度。在每一层,我们首先计算每个节点的总输入z,这是下一层节点输出的加权和。然后将一个非线性函数f(.)应用于z得到节点的输出。为了简单起见,我们省略了偏差项。神经网络中使用的非线性函数包括近年来常用的校正线性单元(ReLU) f(z)=max(0,z),以及更传统的sigmoid,如双曲线正切函数f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z))和logistic函数,f(z) = 1/(1 + exp(−z)).

d、计算反向传播的公式。在每个隐层,我们计算每个单元的输出的误差导数,它是相对于上一层单元的总输入的误差导数的加权和。然后我们将输出层的误差导数乘以f(z)的梯度,将其转换为输入层的误差导数。在输出层,通过对成本函数的微分,计算出输出单元的误差导数。如果单元l的成本函数为0.5(yl − tl)^2,则单元误差为yl − tl,其中tl是期望值。一旦∂E/∂zk已知,下一层中来自单元j的连接上的权重wjk的误差导数仅为yj∂E/∂zk。

A deep-learning architecture is a multilayer stack of simple mod­ules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate func­tions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.

深度学习体系结构是简单模块的多层栈,其中所有(或大部分)模块都需要学习,还有许多模块计算非线性输入-输出映射。栈中的每个模块转换其输入,以提高表达的选择性和不变性。有了一个5到20层的非线性层系统可以实现其输入的极其复杂的功能,如输入数据对细节很敏感——区分萨摩耶犬和白狼——并且对大的无关变化不敏感,比如背景、姿势、光照和周围的物体。

Backpropagation to train multilayer architectures

From the earliest days of pattern recognition, the aim of research­ers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s.

反向传播来训练多层体系结构

从模式识别的早期开始,研究人员的目的就是用可训练的多层网络代替人工设计的特征,但是尽管多层神经网络很简单,但直到1980年代中期,该解才被广泛理解。 事实证明,可以通过简单的随机梯度下降来训练多层体系结构。只要模块是其输入及其内部权重的相对平滑函数,就可以使用反向传播过程来计算梯度。在20世纪70年代和80年代,几个不同的研究小组独立地发现了可以做到这一点并且起作用的想法。

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradi­ent) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.

用反向传播算法计算一个目标函数相对于一个多层模块栈的权重的梯度,不过是链式求导规则的一个实际应用。核心思想是,目标相对于模块输入的导数(或梯度)可以通过从相对于模块输出(或下一层模块输入)从梯度进行反向运算(图1)。反向传播算法可重复应用于在传播梯度通过多层体系结构的每一层,从顶部的输出(网络产生其预测)一直到底部(外部输入被馈送)。一旦计算出这些梯度,就很容易计算出相对于每个模块权重的梯度。

Many applications of deep learning use feedforward neural net­work architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a prob­ability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0). In past decades, neural nets used smoother non-linearities, such as tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).

深度学习的许多应用使用前馈式神经网络架构(图1),该神经网络学习将固定大小的输入(例如,图像)映射到固定大小的输出(例如,几个类别中的每一个的概率)。为了从一层到下一层,计算上一层神经元输入数据的加权和,并将结果传递给一个非线性函数。目前最流行的非线性函数是线性整流函数(ReLU),即半波整流器f(z)=max(z,0)。在过去的几十年里,神经网络使用更平滑的非线性函数,例如tanh(z)和1/(1 + exp(−z)),但ReLU通常在多层神经网络中学习得更快,允许在无监督预训练的情况下训练有监督的深度网络。不在输入或输出层的神经单元通常称为隐单元。隐层可以被视为以非线性方式扭曲输入数据,使得输入数据的类别可以在最后一层线性分离(图1)。

In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with lit­tle prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error.

在20世纪90年代末,神经网络和反向传播算法在很大程度上被机器学习团队所抛弃,也被计算机视觉和语音识别团队所忽视。人们普遍认为,学习有用的、多阶段的、具有少量先验知识的特征提取器是不可行的。特别是,人们普遍认为简单的梯度下降会陷入不良的局部最小值——权重配置,任何小的变化都会降低平均误差。

In practice, poor local minima are rarely a problem with large net­works. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinato­rially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objec­tive function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

实际上,对于大型网络而言,较差的局部最小值并不是问题。不管初始条件如何,该系统几乎总是能获得效果非常相似的解。最近的理论和经验结果表明,局部极小值通常不是一个严重的问题。相反,解空间中填充了大量的鞍点(梯度为零),并且曲面在大多数维度上向上弯曲,而在其余维度上向下弯曲。分析似乎表明,只有很少几个向下弯曲方向的鞍点存在很多,但是几乎所有鞍点的目标函数值都非常相似。因此,算法陷入这些鞍点中的哪一个都没关系。

Interest in deep feedforward networks was revived around 2006 by a group of researchers brought together by the Cana­dian Institute for Advanced Research (CIFAR). The researchers intro­duced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropaga­tion. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited.

2006年左右,由加拿大高级研究所(CIFAR)召集的一组研究人员重新唤起了人们对深度前馈式神经网络的兴趣。研究人员引入了无监督的学习方法,这种方法可以在不需要带标签的数据的情况下创建特征检测器层。学习每一层特征检测器的目的是能够重建或模拟下一层特征检测器(或原始输入)的活动。通过使用该重构目标对多个逐步复杂的特征检测器进行“预训练”,可以将深度网络的权值初始化为合理值。最后一个输出层可以添加到网络的顶部,整个深度系统可以使用标准反向传播算法进行微调。这对于识别手写数字或检测行人非常有效,尤其是在标签数据量非常有限的情况下。

The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coef­ficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabu­lary and was quickly developed to give record-breaking results on a large vocabulary task. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting, leading to significantly better generalization when the number of labelled exam­ples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

这种预训练方法的第一次主要应用出现在语音识别领域,便于编程并且能让研究者以10倍到20倍的速度进行训练的快速图像处理单元(GPUs)的出现让这一切成为可能。2009年,这个方法被用来获得将从声波中提取的短时间窗口映射为不同的语音片段的概率。它用少量的发音样本给出了创纪录的语音识别结果,并且很快通过使用更多的发音样本得到了发展。2009到2012年之间,深度网络经许多主要的语音识别团队发展并被使用到了安卓手机中。对于更少的数据集,无监督的预训练可以帮助减少过度拟合情况的发生,在数据数量很少或者有大量输入样本却缺少目标样本的情况下具有显著更好的泛化概括能力。但深度学习恢复了名誉之后,事实证明其实只有数据集很小的情况下才需要预训练阶段。

There was, however, one particular type of deep, feedforward net­work that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet). It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computer-vision community.

然而,有一种特殊类型的深度前馈式神经网络比相邻层之间完全连通的神经网络更容易训练且泛化性能更好。这就是卷积神经网络(ConvNet)。它在人们对神经网络失去兴趣的时候取得了许多成功,如今被计算机视觉团队广泛采用。

Convolutional neural networks

ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

卷积神经网络

卷积神经网络被设计用于处理多维数组数据,例如由三个包含了像素值2D图像组成的具有三个颜色通道的彩色图像。许多数据形态都是以多维数组的形式存在:1D用于信号和序列,包括语言;2D用于图像或音频;3D用于视频或有声音的图像。卷积神经网络利用了自然信号的特性,其背后有四个关键思想:局部连接、权重共享、池化和多网络层的使用。
在这里插入图片描述

Figure 2 | Inside a convolutional network. The outputs (not the filters) of each layer (horizontally) of a typical convolutional network architecture applied to the image of a Samoyed dog (bottom left; and RGB (red, green, blue) inputs, bottom right). Each rectangular image is a feature map corresponding to the output for one of the learned features, detected at each of the image positions. Information flows bottom up, with lower-level features acting as oriented edge detectors, and a score is computed for each image class in output. ReLU, rectified linear unit.

图2 卷积神经网络内部

一个典型的卷积网络结构的每一层(水平)的输出(不是滤波器)应用于萨摩耶犬的图像(左下;RGB(红、绿、蓝)输入,右下)。每个矩形图像是对应于在每个图像位置检测到的学习特征中的一个输出的特征映射。信息流自下而上,低层特征作为定向边缘检测器,并为输出的每个图像类计算分数。ReLU,整流线性函数。

The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolu­tional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Differ­ent feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathemati­cally, the filtering operation performed by a feature map is a discrete convolution, hence the name.

一个典型的卷积神经网络由一系列阶段组成(如图2)。最初的几个阶段由两种层次结构组成:卷积层和池化层。卷积层的单元被组织在不同特征图中,每个单元都通过过滤器形式的权重连接着前面层次的特征图的一小块。随后将该层的加权和输入到如ReLU之类的非线性函数。同一个特征图中的每个单元都共享着同样的过滤器形式的权重。每一层中不同的特征图使用不同的过滤器组。之所以设计成这样有两点原因:第一,在图像之类的数组数据中,局部的一组值通常高度相关,组成了特殊的且检测方便的局部图案;第二,对图像或其它信号的局部统计与其统计位置无关。换句话说,如果一个图案能在一个部分出现,那它应该能在任何地方出现,因此不同部分的单元应当共享着同样的权重并在数组不同的部分检测相同的模式。数学上来说,这种用特征图进行的过滤操作是一种离散卷积,卷积神经网络也因此得名。

Although the role of the convolutional layer is to detect local con­junctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the posi­tion of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and dis­tortions. Two or three stages of convolution, non-linearity and pool­ing are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.

尽管卷积层的作用是探测前一层特征的局部连接,但池化层的作用是将语义相似的特征合并为一个。因为构成一个主题的特征的相对位置可能会有所不同,所以可以通过粗粒化每个特征的位置来可靠地探测到主题。一个典型的池化单元计算一个特征图(或几个特征图)中一个的局部块的最大值。相邻的池化单元从移动一行或一列的小块获取输入数据,从而减少了表达的维数,并创建了对数据的不变性。两个或三个阶段的卷积,非线性和池化,然后是更多的卷积和完全连接层。通过卷积神经网络的反向传播算法就像通过常规的深层网络一样,允许训练所有滤波器组中的所有权重。

Deep neural networks exploit the property that many natural sig­nals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combi­nations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previ­ous layer vary in position and appearance.

深度神经网络利用了许多自然信号都是分层次结构的特性,在这种结构中,高层次的特征是通过合成较低层次的特征来获得的。在图像中,边缘的局部组合形成图案,图案组合成部分,部分形成物体。从电话中的声音、音素、音节、单词和句子,语音和文本中都存在类似的层次结构。当输入数据前一层中的在位置和外观上发生变化时,池化操作允许表示的变化很小。

The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience, and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral path­way. When ConvNet models and monkeys are shown the same pic­ture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s infer­otemporal cortex. ConvNets have their roots in the neocognitron, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words.

卷积神经网络的卷积层和池化层的概念灵感来源于神经科学中对简单神经细胞和复杂神经细胞的经典观念,它们的总体构架则让人联想到视觉腹侧通路中的LGN–V1–V2–V4–IT层次结构。当同一幅图像被展示给卷积神经网络模型和猴子的时候,卷积网络中高层次单元的激活过程达到了猴子下颞叶皮质中160个随机神经元的变化的一半。卷积神经网络受着神经认知学的影响,后者的架构和卷积神经网络有点相似,但却缺乏反向传播这种端到端的有监督学习算法。一种早期的被称为时延神经网络的一维卷积神经网络曾被用来进行相似音素和单词的识别。

There have been numerous applications of convolutional net­works going back to the early 1990s, starting with time-delay neu­ral networks for speech recognition and document reading. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recog­nition and handwriting recognition systems were later deployed by Microsoft. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands, and for face recognition.

早在20世纪90年代初,卷积神经网络已经有了大量的应用,从语音识别和文档阅读的时延神经网络开始。文档阅读系统使用一个被训练好的卷积神经网络和一个概率模型,实现了语言方面的一些约束。到20世纪90年代末,这个系统被用来美国10%以上的支票阅读上。微软后来开发了许多基于卷积神经网络的光学字符识别系统和手写识别系统。在20世纪90年代早期卷积神经网络也被用于自然图像中的目标检测,包括脸和手,以及人脸识别。

Image understanding with deep convolutional networks

Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abun­dant, such as traffic sign recognition, the segmentation of biological images particularly for connectomics, and the detection of faces, text, pedestrians and human bodies in natural images. A major recent practical success of ConvNets is face recognition.

基于深度卷积网络的图像理解

自21世纪初以来,卷积神经网络已成功地应用于图像中目标和区域的检测、分割和识别。这些任务都是使用了大量的带有标签的数据,例如交通信号识别、生物图像分割(尤其是连接组学)以及在自然图像中检测脸部、文本、行人和人体。近年来,卷积神经网络主要的成功应用是人脸识别。

Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys­tems for cars. Other applications gaining importance involve natural language understanding and speech recognition.

重要的是,图像可以在像素级别打标签,这可以应用在诸如自主移动机器人和自动驾驶汽车等技术中。像Mobileye和NVIDIA这样的公司正在把这种基于卷积神经网络的方法用于他们即将推出的汽车视觉系统中。其他越来越重要的应用包括自然语言理解和语音识别。
在这里插入图片描述

Figure 3 | From image to text.

Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found that it exploits this to achieve better ‘translation’ of images into captions.

图3 从图像到文本

由递归神经网络(RNN)生成的图像标题,作为额外输入数据,由深度卷积神经网络(CNN)从测试图像中提取的特征,并在RNN训练下将图像的高级特征“翻译”为图像标题(顶部)。当RNN在生成每个单词(粗体)时,它能够将注意力集中在输入图像中的不同位置(中间和底部;较浅的块被给予更多的关注),我们发现它利用这一点来更好地将图像“翻译”成图像标题。

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec­tacular results, almost halving the error rates of the best compet­ing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout, and tech­niques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

尽管取得了这些成功,卷积神经网络曾被主流计算机视觉界和机器学习界大规模抛弃,直到2012年的ImageNet比赛。当深度卷积网络被应用在一个大概百万张涵盖了一千种不同类别的网络图像的数据集上时,它得到了引入注目的结果:它的误差率只有当时比赛中竞争者的一半!这次成功有来自GPU、ReLU、一种被称为dropout的全新正则技术和通过扭曲变换现有样本来生成更多训练样本等技术的贡献。这项成功为计算机视觉界带来了一场革命。卷积神经网络现在成为了几乎所有识别和检测项目的主要方法,并且在某些任务上达到了人类的识别水平。最近一个令人震惊的例子是结合卷积神经网络和循环神经网络(递归网络)模型来生成图像标题。

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun­dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

最近的卷积神经网络架构有10到20层采用了ReLUs函数、成千上万个权重、几十亿个连接。两年前,训练如此庞大的网络只需要几周时间,但在硬件、软件和算法并行化的进步已将训练时间缩短到几个小时。

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

基于卷积神经网络的视觉系统的性能已经引起了大多数大型技术公司的关注,包括Google、Facebook、Microsoft、IBM、Yahoo!推特和Adobe,以及数量迅速增长的初创企业,它们发起研发项目,开发基于卷积神经网络的图像理解产品和服务。

ConvNets are easily amenable to efficient hardware implemen­tations in chips or field-programmable gate arrays. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

卷积神经网络很容易在芯片或现场可编程门阵列中有效的实现。许多公司,如NVIDIA、Mobileye、Intel、Qualcomm和Samsung正在开发卷积神经网络芯片,以便在智能手机、相机、机器人和自动驾驶汽车中实现实时视觉应用。

Distributed representations and language processing

Deep-learning theory shows that deep nets have two different expo­nential advantages over classic learning algorithms that do not use distributed representations. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features). Second, composing layers of representation in a deep net brings the potential for another exponential advantage(exponential in the depth).

分布式特征表示和语言处理

深度学习理论表明,与不使用分布式特征表示的经典学习算法相比,深度网络具有两种不同的巨大的优势。这两个优点都源于组合的能力,并依赖于具有适当组件结构的底层数据生成分布。首先,学习分布式特征表示可以泛化到新学习特征值的组合,而不是训练期间看到的那些值(例如,对于n个二进制特征,2n个可能的组合)。第二,在一个深度网络中组合表示层带来了另一个巨大的优势潜力(指数级的深度)。

The hidden layers of a multilayer neural network learn to repre­sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vec­tors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’. Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications.

一个多层神经网络的隐藏层学习如何描述网络的输入来使预测目标输出更容易。一个很好的例子是,通过训练多层次神经网络来利用本地文本中已有的单词预测下一个单词(译注:此句引用Advances in Neural Information Processing Systems )。上下文的每个单词都以“N分之1向量”表示,也就是只有一个位置的值为1其余位置值皆为0的向量。在第一层,每个单词创建出不同的单词向量(如图4)。在一个语言模型中,网络的其他层学习如何将输入的单词向量转化为作为输出的预测单词向量,以此来预测词汇表中每一个单词成为下一个单词的可能性。网络通过许多激活组件学习单词向量,每个激活组件都可以理解为单词的一个特征分量,就像我们之前用分布式特征进行学习那部分那样。这些(分布式的)语义结构没有被明确地表现在输入中,而是由学习流程发现出来的,这是一种将输入和输出间结构化关系分解为多个“微规则”的好方式。学习单词向量的方法被发现在单词集来自于真实文本和个别微规则不可靠时也表现得非常好。当它被训练用来预测新闻中的下一个词的时候,如“星期二”与“星期四”、“瑞典”与“挪威”之类的单词之间的单词向量可能非常相似。这样的特征被称为分布式特征是因为它们的特征并不互斥并且它们的许多配置信息与观测到的数据变化一致。这些单词向量由神经网络自动学习到的特征组合而成而非由专家事先定义。从文本中学习单词的向量特征的方法如今已被广泛应用于自然语言的应用中。

The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intui­tive’ inference that underpins effortless commonsense reasoning.

“特征描述”处于逻辑启发(logic-inspired)和神经网络启发(neural-network-inspired)的探讨争论的中心。在逻辑启发的观点中,一个符号的实例应该是一个既不能被其它符号实例定义也不能无法被其它符号实例定义的东西(译注:虽然初读很矛盾,但想想好像有道理)。它不应该有和它的使用相关的更内部的结构。而在符号语义上,它必须和推理规则中的变化严格对应。相比之下,神经网络只是使用大型的激活向量、很多权重矩阵和标量非线性化来表现出支持常识推理的“直觉”推断。

Before the introduction of neural language models, the standard approach to statistical modelling of language did not exploit distrib­uted representations: it was based on counting frequencies of occur­rences of short symbol sequences of length up to N (called N-grams). The number of possible N-grams is on the order of VN, where V is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N-grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).

在介绍神经语言模型之前,解释下标准方法,语言统计建模的标准方法并没有利用分布式特征表示:它是基于长度为N的短符号序列(称为N-图)出现频率的计数。可能的N-gram的数字接近于VN,其中V是词汇量的大小,考虑到一个包含多个单词的上下文将需要非常大的训练语料库。N-gram将每个单词作为一个原子单元来处理,因此它们不能在语义相关的单词序列中一概而论,而神经语言模型可以,因为它们将每个单词与实值特征向量相关联,并且语义相关的单词在该向量空间中彼此接近(图4)。
在这里插入图片描述

Figure 4 | Visualizing the learned word vectors.

On the left is an illustration of word representations learned for modelling language, non-linearly projected to 2D for visualization using the t-SNE algorithm. On the right is a 2D representation of phrases learned by an English-to-French encoder–decoder recurrent neural network. One can observe that semantically similar words or sequences of words are mapped to nearby representations. The distributed representations of words are obtained by using backpropagation to jointly learn a representation for each word and a function that predicts a target quantity such as the next word in a sequence (for language modelling) or a whole sequence of translated words (for machine translation).

图4单词向量学习可视化。

左边是一个为建模语言学习的单词表示的说明,使用t-SNE算法将其非线性投影到2D以进行可视化。 右侧是英语到法语编码器-解码器递归神经网络学习的短语的二维表示。 可以看到,语义相似的单词或句子已映射到附近的特征表示形式。通过使用反向传播算法共同学习每个单词的特征表示以及预测目标数量的函数(例如句子中的下一个单词(用于语言建模)或翻译单词的整个句子(用于机器翻译)),可以获得单词的分布式特征表示形式。

Recurrent neural networks

When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.

递归神经网络

当反向传播算法首次被引入时,它最令人兴奋的用途是训练递归神经网络(RNNs)。对于涉及序列输入的任务,例如语音和语言,使用RNNs更好(图5)。RNNs一次只处理一个输入序列的元素,在它们的隐单元中维护一个“状态向量”,它隐含地包含序列元素过去的历史信息。当我们考虑隐单元在不同离散时间步长的输出,就好像它们是深度多层网络中不同神经元的输出一样(图5,右图),我们就可以清楚地知道如何应用反向传播来训练RNNs。
在这里插入图片描述

Figure 5 | A recurrent neural network and the unfolding in time of the computation involved in its forward computation.

The artificial neurons (for example, hidden units grouped under node s with values st at time t) get inputs from other neurons at previous time steps (this is represented with the black square, representing a delay of one time step, on the left). In this way, a recurrent neural network can map an input sequence with elements x t x_t xt into an output sequence with elements o t o_t ot, with each o t o_t ot depending on all the previous x t ′ x_t' xt (for t ′ ≤ t t'\le t tt). The same parameters (matrices U,V,W ) are used at each time step. Many other architectures are possible, including a variant in which the network can generate a sequence of outputs (for example, words), each of which is used as inputs for the next time step. The backpropagation algorithm (Fig. 1) can be directly applied to the computational graph of the unfolded network on the right, to compute the derivative of a total error (for example, the log-probability of generating the right sequence of outputs) with respect to all the states st and all the parameters.

图5 递归神经网络及其正向计算所涉及的计算时间展开。

人工神经元(例如,在节点s下分组的隐单元,在时间t处的值为st)在之前的时间步长(左侧用黑色正方形表示,表示一个时间步长的延迟)从其他神经元获得输入数据。这样,一个递归神经网络就可以把一个含有 x t x_t xt的输入序列映射成一个含有 o t o_t ot的输出序列,每个 o t o_t ot依赖于先前的 x t ′ x_t' xt(对于 t ′ ≤ t t'\le t tt)。每个时间步长使用相同的参数(矩阵U、V、W)。许多其他的架构是可能的,包括网络生成的一系列输出数据(例如,字),每个输出数据被用作下一个时间步长的输入数据。反向传播算法(图1)可直接应用于右侧展开网络的计算图,以计算关于所有状态st和所有参数的总误差(例如,生成正确输出序列的对数概率)的导数。

RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish.

RNNs是非常强大的动态系统,但是训练它们被证明是有问题的,因为反向传播的梯度在每一个时间步长都会增长或下降,因此在许多时间步长中,它们通常会激增或降为0。

Thanks to advances in their architecture and ways of training them, RNNs have been found to be very good at predicting the next character in the text or the next word in a sequence, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a prob­ability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability dis­tribution for the second word of the translation and so on until a full stop is chosen. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sen­tence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion.

得益于它们结构和训练方式的改进,人们发现循环神经网络非常擅长预测文本中的下一个字母或者序列中的下一个单词,但它们也可以用来完成更复杂的任务。比如,一个一个单词地阅读完一句英语之后,一个英语的“编码器”网络就可以被训练,其内部隐藏层的状态向量将能很好地表征这句话表达的意思。这个表征句子含义的向量随后可以被用来作为另一个连带被训练的法语“解码器”的隐藏层的初始值(或者是作为其外界的额外输入),这次输入对应的输出是翻译成的法语句子里第一个单词的概率分布。如果该分布中的某首单词被选择作为“解码器”网络的第二次输入,它又会经网络给出句子中第二个单词的概率分布,这一过程不断重复直到句号为止。这个过程概括来说就是根据英语句子决定的概率分布来生成法语单词序列。这个有点稚拙的机器翻译方法迅速地成为了顶尖翻译方法的竞争者,而这也带来了一个严肃的疑问:理解一个句子是否需要任何由推理规则操纵的内部特征?这种方法更符合这样一种观点:日常的推理中包含着许多同时发生的类比,而每次类比都增加了所得出结论的可信度

Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep Con­vNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently.

类比把法语句子的意思翻译成英语句子,人们可以学习把图像的意思“翻译”成英语句子(图3)。这里的编码器是一个可以在最后一个隐层将像素转换成活动向量的深度卷积网络。译码器是一个类似于机器翻译和神经网络语言模型的RNNs。最近人们对这类系统的兴趣激增。

RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long.

RNNs一旦在时间上展开(图5),就可以看作是一个所有层共享相同的权重的深度前馈神经网络。虽然他们的主要目的是学习长期依赖性,但理论和经验证据表明,学习长期存储信息是很困难的。

To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.

为了纠解决这个问题,一个想法是扩充网络存储。第一种建议是使用特殊隐单元的LSTM,其自然行为是长期保存输入。一个叫做记忆细胞的特殊单元就像一个累加器或一个门控神经元:它在下一个时间步长有一个权重连接到它自己,复制它自己状态的真实值并累积外部信号,但是这个自联接被另一个单元学习决定何时清除内存的内容乘法门控制的。

LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step87, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation.

LSTM网络后来被证明比传统的RNNs更有效,尤其是当它们在每个时间步长有多个层时,整个语音识别系统能够实现从声学转录为字符序列。LSTM网络或相关形式的门控单元目前也用于编码器和解码器网络,在机器翻译方面表现得非常好。

Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to, and memory networks, in which a regular network is augmented by a kind of associative memory. Memory networks have yielded excel­lent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.

在过去的几年里,几位学者提出了不同的建议用于来扩充RNNs内存模块。建议包括神经图灵机,其中网络可以由RNNs选择读或写的“磁带状”存储器扩充,而记忆网络中的常规网络由联想存储器扩充。记忆网络在标准问答基准测试中表现出色。记忆是用来记住随后被要求回答问题的实例。

Beyond simple memorization, neural Turing machines and mem­ory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’. Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as “where is Frodo now?”.

除了简单的记忆化,神经图灵机和记忆网络被用于通常需要推理和符号操作的任务。还可以教神经图灵机 “算法”。除此之外,当他们的输入由一个未排序的符号序列组成时,他们可以学习输出一个经过排序的符号序列,在这个序列中,每个符号都有一个与其在列表中对应的表面优先级的真实值。记忆网络可以训练成追踪一个设定类似文本冒险游戏的世界的状态,在阅读故事之后,他们可以回答需要复杂推理的问题。在一个测试例子中,网络可以正确回答15句版的《指环王》诸如“Frodo现在在哪里?”的问题。

The future of deep learning

Unsupervised learning had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.

深度学习的未来展望

无监督学习在恢复对深度学习的兴趣方面起到了催化作用,但自那以后,纯粹的监督学习的成功使其黯然失色。虽然我们在这篇论述中没有关注它,但我们期望无监督学习在长期内会变得更加重要。人类和动物的学习在很大程度上是不受监督的:我们通过观察发现世界的结构,而不是被告知每一个事物的名称。

Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-to-end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and rein­forcement learning are in their infancy, but they already outperform passive vision systems at classification tasks and produce impressive results in learning to play many different video games.

人类视觉是一个活跃的过程,它使用一个小的、高分辨率的视网膜中央窝和一个大的、低分辨率的环绕物,以智能的、特定的方式对光线进行采样。我们期望未来在机器视觉方面经过端到端训练的系统有更大的进步,,并将ConvNets与RNNs相结合,使用强化学习来决定走向。将深度学习和强化学习相结合的系统还处于初级阶段,但它们在分类任务上的表现已经超过了被动视觉系统,并在学习操作多种不同的视频游戏方面取得了令人印象深刻的成果。

Natural language understanding is another area in which deep learn­ing is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time.

自然语言理解将是深度学习在未来几年产生重大影响的另一个领域。我们希望使用RNNs理解句子或整个文档的系统在学习每次选择性地关注一部分内容的策略时会变得更好。

Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors.

最终,人工智能的重大进展将来自将特征表示学习与复杂推理相结合的系统。虽然在语音和手写体识别中使用深度学习和简单推理已经有很长一段时间了,但是仍需要通过操作大量向量的新范式来取代基于规则的字符表达式操作。

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值