Deep Learning — LeCun, Yann, Yoshua Bengio and Geoffrey Hinton

最新推荐文章于 2023-12-22 15:36:37 发布

超爱喝酸奶

最新推荐文章于 2023-12-22 15:36:37 发布

阅读量1.5k

点赞数 2

分类专栏： Paper

本文链接：https://blog.csdn.net/Kwong_young/article/details/90644374

版权

Paper 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

原文链接Deep Learning

由于作者太菜，本文70%为机翻。见谅见谅

第一篇是三巨头LeCun, Yann, Yoshua Bengio和Geoffrey Hinton做的有关Deep Learning的调查。

论文下载Deep learning (Three Giants’ Survey)

Deep Learning

Abstract


Deep learning allows computational models that are composed of multiple processing layers to learn representations of

data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition,

visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep

learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine

should change its internal parameters that are used to compute the representation in each layer from the representation in

the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and

audio, whereas recurrent nets have shone light on sequential data such as text and speech.

深度学习允许用多个处理层组成的计算模型去学习多个抽象级别的数据表示。这些方法极大地提高了语音识别、目标识别、目标探测以及药物发现和基因学等许多领域的技术水平。深度学习通过使用反向传播算法指出机器怎样改变内部参数从而发现在大数据中错综复杂的结构，这些参数被用于计算从上一层到下一层的表示。深度卷积网络在图像、视频、语音和音频处理方面取得了突破性进展，而递归网络则为文本和语音等序列数据带来了光明。

Review


Machine-learning technology powers many aspects of modern

society: from web searches to content filtering on social networks

to recommendations on e-commerce websites, and

it is increasingly present in consumer products such as cameras and

smartphones. Machine-learning systems are used to identify objects

in images, transcribe speech into text, match news items, posts or

products with users’ interests, and select relevant results of search.

Increasingly, these applications make use of a class of techniques called

deep learning.

机器学习技术为现代社会的许多方面提供了动力:从网络搜索到社交网络上的内容过滤，再到推荐电子商务网站，它越来越多的出现在像照相机和手机等消费品中。机器学习系统被用于在图片中辨认目标，把语言转换为文本，将新闻的项目、文章和产品与用户兴趣相匹配，选择相关的搜索结果。越来越多地，这些应用被称作深度学习。


Conventional machine-learning techniques were limited in their

ability to process natural data in their raw form. For decades, constructing

a pattern-recognition or machine-learning system required

careful engineering and considerable domain expertise to design a feature

extractor that transformed the raw data (such as the pixel values

of an image) into a suitable internal representation or feature vector

from which the learning subsystem, often a classifier, could detect or

classify patterns in the input.

传统地机器学习技术用原始的方式处理自然数据的能力有限。过去几十年间，构建一个模式识别或机器学习系统需要仔细的工程设计和相当专业的领域知识去设计一个特征提取器，将原始数据(例如图像的像素值)转换成合适的内部表示形式或特征向量，学习子系统(通常是一个分类器)可以从输入中检测和分类模式。


Representation learning is a set of methods that allows a machine to

be fed with raw data and to automatically discover the representations

needed for detection or classification. Deep-learning methods are

representation-learning methods with multiple levels of representation,

obtained by composing simple but non-linear modules that each

transform the representation at one level (starting with the raw input)

into a representation at a higher, slightly more abstract level. With the

composition of enough such transformations, very complex functions

can be learned. For classification tasks, higher layers of representation

amplify aspects of the input that are important for discrimination and

suppress irrelevant variations. An image, for example, comes in the

form of an array of pixel values, and the learned features in the first

layer of representation typically represent the presence or absence of

edges at particular orientations and locations in the image. The second

layer typically detects motifs by spotting particular arrangements of

edges, regardless of small variations in the edge positions. The third

layer may assemble motifs into larger combinations that correspond

to parts of familiar objects, and subsequent layers would detect objects

as combinations of these parts. The key aspect of deep learning is that

these layers of features are not designed by human engineers: they

are learned from data using a general-purpose learning procedure.

表示学习是一组方法，它允许向机器输入原始数据，并自动发现检测或分类所需的表示。深度学习方法是具有多层表示的表示学习方法，通过组合简单但非线性的模块来获得，每一个转化都将一个层次上的表示(从原始输入开始)转换为一个更高、更抽象的层次上的表示。有了足够多这样的变换组合，就可以学习非常复杂的函数。对于分类任务，更高层次的表示将放大输入的某些方面，这对于区分和抑制不相关的变异非常重要。例如，图像以像素值数组地形式出现，在第一个表示层通常表示特定方向的边和图片中位置存在或不存在。第二层通常通过发现边缘的特定排列来检测图案，而不考虑边缘位置的微小变化。第三层可以将图形组合成更大的图案，这些图案对应于相似对象的部分，随后的层将作为这些部分的组合检测对象。深度学习的关键方面是，这些特性层不是由人类工程师设计的:它们是使用通用的学习过程从数据中学习的。


Deep learning is making major advances in solving problems that

have resisted the best attempts of the artificial intelligence community

for many years. It has turned out to be very good at discovering intricate structures 

in high-dimensional data and is therefore applicable

to many domains of science, business and government. In addition

to beating records in image recognition and speech recognition, it

has beaten other machine-learning techniques at predicting the activity

of potential drug molecules, analysing particle accelerator data,

reconstructing brain circuits, and predicting the effects of mutations

in non-coding DNA on gene expression and disease. Perhaps more

surprisingly, deep learning has produced extremely promising results

for various tasks in natural language understanding, particularly

topic classification, sentiment analysis, question answering and language

translation

深度学习在解决多年来一直阻碍人工智能领域最佳尝试的问题方面取得了重大进展。事实证明，它非常善于发现高维数据的复杂结构，因此适用于科学、商业和政府的许多领域。此外，它打破了图像识别、语音识别的记录，它还在预测潜在药物分子的活性、分析粒子加速器数据、重构大脑回路和预测非编码DNA突变对基因表达和疾病的影响等领域中打败了其他机器学习技术。也许更令人惊讶的是，在自然语言理解的各种任务中，尤其是主题分类、情感分析、问题回答和语言翻译方面，深度学习已经产生了非常有希望的结果。


We think that deep learning will have many more successes in the

near future because it requires very little engineering by hand, so it

can easily take advantage of increases in the amount of available computation

and data. New learning algorithms and architectures that are

currently being developed for deep neural networks will only accelerate

this progress.

我们相信深度学习在不久的将来会取得更大的成功，因为它只需要很少的手工工程，所以可以很容易地增加可用计算和数据量。目前正在为深度神经网络开发的新的学习算法和体系结构只会加速这一进程。

Supervised learning


The most common form of machine learning, deep or not, is supervised

learning. Imagine that we want to build a system that can classify

images as containing, say, a house, a car, a person or a pet. We first

collect a large data set of images of houses, cars, people and pets, each

labelled with its category. During training, the machine is shown an

image and produces an output in the form of a vector of scores, one

for each category. We want the desired category to have the highest

score of all categories, but this is unlikely to happen before training.

We compute an objective function that measures the error (or distance)

between the output scores and the desired pattern of scores. The

machine then modifies its internal adjustable parameters to reduce

this error. These adjustable parameters, often called weights, are real

numbers that can be seen as ‘knobs’ that define the input–output function

of the machine. In a typical deep-learning system, there may be

hundreds of millions of these adjustable weights, and hundreds of

millions of labelled examples with which to train the machine.

不管是否是深度学习，机器学习最常见的形式是监督学习。假设我们要构建一个系统，该系统可以将图像分类为包括一所房子、一辆汽车、一个人或一只宠物的图片。首先，我们收集了大量的房屋、汽车、人和宠物的图片，每个图片都有自己的分类。通过训练，机器会显示一幅图像，并以分数向量的形式输出，每个类别对应一个分数。我们希望需要的类别在所有的类别中获得最高的分数，但是这不太可能在训练之前发生。我们计算一个目标函数，它度量输出分数和期望模式的分数之间的误差（或距离）。然后，机器修改其内部可调参数，以减少这种误差。这些可调参数通常被称为权重，它们是实数，可以被视为定义机器输入-输出功能的“开关”。在一个典型的深度学习系统中，可能有数亿个这样的可调权重，以及数亿个用于训练机器的带标签的例子。


To properly adjust the weight vector, the learning algorithm computes

a gradient vector that, for each weight, indicates by what amount

the error would increase or decrease if the weight were increased by a

tiny amount. The weight vector is then adjusted in the opposite direction

to the gradient vector.

为了正确调整权值向量，学习算法计算一个梯度向量，对于每个权值，梯度向量表示如果权值增加一点点，误差将增加或减少多少。然后按照梯度向量相反的方向调整权重向量。


The objective function, averaged over all the training examples, 

can be seen as a kind of hilly landscape in the high-dimensional space of

weight values. The negative gradient vector indicates the direction

of steepest descent in this landscape, taking it closer to a minimum,

where the output error is low on average.

对所有训练样本取平均值的目标函数被视为一种山地景观在高维空间中的权重值。负梯度向量表示该景观中最陡下降的方向，使其接近最小值，此时平均输出误差较低。


In practice, most practitioners use a procedure called stochastic

gradient descent (SGD). This consists of showing the input vector

for a few examples, computing the outputs and the errors, computing

the average gradient for those examples, and adjusting the weights

accordingly. The process is repeated for many small sets of examples

from the training set until the average of the objective function stops

decreasing. It is called stochastic because each small set of examples

gives a noisy estimate of the average gradient over all examples. This

simple procedure usually finds a good set of weights surprisingly

quickly when compared with far more elaborate optimization techniques.

After training, the performance of the system is measured

on a different set of examples called a test set. This serves to test the

generalization ability of the machine — its ability to produce sensible

answers on new inputs that it has never seen during training.

在实践中，大多数人使用一种称为随机梯度下降(SGD)的方法。这包括显示几个示例的输入向量，计算输出和错误，计算这些示例的平均梯度，并相应地调整权重。，对许多小样本集重复这个过直到目标函数的平均值停止下降。它之所以被称为随机，是因为每个小样本集都给出了所有样本平均梯度的噪声估计。与复杂得多的优化技术相比，这个简单的过程通常能以惊人的速度找到一组好的权重。在训练之后，系统的性能将在另一组称为测试集的示例上进行测量。这是为了测试机器的泛化能力——它能够对训练中从未见过的新输入产生合理的答案。


Many of the current practical applications of machine learning use

linear classifiers on top of hand-engineered features. A two-class linear

classifier computes a weighted sum of the feature vector components.

If the weighted sum is above a threshold, the input is classified as

belonging to a particular category.

目前机器学习的许多实际应用都是在手工设计的特征之上使用线性分类器。二分类线性分类器计算特征向量分量的加权和。如果加权和高于阈值，则将输入分类划分为特定类别。


Since the 1960s we have known that linear classifiers can only carve

their input space into very simple regions, namely half-spaces separated

by a hyperplane. But problems such as image and speech recognition

require the input–output function to be insensitive to irrelevant

variations of the input, such as variations in position, orientation or

illumination of an object, or variations in the pitch or accent of speech,

while being very sensitive to particular minute variations (for example,

the difference between a white wolf and a breed of wolf-like white

dog called a Samoyed). At the pixel level, images of two Samoyeds in

different poses and in different environments may be very different

from each other, whereas two images of a Samoyed and a wolf in the

same position and on similar backgrounds may be very similar to each

other. A linear classifier, or any other ‘shallow’ classifier operating 

on raw pixels could not possibly distinguish the latter two, while putting

the former two in the same category. This is why shallow classifiers

require a good feature extractor that solves the selectivity–invariance

dilemma — one that produces representations that are selective to

the aspects of the image that are important for discrimination, but

that are invariant to irrelevant aspects such as the pose of the animal.

To make classifiers more powerful, one can use generic non-linear

features, as with kernel methods, but generic features such as those

arising with the Gaussian kernel do not allow the learner to generalize

well far from the training examples. The conventional option is

to hand design good feature extractors, which requires a considerable

amount of engineering skill and domain expertise. But this can

all be avoided if good features can be learned automatically using a

general-purpose learning procedure. This is the key advantage of

deep learning.

自20世纪60年代以来，我们已经知道线性分类器只能将其输入空间分割成非常简单的区域，即由超平面分隔的半空间.但是，像图像和语音识别这样的问题要求输入-输出函数对像物体的位置、方向或光照的变化，或者语音的音调或重音的变化之类的输入的无关变量不敏感，同时对细微的变化特别敏感（例如，白狼和一种叫萨摩耶的类似狼的白狗的区别）。在像素水平上，两个萨摩耶在不同的姿势和环境下的图像可能会有很大的不同，而处于相同位置和背景相似的萨摩耶和狼的图像可能非常相似。当把前两者归为一类，线性分类器或任何其他“浅层”分类器操作原始像素不可能区分后两者。这就是为什么浅层分类器需要一个好的特征提取器来解决选择不变的困境——一种对图像的某些方面有选择性的表现，这些方面对辨别很重要，但对不相关的方面，如动物的姿势，是不变的。为了使分类器更强大，我们可以使用一般的非线性特征，就像使用内核方法一样，但是一般的特征，比如高斯内核产生的那些特征，不允许学习者在远离训练样例的场景很好地泛化。传统的选择是手工设计好的特征提取器，这需要相当数量的工程技能和领域专业知识。但是，如果可以使用通用的学习过程自动学习好的特性，那么这一切都可以避免。这是深度学习的关键优势


A deep-learning architecture is a multilayer stack of simple modules,

all (or most) of which are subject to learning, and many of which

compute non-linear input–output mappings. Each module in the

stack transforms its input to increase both the selectivity and the

invariance of the representation. With multiple non-linear layers, say

a depth of 5 to 20, a system can implement extremely intricate functions

of its inputs that are simultaneously sensitive to minute details

— distinguishing Samoyeds from white wolves — and insensitive to

large irrelevant variations such as the background, pose, lighting and

surrounding objects.

深度学习体系结构是由简单模块组成的多层堆栈，所有(或大部分)模块都需要学习，其中许多模块计算非线性输入-输出映射。栈中的每个模块都转换其输入，以提高表示的选择性和不变性。通过多个非线性层，例如有5-20层，一个系统可以实现其输入的极其复杂的函数，这些函数同时对细微的细节非常敏感——区分萨摩耶德和白狼，对大的无关变化，如背景，姿势，灯光和周围的物体不敏感。

Backpropagation to train multilayer architectures


From the earliest days of pattern recognition, the aim of researchers

has been to replace hand-engineered features with trainable

multilayer networks, but despite its simplicity, the solution was not

widely understood until the mid 1980s. As it turns out, multilayer

architectures can be trained by simple stochastic gradient descent.

As long as the modules are relatively smooth functions of their inputs

and of their internal weights, one can compute gradients using the

backpropagation procedure. The idea that this could be done, and

that it worked, was discovered independently by several different

groups during the 1970s and 1980s.

The backpropagation procedure to compute the gradient of an

objective function with respect to the weights of a multilayer stack

of modules is nothing more than a practical application of the chain

从最早的模式识别开始,研究人员的目标是用可训练的多层网络取代手工设计的功能，尽管它很简单，但是直到20世纪80年代中期才被广泛理解。结果表明，多层结构可以通过简单的随机梯度下降训练，只要模块的输入和内部权重是相对平滑的函数，就可以使用反向传播过程计算梯度。在20世纪70年代和80年代，一些不同的研究小组独立地发现了这种方法是可行的。


The backpropagation procedure to compute the gradient of an

objective function with respect to the weights of a multilayer stack

of modules is nothing more than a practical application of the chain rule 

for derivatives. The key insight is that the derivative (or gradient)

of the objective with respect to the input of a module can be

computed by working backwards from the gradient with respect to

the output of that module (or the input of the subsequent module)

(Fig. 1). The backpropagation equation can be applied repeatedly to

propagate gradients through all modules, starting from the output

at the top (where the network produces its prediction) all the way to

the bottom (where the external input is fed). Once these gradients

have been computed, it is straightforward to compute the gradients

with respect to the weights of each module.

计算目标函数相对于多层模块堆栈权重的梯度的反向传播过程只不过是导数链式法则的一个实际应用。关键的观点是，目标对模块输入的导数(或梯度)可以通过对该模块输出(或后续模块的输入)的梯度进行反向计算来计算。

(图1)。反向传播方程可以重复地应用于传播梯度到所有模块，从顶部(网络产生预测的地方)的输出一直到底部(外部输入的地方)。一旦计算了这些梯度，就可以直接计算出与每个模块权重相关的梯度。


Many applications of deep learning use feedforward neural network

architectures (Fig. 1), which learn to map a fixed-size input

(for example, an image) to a fixed-size output (for example, a probability

for each of several categories). To go from one layer to the

next, a set of units compute a weighted sum of their inputs from the

previous layer and pass the result through a non-linear function. At

present, the most popular non-linear function is the rectified linear

unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0).

In past decades, neural nets used smoother non-linearities, such as

tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster

in networks with many layers, allowing training of a deep supervised

network without unsupervised pre-training. Units that are not in

the input or output layer are conventionally called hidden units. The

hidden layers can be seen as distorting the input in a non-linear way

so that categories become linearly separable by the last layer (Fig. 1).

深度学习的许多应用都使用了前馈神经网络架构(图1)。它学习映射固定大小的输入(例如，一幅图像)到一个固定大小的输出(例如，每种类别的概率)。为了从一层到另一层，一组单元计算上一层输入的加权和，并将结果传递给一个非线性函数。目前最常用的非线性函数是线性整流函数(ReLU)，即半波整流f(z) = max(z, 0)。在过去的几十年里，神经网络使用更平滑的非线性，如tanh(z)或1/(1 + exp(- z))，但ReLU通常在多层网络中学习得更快，这使得深度监督网络的训练无需非监督的预训练。不在输入或输出层中的单位通常称为隐藏单位。隐层可以看作是非线性的扭曲输入，使得类别可以被最后一层线性分离(图1)。


In the late 1990s, neural nets and backpropagation were largely

forsaken by the machine-learning community and ignored by the

computer-vision and speech-recognition communities. It was widely

thought that learning useful, multistage, feature extractors with little

prior knowledge was infeasible. In particular, it was commonly

thought that simple gradient descent would get trapped in poor local

minima — weight configurations for which no small change would

reduce the average error.

在20世纪90年代末，神经网络和反向传播在很大程度上被机器学习社区所抛弃，从而被计算机视觉和语音识别社区所忽视。人们普遍认为，学习有用的、多阶段的、几乎没有先验知识的特征提取器是不可行的。特别是，人们普遍认为，简单的梯度下降会陷入局部极小值，此时权重配置很差，在这种情况下，任何微小的变化都不能减少平均误差。


In practice, poor local minima are rarely a problem with large networks.

Regardless of the initial conditions, the system nearly always

reaches solutions of very similar quality. Recent theoretical and

empirical results strongly suggest that local minima are not a serious

issue in general. Instead, the landscape is packed with a combinatorially

large number of saddle points where the gradient is zero, and

the surface curves up in most dimensions and curves down in the remainder. 

The analysis seems to show that saddle points with

only a few downward curving directions are present in very large

numbers, but almost all of them have very similar values of the objective

function. Hence, it does not much matter which of these saddle

points the algorithm gets stuck at.

在实践中，在大型网络中，较差的极小值很少是一个问题。无论初始条件如何，系统几乎总是能得到非常相似质量的解。最近的理论和实证结果表明，局部极小值在一般情况下不是一个严重的问题。相反，景观是由大量的鞍点组合而成的，这些鞍点的梯度为零，在大多数维度的表面上向上弯曲，在其余维度上向下弯曲。分析似乎表明，只有少数向下弯曲方向的鞍点数量非常大，但几乎所有鞍点的目标函数值都非常相似。因此，在这些鞍点中，算法被卡在哪个鞍点并不重要。


Interest in deep feedforward networks was revived around 2006

by a group of researchers brought together by the Canadian

Institute for Advanced Research (CIFAR). The researchers introduced

unsupervised learning procedures that could create layers of

feature detectors without requiring labelled data. The objective in

learning each layer of feature detectors was to be able to reconstruct

or model the activities of feature detectors (or raw inputs) in the layer

below. By ‘pre-training’ several layers of progressively more complex

feature detectors using this reconstruction objective, the weights of a

deep network could be initialized to sensible values. A final layer of

output units could then be added to the top of the network and the

whole deep system could be fine-tuned using standard backpropagation33–

. This worked remarkably well for recognizing handwritten

digits or for detecting pedestrians, especially when the amount of

labelled data was very limited

大约在2006年，由加拿大人召集的一组研究人员所组成的先进研究院重新唤起了人们对深度前馈网络的兴趣。研究人员引入了无监督学习过程，可以在不需要标记数据的情况下创建多层特征检测器。学习每一层特征检测器的目的是能够重构或模拟下一层特征检测器(或原始输入)的活动。通过使用这个重建目标“预训练”几层越来越复杂的特征检测器，可以将深度网络的权值初始化为可合理的值。最后一层输出单元可以添加到网络的顶部，并且可以使用标准的反向传播对整个深层系统进行微调。这对于识别手写数字或检测行人非常有效，尤其是在标签数据非常有限的情况下。


The first major application of this pre-training approach was in

speech recognition, and it was made possible by the advent of fast

graphics processing units (GPUs) that were convenient to program

and allowed researchers to train networks 10 or 20 times faster. In

2009, the approach was used to map short temporal windows of coefficients

extracted from a sound wave to a set of probabilities for the

various fragments of speech that might be represented by the frame

in the centre of the window. It achieved record-breaking results on a

standard speech recognition benchmark that used a small vocabulary

and was quickly developed to give record-breaking results on

a large vocabulary task. By 2012, versions of the deep net from 2009

were being developed by many of the major speech groups and were

already being deployed in Android phones. For smaller data sets,

unsupervised pre-training helps to prevent overfitting, leading to

significantly better generalization when the number of labelled examples

is small, or in a transfer setting where we have lots of examples

for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep

learning had been rehabilitated, it turned out that the pre-training

stage was only needed for small data sets.

这种预训练方法的第一个主要应用是语音识别，它是由于快速图形处理单元(gpus)的出现而成为可能，gpu便于编程，并且允许研究人员以10或20倍的速度训练网络。在2009年，该方法被用于将从声波中提取的系数的瞬时窗口映射到一组可能由窗口中心的帧表示的各种语音片段的概率。它在一个使用小词汇量的标准语音识别基准测试中取得了破纪录的成绩，并迅速被开发出来打破了大词汇量的任务中纪录。到2012年，2009年的深度网络版本已经被许多主要的演讲小组开发出来，并且已经部署在Android手机上。对于较小的数据集，无监督的预训练有助于防止过度拟合，当标记的示例数量较少时，或在传输设置中，对于某些“源”任务我们有很多示例，而对于某些“目标”任务我们只有很少的示例时，可以显著提高泛化效果。它证明了只需要对小数据集进行预训练，深度学习得以恢复作用。


There was, however, one particular type of deep, feedforward network

that was much easier to train and generalized much better than

networks with full connectivity between adjacent layers. This was

the convolutional neural network (ConvNet). It achieved many

practical successes during the period when neural networks were out

of favour and it has recently been widely adopted by the computervision

community.

然而，有一种特殊类型的深度前馈网络比相邻层之间完全连接的网络更容易训练和推广。这就是卷积神经网络(ConvNet)。它在神经网络失宠的时期取得了许多实际的成功，最近被计算机视觉界广泛采用。

Convolutional neural networks


ConvNets are designed to process data that come in the form of

multiple arrays, for example a colour image composed of three 2D

arrays containing pixel intensities in the three colour channels. Many

data modalities are in the form of multiple arrays: 1D for signals and

sequences, including language; 2D for images or audio spectrograms;

and 3D for video or volumetric images. There are four key ideas

behind ConvNets that take advantage of the properties of natural

signals: local connections, shared weights, pooling and the use of

many layers.

卷积神经网络用于处理多个阵列形式的数据，例如由三个二维阵列组成的彩色图像，其中包含三个彩色通道中的像素强度。许多数据模式包括语言，都是多阵列的形式:一维表示信号和序列;二维图像或声频图;和3D的视频或体积图像。利用自然信号特性的卷积神经网络背后有四个关键思想:全连接、共享权重、池化和使用多层。


The architecture of a typical ConvNet (Fig. 2) is structured as a

series of stages. The first few stages are composed of two types of

layers: convolutional layers and pooling layers. Units in a convolutional

layer are organized in feature maps, within which each unit

is connected to local patches in the feature maps of the previous

layer through a set of weights called a filter bank. The result of this

local weighted sum is then passed through a non-linearity such as a

ReLU. All units in a feature map share the same filter bank. Different

feature maps in a layer use different filter banks. The reason for this architecture 

is twofold. First, in array data such as images, local

groups of values are often highly correlated, forming distinctive local

motifs that are easily detected. Second, the local statistics of images

and other signals are invariant to location. In other words, if a motif

can appear in one part of the image, it could appear anywhere, hence

the idea of units at different locations sharing the same weights and

detecting the same pattern in different parts of the array. Mathematically,

the filtering operation performed by a feature map is a discrete

convolution, hence the name.

典型的卷积神经网络(图2)的体系结构是由一系列阶段构成的。前几个阶段由两种类型的层组成:卷积层和池化层。卷积层中的单元组织在特征映射中，每个单元通过一组称为滤波器组的权重连接到上一层的特征映射中的局部小块。这个局部加权和的结果然后通过一个非线性，像线性整流函数。特征映射中的所有单元共享相同的滤波器组。每层中不同的特征映射使用不同的滤波器组。这样构建的原因有两个：首先，在像图像这样的数组数据中，局部值组通常高度相关，形成易于检测的独特的局部基序。其次，图像和其他信号的局部统计不受位置的影响。换句话说，如果一个图形可以出现在图像的一个部分，它可以出现在任何地方，因此不同位置的单元共享相同的权重，并在数组的不同部分检测相同的模式。从数学上讲，特征图的过滤操作是一个离散的卷积，因此得名。


Although the role of the convolutional layer is to detect local conjunctions

of features from the previous layer, the role of the pooling

layer is to merge semantically similar features into one. Because the

relative positions of the features forming a motif can vary somewhat,

reliably detecting the motif can be done by coarse-graining the position

of each feature. A typical pooling unit computes the maximum

of a local patch of units in one feature map (or in a few feature maps).

Neighbouring pooling units take input from patches that are shifted

by more than one row or column, thereby reducing the dimension of

the representation and creating an invariance to small shifts and distortions.

Two or three stages of convolution, non-linearity and pooling

are stacked, followed by more convolutional and fully-connected

layers. Backpropagating gradients through a ConvNet is as simple as

through a regular deep network, allowing all the weights in all the

filter banks to be trained.

虽然卷积层的作用是检测前一层特征的局部连接，但池化层的作用是将语义上相似的特征合并为一个。由于构成图新的特征的相对位置可能有所不同，因此可以通过粗粒化每个特征的位置来可靠地检测图形。一个典型的池单元计算一个特征映射(或几个特征映射)中单元的局部小块的最大值。相邻的池化单元从移动了不止一行或一列的小块中获取输入，从而减少了表示的维数，并为小的变换和扭曲创建了不变性。两个或三个阶段的卷积，非线性和池堆叠，然后是更多的卷积和全连接层。通过卷积网络反向传播梯度与通过常规深度网络一样简单，允许训练所有过滤器组中的所有权重。


Deep neural networks exploit the property that many natural signals

are compositional hierarchies, in which higher-level features

are obtained by composing lower-level ones. In images, local combinations

of edges form motifs, motifs assemble into parts, and parts

form objects. Similar hierarchies exist in speech and text from sounds

to phones, phonemes, syllables, words and sentences. The pooling

allows representations to vary very little when elements in the previous

layer vary in position and appearance.

深层神经网络利用了许多自然信号是组成层次结构的特性，在这种结构中，通过组成较低层次的特征来获得较高层次的特征。在图像中，边缘的局部组合形成图形，图形组合成部分，部分形成目标。从声音到电话、音素、音节、单词和句子，在语音和文本中也存在类似的层次结构。当前一层中的元素在位置和外观上发生变化时，池化层允许表示发生非常小的变化。


The convolutional and pooling layers in ConvNets are directly

inspired by the classic notions of simple cells and complex cells in

visual neuroscience, and the overall architecture is reminiscent of

the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral pathway.

When ConvNet models and monkeys are shown the same picture,

the activations of high-level units in the ConvNet explains half

of the variance of random sets of 160 neurons in the monkey’s inferotemporal

cortex. ConvNets have their roots in the neocognitron,

the architecture of which was somewhat similar, but did not have an

end-to-end supervised-learning algorithm such as backpropagation.

A primitive 1D ConvNet called a time-delay neural net was used for

the recognition of phonemes and simple words

卷积网络中的卷积和池化层视觉神经科学中简单细胞和复杂细胞的经典概念的启发，整体架构让人联想到视觉皮质腹侧通路中的LGN–V1–V2–V4–IT层次结构。当卷积网络模型和猴子被展示相同的图片时，卷积网络中高层次单元的激活解释了猴子颞下皮层160个随机神经元组的一半方差。卷积网络起源于新认知机，它的架构有点类似，但没有像反向传播一样的端到端的监控学习算法。利用一种称为延时神经网络的一维卷积神经网络对音素和简单单词进行识别。


There have been numerous applications of convolutional networks

going back to the early 1990s, starting with time-delay neural

networks for speech recognition and document reading. The

document reading system used a ConvNet trained jointly with a

probabilistic model that implemented language constraints. By the

late 1990s this system was reading over 10% of all the cheques in the

United States. A number of ConvNet-based optical character recognition

and handwriting recognition systems were later deployed by

Microsoft. ConvNets were also experimented with in the early 1990s

for object detection in natural images, including faces and hands,

and for face recognition.

卷积网络的应用可以追溯到20世纪90年代初，最早是用于语音识别和文档阅读的延时神经网络。该文档读取系统使用了一个与实现语言约束的概率模型联合训练的卷积网络。到20世纪90年代末，在美国，这一系统已经阅读了超过10%的支票。随后，一些基于卷积神经网络的光学字符识别和手写识别系统被微软部署。20世纪90年代初，卷积神经网络也进行了实验，用于自然图像(包括人脸和手)中的目标检测，以及人脸识别。

Image understanding with deep convolutional networks


Since the early 2000s, ConvNets have been applied with great success to

the detection, segmentation and recognition of objects and regions in

images. These were all tasks in which labelled data was relatively abundant,

such as traffic sign recognition, the segmentation of biological

images particularly for connectomics, and the detection of faces,

text, pedestrians and human bodies in natural images. A major

recent practical success of ConvNets is face recognition

自21世纪初以来，卷积神经网络在图像中对目标和区域的检测、分割和识别方面取得了很大的成功。这些都是标记数据相对丰富的任务，如交通标志识别、生物图像的分割，特别是连接体的分割，以及在自然图像3中对人脸、文本、行人和人体的检测。最近卷积网络的一个主要的实例成功是人脸识别。


Importantly, images can be labelled at the pixel level, which will have

applications in technology, including autonomous mobile robots and self-driving cars.

Companies such as Mobileye and NVIDIA are

using such ConvNet-based methods in their upcoming vision systems

for cars. Other applications gaining importance involve natural

language understanding and speech recognition

重要的是，图像可以在像素级别上进行标记，这将在包括自动移动机器人和自动驾驶汽车等技术上得到应用。Mobileye和英伟达(NVIDIA)等公司正在即将推出的汽车视觉系统中使用这种基于卷积神经网络的方法。其他越来越重要的应用包括自然语言理解和语音识别。


Despite these successes, ConvNets were largely forsaken by the

mainstream computer-vision and machine-learning communities

until the ImageNet competition in 2012. When deep convolutional

networks were applied to a data set of about a million images from

the web that contained 1,000 different classes, they achieved spectacular

results, almost halving the error rates of the best competing

approaches. This success came from the efficient use of GPUs,

ReLUs, a new regularization technique called dropout, and techniques

to generate more training examples by deforming the existing

ones. This success has brought about a revolution in computer vision;

ConvNets are now the dominant approach for almost all recognition

and detection tasks4 and approach human performance on

some tasks. A recent stunning demonstration combines ConvNets

and recurrent net modules for the generation of image captions

(Fig. 3)

尽管取得了这些成功，但直到2012年ImageNet大赛之前，卷积网络在很大程度上都被主流的计算机视觉和机器学习社区所抛弃。当深度卷积网络应用于包含1000个不同类别的约100万幅网络图像的数据集时，它们取得了惊人的效果，几乎是最佳竞争方法的错误率的一半。这种成功来自于对gpu的高效使用、ReLUs和一种称为dropout的新的正则化技术，它通过对现有的训练实例进行变形来生成更多的训练实例。这一成功带来了计算机视觉的革命;卷积神经网络现在几乎是所有识别和检测任务的主导方法，并且在某些任务中的表现接近人类。最近一个令人震惊的演示结合了卷积网络和循环网络模块，用于生成图像标题。


Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds

of millions of weights, and billions of connections between

units. Whereas training such large networks could have taken weeks

only two years ago, progress in hardware, software and algorithm

parallelization have reduced training times to a few hours.

最近的卷积网络架构有10到20层ReLUs、数亿个权重和数十亿个单元之间的连接。两年前，训练这样的大型网络可能只需要几周的时间，但硬件、软件和算法并行化的进展已将训练时间缩短至几个小时。


The performance of ConvNet-based vision systems has caused

most major technology companies, including Google, Facebook,

Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly

growing number of start-ups to initiate research and development

projects and to deploy ConvNet-based image understanding products

and services.

基于卷积网络的视觉系统的性能已经引起了包括谷歌、Facebook、Microsoft、IBM、Yahoo!， Twitter和Adobe，以及越来越多的初创企业，来启动研究和开发项目，并部署基于卷积网络的图像理解产品和服务。


ConvNets are easily amenable to efficient hardware implementations

in chips or field-programmable gate arrays. A number

of companies such as NVIDIA, Mobileye, Intel, Qualcomm and

Samsung are developing ConvNet chips to enable real-time vision

applications in smartphones, cameras, robots and self-driving cars

卷积网络很容易适应芯片或现场可编程门阵列中的高效硬件实现。英伟达(NVIDIA)、移动眼(Mobileye)、英特尔(Intel)、高通(Qualcomm)和三星(Samsung)等多家公司正在开发ConvNet芯片，使实时视觉应用能够应用于智能手机、相机、机器人和自动驾驶。

Distributed representations and language processing


Deep-learning theory shows that deep nets have two different exponential advantages 

over classic learning algorithms that do not use

distributed representations. Both of these advantages arise from the

power of composition and depend on the underlying data-generating

distribution having an appropriate componential structure. First,

learning distributed representations enable generalization to new

combinations of the values of learned features beyond those seen

during training (for example, 2^n combinations are possible with n

binary features). Second, composing layers of representation in

a deep net brings the potential for another exponential advantage

(exponential in the depth).

深度学习理论表明，与不使用分布式表示的经典学习算法相比，深度网络具有两种不同的指数优势。这两个优点都源于组合的强大功能，并且依赖于具有适当组成结构的底层数据生成分布。首先，学习分布式表示使泛化成为学习特征值的新组合，它超出了在训练中看到的值(例如，二元特征的组合有2^n个可能)。其次，在深层网络中构成表示法层可能带来另一个指数级的优势(深度指数)。


The hidden layers of a multilayer neural network learn to represent

the network’s inputs in a way that makes it easy to predict the

target outputs. This is nicely demonstrated by training a multilayer

neural network to predict the next word in a sequence from a local context 

of earlier words. Each word in the context is presented to

the network as a one-of-N vector, that is, one component has a value

of 1 and the rest are 0. In the first layer, each word creates a different

pattern of activations, or word vectors (Fig. 4). In a language model,

the other layers of the network learn to convert the input word vectors

into an output word vector for the predicted next word, which

can be used to predict the probability for any word in the vocabulary

to appear as the next word. The network learns word vectors that

contain many active components each of which can be interpreted

as a separate feature of the word, as was first demonstrated in the

context of learning distributed representations for symbols. These

semantic features were not explicitly present in the input. They were

discovered by the learning procedure as a good way of factorizing

the structured relationships between the input and output symbols

into multiple ‘micro-rules’. Learning word vectors turned out to also

work very well when the word sequences come from a large corpus

of real text and the individual micro-rules are unreliable. When

trained to predict the next word in a news story, for example, the

learned word vectors for Tuesday and Wednesday are very similar, as

are the word vectors for Sweden and Norway. Such representations

are called distributed representations because their elements (the

features) are not mutually exclusive and their many configurations

correspond to the variations seen in the observed data. These word

vectors are composed of learned features that were not determined

ahead of time by experts, but automatically discovered by the neural

network. Vector representations of words learned from text are now

very widely used in natural language applications.

多层神经网络的隐含层学习以一种易于预测目标输出的方式表示网络的输入。通过训练多层神经网络从上文中预测序列的下一个单词，可以很好地证明这一点。上下文中的每个单词都以1 / n向量的形式呈现给网络，也就是说，一个组件的值为1，其余的值为0。在第一层，每个单词创建一个不同的激活模式，或单词向量。在语言模型中，网络的其他层学习将输入的单词向量转换为预测的下一个单词的输出单词向量，该向量可用于预测词汇表中任何单词作为下一个单词出现的概率。网络学习包含许多活动部件的单词向量，每一个活动部件都可以被解释为单词的一个单独特征，这一点在学习符号的分布式表示内容中首次得到了证明。这些语义特征在输入中没有显式地显示出来。这些语义特征在输入中没有显式地显示出来。它们是通过学习过程发现的，是将输入和输出符号之间的结构化关系分解为多个“微规则”。当单词序列来自真实文本的大型语料库且单个的微规则不可靠时，学习单词向量的效果也很好。例如，当被训练预测新闻故事中的下一个单词时，周二和周三所学的单词向量非常相似，瑞典和挪威的单词向量也是如此。这种表示被称为分布式表示，因为它们的元素(特性)不是相互排斥的，而且它们的许多配置对应于观察到的数据中的变化。这些词向量由习得的特征组成，这些特征不是预先由专家确定的，而是由神经网络自动发现的。从文本中学习单词的向量表示现在在自然语言应用中得到了广泛的应用。


The issue of representation lies at the heart of the debate between

the logic-inspired and the neural-network-inspired paradigms for

cognition. In the logic-inspired paradigm, an instance of a symbol is

something for which the only property is that it is either identical or

non-identical to other symbol instances. It has no internal structure

that is relevant to its use; and to reason with symbols, they must be

bound to the variables in judiciously chosen rules of inference. By

contrast, neural networks just use big activity vectors, big weight

matrices and scalar non-linearities to perform the type of fast ‘intuitive’

inference that underpins effortless commonsense reasoning

表征问题是认知的逻辑启发范式和神经网络启发范式之间争论的核心。在逻辑启发范式中，符号实例的唯一属性是它与其他符号实例相同或不相同。没有与其使用有关的内部结构;而要用符号进行推理，就必须将它们绑定到经过审慎选择的推理规则中的变量上。相比之下，神经网络只使用大的活动向量、大的权重矩阵和非线性标量来执行快速的“直觉”推理，而这种推理是毫不费力的常识推理的基础。


Before the introduction of neural language models, the standard

approach to statistical modelling of language did not exploit distributed

representations: it was based on counting frequencies of occurrences

of short symbol sequences of length up to N (called N-grams).

The number of possible N-grams is on the order of V^N, where V is

the vocabulary size, so taking into account a context of more than a handful of words

would require very large training corpora. N-grams

treat each word as an atomic unit, so they cannot generalize across

semantically related sequences of words, whereas neural language

models can because they associate each word with a vector of real

valued features, and semantically related words end up close to each

other in that vector space (Fig. 4).

在引入神经语言模型之前，语言统计建模的标准方法并没有利用分布式表示:它是基于长度为N(叫做N-grams)的短符号序列出现频率。N-gram的可能性大约是V^N，其中V是词汇量，因此考虑到超过几个单词的上下文需要非常大的训练样本库.N-gram将每个单词视为一个原子单位，因此它们不能泛化到语义相关的单词序列上，而神经语言模型可以，因为它们将每个单词与一个实值特征向量相关联，而语义相关的单词在这个向量空间中彼此接近。

Recurrent neural networks


When backpropagation was first introduced, its most exciting use was

for training recurrent neural networks (RNNs). For tasks that involve

sequential inputs, such as speech and language, it is often better to

use RNNs (Fig. 5). RNNs process an input sequence one element at a

time, maintaining in their hidden units a ‘state vector’ that implicitly

contains information about the history of all the past elements of

the sequence. When we consider the outputs of the hidden units at

different discrete time steps as if they were the outputs of different

neurons in a deep multilayer network (Fig. 5, right), it becomes clear

how we can apply backpropagation to train RNNs

当反向传播首次被引入时，它最令人兴奋的用途是训练循环神经网络(RNNs)。包括顺序输入,语音和语言等任务,通常最好使用RNNs(图5)。RNNs每次处理一个输入序列中的一个元素，在它们的隐藏单元中维护一个“状态向量”，该“状态向量”隐式地包含序列中所有过去元素的历史信息。当我们把隐藏单元在不同离散时间的输出看作是深层多层网络中不同神经元的输出时(图5，右)，我们就清楚了如何应用反向传播来训练RNNs。


RNNs are very powerful dynamic systems, but training them has

proved to be problematic because the backpropagated gradients

either grow or shrink at each time step, so over many time steps they

typically explode or vanish

RNNs是非常强大的动态系统，但事实证明对它们进行训练是有问题的，因为反向传播的梯度在每次迭代时要么增长要么收缩，所以在多次迭代后，它们通常会爆炸或消失。


Thanks to advances in their architecture and ways of training

them, RNNs have been found to be very good at predicting the

next character in the text or the next word in a sequence, but they

can also be used for more complex tasks. For example, after reading

an English sentence one word at a time, an English ‘encoder’ network

can be trained so that the final state vector of its hidden units is a good

representation of the thought expressed by the sentence. This thought

vector can then be used as the initial hidden state of (or as extra input

to) a jointly trained French ‘decoder’ network, which outputs a probability

distribution for the first word of the French translation. If a

particular first word is chosen from this distribution and provided

as input to the decoder network it will then output a probability distribution

for the second word of the translation and so on until a

full stop is chosen. Overall, this process generates sequences of

French words according to a probability distribution that depends on

the English sentence. This rather naive way of performing machine

translation has quickly become competitive with the state-of-the-art,

and this raises serious doubts about whether understanding a sentence

requires anything like the internal symbolic expressions that are

manipulated by using inference rules. It is more compatible with the

view that everyday reasoning involves many simultaneous analogies that each contribute

plausibility to a conclusion

由于它们在结构上的进步以及训练它们的方法，人们发现RNNs非常善于预测文本中的下一个字符或序列中的下一个单词，并且它们也可以用于更复杂的任务。例如，在一次阅读一个英语句子中的一个单词之后，可以训练一个英语“编码器”网络，使其隐藏单元的最终状态向量能够很好地表示句子所表达的思想。然后，这个思维向量可以用作联合训练的法语“解码器”网络的初始隐藏状态(或作为额外输入)，该网络输出法语翻译的第一个单词的概率分布。如果从这个分布中选择一个特定的第一个单词，并将其作为输入提供给解码器网络，那么它将输出翻译的第二个单词的概率分布，以此类推，直到选择了一个句号。总的来说，这个过程生成法语单词序列的概率分布取决于英语句子。这种相当简单的机器翻译方式很快就与最先进的机器翻译方式展开了竞争，这引发了人们对理解一个句子是否需要像使用推理规则操纵的内部符号表达式这样的东西的严重质疑。它更符合这样一种观点，即日常推理包括许多同时进行的类比，每一个类比都有助于得出一个结论的合理性。


Instead of translating the meaning of a French sentence into an

English sentence, one can learn to ‘translate’ the meaning of an image

into an English sentence (Fig. 3). The encoder here is a deep ConvNet

that converts the pixels into an activity vector in its last hidden

layer. The decoder is an RNN similar to the ones used for machine

translation and neural language modelling. There has been a surge of

interest in such systems recently (see examples mentioned in ref. 86)

相比较把法语句子的意思翻译成英语句子，一个人可以学会把一个图像的意思“翻译”成英语句子。这里的编码器是一个深卷积网络，它将像素转换成最后一个隐藏层中的活动向量。解码器是一个类似于用于机器翻译和神经语言建模的RNN。最近人们对这类系统的兴趣激增。


RNNs, once unfolded in time (Fig. 5), can be seen as very deep

feedforward networks in which all the layers share the same weights.

Although their main purpose is to learn long-term dependencies,

theoretical and empirical evidence shows that it is difficult to learn

to store information for very long.

RNNs一旦在时间上展开(图5)，可以看作是一个非常深的前馈网络，其中所有层的权值相同。虽然它们的主要目的是学习长期依赖关系，但理论和经验证据表明，学习长时间存储信息是困难的。


To correct for that, one idea is to augment the network with an

explicit memory. The first proposal of this kind is the long short-term

memory (LSTM) networks that use special hidden units, the natural

behaviour of which is to remember inputs for a long time79. A special

unit called the memory cell acts like an accumulator or a gated leaky

neuron: it has a connection to itself at the next time step that has a

weight of one, so it copies its own real-valued state and accumulates

the external signal, but this self-connection is multiplicatively gated

by another unit that learns to decide when to clear the content of the

memory

为了改正这种情况，一种方法是用显式内存来扩充网络。第一个提议是长短期记忆网络(LSTM)，它使用特殊的隐藏单元，其自然行为是长时间记住输入。一个称为存储器单元的特殊单元就像一个累加器或门控泄漏神经元：它在下一个重量为1的步骤与自身连接，因此它复制自己的实值状态并累积外部信号，但是这种自我连接是由另一个学会决定何时清除记忆内容的单位的乘法门控。


LSTM networks have subsequently proved to be more effective

than conventional RNNs, especially when they have several layers for

each time step, enabling an entire speech recognition system that

goes all the way from acoustics to the sequence of characters in the

transcription. LSTM networks or related forms of gated units are also

currently used for the encoder and decoder networks that perform

so well at machine translation.

LSTM网络后来被证明比传统的RNNs更有效，特别是当它们在每个时间步长都有多个层时，使得整个语音识别系统能够从声学一直到转录中的字符序列。LSTM网络或相关形式的门控单元目前也用于在机器翻译方面表现良好的编码器和解码器网络。


Over the past year, several authors have made different proposals to

augment RNNs with a memory module. Proposals include the Neural

Turing Machine in which the network is augmented by a ‘tape-like’

memory that the RNN can choose to read from or write to, and

memory networks, in which a regular network is augmented by a kind of associative memory. 

Memory networks have yielded excellent performance on standard question-answering

benchmarks. The memory is used to remember the story about 

which the network is later asked to answer questions.

在过去的一年中，一些作者提出了不同的建议，以增加内存模块的RNN。建议包括神经图灵机，其中网络由RNN可以选择读取或写入的“类似磁带”的存储器和存储器网络增强，其中常规网络由一种关联存储器增强。内存网络在标准的问答基准测试中取得了优异的性能。记忆用于记住以后要求网络回答的问题。


Beyond simple memorization, neural Turing machines and memory

networks are being used for tasks that would normally require

reasoning and symbol manipulation. Neural Turing machines can

be taught ‘algorithms’. Among other things, they can learn to output

a sorted list of symbols when their input consists of an unsorted

sequence in which each symbol is accompanied by a real value that

indicates its priority in the list. Memory networks can be trained

to keep track of the state of the world in a setting similar to a text

adventure game and after reading a story, they can answer questions

that require complex inference. In one test example, the network is

shown a 15-sentence version of the The Lord of the Rings and correctly

answers questions such as “where is Frodo now?”

除了简单的记忆，神经图灵机和记忆网络被用于通常需要推理和符号操作的任务。神经图灵机器可以学习“算法”。此外，当输入由一个未排序的序列组成时，它们可以学习输出一个排序的符号列表，其中每个符号都伴有一个表示其在列表中的优先级的实值。记忆网络可以被训练在类似文本冒险游戏的环境中跟踪世界的状态，在阅读一个故事后，它们可以回答需要复杂推理的问题。在一个测试示例中，网络显示了一个15句话的《指环王》版本，并正确地回答了诸如“佛罗多现在在哪里?”

The future of deep learning


Unsupervised learning had a catalytic effect in reviving interest in

deep learning, but has since been overshadowed by the successes of

purely supervised learning. Although we have not focused on it in this

Review, we expect unsupervised learning to become far more important

in the longer term. Human and animal learning is largely unsupervised:

we discover the structure of the world by observing it, not by being told

the name of every object

无监督学习在激发人们对深度学习的兴趣方面起到了催化作用，但自那以后，它就被纯监督学习的成功所掩盖。虽然我们没有在这篇综述中关注它，但我们预计从长远来看，无监督学习将变得更加重要。人类和动物的学习在很大程度上是不受监督的:我们通过观察世界来发现世界的结构，而不是通过被告知每个物体的名称。


Human vision is an active process that sequentially samples the optic

array in an intelligent, task-specific way using a small, high-resolution

fovea with a large, low-resolution surround. We expect much of the

future progress in vision to come from systems that are trained end-toend

and combine ConvNets with RNNs that use reinforcement learning

to decide where to look. Systems combining deep learning and reinforcement

learning are in their infancy, but they already outperform

passive vision systems at classification tasks and produce impressive

results in learning to play many different video games.

人类视觉是一个主动的过程，它使用一个小的高分辨率中央凹和一个大的低分辨率环绕物，以一种智能的、特定于任务的方式对光学阵列进行连续采样。我们预计未来的视觉方面进展将来自受过端到端训练的系统，并将ConvNets与使用强化学习的RNN结合起来决定在哪里寻找。深度学习和强化学习相结合的系统还处于起步阶段，但它们在分类任务上已经超过了被动视觉系统，并在学习玩许多不同的视频游戏方面产生了令人印象深刻的效果。


Natural language understanding is another area in which deep learning

is poised to make a large impact over the next few years. We expect

systems that use RNNs to understand sentences or whole documents

will become much better when they learn strategies for selectively

attending to one part at a time.

自然语言理解是深度学习在未来几年内产生巨大影响的另一个领域。我们期望使用RNN来理解句子或整个文档的系统在学习有选择地一次参与一个部分的策略时会变得更好。


Ultimately, major progress in artificial intelligence will come about

through systems that combine representation learning with complex

reasoning. Although deep learning and simple reasoning have been

used for speech and handwriting recognition for a long time, new

paradigms are needed to replace rule-based manipulation of symbolic

expressions by operations on large vectors.

最终，人工智能的重大进展将通过将表示学习与复杂推理相结合的系统实现。尽管深度学习和简单推理已经被用于语音和手写识别很长一段时间，但是需要新的范例来通过对大向量的操作来替换基于规则的符号表达式操作。