深度学习


一、深度学习简介

a class of techniques of Machine-learning.

Deep learning allows   models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.

deep learning used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search.

优势:

The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

二、表示学习

 Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification.

Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules .

三、监督学习

During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category.

训练目标:

 the desired category to have the highest score of all categories

训练:

objective function:measures the error (or distance) between the output scores and the desired pattern of score.

parameter: there may be hundreds of millions of these adjustable weights.

前馈神经网络中上一层输出的加权和输入到下一层。

The objective function can  be seen as a kind of hilly landscape in the high-dimensional space of weight values,

 the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.

The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.

随机梯度下降

In practice, most practitioners use a procedure called stochastic gradient descent (SGD).

It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples.

At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. 

A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category.

This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal.

反向传播算法

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives.

The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top  all the way to the bottom .

In the late 1990s, neural nets and backpropagation were largely forsaken . It was widely thought that learning useful, multistage, feature extractors with little prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima.

By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups6 and were already being deployed in Android phones.

卷积神经网络

The architecture of a typical ConvNet  is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers.

Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank.

The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank.

the role of the pooling layer is to merge semantically similar features into one.

The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

 Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers.

ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012.

This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout, and techniques to generate more training examples by deforming the existing ones.

分布式表示和语言处理

Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure.

First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)

Second, composing layers of representation in a deep net brings the potential for another exponential advantage(exponential in the depth).

This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words.

处理过程

In the first layer, each word creates a different pattern of activations, or word vectors .

the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word.

The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word.

In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use;

By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference .

循环神经网络

For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs.

RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence.

RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish.

更加高级的运用:

英语翻译成法语

For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence.

This thought vector can then be used as the initial hidden state of  a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation.

If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen.

Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence.

生成图片的解释:

The encoder here is a deep ConvNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling.

LSTM:

Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long.

虽然它们的主要目的是学习长期依赖关系,但理论和经验证据表明,很难学习将信息存储很长时间。

To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time.


未来发展

Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.

(无监督学习有一定的趋势)

We expect much of the future progress in vision to come from systems that are trained end-to-end and combine ConvNets with RNNs that use reinforcement learning to decide where to look.

Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for attending to one part at a time.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值