反向传播——计算图基础知识

这篇文章,是反向传播的基础,可以了解下。

 

原文地址:http://colah.github.io/posts/2015-08-Backprop/

计算图运算:BP

发布于2015年8月31日

Introduction简介:

Backpropagation is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years.

反向传播算法是一种易于处理训练深模型计算的关键算法。对于现代神经网络,相对于一个原始的实现,它可以使梯度下降的训练速度提高一千万倍。这就是一个模型花一个星期训练和200000年的区别。

Beyond its use in deep learning, backpropagation is a powerful computational tool in many other areas, ranging from weather forecasting to analyzing numerical stability – it just goes by different names. In fact, the algorithm has been reinvented at least dozens of times in different fields (see Griewank (2010)). The general, application independent, name is “reverse-mode differentiation.”

在应用于深入学习之前,反向传播算法作为一个强大的计算工具已经在其他许多其他领域应用,从天气预报到分析数值稳定性–它只是以不同的名字展示罢了。事实上,该算法已被重复使用至少数十次在不同的领域(见Griewank(2010))。一般的,独立的应用程序,名为“ 反向模式求导”。

Fundamentally, it’s a technique for calculating derivatives quickly. And it’s an essential trick to have in your bag, not only in deep learning, but in a wide variety of numerical computing situations.

从根本上说,这是一种快速计算导数的技术。这是一个必不可少的技术,不仅是在深入学习,而且可以在各种各样的数值计算环境中使用。

Computational Graphs计算图

Computational graphs are a nice way to think about mathematical expressions. For example, consider the expression e=(a+b)∗(b+1). There are three operations: two additions and one multiplication. To help us talk about this, let’s introduce two intermediary variables, c and d so that every function’s output has a variable. We now have:

计算图是思考数学表达式的好方法。例如,考虑表达式e =(a+b)∗(b + 1)。有三个操作:两个加法和一个乘法。为了帮助我们讨论这一点,让我们介绍两个中间变量c 和 d,以便每个函数的输出都有一个变量。我们现在有:

c=a+b
d=b+1
e=c∗d

To create a computational graph, we make each of these operations, along with the input variables, into nodes. When one node’s value is the input to another node, an arrow goes from one to another.

为了创建计算图,我们将这些操作连同输入变量一起放到节点中。当一个节点的值是另一个节点的输入时,箭头从一个指向另一个。

这里写图片描述

These sorts of graphs come up all the time in computer science, especially in talking about functional programs. They are very closely related to the notions of dependency graphs and call graphs. They’re also the core abstraction behind the popular deep learning framework Theano.

We can evaluate the expression by setting the input variables to certain values and computing nodes up through the graph. For example, let’s set a=2 and b=1:

计算图这类图表一直出现在计算机科学中,特别是关于函数程序的讨论中。它们与依赖图和调用图的概念有着非常密切的关系。他们也是现在流行的深层学习框架“Theano”背后的核心概念。
我们可以通过将输入变量设置为特定值并通过图计算节点来求解表达式。例如,我们设置一个a = 2和b = 1:

这里写图片描述

这个表达式计算结果为6。

Derivatives on Computational Graphs计算图上的导数

If one wants to understand derivatives in a computational graph, the key is to understand derivatives on the edges. If a directly affects c, then we want to know how it affects c. If a changes a little bit, how does c change? We call this the partial derivative of c with respect to a.

如果想理解计算图中的导数,关键是要理解边上的导数。如果a直接影响c,那么我们想知道它是如何影响c的。如果a有一点变化,c会如何变化?我们称之c为关于a的偏导数。

To evaluate the partial derivatives in this graph, we need the sum rule and the product rule:
为了评价这个图中的偏导数,我们需要求和规则和乘积规则:

这里写图片描述

Below, the graph has the derivative on each edge labeled.
下面,图在每个边标签上有导数。

这里写图片描述

What if we want to understand how nodes that aren’t directly connected affect each other? Let’s consider how e is affected by a. If we change a at a speed of 1, c also changes at a speed of 1. In turn, c changing at a speed of 1 causes e to change at a speed of 2. So e changes at a rate of 1∗2 with respect to a.

如果我们想了解那些没有直接连接的节点是如何相互影响的呢?让我们来考虑e是如何受a影响的。如果我们以1的速度改变a,c也会以1的速度变化。反过来,c以1的速度变化,使e以2的速度变化。所以e速度变化相对于a 为1∗2。

The general rule is to sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path together. For example, to get the derivative of e with respect to b we get:
一般规则是从一个节点到另一个节点的所有可能路径进行相加,将每条边上的导数相乘。例如,为了得到e相对于b的导数,我们得到:

这里写图片描述

This accounts for how b affects e through c and also how it affects it through d.

This general “sum over paths” rule is just a different way of thinking about the multivariate chain rule.

这就解释了b如何通过c和d影响e。
这种一般的“路径求和”规则只是对多元链规则的一种不同的思考。

Factoring Paths因式分解路径

The problem with just “summing over the paths” is that it’s very easy to get a combinatorial explosion in the number of possible paths.
仅仅在“路径数的和”的问题上,可能的路径数很容易会组合爆炸。

combinatorial explosion组合爆炸
  •  

这里写图片描述

In the above diagram, there are three paths from X to Y, and a further three paths from Y to Z. If we want to get the derivative ∂Z/∂X by summing over all paths, we need to sum over 3∗3=9 paths:

在上面的图中,有从X到Y的3条路径,还有从Y到 Z3条路径。如果我们想要得到的导数∂Z/∂X通过所有路径,我们需要总共计算3∗3 = 9条路径:
这里写图片描述

The above only has nine paths, but it would be easy to have the number of paths to grow exponentially as the graph becomes more complicated.

Instead of just naively summing over the paths, it would be much better to factor them:

上面只有九条路径,但是随着图形变得更复杂,路径数的随指数级增长是很容易的。
与其简单地路径求和,不如把它们分解因子:

这里写图片描述

This is where “forward-mode differentiation” and “reverse-mode differentiation” come in. They’re algorithms for efficiently computing the sum by factoring the paths. Instead of summing over all of the paths explicitly, they compute the same sum more efficiently by merging paths back together at every node. In fact, both algorithms touch each edge exactly once!
这就是“前向模式求导”和“反向模式求导”的出现。这些算法能有效地对因式分解路径进行求和。它们不是显式地对所有路径进行相加,而是在每个节点上合并路径,从而更有效地计算相同的和。事实上,这两种算法都能精确地碰到每一个边!

Forward-mode differentiation starts at an input to the graph and moves towards the end. At every node, it sums all the paths feeding in. Each of those paths represents one way in which the input affects that node. By adding them up, we get the total way in which the node is affected by the input, it’s derivative.
前向模式求导开始于对图形的输入并向结尾移动。在每一个节点,它会对每个路径上的输入求和。这些路径中的每一个都表示输入影响该节点的一种方式。通过将它们加起来,我们得到了节点受输入影响的总方法,即它的导数。

这里写图片描述

Though you probably didn’t think of it in terms of graphs, forward-mode differentiation is very similar to what you implicitly learned to do if you took an introduction to calculus class.

Reverse-mode differentiation, on the other hand, starts at an output of the graph and moves towards the beginning. At each node, it merges all paths which originated at that node.

虽然你可能不认为它就图形而言,前向模式求导是非常相似的,你含蓄地学习做,如果你介绍微积分课。
反向模式求导,另一方面,从图形的输出开始,向开始移动。在每个节点上,它合并起源于该节点的所有路径。

这里写图片描述

Forward-mode differentiation tracks how one input affects every node. Reverse-mode differentiation tracks how every node affects one output. That is, forward-mode differentiation applies the operator ∂/∂X to every node, while reverse mode differentiation applies the operator ∂Z/∂ to every node.
前向模式求导跟踪一个输入如何影响每个节点。反向模式求导跟踪每个节点如何影响一个输出。即前向模式求导应用算子∂/∂X于每一个节点,而反向模式求导应用算子∂Z/∂于每节点。

Computational Victories 计算的成功

At this point, you might wonder why anyone would care about reverse-mode differentiation. It looks like a strange way of doing the same thing as the forward-mode. Is there some advantage?

Let’s consider our original example again:

在这一点上,您可能会想知道为什么有人会关心反向模式求导。它看起来像一个奇怪的方式做与前向模式相同的事情。有什么优势吗?
让我们再考虑一下原来的例子:

这里写图片描述

We can use forward-mode differentiation from b up. This gives us the derivative of every node with respect to b.
我们可以使用前向模式求导从b节点开始。这给出了每个节点的导数关于b。

这里写图片描述

We’ve computed ∂e/∂b, the derivative of our output with respect to one of our inputs.

What if we do reverse-mode differentiation from e down? This gives us the derivative of e with respect to every node:
我们计算∂e/∂b,我们输出的导数与一个输入相关。
如果我们从e开始反向模式求导呢?这里给出了关于每个节点的e的导数:

这里写图片描述

When I say that reverse-mode differentiation gives us the derivative of e with respect to every node, I really do mean every node. We get both ∂e/∂a and ∂e/∂b, the derivatives of e with respect to both inputs. Forward-mode differentiation gave us the derivative of our output with respect to a single input, but reverse-mode differentiation gives us all of them.
当我说反向模式求导给出关于每个节点的e的导数时,我真的指的是每个节点。我们得到 ∂e/∂a和∂e/∂b,关于输入的e的导数。前向模式求导给出了我们关于单个输入的输出的导数,但是反向模式求导给了我们所有的输入的输出导数。

For this graph, that’s only a factor of two speed up, but imagine a function with a million inputs and one output. Forward-mode differentiation would require us to go through the graph a million times to get the derivatives. Reverse-mode differentiation can get them all in one fell swoop! A speed up of a factor of a million is pretty nice!
对于这个图,这只是两个加速的一个因素,但是想象一个有一百万个输入和一个输出的函数。前向模式求导将需要我们经过这个图一百万次来得到导数。反向模式求导可以一举获得全部收益!加速系数为一百万是相当不错的!

When training neural networks, we think of the cost (a value describing how bad a neural network performs) as a function of the parameters (numbers describing how the network behaves). We want to calculate the derivatives of the cost with respect to all the parameters, for use in gradient descent. Now, there’s often millions, or even tens of millions of parameters in a neural network. So, reverse-mode differentiation, called backpropagation in the context of neural networks, gives us a massive speed up!
当训练神经网络时,我们考虑成本(描述一个神经网络运行的好坏)作为参数的函数(描述网络行为的数字)。我们想要计算在梯度下降中使用的所有参数的导数的成本。现在,神经网络中常常有上百万甚至甚至上千万的参数。因此,反向模式求导,在神经网络的背景下称为反向传播,给了我们一个巨大的加速!

(Are there any cases where forward-mode differentiation makes more sense? Yes, there are! Where the reverse-mode gives the derivatives of one output with respect to all inputs, the forward-mode gives us the derivatives of all outputs with respect to one input. If one has a function with lots of outputs, forward-mode differentiation can be much, much, much faster.)
(前模式求导有什么更大的意义吗?)是的,有。在反向模式给出一个输出关于所有输入的导数时,前向模式给出了所有输出相对于一个输入的导数。如果有一个具有大量输出的函数,前向模式求导可以更快。

  • Isn’t This Trivial?这不重要吗?

When I first understood what backpropagation was, my reaction was: “Oh, that’s just the chain rule! How did it take us so long to figure out?” I’m not the only one who’s had that reaction. It’s true that if you ask “is there a smart way to calculate derivatives in feedforward neural networks?” the answer isn’t that difficult.
当我第一次知道反向传播是什么时,我的反应是:“哦,这只是链式法则!我们花了这么长时间才弄明白的?“我不是唯一一个有这种反应的人。如果你问“这是一个在前馈神经网络中计算导数的好方法是真的吗?”“答案并不是那么难。

  • But I think it was much more difficult than it might seem. You see, at the time backpropagation was invented, people weren’t very focused on the feedforward neural networks that we study. It also wasn’t obvious that derivatives were the right way to train them. Those are only obvious once you realize you can quickly calculate derivatives. There was a circular dependency.
    但我认为这要比看起来困难得多。你知道,在反向传播技术被发明的时候,人们并没有把注意力放在我们研究的前馈神经网络上。导数是训练它们的正确方法,这一点也不明显。这些都是显而易见的,一旦你意识到你可以快速计算导数。存在循环依赖。

Worse, it would be very easy to write off any piece of the circular dependency as impossible on casual thought. Training neural networks with derivatives? Surely you’d just get stuck in local minima. And obviously it would be expensive to compute all those derivatives. It’s only because we know this approach works that we don’t immediately start listing reasons it’s likely not to.

That’s the benefit of hindsight. Once you’ve framed the question, the hardest work is already done.

更糟糕的是,在随意的想法上,写下循环依赖的任何一部分都是不可能的。用导数训练神经网络?你肯定会陷入局部极小。很显然,计算所有这些导数是很昂贵的。这仅仅是因为我们知道这种方法是可行的,所以我们不会马上开始列出原因。
这是事后诸葛亮的好处。一旦你提出了这个问题,最困难的工作就已经完成了。

  • Conclusion结论

Derivatives are cheaper than you think. That’s the main lesson to take away from this post. In fact, they’re unintuitively cheap, and us silly humans have had to repeatedly rediscover this fact. That’s an important thing to understand in deep learning. It’s also a really useful thing to know in other fields, and only more so if it isn’t common knowledge.

Are there other lessons? I think there are.

导数比你想象的便宜。这是从这篇文章中吸取的主要教训。事实上,他们不直观地便宜,和我们愚蠢的人类不得不多次重新发现这个事实。在深度学习中理解这一点很重要。在其他领域也知道这一点是非常有用的,而且如果不是常见的知识,则更是如此。
还有其他的课程吗?我想有。

Backpropagation is also a useful lens for understanding how derivatives flow through a model. This can be extremely helpful in reasoning about why some models are difficult to optimize. The classic example of this is the problem of vanishing gradients in recurrent neural networks.
反向传播也是一个有用的镜头,了解如何通过一个模型流动的导数。这对于推断为什么有些模型很难优化是非常有用的。经典的例子是递归神经网络中梯度消失问题。

Finally, I claim there is a broad algorithmic lesson to take away from these techniques. Backpropagation and forward-mode differentiation use a powerful pair of tricks (linearization and dynamic programming) to compute derivatives more efficiently than one might think possible. If you really understand these techniques, you can use them to efficiently calculate several other interesting expressions involving derivatives. We’ll explore this in a later blog post.
最后,我认为有一个广泛的算法教训,以摆脱这些技术。反向传播和前向模式求导使用一组强大的技巧(线性化和动态规划)来更有效地计算导数。如果您真正理解这些技术,您可以使用它们有效地计算涉及导数的其他几个有趣的表达式。我们将在以后的博客文章中探讨这个问题。

This post gives a very abstract treatment of backpropagation. I strongly recommend reading Michael Nielsen’s chapter on it for an excellent discussion, more concretely focused on neural networks.
这篇文章提供了一个非常抽象的反向传播治疗。我强烈建议阅读Michael Nielsen关于它的一章,进行一次精彩的讨论,更具体地聚焦于神经网络。

Acknowledgments 致谢

Thank you to Greg Corrado, Jon Shlens, Samy Bengio and Anelia Angelova for taking the time to proofread this post.

Thanks also to Dario Amodei, Michael Nielsen and Yoshua Bengio for discussion of approaches to explaining backpropagation. Also thanks to all those who tolerated me practicing explaining backpropagation in talks and seminar series!
谢谢你,Greg Corrado,Jon Shlens,花时间校对后Samy Bengio和Anelia Angelova。
也感谢Dario Amodei,为解释传播途径探讨Michael Nielsen和Yoshua Bengio。也感谢所有容忍我在谈判和研讨会系列中解释反向传播的人!

This might feel a bit like dynamic programming. That’s because it is!↩

参考:https://zhuanlan.zhihu.com/p/25081671?refer=xiaoleimlnote

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值