图解微积分:反向传播

原文链接戳此处

Introduction

Backpropagation is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years.

介绍

反向传播是神经网络中的一个关键算法,它使得训练深度模型在计算方法上得到了实现。对于现代的神经网络来说,这种方式能够使得梯度下降算法的训练过程快上1000万倍。而这就是一个模型仅需一周和花上20年来训练的差距。

Beyond its use in deep learning, backpropagation is a powerful computational tool in many other areas, ranging from weather forecasting to analyzing numerical stability – it just goes by different names. In fact, the algorithm has been reinvented at least dozens of times in different fields (see Griewank (2010)). The general, application independent, name is “reverse-mode differentiation.”

除了应用在深度学习中,反向传播对于其它领域来说同样是一个强大的计算工具,如天气预测、数据稳定性分析等,其仅仅只是改了一个名字而且。事实上,反向传播算法在不同的领域至少被构造过10多次。而最普遍的独立于各个领域的名称为“反向模式求导”。

Fundamentally, it’s a technique for calculating derivatives quickly. And it’s an essential trick to have in your bag, not only in deep learning, but in a wide variety of numerical computing situations.

从本质上来说,它是一 个快速求导的计算方法。并且,无论是在深度学习当中还是作为各种各样数字计算的方法,它可以说是一个你必须掌握的技巧。

请一定记住:反向传播算法的本质就算是用来梯度的

Computational Graphs

Computational graphs are a nice way to think about mathematical expressions. For example, consider the expression e=(a+b)(b+1) e = ( a + b ) ∗ ( b + 1 ) . There are three operations: two additions and one multiplication. To help us talk about this, let’s introduce two intermediary variables, c c and d so that every function’s output has a variable. We now have:

计算图

计算图是一个用来思考数学表达的有力方法。例如,思考一下表示 e=(a+b)(b+1) e = ( a + b ) ∗ ( b + 1 ) ,其共有三个操作:两个加法和一个乘法。为了能更好的帮助我们理解其过程,我们先引进两个中间变量 c,d c , d 。这样一来,每个操作的结果都能用一个变量来代替。如下:

c=a+b c = a + b

d=b+1 d = b + 1

e=cd e = c ∗ d

To create a computational graph, we make each of these operations, along with the input variables, into nodes. When one node’s value is the input to another node, an arrow goes from one to another.

下面我们来创建一个计算图,使得每一个操作都沿着输入变量往上进行。其中每个箭头的起点都代表输入变量。

这里写图片描述

These sorts of graphs come up all the time in computer science, especially in talking about functional programs. They are very closely related to the notions of dependency graphs and call graphs. They’re also the core abstraction behind the popular deep learning framework Theano.

We can evaluate the expression by setting the input variables to certain values and computing nodes up through the graph. For example, let’s set a=2andb=1 a = 2 a n d b = 1 :

这种计算图在计算机领域几乎任何时候都被人们所提及,又特别是当谈及到相关函数程序的时候。它非常接近于依赖图和调用图的概念。同时,它还是深度学习框架背后抽象化的核心所在。

我们上可以通过给输入变量设定一个初始值,然后沿着节点向上计算最终得到这个表达式的值。例如,我们假设 a=2andb=1 a = 2 a n d b = 1 :

这里写图片描述

计算之后我们得到整个表达式的值为6

Derivatives on Computational Graphs

If one wants to understand derivatives in a computational graph, the key is to understand derivatives on the edges. If a a directly affects c, then we want to know how it affects c. If a changes a little bit, how does c change? We call this the partial derivative of c with respect to a.

To evaluate the partial derivatives in this graph, we need the sum rule and the product rule:

计算图上的导数

如果想理解偏导数在计算图中的作用,其关键就在于理解每条路径上的偏导数。试想一下,如果 a a 的值变化直接影响到c的话,那么它到底是如何影响到 c c 的呢?如果a仅仅只是变化一点点的话,那么 c c 的值有会变化多少呢?我们称这种影响为c a a 的偏导数(Ps. 之所以称之为偏导数,是因为b的变化也会影响到 c c )。

为了计算图中的偏导数,我们先要了解偏导数的计算规则(加和乘):

这里写图片描述

Below, the graph has the derivative on each edge labeled.

以下就是沿着每条边对每个节点求得的偏导数。

ec=c(cd)=dcc+cdc=d+0=b+1;

ed=d(cd)=dcd+cdd=0+c=a+b; ∂ e ∂ d = ∂ ∂ d ( c ∗ d ) = d ⋅ ∂ c ∂ d + c ⋅ ∂ d ∂ d = 0 + c = a + b ;

What if we want to understand how nodes that aren’t directly connected affect each other? Let’s consider how e e is affected by a. If we change a a at a speed of 1, c also changes at a speed of 1. In turn, c c changing at a speed of 1 causes e to change at a speed of 2. So ee changes at a rate of 12 1 ∗ 2 with respect to a a .

The general rule is to sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path together. For example, to get the derivative of e with respect to b b we get:

如果想要去理解没有直接相连的节点之间是如何相互影响的呢?让我们先思考e是如何被 a a 所影响的吧。如果我们以1的速率来改变a,那么 c c 将同样会以1的速度发生变化;(Ps.因为c a a 的偏导数为1)同理,c以1的速率发生变化将会导致 e e 以2的速率发生变化。由此我们可以知道,a e e 的影响速率为12。换句话说,也就是 a a 以1的速率发生变化,那么e将以2的速率发生改变。

ea=ecca=21=2; ∂ e ∂ a = ∂ e ∂ c ⋅ ∂ c ∂ a = 2 ⋅ 1 = 2 ;

一般的规则就是将所有可能的路径相加,且其中每条路径上,边的偏导数相乘。例如,为了计算 b b e的影响,也就是求 e e b的偏导:(Ps. bce;bde; b → c → e ; b → d → e ; 一共两条)

eb=eccb+eddb=21+31; ∂ e ∂ b = ∂ e ∂ c ⋅ ∂ c ∂ b + ∂ e ∂ d ⋅ ∂ d ∂ b = 2 ⋅ 1 + 3 ⋅ 1 ;

This accounts for how b affects e through c and also how it affects it through d.
This general “sum over paths” rule is just a different way of thinking about the multivariate chain rule.

这就是对于 b b 是如何通过c d d 最终来影响e的解释。
对于“路径和”这一规则,其仅仅是用来思考多变量链式求导的不同方法。

Factoring Paths

The problem with just “summing over the paths” is that it’s very easy to get a combinatorial explosion in the number of possible paths.

路径和的因式分解

然而,“路径相加”的问题就在于,我们可以很容易的发现,如果存在大量的可能路径,那么将这些路径一一相加是一件很费工夫的事情。

这里写图片描述

In the above diagram, there are three paths from X X to Y, and a further three paths from Y Y to Z. If we want to get the derivative ZX ∂ Z ∂ X by summing over all paths, we need to sum over 33=9 3 ∗ 3 = 9 paths:

在上面的图片中, X X Y,以及 Y Y Z都分别存在三条路径。那么,如果要计算 Z Z X的偏导数,则需要将所有的(9条)路径相加:

这里写图片描述

The above only has nine paths, but it would be easy to have the number of paths to grow exponentially as the graph becomes more complicated.

Instead of just naively summing over the paths, it would be much better to factor them:

虽然上面的图中只有9条路径还不算太复杂,但显而易见的是,如果路径的数量成指数的增长,那么这个计算图将变得异常的复杂。

因此“路径和”这种天真的想法就显得苍白无力了,取而代之的就是将其因式分解:

ZX=(α+β+γ)(δ+ϵ+ζ) ∂ Z ∂ X = ( α + β + γ ) ( δ + ϵ + ζ )

This is where “forward-mode differentiation” and “reverse-mode differentiation” come in. They’re algorithms for efficiently computing the sum by factoring the paths. Instead of summing over all of the paths explicitly, they compute the same sum more efficiently by merging paths back together at every node. In fact, both algorithms touch each edge exactly once!

Forward-mode differentiation starts at an input to the graph and moves towards the end. At every node, it sums all the paths feeding in. Each of those paths represents one way in which the input affects that node. By adding them up, we get the total way in which the node is affected by the input, it’s derivative.

这就是“正向模式求导”和“反向模式求导”的由来。两者都是通过分解路径计算路径和的有效算法。相较于直接将所有可能路径粗暴的相加,这两者算法取而代之的是通过先计算每一个节点的所有对其产生影响的路径的和。(Ps. Y的所有对其产生影响的路径的和为 α+β+γ α + β + γ )事实上,无论哪种方式,对于所有路径相加都不会重复。

正向模式求导是从图中的一个输入开始,并且一直向前,直到最后一个节点。在经过每一个节点的时候,它会累加所有对这个节点产生影响的路径和。其中,每一条路径都都代表一个输入对输出节点产生影响的一种方式。通过将它们累加起来,我们就能够得到所有对输出节点产生影响的方式,这就是(偏)导数。

这里写图片描述

Though you probably didn’t think of it in terms of graphs, forward-mode differentiation is very similar to what you implicitly learned to do if you took an introduction to calculus class.

Reverse-mode differentiation, on the other hand, starts at an output of the graph and moves towards the beginning. At each node, it merges all paths which originated at that node.

尽管就这张图而言你可能并不这样认为,但是正向模式求导确确实实非常类似你无意间在微积分课程中学会的那样。

另一方面,反向模式求导是从图的输出节点开始,一直到输入节点。在经过每一个节点的时候,它就会累加所有从上一个节点源起的所有路径。

这里写图片描述

Forward-mode differentiation tracks how one input affects every node. Reverse-mode differentiation tracks how every node affects one output. That is, forward-mode differentiation applies the operator X ∂ ∂ X to every node, while reverse mode differentiation applies the operator Z ∂ Z ∂ to every node.

正向模式求导探究的是一个输入节点是如何影响其他每一个输出节点的;而反向模式求导探究的是一个输出节点是如何被其它节点所影响的。这也就是为什么正向模式求导用符号 X ∂ ∂ X 来对每个节点求导,而反向模式求导用符号 Z ∂ Z ∂ 来对每个节点求导。


举个例子

这里写图片描述

还是这幅图,现在我们来探究下面两个问题:

①输入节点b是如何影响输出节点e的?(也就是正向模求导,自下而上)

cb=1;db=1;eb=eccb+eddb=21+31; ∂ c ∂ b = 1 ; ∂ d ∂ b = 1 ; ⟹ ∂ e ∂ b = ∂ e ∂ c ⋅ ∂ c ∂ b + ∂ e ∂ d ⋅ ∂ d ∂ b = 2 ⋅ 1 + 3 ⋅ 1 ;

②输出节点e是如何被其他节点所影响的?(也就是反向模式求导,自上而下)
ec=dcc+cdc=d+0=b+1;ed=dcd+cdd=0+c=a+b ∂ e ∂ c = d ⋅ ∂ c ∂ c + c ⋅ ∂ d ∂ c = d + 0 = b + 1 ; ∂ e ∂ d = d ⋅ ∂ c ∂ d + c ⋅ ∂ d ∂ d = 0 + c = a + b

ea=cca=21=2;eb=eccb+eddb=21+31; ∂ e ∂ a = ∂ ∂ c ⋅ ∂ c ∂ a = 2 ⋅ 1 = 2 ; ∂ e ∂ b = ∂ e ∂ c ⋅ ∂ c ∂ b + ∂ e ∂ d ⋅ ∂ d ∂ b = 2 ⋅ 1 + 3 ⋅ 1 ;


Computational Victories

At this point, you might wonder why anyone would care about reverse-mode differentiation. It looks like a strange way of doing the same thing as the forward-mode. Is there some advantage?

Let’s consider our original example again:

计算上的胜利

在计算这一点上你可能会感到奇怪,为什么人人都这么关心反向模式的求导方法呢? 这个奇怪的方法开起来似乎和正向模式求导的方法没什么差别。难道就没有一点更先进的地方吗?

让我们再来仔细看看起初的这个例子:

这里写图片描述

We can use forward-mode differentiation from b b up. This gives us the derivative of every node with respect to b.

我们可以使用正向模式求导从 b b 开始,然后计算所有节点关于b的偏导数。

这里写图片描述

We’ve computed eb ∂ e ∂ b , the derivative of our output with respect to one of our inputs.

What if we do reverse-mode differentiation from e e down? This gives us the derivative of e with respect to every node:

我们已经计算了输出节点关于一个输入节点的偏导数 eb ∂ e ∂ b

那现在如果我们用反向模式从 e e 开始,自上而下的求导又会是怎样的呢?答案就是这将得出e关于所有节点的偏导数。

这里写图片描述

When I say that reverse-mode differentiation gives us the derivative of e with respect to every node, I really do mean every node. We get both ea ∂ e ∂ a and eb ∂ e ∂ b , the derivatives of ee with respect to both inputs. Forward-mode differentiation gave us the derivative of our output with respect to a single input, but reverse-mode differentiation gives us all of them.

当我说到反向模式求导将会使我们得到e关于每一个节点的偏导时,不要感到怀疑,我确确实实是说的每一个节点。我们得到了e关于两个输入节点a,b的偏导数 ea ∂ e ∂ a eb ∂ e ∂ b 。也就是说,正向模式求导得到的仅仅只是输出节点关于其中一个输入节点的偏导数;而反向模式求导得到的却是关于所有输入节点的偏导。

For this graph, that’s only a factor of two speed up, but imagine a function with a million inputs and one output. Forward-mode differentiation would require us to go through the graph a million times to get the derivatives. Reverse-mode differentiation can get them all in one fell swoop! A speed up of a factor of a million is pretty nice!

从上面这个图来看,仅仅只是两个输入因素。但是试想一下,如果一个函数有上百万个输入一个输出的时候,正向模式求导就需要我们遍历整个图上百万次,以此来得到所有输出节点关于输入节点的偏导数;然而,反向模式求导仅仅只要一次就能得到所有的结果。所以百万倍的加速想想还是不错的。

When training neural networks, we think of the cost (a value describing how bad a neural network performs) as a function of the parameters (numbers describing how the network behaves). We want to calculate the derivatives of the cost with respect to all the parameters, for use in gradient descent. Now, there’s often millions, or even tens of millions of parameters in a neural network. So, reverse-mode differentiation, called backpropagation in the context of neural networks, gives us a massive speed up!

当我们训练一个神经网络的时候,我们把代价函数(描述一个神经网络性能好坏的值)视为关于所有参数(描述神经网络行为的参数)的一个函数。然后我们就需要使用梯度下降算法来计算代价函数关于所有参数的偏导数。但对于如今的神经网络而言,上百万甚至是上千万的参数已是随处可见,因此反向模式求导(在神经网络中称之为反向传播)恰恰成为了一个十分有效的工具。

(Are there any cases where forward-mode differentiation makes more sense? Yes, there are! Where the reverse-mode gives the derivatives of one output with respect to all inputs, the forward-mode gives us the derivatives of all outputs with respect to one input. If one has a function with lots of outputs, forward-mode differentiation can be much, much, much faster.)

(难道就没有一种情况是正向模式求导更具有意义的吗?答案当然是否定的。反向模式求导得到的一个输出关于所有输入节点的偏导,正向模式求导得到的是所有输出关于一个输入节点的偏导。如果一个函数有大量的输出,那么正向模式求导将快上很多很多很多。)

Isn’t This Trivial?

When I first understood what backpropagation was, my reaction was:”Oh, that’s just the chain rule! How did it take us so long to figure out?” I’m not the only one who’s had that reaction. It’s true that if you ask” is there a smart way to calculate derivatives in feedforward neural networks?” the answer isn’t that difficult.

这些都有用吗?

当我第一次试图理解反向传播的时候,我的反应就是:“哦,不就是链式求导!它怎么会让使得我们长时间的来研究它?”我相信我绝对不是唯一一个做出这样反应的人。如果你问是否存在一种巧妙的方法在前馈神经网络中来计算导数的话,那答案当然是肯定的。

But I think it was much more difficult than it might seem. You see, at the time backpropagation was invented, people weren’t very focused on the feedforward neural networks that we study. It also wasn’t obvious that derivatives were the right way to train them. Those are only obvious once you realize you can quickly calculate derivatives. There was a circular dependency.

但是我认为这要比想象中的更加困难。你想想,在反向传播算法被发明的时候,难道人们就没有更加的聚焦于前馈神经网络?同时,对于在前馈神经网络中运用求导来训练参数是否有效也尚不清楚。一旦你意识到你能够很快的计算偏导,这些问题都显而易见,因为里面的参数存在着循环依赖。

Worse, it would be very easy to write off any piece of the circular dependency as impossible on casual thought. Training neural networks with derivatives? Surely you’d just get stuck in local minima. And obviously it would be expensive to compute all those derivatives. It’s only because we know this approach works that we don’t immediately start listing reasons it’s likely not to.

糟糕的是,这种思维将很可能会抹杀任何一种具有循环依赖的想法。用求导来训练神经网络?你这样当然最终会得到一个局部最优解。但很显然,这将花费很高昂的代价。所以我们仅仅只是知道这种方法,但并不会去做。

推荐阅读:
图解微积分:反向传播
利用反向传播训练多层神经网络的原理
如何直观地解释 back propagation 算法?(@胡逸夫)

  • 3
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值