Calculus on Computational Graphs: Backpropagation

Calculus on Computational Graphs: Backpropagation
https://colah.github.io/posts/2015-08-Backprop/
https://github.com/colah/colah.github.io/tree/master/posts

calculus /ˈkælkjələs/
n. 微积分
computational /ˌkɒmpjuˈteɪʃənl/
adj. 使用计算机的;与计算机有关的
propagation /ˌprɒpə'ɡeɪʃ(ə)n/
n. 传播;扩展;宣传;培养

1. Introduction

Backpropagation is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years.

tractable /ˈtræktəbl/
adj. 易处理的;易驾驭的
descent /dɪˈsent/
n. 下降;血统;出身;祖先;斜坡;祖籍;坡道;下倾

Beyond its use in deep learning, backpropagation is a powerful computational tool in many other areas, ranging from weather forecasting to analyzing numerical stability - it just goes by different names. In fact, the algorithm has been reinvented at least dozens of times in different fields. The general, application independent, name is “reverse-mode differentiation.”

reinvent /ˌriːɪnˈvent/
vt. 以新形象示人;以新形式出现
differentiation /ˌdɪfəˌrenʃiˈeɪʃn/
n. 区别;分化

Fundamentally, it’s a technique for calculating derivatives quickly. And it’s an essential trick to have in your bag, not only in deep learning, but in a wide variety of numerical computing situations.

derivative /dɪˈrɪvətɪv/
n. 衍生物;派生词;派生物; 衍生字
adj. 模仿他人的;缺乏独创性的

2. Computational Graphs

Computational graphs are a nice way to think about mathematical expressions. For example, consider the expression e = ( a + b ) ∗ ( b + 1 ) e=(a+b)∗(b+1) e=(a+b)(b+1). There are three operations: two additions and one multiplication. To help us talk about this, let’s introduce two intermediary variables, c c c and d d d so that every function’s output has a variable. We now have:

c = a + b d = b + 1 e = c ∗ d c = a + b \\ d = b + 1 \\ e = c ∗ d c=a+bd=b+1e=cd

To create a computational graph, we make each of these operations, along with the input variables, into nodes. When one node’s value is the input to another node, an arrow goes from one to another.

在这里插入图片描述

These sorts of graphs come up all the time in computer science, especially in talking about functional programs. They are very closely related to the notions of dependency graphs and call graphs. They’re also the core abstraction behind the popular deep learning framework Theano.

We can evaluate the expression by setting the input variables to certain values and computing nodes up through the graph. For example, let’s set a = 2 a=2 a=2 and b = 1 b=1 b=1:

在这里插入图片描述

The expression evaluates to 6.

3. Derivatives on Computational Graphs

If one wants to understand derivatives in a computational graph, the key is to understand derivatives on the edges. If a a a directly affects c c c, then we want to know how it affects c c c. If a a a changes a little bit, how does c c c change? We call this the partial derivative of c c c with respect to a a a.

To evaluate the partial derivatives in this graph, we need the sum rule and the product rule:

∂ ∂ a ( a + b ) = ∂ a ∂ a + ∂ b ∂ a = 1 + 0 = 1 ∂ ∂ u u v = v ∂ u ∂ u + u ∂ v ∂ u = v ∗ 1 + u ∗ 0 = v \begin{aligned} \frac{\partial}{\partial a}(a+b) = \frac{\partial a}{\partial a} + \frac{\partial b}{\partial a} = 1 + 0 = 1 \\ \frac{\partial}{\partial u}uv = v\frac{\partial u}{\partial u} + u\frac{\partial v}{\partial u} = v * 1 + u * 0 = v \end{aligned} a(a+b)=aa+ab=1+0=1uuv=vuu+uuv=v1+u0=v

Below, the graph has the derivative on each edge labeled.

在这里插入图片描述

∂ c ∂ a = ∂ ∂ a ( a + b ) = ∂ a ∂ a + ∂ b ∂ a = 1 + 0 = 1 ∂ c ∂ b = ∂ ∂ b ( a + b ) = ∂ a ∂ b + ∂ b ∂ b = 0 + 1 = 1 ∂ d ∂ b = ∂ ∂ b ( b + 1 ) = ∂ b ∂ b + ∂ 1 ∂ b = 1 + 0 = 1 ∂ e ∂ c = ∂ ∂ c ( c ∗ d ) = d ∂ c ∂ c + c ∂ d ∂ c = d ∗ 1 + c ∗ 0 = d = 2 ∂ e ∂ d = ∂ ∂ d ( c ∗ d ) = d ∂ c ∂ d + c ∂ d ∂ d = d ∗ 0 + c ∗ 1 = c = 3 \begin{aligned} \frac{\partial c}{\partial a} = \frac{\partial}{\partial a}(a+b) = \frac{\partial a}{\partial a} + \frac{\partial b}{\partial a} = 1 + 0 = 1 \\ \frac{\partial c}{\partial b} = \frac{\partial}{\partial b}(a+b) = \frac{\partial a}{\partial b} + \frac{\partial b}{\partial b} = 0 + 1 = 1 \\ \frac{\partial d}{\partial b} = \frac{\partial}{\partial b}(b+1) = \frac{\partial b}{\partial b} + \frac{\partial 1}{\partial b} = 1 + 0 = 1 \\ \frac{\partial e}{\partial c} = \frac{\partial}{\partial c}(c ∗ d) = d\frac{\partial c}{\partial c} + c\frac{\partial d}{\partial c} = d * 1 + c * 0 = d = 2 \\ \frac{\partial e}{\partial d} = \frac{\partial}{\partial d}(c ∗ d) = d\frac{\partial c}{\partial d} + c\frac{\partial d}{\partial d} = d * 0 + c * 1 = c =3 \\ \end{aligned} ac=a(a+b)=aa+ab=1+0=1bc=b(a+b)=ba+bb=0+1=1bd=b(b+1)=bb+b1=1+0=1ce=c(cd)=dcc+ccd=d1+c0=d=2de=d(cd)=ddc+cdd=d0+c1=c=3

What if we want to understand how nodes that aren’t directly connected affect each other?
如果我们想了解不直接连接的节点如何相互影响怎么办?

Let’s consider how e e e is affected by a a a. If we change a a a at a speed of 1, c c c also changes at a speed of 1. In turn, c c c changing at a speed of 1 causes e e e to change at a speed of 2. So e e e changes at a rate of 1 ∗ 2 1∗2 12 with respect to a a a.

The general rule is to sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path together. For example, to get the derivative of e e e with respect to b b b we get:

∂ e ∂ a = ∂ e ∂ c ∂ c ∂ a = 2 ∗ 1 = 2 ∂ e ∂ b = ∂ e ∂ c ∂ c ∂ b + ∂ e ∂ d ∂ d ∂ b = 2 ∗ 1 + 3 ∗ 1 = 5 \begin{aligned} \frac{\partial e}{\partial a} &= \frac{\partial e}{\partial c}\frac{\partial c}{\partial a} = 2*1 = 2 \\ \frac{\partial e}{\partial b} &= \frac{\partial e}{\partial c}\frac{\partial c}{\partial b} + \frac{\partial e}{\partial d}\frac{\partial d}{\partial b} = 2*1 + 3*1 = 5 \\ \end{aligned} aebe=ceac=21=2=cebc+debd=21+31=5

This accounts for how b b b affects e e e through c c c and also how it affects it through d d d.

This general “sum over paths” rule is just a different way of thinking about the multivariate chain rule.

4. Factoring Paths

The problem with just “summing over the paths” is that it’s very easy to get a combinatorial explosion in the number of possible paths.
仅“对路径求和”的问题在于,很容易导致可能路径数量的组合爆炸式增长。

explosion /ɪkˈspləʊʒn/
n. 爆炸;爆破;激增;爆裂 (声);(感情,尤指愤怒的) 突然爆发,迸发;突增

在这里插入图片描述

In the above diagram, there are three paths from X X X to Y Y Y, and a further three paths from Y Y Y to Z Z Z. If we want to get the derivative ∂ Z ∂ X \frac{\partial Z}{\partial X} XZ by summing over all paths, we need to sum over 3 ∗ 3 = 9 3∗3=9 33=9 paths:

∂ Z ∂ X = α δ + α ϵ + α ζ + β δ + β ϵ + β ζ + γ δ + γ ϵ + γ ζ \frac{\partial Z}{\partial X} = \alpha\delta + \alpha\epsilon + \alpha\zeta + \beta\delta + \beta\epsilon + \beta\zeta + \gamma\delta + \gamma\epsilon + \gamma\zeta XZ=αδ+αϵ+αζ+βδ+βϵ+βζ+γδ+γϵ+γζ

The above only has nine paths, but it would be easy to have the number of paths to grow exponentially as the graph becomes more complicated.
上面只有九条路径,但随着图变得更加复杂,路径的数量很容易呈指数增长。

Instead of just naively summing over the paths, it would be much better to factor them:

∂ Z ∂ X = ( α + β + γ ) ( δ + ϵ + ζ ) \frac{\partial Z}{\partial X} = (\alpha + \beta + \gamma)(\delta + \epsilon + \zeta) XZ=(α+β+γ)(δ+ϵ+ζ)

This is where “forward-mode differentiation” and “reverse-mode differentiation” come in. They’re algorithms for efficiently computing the sum by factoring the paths. Instead of summing over all of the paths explicitly, they compute the same sum more efficiently by merging paths back together at every node. In fact, both algorithms touch each edge exactly once!
它们是通过分解路径来高效计算和的算法。它们不是对所有路径进行显式求和,而是通过在每个节点处合并路径来更高效地计算相同的和。事实上,这两种算法都恰好触及每条边一次!

differentiation /ˌdɪfəˌrenʃiˈeɪʃn/
n. 区别;分化

Forward-mode differentiation starts at an input to the graph and moves towards the end. At every node, it sums all the paths feeding in. Each of those paths represents one way in which the input affects that node. By adding them up, we get the total way in which the node is affected by the input, it’s derivative.

在这里插入图片描述

Though you probably didn’t think of it in terms of graphs, forward-mode differentiation is very similar to what you implicitly learned to do if you took an introduction to calculus class.
虽然你可能没有从图形的角度来考虑它,但前向模式微分与你在微积分入门课上隐性学到的内容非常相似。

Reverse-mode differentiation, on the other hand, starts at an output of the graph and moves towards the beginning. At each node, it merges all paths which originated at that node.

在这里插入图片描述

Forward-mode differentiation tracks how one input affects every node. Reverse-mode differentiation tracks how every node affects one output. That is, forward-mode differentiation applies the operator ∂ ∂ X \frac{\partial}{\partial X} X to every node, while reverse mode differentiation applies the operator ∂ Z ∂ \frac{\partial Z}{\partial} Z to every node.

5. Computational Victories

victory /ˈvɪktəri/
n. 胜利;成功

At this point, you might wonder why anyone would care about reverse-mode differentiation. It looks like a strange way of doing the same thing as the forward-mode. Is there some advantage?

Let’s consider our original example again:

在这里插入图片描述

We can use forward-mode differentiation from b b b up. This gives us the derivative of every node with respect to b b b.

在这里插入图片描述

We’ve computed ∂ e ∂ b \frac{\partial e}{\partial b} be, the derivative of our output with respect to one of our inputs.

What if we do reverse-mode differentiation from e e e down? This gives us the derivative of e e e with respect to every node:

在这里插入图片描述

When I say that reverse-mode differentiation gives us the derivative of e e e with respect to every node, I really do mean every node. We get both ∂ e ∂ a \frac{\partial e}{\partial a} ae and ∂ e ∂ b \frac{\partial e}{\partial b} be, the derivatives of e e e with respect to both inputs. Forward-mode differentiation gave us the derivative of our output with respect to a single input, but reverse-mode differentiation gives us all of them.
正向模式微分给出了输出对于单个输入的导数,而反向模式微分给出了所有输入的导数。

For this graph, that’s only a factor of two speed up, but imagine a function with a million inputs and one output. Forward-mode differentiation would require us to go through the graph a million times to get the derivatives. Reverse-mode differentiation can get them all in one fell swoop! A speed up of a factor of a million is pretty nice!
对于这张图来说,速度只提升了两倍,但想象一下一个函数有一百万个输入和一个输出。正向微分需要我们遍历该图一百万次才能得到导数。反向微分可以一次性得到所有导数!百万倍的速度提升相当不错!

When training neural networks, we think of the cost (a value describing how bad a neural network performs) as a function of the parameters (numbers describing how the network behaves). We want to calculate the derivatives of the cost with respect to all the parameters, for use in gradient descent. Now, there’s often millions, or even tens of millions of parameters in a neural network. So, reverse-mode differentiation, called backpropagation in the context of neural networks, gives us a massive speed up!
现在,神经网络中通常有数百万甚至数千万个参数。因此,反向模式微分 (在神经网络中称为反向传播) 可以大大提高速度!

Are there any cases where forward-mode differentiation makes more sense? Yes, there are! Where the reverse-mode gives the derivatives of one output with respect to all inputs, the forward-mode gives us the derivatives of all outputs with respect to one input. If one has a function with lots of outputs, forward-mode differentiation can be much, much, much faster.
在某些情况下,前向模式微分更有意义吗?是的,有!反向模式给出一个输出对所有输入的导数,而前向模式给出所有输出对一个输入的导数。如果一个函数有很多输出,前向模式微分的速度会快得多。

6. Isn’t This Trivial?

trivial /ˈtrɪviəl/
adj. 不重要的;微不足道的;琐碎的

When I first understood what backpropagation was, my reaction was: “Oh, that’s just the chain rule! How did it take us so long to figure out?” I’m not the only one who’s had that reaction. It’s true that if you ask “is there a smart way to calculate derivatives in feedforward neural networks?” the answer isn’t that difficult.
当我第一次理解反向传播是什么时,我的反应是:“哦,那只是链式法则!我们怎么花了这么长时间才弄清楚?” 我不是唯一一个有这种反应的人。确实,如果你问“有没有一种巧妙的方法来计算前馈神经网络中的导数?” 答案并不难。

reaction /riˈækʃn/
n. 反应;化学反应;反作用力;回应;副作用;反对;反应能力;生理反应;(对旧观念等的)抗拒

But I think it was much more difficult than it might seem. You see, at the time backpropagation was invented, people weren’t very focused on the feedforward neural networks that we study. It also wasn’t obvious that derivatives were the right way to train them. Those are only obvious once you realize you can quickly calculate derivatives. There was a circular dependency.
但我认为这比想象中要难得多。要知道,在反向传播算法发明的时候,人们并没有太关注我们研究的前馈神经网络。当时人们也不清楚导数是否是训练前馈神经网络的正确方法。只有当你意识到可以快速计算导数时,这一点才会变得显而易见。这其中存在着循环依赖。

circular /ˈsɜːkjələ(r)/
adj. 圆形的,环形的;循环的
n. 通知,通告

Worse, it would be very easy to write off any piece of the circular dependency as impossible on casual thought. Training neural networks with derivatives? Surely you’d just get stuck in local minima. And obviously it would be expensive to compute all those derivatives. It’s only because we know this approach works that we don’t immediately start listing reasons it’s likely not to.
更糟糕的是,人们很容易会因为随意的想法就否定任何循环依赖的可能性。用导数训练神经网络?你肯定会陷入局部极小值。而且,计算所有这些导数显然成本高昂。正因为我们知道这种方法有效,我们才不会立即列举它可能无效的理由。

That’s the benefit of hindsight. Once you’ve framed the question, the hardest work is already done.
这就是事后诸葛亮的好处。一旦你确定了问题,最艰难的工作就已经完成了。

hindsight /'haɪn(d)saɪt/
n. 事后的觉悟,事后的聪明
frame /freɪm/
n. 框架;结构;画面
vt. 设计;陷害;建造;使…适合
vi. 有成功希望
adj. 有木架的;有构架的

7. Conclusion

Derivatives are cheaper than you think. That’s the main lesson to take away from this post. In fact, they’re unintuitively cheap, and us silly humans have had to repeatedly rediscover this fact. That’s an important thing to understand in deep learning. It’s also a really useful thing to know in other fields, and only more so if it isn’t common knowledge.
事实上,它们便宜得令人难以置信,而我们这些愚蠢的人类不得不反复地重新发现这个事实。在深度学习中,理解这一点至关重要。在其他领域,了解这一点也非常有用,尤其是在它并非常识的情况下。

Are there other lessons? I think there are.
还有其他教训吗?我想有的。

lesson /ˈlesn/
n. 课程;经验;课;教训;一节课;教学单元;一课时;《圣经》选读
silly /ˈsɪli/
adj. 愚蠢的;傻的;(尤指像小孩一样) 可笑的,荒唐的,冒傻气的;没头脑的;闹着玩的;不实用的;不明事理的
n. (常用于向孩子指出其愚蠢行为) 傻孩子,淘气鬼

Backpropagation is also a useful lens for understanding how derivatives flow through a model. This can be extremely helpful in reasoning about why some models are difficult to optimize. The classic example of this is the problem of vanishing gradients in recurrent neural networks.
反向传播也是理解导数如何在模型中流动的有用视角。这对于推理某些模型难以优化的原因非常有帮助。典型的例子是循环神经网络中的梯度消失问题。

vanish /ˈvænɪʃ/
vi. 消失;突然不见;成为零
vt. 使不见,使消失
n. 弱化音
lens /lenz/
n. 透镜;镜片;(眼球的) 晶状体

Finally, I claim there is a broad algorithmic lesson to take away from these techniques. Backpropagation and forward-mode differentiation use a powerful pair of tricks (linearization and dynamic programming) to compute derivatives more efficiently than one might think possible. If you really understand these techniques, you can use them to efficiently calculate several other interesting expressions involving derivatives. We’ll explore this in a later blog post.
反向传播和前向模式微分运用了一对强大的技巧 (线性化和动态规划),能够以超乎想象的效率计算导数。如果你真正理解了这些技巧,就可以用它们高效地计算其他一些涉及导数的有趣表达式。

linearization /ˌlɪnɪərɪ'zeʃən/
n. 线性化

This post gives a very abstract treatment of backpropagation. I strongly recommend reading Michael Nielsen’s chapter on it for an excellent discussion, more concretely focused on neural networks.

How the backpropagation algorithm works
http://neuralnetworksanddeeplearning.com/chap2.html

concretely /'kɔnkri:tli/
adv. 具体地

Acknowledgments

Thanks to all those who tolerated me practicing explaining backpropagation in talks and seminar series!

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/
[2] Neural Networks and Deep Learning, http://neuralnetworksanddeeplearning.com/index.html
[3] colah’s blog, https://colah.github.io/
[4] Daniel Sabinasz, https://www.sabinasz.net/list-of-articles/

<think>好的,我现在需要帮助用户了解或实现得物签名的相关技术,特别是签名算法的实现方式。首先,我需要收集已有的信息。根据用户提供的引用,特别是引用[4]和[5],得物的签名算法可能涉及逆向工程分析,使用了JavaScript中的XMLHttpRequest重写来拦截请求,并在请求中包含特定的sign参数。这可能意味着签名生成是在客户端进行的,通常在前端JavaScript代码中完成。 接下来,我应该考虑常见的签名生成方法。常见的Web应用签名算法通常使用HMAC、AES、RSA或自定义的哈希组合。引用[1]提到了RSA算法,但得物的签名可能更倾向于使用HMAC-SHA256之类的哈希算法,结合时间戳、随机数和请求参数进行签名,以防止篡改和重放攻击。 用户可能需要具体的步骤,比如逆向分析得物的JavaScript代码,找到生成sign的函数,提取关键参数和加密逻辑。引用[5]中的代码片段展示了如何重写XMLHttpRequest的open方法来拦截请求,这可能用于调试和定位签名生成的代码位置。 此外,引用[4]提到使用Python进行算法还原,所以可能需要在Python中复现JavaScript的签名生成逻辑。这通常涉及分析请求参数如何排序、拼接,以及如何与密钥结合进行哈希或加密。需要注意反爬机制,如obfuscation(代码混淆)或动态密钥,这可能需要使用AST(抽象语法树)解析或动态执行来还原算法。 用户可能还需要了解如何应对反调试措施,比如在Chrome开发者工具中使用断点调试,或者使用工具如Frida进行Hook。同时,引用[3]和[4]强调了法律和合规问题,确保研究仅用于学习目的,不用于非法用途。 最后,应该提醒用户注意法律风险,遵循网站的使用条款,避免侵犯知识产权或违反计算机安全法律。总结起来,实现得物签名需要逆向工程分析前端代码,提取关键算法,处理可能的混淆和反调试,最后用目标语言如Python复现算法。</think>### 得物签名算法实现方式分析 得物签名算法主要用于接口请求的身份验证和防篡改保护,常见实现方式涉及以下技术点: #### 1. **签名参数组成** 签名通常由以下参数动态生成: - 请求参数(经排序、过滤空值后拼接) - 时间戳(如:$timestamp=1620000000$) - 随机数(如:$nonce=5a8s3d$) - 设备指纹(如:$device\_id=abcdef$) - 应用密钥(加密盐值,可能动态获取)[^4] 示例参数拼接逻辑: $$ \text{sign\_str} = \text{path} + \text{sorted\_params} + \text{timestamp} + \text{nonce} $$ #### 2. **加密算法类型** 根据逆向分析,得物可能采用以下组合: - **HMAC-SHA256**:对拼接字符串进行哈希运算 - **AES/Base64编码**:对结果二次处理 - **自定义位移/异或操作**:增加逆向难度[^5] #### 3. **JavaScript代码混淆** 关键函数可能被混淆,例如: ```javascript function _0x12ab5(a, b) { return a ^ b << 3; } // 需要AST解析还原控制流 ``` #### 4. **Python算法还原示例** ```python import hmac import hashlib def generate_sign(params, secret_key): # 1. 参数排序并拼接 sorted_str = '&'.join([f"{k}={v}" for k,v in sorted(params.items())]) # 2. HMAC-SHA256加密 sign = hmac.new(secret_key.encode(), sorted_str.encode(), hashlib.sha256).hexdigest() # 3. 自定义处理(示例) return sign.upper() + str(int(time.time())) ``` #### 5. **反爬对抗措施** - 动态密钥:通过接口定期更新加密盐值 - 环境检测:验证是否在真机环境运行 - 请求频率限制:异常高频触发验证码[^5]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Yongqiang Cheng

梦想不是浮躁,而是沉淀和积累。

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值