2021-06-13

最新推荐文章于 2024-07-23 13:34:11 发布

m0_59276263

最新推荐文章于 2024-07-23 13:34:11 发布

阅读量44

点赞数

文章标签：神经网络

原文链接：https://github.com/Microsoft/ai-edu

版权

神经网络的基本内容

2.2 非线性反向传播

2.2.1 提出问题

在上面的线性例子中，我们可以发现，误差一次性地传递给了初始值 $w$ 和 $b$ ，即，只经过一步，直接修改 $w$ 和 $b$ 的值，就能做到误差校正。因为从它的计算图看，无论中间计算过程有多么复杂，它都是线性的，所以可以一次传到底。缺点是这种线性的组合最多只能解决线性问题，不能解决更复杂的问题。这个我们在神经网络基本原理中已经阐述过了，需要有激活函数连接两个线性单元。

下面我们看一个非线性的例子，如图2-8所示。

图2-8 非线性的反向传播

其中 $1 < x < = 10, 0 < y < 2.15$ 。假设有5个人分别代表 $x, a, b, c, y$ ：

正向过程

第1个人，输入层，随机输入第一个 $x$ 值， $x$ 的取值范围 $(1, 10]$ ，假设第一个数是 $2$ ；
第2个人，第一层网络计算，接收第1个人传入 $x$ 的值，计算： $a=x^2$ ；
第3个人，第二层网络计算，接收第2个人传入 $a$ 的值，计算： $b=\ln (a)$ ；
第4个人，第三层网络计算，接收第3个人传入 $b$ 的值，计算： $c=\sqrt{b}$ ；
第5个人，输出层，接收第4个人传入 $c$ 的值

反向过程

第5个人，计算 $y$ 与 $c$ 的差值： $\Delta c = c - y$ ，传回给第4个人
第4个人，接收第5个人传回 $\Delta c$ ，计算 $\Delta b = \Delta c \cdot 2\sqrt{b}$
第3个人，接收第4个人传回 $\Delta b$ ，计算 $\Delta a = \Delta b \cdot a$
第2个人，接收第3个人传回 $\Delta a$ ，计算 $\Delta x = \frac{\Delta}{2x}$
第1个人，接收第2个人传回 $\Delta x$ ，更新 $\leftarrow x - \Delta x$ ，回到第1步

提出问题：假设我们想最后得到 $c = 2.13$ 的值， $x$ 应该是多少？（误差小于 $0.001$ 即可）

2.2.2 数学解析解

$c=\sqrt{b}=\sqrt{\ln(a)}=\sqrt{\ln(x^2)}=2.13$
$x = 9.6653$

2.2.3 梯度迭代解

$\frac{da}{dx}=\frac{d(x^2)}{dx}=2x=\frac{\Delta a}{\Delta x} \tag{1}$
$\frac{db}{da} =\frac{d(\ln{a})}{da} =\frac{1}{a} = \frac{\Delta b}{\Delta a} \tag{2}$
$\frac{dc}{db}=\frac{d(\sqrt{b})}{db}=\frac{1}{2\sqrt{b}}=\frac{\Delta c}{\Delta b} \tag{3}$
因此得到如下一组公式，可以把最后一层 $\Delta c$ 的误差一直反向传播给最前面的 $\Delta x$ ，从而更新 $x$ 值：
$\Delta c = c - y \tag{4}$
$\Delta b = \Delta c \cdot 2\sqrt{b} \tag{根据式3}$
$\Delta a = \Delta b \cdot a \tag{根据式2}$
$\Delta x = \Delta a / 2x \tag{根据式1}$

我们给定初始值 $x = 2$ ， $\Delta x=0$ ，依次计算结果如表2-2。

表2-2 正向与反向的迭代计算

方向	公式	迭代1	迭代2	迭代3	迭代4	迭代5
正向	$x=x-\Delta x$	2	4.243	7.344	9.295	9.665
正向	$a=x^2$	4	18.005	53.934	86.404	93.233
正向	$b=\ln(a)$	1.386	2.891	3.988	4.459	4.535
正向	$c=\sqrt{b}$	1.177	1.700	1.997	2.112	2.129
	标签值y	2.13	2.13	2.13	2.13	2.13
反向	$\Delta c = c - y$	-0.953	-0.430	-0.133	-0.018
反向	$\Delta b = \Delta c \cdot 2\sqrt{b}$	-2.243	-1.462	-0.531	-0.078
反向	$\Delta a = \Delta b \cdot a$	-8.973	-26.317	-28.662	-6.698
反向	$\Delta x = \Delta a / 2x$	-2.243	-3.101	-1.951	-0.360

表2-2，先看“迭代-1”列，从上到下是一个完整的正向+反向的过程，最后一行是 $- 2.243$ ，回到“迭代-2”列的第一行， $2 - (- 2.243) = 4.243$ ，然后继续向下。到第5轮时，正向计算得到的 $c = 2.129$ ，非常接近 $2.13$ 了，迭代结束。

运行示例代码可以得到如下结果：

how to play: 1) input x, 2) calculate c, 3) input target number but not faraway from c
input x as initial number(1.2,10), you can try 1.3:
2
c=1.177410
input y as target number(0.5,2), you can try 1.8:
2.13
forward...
x=2.000000,a=4.000000,b=1.386294,c=1.177410
backward...
delta_c=-0.952590, delta_b=-2.243178, delta_a=-8.972712, delta_x=-2.243178
......
forward...
x=9.655706,a=93.232666,b=4.535098,c=2.129577
backward...
done!

为节省篇幅只列出了第一步和最后一步（第5步）的结果，第一步时c=1.177410，最后一步时c=2.129577，停止迭代。

代码位置

代码如下

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license. See LICENSE file in the project root for full license information.

import numpy as np
import matplotlib.pyplot as plt

def draw_fun(X,Y):
    x = np.linspace(1.2,10)
    a = x*x
    b = np.log(a)
    c = np.sqrt(b)
    plt.plot(x,c)

    plt.plot(X,Y,'x')

    d = 1/(x*np.sqrt(np.log(x**2)))
    plt.plot(x,d)
    plt.show()


def forward(x):
    a = x*x
    b = np.log(a)
    c = np.sqrt(b)
    return a,b,c

def backward(x,a,b,c,y):
    loss = c - y
    delta_c = loss
    delta_b = delta_c * 2 * np.sqrt(b)
    delta_a = delta_b * a
    delta_x = delta_a / 2 / x
    return loss, delta_x, delta_a, delta_b, delta_c

def update(x, delta_x):
    x = x - delta_x
    if x < 1:
        x = 1.1
    return x

if __name__ == '__main__':
    print("how to play: 1) input x, 2) calculate c, 3) input target number but not faraway from c")
    print("input x as initial number(1.2,10), you can try 1.3:")
    line = input()
    x = float(line)
    
    a,b,c = forward(x)
    print("c=%f" %c)
    print("input y as target number(0.5,2), you can try 1.8:")
    line = input()
    y = float(line)

    error = 1e-3

    X,Y = [],[]

    for i in range(20):
        # forward
        print("forward...")
        a,b,c = forward(x)
        print("x=%f,a=%f,b=%f,c=%f" %(x,a,b,c))
        X.append(x)
        Y.append(c)
        # backward
        print("backward...")
        loss, delta_x, delta_a, delta_b, delta_c = backward(x,a,b,c,y)
        if abs(loss) < error:
            print("done!")
            break
        # update x
        x = update(x, delta_x)
        print("delta_c=%f, delta_b=%f, delta_a=%f, delta_x=%f\n" %(delta_c, delta_b, delta_a, delta_x))

    
    draw_fun(X,Y)

代码结果如下
x=0.480000, y=0.230400
x=0.192000, y=0.036864
x=0.076800, y=0.005898
x=0.030720, y=0.000944

2.3 梯度下降

2.3.1 从自然现象中理解梯度下降

在大多数文章中，都以“一个人被困在山上，需要迅速下到谷底”来举例，这个人会“寻找当前所处位置最陡峭的地方向下走”。这个例子中忽略了安全因素，这个人不可能沿着最陡峭的方向走，要考虑坡度。

在自然界中，梯度下降的最好例子，就是泉水下山的过程：

水受重力影响，会在当前位置，沿着最陡峭的方向流动，有时会形成瀑布（梯度下降）；
水流下山的路径不是唯一的，在同一个地点，有可能有多个位置具有同样的陡峭程度，而造成了分流（可以得到多个解）；
遇到坑洼地区，有可能形成湖泊，而终止下山过程（不能得到全局最优解，而是局部最优解）。

2.3.2 梯度下降的数学理解

梯度下降的数学公式：

$\theta_{n+1} = \theta_{n} - \eta \cdot \nabla J(\theta) \tag{1}$

其中：

$\theta_{n+1}$ ：下一个值；
$\theta_n$ ：当前值；
$-$ ：减号，梯度的反向；
$\eta$ ：学习率或步长，控制每一步走的距离，不要太快以免错过了最佳景点，不要太慢以免时间太长；
$\nabla$ ：梯度，函数当前位置的最快上升点；
$J(\theta)$ ：函数。

梯度下降的三要素

当前点；
方向；
步长。

为什么说是“梯度下降”？

“梯度下降”包含了两层含义：

梯度：函数当前位置的最快上升点；
下降：与导数相反的方向，用数学语言描述就是那个减号。

亦即与上升相反的方向运动，就是下降。

图2-9 梯度下降的步骤

图2-9解释了在函数极值点的两侧做梯度下降的计算过程，梯度下降的目的就是使得x值向极值点逼近。

2.3.3 单变量函数的梯度下降

假设一个单变量函数：

$J(x) = x ^2$

我们的目的是找到该函数的最小值，于是计算其微分：

$J^{'} (x) = 2 x$

假设初始位置为：

$x_0=1.2$

假设学习率：

$\eta = 0.3$

根据公式(1)，迭代公式：

$x_{n+1} = x_{n} - \eta \cdot \nabla J(x)= x_{n} - \eta \cdot 2x$

假设终止条件为 $J (x) < 0.01$ ，迭代过程是：

x=0.480000, y=0.230400
x=0.192000, y=0.036864
x=0.076800, y=0.005898
x=0.030720, y=0.000944

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license. See LICENSE file in the project root for full license information.

import numpy as np
import matplotlib.pyplot as plt

def target_function(x):
    y = x*x
    return y

def derivative_function(x):
    return 2*x

def draw_function():
    x = np.linspace(-1.2,1.2)
    y = target_function(x)
    plt.plot(x,y)

def draw_gd(X):
    Y = []
    for i in range(len(X)):
        Y.append(target_function(X[i]))
    
    plt.plot(X,Y)

if __name__ == '__main__':
    x = 1.2
    eta = 0.3
    error = 1e-3
    X = []
    X.append(x)
    y = target_function(x)
    while y > error:
        x = x - eta * derivative_function(x)
        X.append(x)
        y = target_function(x)
        print("x=%f, y=%f" %(x,y))


    draw_function()
    draw_gd(X)
    plt.show()

代码结果如下
在这里插入图片描述

2.3.4 双变量的梯度下降

假设一个双变量函数：

$J(x,y) = x^2 + \sin^2(y)$

我们的目的是找到该函数的最小值，于是计算其微分：

${\partial{J(x,y)} \over \partial{x}} = 2x$
${\partial{J(x,y)} \over \partial{y}} = 2 \sin y \cos y$

假设初始位置为：

$x_0,y_0)=(3,1)$

假设学习率：

$\eta = 0.1$

根据公式(1)，迭代过程是的计算公式：
$(x_{n+1},y_{n+1}) = (x_n,y_n) - \eta \cdot \nabla J(x,y)$
$(x_n,y_n) - \eta \cdot (2x,2 \cdot \sin y \cdot \cos y) \tag{1}$

根据公式(1)，假设终止条件为 $J (x, y) < 0.01$ ，迭代过程如表2-3所示。

表2-3 双变量梯度下降的迭代过程

迭代次数	x	y	J(x,y)
1	3	1	9.708073
2	2.4	0.909070	6.382415
…	…	…	…
15	0.105553	0.063481	0.015166
16	0.084442	0.050819	0.009711

迭代16次后， $J (x, y)$ 的值为 $0.009711$ ，满足小于 $0.01$ 的条件，停止迭代。

上面的过程如表2-4所示，由于是双变量，所以需要用三维图来解释。请注意看两张图中间那条隐隐的黑色线，表示梯度下降的过程，从红色的高地一直沿着坡度向下走，直到蓝色的洼地。

表2-4 在三维空间内的梯度下降过程

1# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license. See LICENSE file in the project root for full license information.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def target_function(x,y):
    J = x**2 + np.sin(y)**2
    return J

def derivative_function(theta):
    x = theta[0]
    y = theta[1]
    return np.array([2*x,2*np.sin(y)*np.cos(y)])

def show_3d_surface(x, y, z):
    fig = plt.figure()
    ax = Axes3D(fig)
 
    u = np.linspace(-3, 3, 100)
    v = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(u, v)
    R = np.zeros((len(u), len(v)))
    for i in range(len(u)):
        for j in range(len(v)):
            R[i, j] = X[i, j]**2 + np.sin(Y[i, j])**2

    ax.plot_surface(X, Y, R, cmap='rainbow')
    plt.plot(x,y,z,c='black')
    plt.show()

if __name__ == '__main__':
    theta = np.array([3,1])
    eta = 0.1
    error = 1e-2

    X = []
    Y = []
    Z = []
    for i in range(100):
        print(theta)
        x=theta[0]
        y=theta[1]
        z=target_function(x,y)
        X.append(x)
        Y.append(y)
        Z.append(z)
        print("%d: x=%f, y=%f, z=%f" %(i,x,y,z))
        d_theta = derivative_function(theta)
        print("    ",d_theta)
        theta = theta - eta * d_theta
        if z < error:
            break
    show_3d_surface(X,Y,Z)

|观察角度1|观察角度2|
|- 在这里插入图片描述
-|-
|
|||

2.3.5 学习率η的选择

在公式表达时，学习率被表示为 $\eta$ 。在代码里，我们把学习率定义为learning_rate，或者eta。针对上面的例子，试验不同的学习率对迭代情况的影响，如表2-5所示。

表2-5 不同学习率对迭代情况的影响

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license. See LICENSE file in the project root for full license information.

import numpy as np
import matplotlib.pyplot as plt

def targetFunction(x):
    y = (x-1)**2 + 0.1
    return y

def derivativeFun(x):
    y = 2*(x-1)
    return y

def create_sample():
    x = np.linspace(-1,3,num=100)
    y = targetFunction(x)
    return x,y

def draw_base():
    x,y=create_sample()
    plt.plot(x,y,'.')
    plt.show()
    return x,y
   
def gd(eta):
    x = -0.8
    a = np.zeros((2,10))
    for i in range(10):
        a[0,i] = x
        a[1,i] = targetFunction(x)
        dx = derivativeFun(x)
        x = x - eta*dx
    
    plt.plot(a[0,:],a[1,:],'x')
    plt.plot(a[0,:],a[1,:])
    plt.title("eta=%f" %eta)
    plt.show()

if __name__ == '__main__':

    eta = [1.1,1.,0.8,0.6,0.4,0.2,0.1]

    for e in eta:
        X,Y=create_sample()
        plt.plot(X,Y,'.')
        #plt.show()
        gd(e)

在这里插入图片描述

m0_59276263

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
2021-06-13

神经网络的基本内容2.2 非线性反向传播2.2.1 提出问题在上面的线性例子中，我们可以发现，误差一次性地传递给了初始值 www 和 bbb，即，只经过一步，直接修改 www 和 bbb 的值，就能做到误差校正。因为从它的计算图看，无论中间计算过程有多么复杂，它都是线性的，所以可以一次传到底。缺点是这种线性的组合最多只能解决线性问题，不能解决更复杂的问题。这个我们在神经网络基本原理中已经阐述过了，需要有激活函数连接两个线性单元。下面我们看一个非线性的例子，如图2-8所示。图2-8 非线性的反向
复制链接

扫一扫