shen
介绍cross-entropy cost 函数
代入 上面的偏导, 得到:
cnn的结构很不一样,输入层是一个二维的神经元
Generative learning: 模拟输入数据的概率分部
discriminative learning: 把输入映射到输出, 区分几类点
神经网络结构
两个隐藏层的神经网络
MultiLayer Perceptions (MLP): 实际是sigmoid neurons, 不是perceptrons
假设识别一个手写图片:
![](https://www.evernote.com/shard/s27/res/e3009f2b-229d-42d2-9c09-951e1b9fbcc6.png)
如果图片是64*64, 输入层总共有64*64 = 4096个神经元
如果图片是28*28, 输入层总共有28*28 = 784个神经元
如果输出层只有一个神经元, >0.5说明是9, <0.5说明不是9
Cost function (loss function, objective function): 目标函数
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-13%2020-57-03.png)
C: cost
w: weight 权重
b: bias 偏向
n: 训练数据集实例个数
x: 输入值
a: 输出值 (当x是输入时)
||v||: 向量的length function
C(w,b) 越小越好,输出的预测值和真实值差别越小
目标: 最小化C(w,b)
最小化问题可以用梯度下降解决(gradient descent)
C(v) v有两个变量v1, v2
可能会陷入局部最优
前提是目标函数要是凸函数convex
learning rate自动会减小
backpropagation算法
5.1 通过迭代性的来处理训练集中的实例
5.2 对比经过神经网络后输入层预测值(predicted value)与真实值(target value)之间
5.3 反方向(从输出层=>隐藏层=>输入层)来以最小化误差(error)来更新每个连接的权重(weight)
5.4 算法详细介绍
输入:D:数据集,l 学习率(learning rate), 一个多层前向神经网络
输出:一个训练好的神经网络(a trained neural network)
5.4.1 初始化权重(weights)和偏向(bias): 随机初始化在-1到1之间,或者-0.5到0.5之间,每个单元有
一个偏向
5.4.2 对于每一个训练实例X,执行以下步骤:
5.4.2.1: 由输入层向前传送
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/Image%20[47].png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/Image%20[48].png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/Image%20[49].png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/Image%20[50].png)
5.4.2.2 根据误差(error)反向传送
对于输出层:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/Image%20[51].png)
对于隐藏层:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/Image%20[52].png)
权重更新:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/Image%20[53].png)
偏向更新
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/Image%20[54].png)
5.4.3 终止条件
5.4.3.1 权重的更新低于某个阈值
5.4.3.2 预测的错误率低于某个阈值
5.4.3.3 达到预设一定的循环次数
介绍cross-entropy cost 函数
假设一个稍微复杂一些的神经网络
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-24-18.png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-25-05.png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-25-25.png)
定义cross-entropy函数:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-26-08%20[1].png)
为什么可以用来做cost函数?
1. 函数值大于等于0 (验证)
2. 当a=y时, cost = 0
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-28-31.png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-29-08.png)
用sigmoid函数定义
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-31-04.png)
推出:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-31-32.png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-32-36.png)
学习的快慢取决于
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-33-27.png)
也就是输出的error
好处: 错误大时,更新多,学得快. 错误小时,学习慢
对于偏向也类似:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-34-52.png)
用cross-entropy 演示:
w = 0.6, b = 0.9
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-36-10.png)
w = 2.0, b = 2.0
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-36-52.png)
与之前的二次cost比较
学习率=0.005, 但是不是重点, 主要是速度的变化率, 也就是曲线的形状不同.
以上是对于一个单个神经元的cost, 对于多层:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-20%2021-40-21.png)
以上把输出层所有的神经元的值加起来
总结:
cross-entropy cost几乎总是比二次cost函数好
如果神经元的方程是线性的, 用二次cost函数 (不会有学习慢的问题)
Regularization
最常见的一种regularization: (weight decay)L2 regularization
Regularized cross-entropy:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-08-36.png)
增加了一项: 权重之和 (对于神经网络里面的所有权重w相加)
λ>0: regularization 参数
n: 训练集包含实例个数
对于二次cost,
Regularized quadratic cost:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-13-43.png)
对于以上两种情况, 可以概括表示为:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-14-46.png)
Regularization的Cost偏向于让神经网络学习比较小的权重w, 除非第一项的Co明显减少.
λ: 调整两项的相对重要程度, 较小的λ倾向于让第一项Co最小化. 较大的λ倾向与最小化增大的项(权重之和).
对以上公式求偏导数:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-21-05.png)
以上两个偏导数可以用之前介绍的backpropagation算法求得:
添加了一个项:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-22-55.png)
对于偏向b, 偏导数不变
根据梯度下降算法, 更新法则变为:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-31-38.png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-32-09.png)
对于随机梯度下降(stochastic gradient descent):
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-36-13.png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-36-49.png)
求和是对于一个mini-batch里面所有的x
实验:
隐藏层: 30个神经元, mini-batch size: 10, 学习率: 0.5, cross-entropy
>>> import mnist_loader >>> training_data, validation_data, test_data = \ ... mnist_loader.load_data_wrapper() >>> import network2 >>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost) >>> net.large_weight_initializer() >>> net.SGD(training_data[:1000], 400, 10, 0.5, ... evaluation_data=test_data, lmbda = 0.1, ... monitor_evaluation_cost=True, monitor_evaluation_accuracy=True, ... monitor_training_cost=True, monitor_training_accuracy=True)
但是这次accuracy在test data上面持续增加:
最高的accuracy也增加了, 说明regularization减少了overfitting
如果用50,000张训练集:
同样的参数: 30 epochs, 学习率 0.5, mini-batch size: 10
需要改变λ, 因为n从1,000变到50,000了
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2017-52-16.png)
变了, 所以需要增大λ, 增大到 5.0>>> net.large_weight_initializer()
>>> net.SGD(training_data, 30, 10, 0.5, ... evaluation_data=test_data, lmbda = 5.0, ... monitor_evaluation_accuracy=True, monitor_training_accuracy=True)
结果好很多, accuracy对于测试集提高了, 两条曲线之间的距离大大减小
如果用隐藏层100个神经元
>>> net = network2.Network([784, 100, 10], cost=network2.CrossEntropyCost) >>> net.large_weight_initializer() >>> net.SGD(training_data, 30, 10, 0.5, lmbda=5.0, ... evaluation_data=validation_data, ... monitor_evaluation_accuracy=True)
最终结果在测试集上accuracy达到97.92, 比隐藏层30个神经元提高很多
如果调整优化一下参数 用 学习率=0.1, λ=5.0, 只需要30个epoch, 准确率就超过了98%,达到了98.04%
加入regularization不仅减小了overfitting, 还对避免陷入局部最小点 (local minimum), 更容易重现实验结果
为什么Regularization可以减少overfitting?
假设一个简单数据集
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-21%2018-06-06.png)
y = 2x
那个模型更好?
y=2x更简单,仍然很好描述了数据, 巧合的概率很小,所以我们偏向y=2x
x^9的模型更可能是对局部带有数据噪音的扑捉
在神经网络中:
Regularized网络更鼓励小的权重, 小的权重的情况下, x一些随机的变化不会对神经网络的模型造成太大影响, 所以更小可能受到数据局部噪音的影响.
Un-regularized神经网路, 权重更大, 容易通过神经网络模型比较大的改变来适应数据,更容易学习到局部数据的噪音
Regularized更倾向于学到更简单一些的模型
简单的模型不一定总是更好,要从大量数据实验中获得,目前添加regularization可以更好的泛化更多的从实验中得来,理论的支持还在研究之中
实现提高版本的神经网络算法来识别手写数字:
复习之前原始的版本: Network.py
我们从以下方面做了提高:
Cost函数: cross-entropy
Regularization: L1, L2
Softmax layer
初始化 1/sqrt(n_in)
到目前为止, 我们例子中使用的神经网络一共只有3层 (一个隐藏层):
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2020-07-00.png)
我们用以上神经网络达到了98%的accuracy
更深层的神经网络:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2020-08-45.png)
可以学习到不同抽象程度的概念:
例如: 图像中: 第一层学到边角, 第二层学到一些基本形状, 第三层学到物体概念
如何训练深度神经网络?
难点: 神经网络的不同层学习的速率显著不同
接近输出层学习速率比较合适时, 前面的层学习太慢, 有时被困住
我们到目前为止在神经网络中使用了好几个参数, hyper-parameters包括:
学习率(learning rate): η
Regularization parameter: λ
之前只是设置了一些合适的值, 如何来选择合适的hyper-parameters呢?
例如:
我们设置如下参数:
隐藏层: 30个神经元, mini-batch size: 10, 训练30个epochs
η=10.0, λ=1000.0
>>> import mnist_loader >>> training_data, validation_data, test_data = \ ... mnist_loader.load_data_wrapper() >>> import network2 >>> net = network2.Network([784, 30, 10]) >>> net.SGD(training_data, 30, 10, 10.0, lmbda = 1000.0, ... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
结果:
Epoch 0 training complete Accuracy on evaluation data: 1030 / 10000 Epoch 1 training complete Accuracy on evaluation data: 990 / 10000 Epoch 2 training complete Accuracy on evaluation data: 1009 / 10000 ... Epoch 27 training complete Accuracy on evaluation data: 1009 / 10000 Epoch 28 training complete Accuracy on evaluation data: 983 / 10000 Epoch 29 training complete Accuracy on evaluation data: 967 / 10000
差到跟随机猜测一样!
神经网络中可变化调整的因素很多:
神经网络结构: 层数, 每层神经元个数多少
初始化w和b的方法
Cost函数
Regularization: L1, L2
Sigmoid输出还是Softmax?
使用Droput?
训练集大小
mini-batch size
学习率(learning rate): η
Regularization parameter: λ
总体策略:
从简单的出发: 开始实验
如: MNIST数据集, 开始不知如何设置, 可以先简化使用0,1两类图, 减少80%数据量, 用两层神经网络[784, 2] (比[784, 30, 2]快)
更快的获取反馈: 之前每个epoch来检测准确率, 可以替换为每1000个图之后,
或者减少validation set的量, 比如用100代替10,000
重复实验:
>>> net = network2.Network([784, 10]) >>> net.SGD(training_data[:1000], 30, 10, 10.0, lmbda = 1000.0, \ ... evaluation_data=validation_data[:100], \ ... monitor_evaluation_accuracy=True) Epoch 0 training complete Accuracy on evaluation data: 10 / 100 Epoch 1 training complete Accuracy on evaluation data: 10 / 100 Epoch 2 training complete Accuracy on evaluation data: 10 / 100 ...
更快得到反馈, 之前可能每轮要等10秒,现在不到1秒:
λ之前设置为1000, 因为减少了训练集的数量, λ为了保证weight decay一样,对应的减少λ = 20.0
>>> net = network2.Network([784, 10]) >>> net.SGD(training_data[:1000], 30, 10, 10.0, lmbda = 20.0, \ ... evaluation_data=validation_data[:100], \ ... monitor_evaluation_accuracy=True)
结果:
Epoch 0 training complete Accuracy on evaluation data: 12 / 100 Epoch 1 training complete Accuracy on evaluation data: 14 / 100 Epoch 2 training complete Accuracy on evaluation data: 25 / 100 Epoch 3 training complete Accuracy on evaluation data: 18 / 100
也许学习率η=10.0太低? 应该更高?
增大到100:
>>> net = network2.Network([784, 10]) >>> net.SGD(training_data[:1000], 30, 10, 100.0, lmbda = 20.0, \ ... evaluation_data=validation_data[:100], \ ... monitor_evaluation_accuracy=True)
结果:
Epoch 0 training complete Accuracy on evaluation data: 10 / 100 Epoch 1 training complete Accuracy on evaluation data: 10 / 100 Epoch 2 training complete Accuracy on evaluation data: 10 / 100 Epoch 3 training complete Accuracy on evaluation data: 10 / 100
结果非常差, 也许结果学习率应该更低? =10
>>> net = network2.Network([784, 10]) >>> net.SGD(training_data[:1000], 30, 10, 1.0, lmbda = 20.0, \ ... evaluation_data=validation_data[:100], \ ... monitor_evaluation_accuracy=True)
结果好很多: Epoch 0 training complete Accuracy on evaluation data: 62 / 100 Epoch 1 training complete Accuracy on evaluation data: 42 / 100 Epoch 2 training complete Accuracy on evaluation data: 43 / 100 Epoch 3 training complete Accuracy on evaluation data: 61 / 100
假设保持其他参数不变: 30 epochs, mini-batch size: 10, λ=5.0
实验学习率=0.025, 0.25, 2.5
对于学习率, 可以从0.001, 0.01, 0.1, 1, 10 开始尝试, 如果发现cost开始增大, 停止, 实验更小的微调
对于MNIST, 先找到0.1, 然后0.5, 然后0.25
对于提前停止学习的条件设置, 如果accuracy在一段时间内变化很小 (不是一两次)
之前一直使用学习率是常数, 可以开始设置大一下, 后面逐渐减少: 比如开始设定常数, 直到在验证集上准确率开始下降, 减少学习率 (/2, /3)
对于regularization parameter λ:
先不设定regularization, 把学习率调整好, 然后再开始实验λ, 1.0, 10, 100..., 找到合适的, 再微调
对于mini-batch size:
太小: 没有充分利用矩阵计算的library和硬件的整合的快速计算
太大: 更新权重和偏向不够频繁
好在mini-batch size和其他参数变化相对独立, 所以不用重新尝试, 一旦选定
自动搜索:
网格状搜索各种参数组合 (grid search)
2012**
Random search for hyper-parameter optimization, by James Bergstra and Yoshua Bengio (2012). by James Bergstra and Yoshua Bengio
1998 paper**
Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998) by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller.
参数之前会互相影响
如何选择合适的hyper-parameters仍是一个正在研究的课题
随机梯度下降有没有其他变种: Hessian 优化, Momentum-based gradient descent
除了sigmoid, 其他人工神经网络的模型?
tanh
tanh(w⋅x+b)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-23%2016-15-36.png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-23%2016-16-18.png)
tanh 只是一个重新调节过度量后的 sigmoid函数
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-23%2016-18-23.png)
-1 到 1 之间, 不像 sigmoid 在 0, 1 之间, 所以输入要转化到-1, 1之间
rectified linear 神经元:
max(0,w⋅x+b)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-23%2016-26-24.png)
像sigmoid, tanh一样, 也可以扑模拟何函数
优势: 增加权重不会引起饱和, 但加权的输入如果是负数, gradient就为0
要靠实验比较rectified linear和sigmoid, tanh的好坏
目前神经网络还有很多方面理论基础需要研究, 为什么学习能力强, 现在的一些实验表明结果比较好, 但发展底层理论基础还有很长的路要走
消失的gradient问题 (vanishing gradient problem):
>>> import mnist_loader >>> training_data, validation_data, test_data = \ ... mnist_loader.load_data_wrapper()
>>> import network2 >>> net = network2.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, ... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
结果: 96.48%
加入一个隐藏层:
>>> net = network2.Network([784, 30, 30, 10]) >>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, ... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
结果: 96.9%
再加入一个隐藏层:
>>> net = network2.Network([784, 30, 30, 30, 10]) >>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, ... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
结果: 96.57%
为什么加入一层反而降低了准确率?
条形区域长度代表∂C/∂b, Cost对于bias的变化率:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2020-20-09.png)
随机初始化, 看到第一层学习的速率远远低于第二层学习的速率
进一步通过计算来验证:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2020-32-16.png)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2020-34-03.png)
代表学习速率
以上图中:
∥δ1∥=0.07, ∥δ2∥=0.31
∥δ1∥=0.07, ∥δ2∥=0.31
在5层的神经网络中: [784,30,30,30,10]
学习速率分别为: 0.012, 0.060, and 0.283
以上只是初始的时候的学习率, 当神经网络在训练过程中, 随epoch增加时学习率变化:
可以看出, 第一个隐藏层比第四个几乎要慢100倍
这种现象普遍存在于神经网络之中, 叫做: vanishing gradient problem
另外一种情况是内层的梯度被外层大很多, 叫做exploding gradient problem
所以说神经网络算法用gradient之类的算法学习存在不稳定性
训练深度神经网络, 需要解决vanishing gradient problem
Exploding gradient problem:
加入想修正以上问题:
1. 初始化比较大的权重: 比如 w1=w2=w3=w4=100
2. 初始化b使
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2021-22-47.png)
不要太小
比如为了让σ′最大(也就是=1/4), 我们可以通过调节b让z=0:
b1 = -100*a0
z1 = 100 * a0 + -100*a0 = 0
这种情况下:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2021-27-25.png)
= 100 * 1/4 = 25
每层是前一层的25倍, 又出现了exploding的问题
从根本来讲, 不是vanishing或者exploding的问题, 而是后面层的的梯度是前面层的累积的乘积, 所以神经网络非常不稳定. 唯一可能的情况是以上的连续乘积刚好平衡大约等于1, 但是这种几率非常小.
所以, 这是一个不稳定的梯度问题, 通常有多层后, 每层网络都以非常不同的速率学习
总体, vanishing problem具有普遍性:
如果想要客克服vanishing problem, 需要
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2021-27-25%20[1].png)
的绝对值>1, 我们可以尝试赋值w很大, 但是问题是 σ′(z) 也取决于w: σ′(z)=σ′(wa+b)
所以我们要让w大的时候, 还得注意不能让σ′(wa+b)变小, 这种情况非常少见, 除非输入值在一个非常小的区间内
刚才的例子只是一每层一个神经元:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2020-49-35.png)
在每层多个神经元的情况下:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2021-40-02.png)
在l层的gradient (L层神经网络):
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-neuralnetworksanddeeplearning.com%202015-09-24%2021-41-15.png)
矩阵和向量的表示, 与之前类似
所以只要是sigmoid函数的神经网络都会造成gradient更新的时候及其不稳定, vanishing or exploding问题
训练深度神经网络的其他难点:
2010 Glorot and Bengio*: sigmoid函数造成输出层的activation大部分饱和0, 并且建议了其他的activation函数
2013 Sutskever, Martens, Dahl and Hinton*: 随机初始权重和偏向时, 提出momentum-based stochastic gradient descent
综上所属, 训练深度神经网络中有很多难点.
本节课: 神经网络的不稳定性
activation方程的选择
初始化权重和偏向的方法
具体更新的过程
hyper-parameter的选择
这些目前都是当前学术界研究的课题, 已经取得一些有效的解决方法
解决vanishing gradient方法:
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-www.quora.com%202015-09-25%2001-44-25.png)
softplus函数可以被max函数模拟 max(0, x+N(0,1))
max函数叫做Rectified Linear Function (ReL)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-www.quora.com%202015-09-25%2001-48-34.png)
Sigmoid和ReL方程主要区别:
Sigmoid函数值在[0, 1], ReL函数值在[0, ∞], 所以sigmoid函数方面来描述概率, 而ReL适合用来描述实数
Sigmoid函数的gradient随着x增大或减小和消失
ReL 函数不会:
gradient = 0 (if x < 0), gradient = 1 (x > 0)
Rectified Linear Unit在神经网络中的优势:
不会产生vanishing gradient的问题
目前总体来讲最流行, 表现最好的算法:
Convolution Neural Network (CNN)
CNN结构很不一样, 输入是一个二维的神经元 (28x28)
cnn的结构很不一样,输入层是一个二维的神经元
基准:
3层
隐藏层: 100个神经元
训练60个epochs
学习率 = 0.1
mini-batch size: 10
>>> import network3 >>> from network3 import Network >>> from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer >>> training_data, validation_data, test_data = network3.load_data_shared() >>> mini_batch_size = 10 >>> net = Network([ FullyConnectedLayer(n_in=784, n_out=100), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)
结果: 97.8 accuracy (上节课98.04%
这次: 没有regularization, 上次有
这次: softmax 上次: sigmoid + cross-entropy
加入convolution层:
>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2)), FullyConnectedLayer(n_in=20*12*12, n_out=100), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)
准确率: 98.78 比上次有显著提高
再加入一层convolution (共两层):
>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2)), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2)), FullyConnectedLayer(n_in=40*4*4, n_out=100), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)
准确率: 99.06% (再一次刷新)
用Rectified Linear Units代替sigmoid:
f(z) = max(0, z)
>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)
准确率: 99.23 比之前用sigmoid函数的99.06%稍有提高
库大训练集: 每个图像向上,下,左,右移动一个像素
总训练集: 50,000 * 5 = 250,000
$ python expand_mnist.py
>>> expanded_training_data, _, _ = network3.load_data_shared( "../data/mnist_expanded.pkl.gz") >>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(expanded_training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)
结果: 99.37%
加入一个100个神经元的隐藏层在fully-connected层:
>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU), FullyConnectedLayer(n_in=100, n_out=100, activation_fn=ReLU), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(expanded_training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)
结果: 99.43%, 并没有大的提高
有可能overfit
加上dropout到最后一个fully-connected层:
>>> expanded_training_data, _, _ = network3.load_data_shared( "../data/mnist_expanded.pkl.gz")
>>> net = Network([
ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer( n_in=40*4*4, n_out=1000, activation_fn=ReLU, p_dropout=0.5), FullyConnectedLayer( n_in=1000, n_out=1000, activation_fn=ReLU, p_dropout=0.5), SoftmaxLayer(n_in=1000, n_out=10, p_dropout=0.5)], mini_batch_size) >>> net.SGD(expanded_training_data, 40, mini_batch_size, 0.03, validation_data, test_data)
结果: 99.60% 显著提高
epochs: 减少到了40
隐藏层有 1000 个神经元
Ensemble of network: 训练多个神经网络, 投票决定结果, 有时会提高
为何只对最后一层用dropout?
CNN本身的convolution层对于overfitting有防止作用: 共享的权重造成convolution filter强迫对于整个图像进行学习
为什么可以克服深度学习里面的一些困难?
用CNN大大减少了参数数量
用dropout减少了overfitting
用Rectified Linear Units代替了sigmoid, 避免了overfitting, 不同层学习率差别大的问题
用GPU计算更快, 每次更新较少, 但是可以训练很多次
目前的深度神经网络有多深? (多少层)?
最多有20多层
Restricted Boltzmann Machine:
Geoff Hinton发明
降低维度, 分类, 回归, 特征学习
非监督学习(unsupervised learning)
![](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/screenshot-deeplearning4j.org%202015-09-27%2020-17-48.png)
activation f((weight w * input x) + bias b ) = output a
多个输入:
多个隐藏层:
Reconstructions:
隐藏层变成输入层, 反向更新, 用老的权重和新的bias:
回到原始输入层:
算出的值跟原始输入层的值比较, 最小化error, 接着迭代更新:
正向更新: 用输入预测神经元的activation, 也就是输出的概率, 在给定的权重下: p(a|x; w)
反向更新的时候:
activation被输入到网络里面,来预测原始的数据X, RBM尝试估计X的概率, 对于给定的activation a: p(x|a; w)
Deep Brief Network: 多个Restricted Boltzmann Machines
每层的神经元不与本层的其他神经元交流
最后一层通常是classification layer (e.g. Softmax)
除了第一层, 最后一层:
每层都有两个作用: 对于前一层作为隐藏层, 作为后一层的输入层
Generative
Deep Autoencoders:
由两个对称的Deep Brief Network组成:
![Alt text](Introduction%20to%20Deep%20Learning%20Advance%20Algorithm%20&%20Applications_files/Image%20[8].png)
每层由Restricted Boltzmann Machine组成:
对于MNIST, 输入转化为binary
Encoding:
784 (input) ----> 1000 ----> 500 ----> 250 ----> 100 -----> 30
1000 > 784, sigmoid-brief unit代表的信息量比实数少
Decoding:
784 (output) <---- 1000 <---- 500 <---- 250 <---- 30
用来降低维度, 图像搜索(压缩), 数据压缩, 信息检索
scikit-learn nerualnetwork:
iris 数据库:
https://en.wikipedia.org/wiki/Iris_flower_data_sethttps://github.com/aigamedev/scikit-neuralnetwork
举例:
import logging
logging.basicConfig()
from sknn.mlp import Classifier, Layer
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
iris = datasets.load_iris()
# iris.data.shape, iris.target.shape
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
nn = Classifier(
layers=[
Layer("Rectifier", units=100),
Layer("Linear")],
learning_rate=0.02,
n_iter=10)
nn.fit(X_train, y_train)
y_pred = nn.predict(X_test)
score = nn.score(X_test, y_test)
# print("y_test", y_test)
# print("y_pred", y_pred)
print("score", score)
正向更新: 给定这些像素, 权重应该送出一个更强的信号给大象还是狗?
反向更新: 给定大象和狗, 我应该期待什么样的像素分布?
discriminative learning: 把输入映射到输出, 区分几类点