Parameters & hyperparameters:
hyperparameters: learning rate α \alpha α, # iterations, # hiddlen layers, # hidden units, choice of activation function, # minibatch …
hyperparameters determine parameters to some extent.
bias and variance
- High variance: 训练上好,测试上差
- High bias: 都一样差 → \to → regularization, data augmentation, early stopping (not good)
- High variance, High bias: 训练上差,测试上更差
- Improving Deep Neural Networks: Hyper-parameter tuning, Regularization and Optimization.
- Structuring your Machine Learning project.
- Convolutional Neural Networks.
- Neural Language Processing: Building sequence models.
2.11 向量化
L1 regularization: J +=
λ
2
m
∣
∣
w
∣
∣
1
\frac{\lambda}{2m} ||w||_1
2mλ∣∣w∣∣1
L2 regularization (weight decay by rate [
1
−
α
λ
m
1-\frac{\alpha \lambda}{m}
1−mαλ]): J +=
λ
2
m
∣
∣
w
∣
∣
2
2
\frac{\lambda}{2m} ||w||^2_2
2mλ∣∣w∣∣22
λ \lambda λ: regularization parameter
λ \lambda λ 越大 W W W 越小 Z Z Z 越小 (覆盖区域会变窄) 使得非线性激活函数基本起线性运算作用,使得网络的表达能力降低,从而无法 overfitting
Dropout regularization:
Intuition for drop-out: cannot rely on any one feature, so have to spread out weights.
exploding & vanishing:
weight initialization:
Gradient check:
mini-batch gradient: between batch gradient and stochastic gradient.
epoch: a single pass through the training set.
Momentum: exponentially weighted averages of the gradient, with
β
=
\beta=
β= 0.9
RMSprop: keep the exponentially weighted averages, but
w
:
=
w
−
α
d
w
S
d
w
w:= w-\alpha \frac{dw}{\sqrt{S_{dw}}}
w:=w−αSdwdw: decrease the update in bigger gradient; increase the update in smaller gradient. 缓解震荡
Adam: combine Momentum and RMSprop.
Learning rate decay
the problem of local optimal …
遇到的局部最优更可能是鞍点,因为,对于一个 n 维度的空间而言,所有的维度都是 concave 或者 convex 的可能性很小
import numpy as npimport tensorflow as tf
w = tf.variable(e,dtype=tf.float32)
cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
session = tf.Session()session.run(init)
print(sessionrun(w)) # 0.0
session.run(train)
print(session.run(w)) # 0.1
for i in range(1000):
session.run(trainprint(session.run(w))
print(session.run(w)) # 4.99999
有一个单一的评价指标,才能快速筛选好的模型
ResNet 设计的残差结构,可以使得轻松忽略无用(参数为几乎为 0)的卷积层
max pooling 几乎都在所有任务中都比 mean pooling 表现好 (把特征平均化意义不大)