2. Hyperparameter tuning, Batch Normalization and Programming Frameworks - Note 3
1. Hyperparameter tuning
- 待调参数
老师认为的重要性如下(*越多越重要):
-
α \alpha α: 学习速率 ***
-
b e t a beta beta (如果不使用adam): momentum
-
β 1 , β 2 , ϵ \beta_1, \beta_2, \epsilon β1,β2,ϵ:Adam
-
#layers
* -
#hidden units
** -
Learning rate decay *
-
mini-batch size **
- 建议在超参空间内测试随机值,不建议使用grid search,但是可以coarse to fine,缩小范围。
- 要理解参数的意义,比如exponetial average weights β 1 \beta1 β1的物理是 ≈ 1 1 − β \approx \frac{1}{1 - \beta} ≈1−β1,所以均匀随机取 β 1 \beta_1 β1就没有意义了,应该在log尺度上取。
2. Batch Normalization
网络更鲁棒
降低超参选择的敏感度
训练更深的网络更容易
轻微的正则化
对
z
z
z进行归一化(回忆:
z
=
W
x
+
b
z = Wx+b
z=Wx+b,
a
=
g
(
z
)
a = g(z)
a=g(z))
μ
=
1
m
∑
i
(
z
i
−
μ
)
2
σ
2
=
1
m
∑
i
(
z
i
−
μ
)
2
z
norm
(
i
)
=
z
(
i
)
−
μ
σ
2
+
ϵ
z
~
(
i
)
=
γ
z
norm
(
i
)
+
β
\begin{aligned} \mu & = \frac{1}{m} \sum_i (z_i - \mu)^2 \\ \sigma^2 & = \frac{1}{m} \sum_i(z_i - \mu)^2 \\ z_\text{norm}^{(i)} & = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2 + \epsilon}} \\ \tilde{z}^{(i)} & = \gamma z_\text{norm}^{(i)} + \beta \end{aligned}
μσ2znorm(i)z~(i)=m1i∑(zi−μ)2=m1i∑(zi−μ)2=σ2+ϵz(i)−μ=γznorm(i)+β
其中,
γ
\gamma
γ和
β
\beta
β并不是超参,是可以在学习的过程中确定的,并且在这一个归一化的过程中与常数b无关,所以最终只需要学习
W
[
i
]
W^{[i]}
W[i],
γ
[
i
]
\gamma^{[i]}
γ[i],
。
具
体
的
学
习
方
法
与
学
习
。具体的学习方法与学习
。具体的学习方法与学习
W
W
W的过程是一样的。
- Test time:
μ
\mu
μ和
σ
2
\sigma^2
σ2的在单个样本预测时并不存在
- 使用exponential weight average来从mini-batch中计算;
- 从全部样本中计算
3. Multi-class classification
将Logistic Regression扩展到Softmax Regression
L
(
y
^
,
y
)
=
−
∑
j
=
1
n class
y
j
log
y
^
j
\mathcal{L}(\hat{y}, y) = - \sum_{j=1}^\text{n class} y_j \log \hat{y}_j
L(y^,y)=−j=1∑n classyjlogy^j
输出层换成
n
×
1
n\times1
n×1就可以了,区别比较小。
4. Tensorflow
只需要关心前向的实现,后向会自动帮你实现
TensorFlow的基本套路:
- Create Tensors (variables/placeholders) that are not yet executed/evaluated.
- Write operations between those Tensors (computaion graph,
tf.matmul
,tf.add
…). - Initialize your Tensors.
- Create a Session.
- Run the Session on “optimizer” object (using a feed dictionary to bind placeholder variables)
示例代码:
y_hat = tf.constant(36, name='y_hat') # Define y_hat constant. Set to 36.
y = tf.constant(39, name='y') # Define y. Set to 39
loss = tf.Variable((y - y_hat)**2, name='loss') # Create a variable for the loss
init = tf.global_variables_initializer() # When init is run later (session.run(init)),
# the loss variable will be initialized and ready to be computed
with tf.Session() as session: # Create a session and print the output
session.run(init) # Initializes the variables
print(session.run(loss))