机器学习实践—基于Scikit-Learn、Keras和TensorFlow2第二版—第11章 训练深度神经网络

训练深度神经网络并非易事,常常会遇到如下问题:

  1. 梯度消失和爆炸问题,导致神经网络前面的层无法得到很好地训练
  2. 数据不足,或者标注代价太大
  3. 训练速度极慢
  4. 参数较多时很容易过拟合,尤其是在数据量不足或数据存在大师噪声时

0. 导入所需的库

import tensorflow as tf
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import os

for i in (tf, np, mpl):
    print(i.__name__,": ",i.__version__,sep="")

输出:

tensorflow: 2.2.0
numpy: 1.17.4
matplotlib: 3.1.2

1. 梯度消失和梯度爆炸问题

1.1 梯度消失和梯度爆炸

梯度消失:梯度通过梯度下降算法反向传播时越来越小,导致前面的网络层无法很好地收敛。

梯度爆炸:梯度在传播过程中不断变大,使得权重更新幅度太小而导致发散,无法收敛。通常在循环神经网络中比较严重。

这些问题很早之前就有发现,并且是2000年代初期神经网络被冷漠的原因之一。

2010年Xavier Glorot和Yoshua Bengio发表的文章指出这些问题存在的原因:使用了Sigmoid激活函数、均值为0方差为1的权重初始化方法。因为随着神经网络前向传播,加权求和的结果的绝对值越来越大,再经过Sigmoid函数时,结果越来越接近1或0,此时的平均值大约是0.5,而不是0。因此双曲正切tanh函数在这方面可能效果会更好。

def logit(x):
    return 1/(1+np.exp(-x))

z = np.linspace(-5, 5, 200)

plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [1, 1], 'k--')
plt.plot([0, 0], [-0.2, 1.2], 'k-')
plt.plot([-5, 5], [-3/4, 7/4], 'g--')
plt.plot(z, logit(z), "b-", linewidth=2)
props = dict(facecolor='black', shrink=0.1)
plt.annotate('Saturating', xytext=(3.5, 0.7), xy=(5, 1), arrowprops=props, fontsize=14, ha="center")
plt.annotate('Saturating', xytext=(-3.5, 0.3), xy=(-5, 0), arrowprops=props, fontsize=14, ha="center")
plt.annotate('Linear', xytext=(2, 0.2), xy=(0, 0.5), arrowprops=props, fontsize=14, ha="center")
plt.grid(True)
plt.title("Sigmoid activation function", fontsize=14)
plt.axis([-5, 5, -0.2, 1.2])

plt.show()

输出:

如上图所示为Sigmoid函数图像,当输入>4或<-4时函数值基本接近于0或者1,也就是说饱和了,而此时导数就接近于0,因此这种情况反向传播就会出现梯度消失的问题。

1.2 Glorot and He初始化(Xavier and He Initialization, Glorot and He Initialization)

Glorot和Bengio在其文章是指出,神经网络期望是输入方差和输出方差相等,梯度前向和向后的方差也相等,但这是不可能的,除非保证每次的输入和神经元数目相等。但是他们提出的一个折中的办法,当使用如Sigmoid类似的logistic函数时,权重的初始化使用如下的随机初始化方法:

  • 初始化方法一:采用均值为0,方差为\frac{1}{fan_{avg}}的正态分布
  • 初始化方法二:采用-r到r的均匀分布

其中:

fan_{avg} = \frac{fan_{in}+fan_{out}}{2}

r = \sqrt{\frac{3}{fan_{avg}}}

以上这两种初始化叫做Xavier初始化或者Glorot初始化。

如果将上式中的fan_{avg}换成fan_{in}就成了LeCun于1990年提出的初始化方法。当fan_{in} = fan_{out}时,LeCun初始化方法等价于Glorot初始化方法。

研究人员花了十多年才发现这些规律,使用Glorot初始化方法可以大大加速训练速度,并且这也是深度学习成功的原因之一。

针对ReLU激活函数(包括ELU等变体)的初始化方法叫He初始化方法,即文章的第一作者何恺明,详细论文请看:Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification(https://arxiv.org/abs/1502.01852)

SELU激活函数使用LeCun初始化方法。

在Keras中,默认使用Glorot初始化方法中的均匀分布方法。在Keras中可以通过kernel_initializer指定为其它初始化方法。

[name for name in dir(tf.keras.initializers) if not name.startswith("_")]

输出:

['Constant',
 'GlorotNormal',
 'GlorotUniform',
 'Identity',
 'Initializer',
 'Ones',
 'Orthogonal',
 'RandomNormal',
 'RandomUniform',
 'TruncatedNormal',
 'VarianceScaling',
 'Zeros',
 'constant',
 'deserialize',
 'get',
 'glorot_normal',
 'glorot_uniform',
 'he_normal',
 'he_uniform',
 'identity',
 'lecun_normal',
 'lecun_uniform',
 'ones',
 'orthogonal',
 'serialize',
 'zeros']

如上输出所示为Keras中可以使用的所有初始化方法。

tf.keras.layers.Dense(10, activation="relu",kernel_initializer="he_normal")

输出:

<tensorflow.python.keras.layers.core.Dense at 0x1da88f60>

如上代码表示,使用ReLU激活函数,并使用he初始化方法的正态分布初始化。

init = tf.keras.initializers.VarianceScaling(scale=2.0, mode="fan_avg",distribution="uniform")
tf.keras.layers.Dense(10, activation="relu", kernel_initializer=init)

输出:

<tensorflow.python.keras.layers.core.Dense at 0x1db5c978>

如上输出所示为使用ReLU激活函数,并在使用he初始化时基于fan_avg的均匀分布。

1.3 非饱和激活函数(Nonsaturating Activation Functions)

上面提到的当使用Sigmoid函数时,使用Glorot初始化时不错的选择。但是ReLU激活函数在神经网络中表示出色,ReLU不会出现饱和的情况,同时计算梯度非常简单、快速。

但是,ReLU也有缺点,即dying ReLUs,主要观点是当输入小时0时,输出为0,因此在神经网络中可能导致某些神经元的输出为0,导致梯度也为0。

Leaky ReLU、RReLU和PReLU

为了解决如上缺点,提出了leaky ReLU,函数表达式如下:

𝐿𝑒𝑎𝑘𝑅𝑒𝐿𝑈𝛼(𝑥)=𝑚𝑎𝑥(𝛼𝑥,𝑥)

其中:

  • 𝛼:表示输入x小于0时的直线斜率,通常设置为0.01

LeakyReLU防止了神经元死亡(die)的发生。

Bing Xu等2015年的研究(Empirical Evaluation of Rectified Activations in Convolutional Network, https://arxiv.org/abs/1505.00853)表明,leaky版本的ReLU效果往往比原生的ReLU效果要好。他们还研究了随机leakyReLU(RReLU),即alpha在训练时根据给定范围随机选择,在测试时使用平均值。RReLU效果也不错,并且也可以作为一种防止过拟合的正则化手段。同时他们还研究了参数化leaky ReLU(PReLU),即alpha作为可以训练的参数而不是指定的超参数,PReLU在大型图像数据集上效果完胜原生ReLU,但在小型数据集上有过拟合的风险。

很遗憾,RReLU在Keras中还没有实现。

def leaky_relu(x, alpha=0.01):
    return np.maximum(alpha*x, x)

plt.plot(z, leaky_relu(z, 0.05), "b-", linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([0, 0], [-0.5, 4.2], 'k-')
plt.grid(True)
props = dict(facecolor='black', shrink=0.1)
plt.annotate('Leak', xytext=(-3.5, 0.5), xy=(-5, -0.2), arrowprops=props, fontsize=14, ha="center")
plt.title("Leaky ReLU activation function", fontsize=14)
plt.axis([-5, 5, -0.5, 4.2])

plt.show()

输出:

如上输出所示为alpha=0.01的leaky ReLU图像。

[m for m in dir(tf.keras.activations) if not m.startswith("_")]

输出:

['deserialize',
 'elu',
 'exponential',
 'get',
 'hard_sigmoid',
 'linear',
 'relu',
 'selu',
 'serialize',
 'sigmoid',
 'softmax',
 'softplus',
 'softsign',
 'swish',
 'tanh']

如上输出所示为Keras中所有可以使用的激活函数。

[m for m in dir(tf.keras.layers) if "relu" in m.lower()]

输出:

['LeakyReLU', 'PReLU', 'ReLU', 'ThresholdedReLU']

下面例子使用Leaky ReLU训练Fashion MNIST神经网络:

(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

tf.random.set_seed(42)
np.random.seed(42)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",optimizer=tf.keras.optimizers.SGD(lr=1e-3),metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

输出:

Epoch 1/10
1719/1719 [==============================] - 4s 2ms/step - loss: 1.2819 - accuracy: 0.6229 - val_loss: 0.8886 - val_accuracy: 0.7160
Epoch 2/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.7955 - accuracy: 0.7362 - val_loss: 0.7130 - val_accuracy: 0.7656
Epoch 3/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.6816 - accuracy: 0.7721 - val_loss: 0.6427 - val_accuracy: 0.7898
Epoch 4/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.6217 - accuracy: 0.7944 - val_loss: 0.5900 - val_accuracy: 0.8066
Epoch 5/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5832 - accuracy: 0.8075 - val_loss: 0.5582 - val_accuracy: 0.8202
Epoch 6/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5553 - accuracy: 0.8157 - val_loss: 0.5350 - val_accuracy: 0.8238
Epoch 7/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5338 - accuracy: 0.8225 - val_loss: 0.5157 - val_accuracy: 0.8304
Epoch 8/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5173 - accuracy: 0.8272 - val_loss: 0.5079 - val_accuracy: 0.8286
Epoch 9/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5040 - accuracy: 0.8288 - val_loss: 0.4895 - val_accuracy: 0.8390
Epoch 10/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.4924 - accuracy: 0.8321 - val_loss: 0.4817 - val_accuracy: 0.8396

下面例子使用PReLU训练Fashion MNIST神经网络:

tf.random.set_seed(42)
np.random.seed(42)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.PReLU(),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.PReLU(),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",optimizer=tf.keras.optimizers.SGD(lr=1e-3),metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,validation_data=(X_valid, y_valid))

输出:

Epoch 1/10
1719/1719 [==============================] - 4s 3ms/step - loss: 1.3461 - accuracy: 0.6209 - val_loss: 0.9255 - val_accuracy: 0.7186
Epoch 2/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.8197 - accuracy: 0.7355 - val_loss: 0.7305 - val_accuracy: 0.7630
Epoch 3/10
1719/1719 [==============================] - 4s 2ms/step - loss: 0.6966 - accuracy: 0.7694 - val_loss: 0.6565 - val_accuracy: 0.7880
Epoch 4/10
1719/1719 [==============================] - 4s 3ms/step - loss: 0.6331 - accuracy: 0.7909 - val_loss: 0.6003 - val_accuracy: 0.8050
Epoch 5/10
1719/1719 [==============================] - 4s 3ms/step - loss: 0.5917 - accuracy: 0.8057 - val_loss: 0.5656 - val_accuracy: 0.8178
Epoch 6/10
1719/1719 [==============================] - 4s 3ms/step - loss: 0.5618 - accuracy: 0.8134 - val_loss: 0.5406 - val_accuracy: 0.8236
Epoch 7/10
1719/1719 [==============================] - 4s 3ms/step - loss: 0.5390 - accuracy: 0.8205 - val_loss: 0.5196 - val_accuracy: 0.8314
Epoch 8/10
1719/1719 [==============================] - 4s 3ms/step - loss: 0.5213 - accuracy: 0.8258 - val_loss: 0.5113 - val_accuracy: 0.8316
Epoch 9/10
1719/1719 [==============================] - 4s 3ms/step - loss: 0.5070 - accuracy: 0.8288 - val_loss: 0.4916 - val_accuracy: 0.8378
Epoch 10/10
1719/1719 [==============================] - 4s 3ms/step - loss: 0.4945 - accuracy: 0.8316 - val_loss: 0.4826 - val_accuracy: 0.8396

ELU

由Djork-Arné Clevert等人在2015年发表的文章()提出了一种新的激活函数:ELU(exponential linear unit,指数线性单元),其效果(包括训练时间、准确率)完胜所有版本的ReLU激活函数。其函数表达式如下:

ELU_\alpha(x) = \left\{\begin{matrix} \alpha(exp(x)-1) &if: x<0 \\ x & if:x>=0 \end{matrix}\right.

ELU与ReLU非常相似,但也有区别:

  1. 当输入x小于0时,输出-alpha到0之间的值。alpha定义为当输入x是一个非常大的负值时的输出,通常情况下可以设定为1
  2. 当输入x小于0时,梯度不再是0
  3. 当alpha=1时,x=0处时连续的,因此可以在0处加快梯度下降而不会发生反弹的现象。
def elu(x, alpha=1):
    return np.where(x<0, alpha*(np.exp(x)-1),x)

plt.plot(z, elu(z), "b-", linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1, -1], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title(r"ELU activation function ($\alpha=1$)", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])

plt.show()

输出:

如上输出所示为alpha=1时的ELU函数图像。

ELU的缺点是计算速度比ReLU慢,因为如果输入小于0时需要计算指数,但是较快的收敛速度可以弥补这一点。而在测试时ELU就确实比ReLU慢。

SELU

Günter Klambauer等在2017年发表的文章(Self-Normalizing Neural Networks, https://arxiv.org/abs/1706.02515)提出了SELU(Scaled ELU),即是一个ELU的变体。他们表示,如果一个神经网络只使用一些全连接层,并且这些全连接层全部使用SELU激活函数,那么这个网络就有自动归一化的功能,即最层的输出趋向于均值为0方差为1的分布,甚至是1000多层的网络也是如此,这便解决了梯度消失和爆炸的问题。对于在这种结构的神经网络上,SELU完胜其它版本的激活函数,尤其是特别深的网络。但是想要实现这种自动归一化的功能,也需要一些条件:

  1. 输入特征必须是均值为0方差为1
  2. 所有隐藏层权重使用LeCun正态分布初始化,例如在Keras设置为kernel_initializer="lecun_normal"
  3. 网络结构必须是连续性的,即全连接的,在其它结构的神经网络上效果没有其它激活函数好,例如循环网络等
  4. 该篇文章的作者只研究了所有层是全连接时具有很好的自动归一化功能,但其它研究者研究表明SELU激活函数在卷积神经网络上也表现很好。

然而在使用SELU的这种网络中,无法使用L1或L2正则化、dropout、max-norm、skip connections以及其它非连续拓扑结构

from scipy.special import erfc

z = np.linspace(-5, 5, 200)

# alpha and scale to self normalize with mean 0 and standard deviation 1
alpha_0_1 = -np.sqrt(2 / np.pi) / (erfc(1/np.sqrt(2)) * np.exp(1/2) - 1)
scale_0_1 = (1 - erfc(1 / np.sqrt(2)) * np.sqrt(np.e)) * np.sqrt(2 * np.pi) * (2 * erfc(np.sqrt(2))*np.e**2 + np.pi*erfc(1/np.sqrt(2))**2*np.e - 2*(2+np.pi)*erfc(1/np.sqrt(2))*np.sqrt(np.e)+np.pi+2)**(-1/2)

def selu(x, scale=scale_0_1, alpha=alpha_0_1):
    return scale * elu(x, alpha)

plt.plot(z, selu(z), "b-", linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1.758, -1.758], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title("SELU activation function", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])

plt.show()

输出:

np.random.seed(42)
Z = np.random.normal(size=(500, 100)) # standardized inputs
for layer in range(1000):
    W = np.random.normal(size=(100, 100), scale=np.sqrt(1 / 100)) # LeCun initialization
    Z = selu(np.dot(Z, W))
    means = np.mean(Z, axis=0).mean()
    stds = np.std(Z, axis=0).mean()
    if layer % 100 == 0:
        print("Layer {}: mean {:.2f}, std deviation {:.2f}".format(layer, means, stds))

输出:

Layer 0: mean -0.00, std deviation 1.00
Layer 100: mean 0.02, std deviation 0.96
Layer 200: mean 0.01, std deviation 0.90
Layer 300: mean -0.02, std deviation 0.92
Layer 400: mean 0.05, std deviation 0.89
Layer 500: mean 0.01, std deviation 0.93
Layer 600: mean 0.02, std deviation 0.92
Layer 700: mean -0.02, std deviation 0.90
Layer 800: mean 0.05, std deviation 0.83
Layer 900: mean 0.02, std deviation 1.00

以上输出所示,每次的输出结果都基本为均值为0方差为1的分布。

下面例子使用SELU利用Fashion MNIST数据集训练100个隐藏层的神经网络,注意输入数据需要归一化:

np.random.seed(42)
tf.random.set_seed(42)

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[28, 28]))
model.add(tf.keras.layers.Dense(300, activation="selu",kernel_initializer="lecun_normal"))

for layer in range(99):
    model.add(tf.keras.layers.Dense(100, activation="selu",kernel_initializer="lecun_normal"))

model.add(tf.keras.layers.Dense(10, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy",optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),metrics=["accuracy"])

# 归一化
pixel_means = X_train.mean(axis=0, keepdims=True)
pixel_stds = X_train.std(axis=0, keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_valid_scaled = (X_valid - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

history = model.fit(X_train_scaled, y_train, epochs=5,validation_data=(X_valid_scaled, y_valid))

输出:

Epoch 1/5
1719/1719 [==============================] - 31s 18ms/step - loss: 1.0580 - accuracy: 0.5978 - val_loss: 0.6905 - val_accuracy: 0.7422
Epoch 2/5
1719/1719 [==============================] - 31s 18ms/step - loss: 0.6646 - accuracy: 0.7588 - val_loss: 0.5594 - val_accuracy: 0.8112
Epoch 3/5
1719/1719 [==============================] - 31s 18ms/step - loss: 0.5790 - accuracy: 0.7971 - val_loss: 0.5434 - val_accuracy: 0.8112
Epoch 4/5
1719/1719 [==============================] - 31s 18ms/step - loss: 0.5211 - accuracy: 0.8221 - val_loss: 0.4808 - val_accuracy: 0.8370
Epoch 5/5
1719/1719 [==============================] - 31s 18ms/step - loss: 0.5105 - accuracy: 0.8245 - val_loss: 0.5023 - val_accuracy: 0.8258

再尝试使用ReLU方法:

np.random.seed(42)
tf.random.set_seed(42)

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[28, 28]))
model.add(tf.keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"))

for layer in range(99):
    model.add(tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"))

model.add(tf.keras.layers.Dense(10, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy",optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),metrics=["accuracy"])

history = model.fit(X_train_scaled, y_train, epochs=5, validation_data=(X_valid_scaled, y_valid))

输出:

Epoch 1/5
1719/1719 [==============================] - 32s 18ms/step - loss: 1.7799 - accuracy: 0.2701 - val_loss: 1.3950 - val_accuracy: 0.3578
Epoch 2/5
1719/1719 [==============================] - 31s 18ms/step - loss: 1.1989 - accuracy: 0.4928 - val_loss: 1.1368 - val_accuracy: 0.5300
Epoch 3/5
1719/1719 [==============================] - 31s 18ms/step - loss: 0.9704 - accuracy: 0.6159 - val_loss: 1.0297 - val_accuracy: 0.6104
Epoch 4/5
1719/1719 [==============================] - 31s 18ms/step - loss: 2.0043 - accuracy: 0.2796 - val_loss: 2.0500 - val_accuracy: 0.2606
Epoch 5/5
1719/1719 [==============================] - 31s 18ms/step - loss: 1.5996 - accuracy: 0.3725 - val_loss: 1.4712 - val_accuracy: 0.3778

对比SELU和ReLU的结果发现,SELU的结果似乎远远好于ReLU的结果。

因此,神经网络隐藏层到底使用哪个激活函数呢?

  1. 通常情况下,优先选择排序为:SELU、ELU、leaky ReLU及其变体、ReLU、tanh、logistic。
  2. 如果网络的架构不支持自动归一化,能ELU效果可能会比SELU好。
  3. 如果非常关注模型运行时的时间,则应该首先leaky ReLU。如果不想调整其它任何超参数,则使用Keras中alpha=0.3的默认值。
  4. 如果时间和计算力充足,可以尝试交叉验证选择更好的激活函数,例如模型过拟合时选择RReLU,如果数据集特别大时选择PReLU。
  5. 由于目前最火的激活函数还是ReLU,因此很多库和硬件加速器都对ReLU类似的激活函数进行了优化,如果想要速度更快,那么ReLU还是最佳的选择。

1.4 批归一化(Batch Normalization)

虽然使用He初始化和ELU(及ReLU的任何变体)都以很好地降低训练开始时梯度消失和爆炸的问题,但不能保证这些问题不会在训练一段时间时发生。

Sergey Ioffe和Christian Szegedy在2015年发表的文章(Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, https://arxiv.org/abs/1502.03167)中提出一种叫做批归一化(Batch Normalization (BN))的技术,可以用来解决该问题。

该方法的主要思想是在每个隐藏层激活函数的前后加入一些操作,这些操作包括对输入进行0中心化和归一化,对输出分别利用两个参数微量进行scales和shifts。

如果在模型开始就加入了一个BN层,那么就不再需要对训练集进行归一化了。

模型预测时应该是单个样本,因此无法计算平均值和方差,解决方案之一是模型训练结束后利用全部数据再跑一遍模型,计算每个BN层的平均值和方差,这些平均值和方差就在模型预测时使用。

然而BN的一些实现工具中常常使用层的输入平均和方差的移动平均来估计这些最终统计值。这也是Keras中BatchNormalization层使用的方法。

因此,输出scale向量(gamma)和offset向量(beta)参数通过反向传播进行学习,最终输入平均向量(u)和最终输入标准差向量(sigma)通过指数移动平均进行估算。

Ioffe和Szegedy研究表明,BN能很大程度上提升深度神经网络的性能,在ImageNet分类任务上有很大的提升,梯度消失的问题得到了很好地控制,甚至是使用logistic或tanh激活函数也可以控制梯度消失的问题。并且这种网络对参数初始化也不太敏感。

使用BN的网络训练速度较慢,但它本身需要较少的迭代次数,因此可以抵消这种不足。

下面使用Keras实现BN层:

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.summary()

输出:

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten_4 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_210 (Dense)            (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_211 (Dense)            (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_212 (Dense)            (None, 10)                1010      
=================================================================
Total params: 271,346
Trainable params: 268,978
Non-trainable params: 2,368
_________________________________________________________________
bn1 = model.layers[1]
[(var.name, var.trainable) for var in bn1.variables]

输出:

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

如上输出所示,每个BN层对于每个输入有四个参数:gamma, beta, mu和sigma。mu和sigma是移动平均,不会被反向传播影响,因此是不可训练的,共有2368个。

bn1.updates

输出:

[<tf.Operation 'cond/Identity' type=Identity>,
 <tf.Operation 'cond_1/Identity' type=Identity>]
model.compile(loss="sparse_categorical_crossentropy",optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,validation_data=(X_valid, y_valid))

输出:

Epoch 1/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.8750 - accuracy: 0.7123 - val_loss: 0.5525 - val_accuracy: 0.8228
Epoch 2/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.5753 - accuracy: 0.8030 - val_loss: 0.4725 - val_accuracy: 0.8476
Epoch 3/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.5189 - accuracy: 0.8205 - val_loss: 0.4375 - val_accuracy: 0.8552
Epoch 4/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4827 - accuracy: 0.8323 - val_loss: 0.4152 - val_accuracy: 0.8598
Epoch 5/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4565 - accuracy: 0.8408 - val_loss: 0.3997 - val_accuracy: 0.8640
Epoch 6/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4398 - accuracy: 0.8472 - val_loss: 0.3867 - val_accuracy: 0.8698
Epoch 7/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4242 - accuracy: 0.8512 - val_loss: 0.3763 - val_accuracy: 0.8708
Epoch 8/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4143 - accuracy: 0.8540 - val_loss: 0.3713 - val_accuracy: 0.8738
Epoch 9/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4024 - accuracy: 0.8580 - val_loss: 0.3631 - val_accuracy: 0.8750
Epoch 10/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3914 - accuracy: 0.8624 - val_loss: 0.3573 - val_accuracy: 0.8758

BN层之前的层就不再需要偏置项了,因为与BN中的参数重复造成浪费,因此可以在层中设置use_bias=False,以减少参数量:

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(100, use_bias=False),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,validation_data=(X_valid, y_valid))

输出:

Epoch 1/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.8626 - accuracy: 0.7103 - val_loss: 0.5652 - val_accuracy: 0.8096
Epoch 2/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.5799 - accuracy: 0.7991 - val_loss: 0.4792 - val_accuracy: 0.8392
Epoch 3/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.5190 - accuracy: 0.8191 - val_loss: 0.4441 - val_accuracy: 0.8482
Epoch 4/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4816 - accuracy: 0.8331 - val_loss: 0.4205 - val_accuracy: 0.8550
Epoch 5/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4565 - accuracy: 0.8397 - val_loss: 0.4066 - val_accuracy: 0.8606
Epoch 6/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4411 - accuracy: 0.8456 - val_loss: 0.3963 - val_accuracy: 0.8630
Epoch 7/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4244 - accuracy: 0.8509 - val_loss: 0.3852 - val_accuracy: 0.8672
Epoch 8/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4125 - accuracy: 0.8556 - val_loss: 0.3794 - val_accuracy: 0.8702
Epoch 9/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4020 - accuracy: 0.8585 - val_loss: 0.3716 - val_accuracy: 0.8716
Epoch 10/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.3932 - accuracy: 0.8611 - val_loss: 0.3668 - val_accuracy: 0.8716

BatchNormalization类的超参数:

  1. momentum:该超参数参与指定移动平均的更新计算:\hat{v} \leftarrow \hat{v} * momentum + v *(1-momentum),momentum一般设置成接近1的数比较好,如0.9、0.99、0.999等,对于大数据集希望更多9,小数据集反之。
  2. axis:决定哪个轴方向被标准化,默认为-1,即默认是最后一个轴方向

BN成为深度神经网络最常用的结构之一,而通常没有在网络图中被画出来。

1.5 梯度裁剪(Gradient Clipping)

梯度裁剪是另一种解决梯度爆炸问题的方法,在反向传播过程中将超过设定阈值的梯度进行裁剪,从而避免梯度爆炸的产生。

梯度裁剪常常用在循环神经网络中,而上面提到的BN在RNN中很难使用。

在Keras中使用示例如下:

  1. optimizer = keras.optimizers.SGD(clipvalue=1.0):将所有参数的偏导数裁剪至-1到1之间,注意:这可能导致梯度的方向发生变化。例如对于[0.9,100]的梯度经过这种裁剪后变成[0.9,1.0],导致方向由原来的大约第二个轴方向转变成大约对角线的方向。
  2. optimizer = keras.optimizers.SGD(clipnorm=1.0):如果梯度的L2范数大于设定的阈值,则裁剪掉整个梯度。例如对于梯度[0.9,100]其L2范数为sqrt(0.81+10000),大于阈值1.0,因此每个值除以L2范数,最终被裁剪成[0.00899964, 0.9999595]。这种方法不会改变梯度的方向。

而在实际应用中到底选择上面两种方法中的另一种呢?一般需要通过设置不同的阈值并尝试两种方法,最终根据验证集上的效果确定哪种方法好。

2. 重用预训练层(Reusing Pretrained Layers)

迁移学习:寻找并重用已有网络的低级特征提取层,用于解决现有问题的方法。这种方法很大程度上提升了训练速度,并且需要的数据量较少。

注意:在图像分类迁移学习中,如果现有数据图像大小与重用模型输入大小不一致时,需要先对现有数据进行resize操作。

迁移学习中输出层通常需要被替换掉,根据分类任务类别多少个性化制定输出层。根据任务相似度选择重用一定比例的网络层,如果任务特别特别相似,则除输出层外可以保留其它所有层。

2.1 利用Keras实现迁移学习(Transfer Learning)

假设Fashion MNIST只有8个类别,即没有sandal和shirt两个类别,此时已经有一个性能达到90%的模型A。

假设此时手头仅有sandal和shirt的数据,并且想训练一个二分类模型,shirt为正例,sandal为负例。由于数据量太少,假设仅有200例。如果想基于模型A训练一个模型B,则模型B可能会达到97%以上的性能。然而,对于一个二分类任务,97%的性能似乎不太理想,这就需要再进行优化:

# 数据集拆分
def split_dataset(X,y):
    y_5_or_6 = (y==5)|(y==6)
    y_A = y[~y_5_or_6]
    y_A[y_A>6] -= 2 # 将类别标签7、8、9变成5、6、7
        
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # True、False变成1、2
    return ((X[~y_5_or_6], y_A),(X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

for i in (X_train_A, X_train_B):
    print(i.shape)

输出:

(43986, 28, 28)
(200, 28, 28)
y_train_A[:30]

输出:

array([4, 0, 5, 7, 7, 7, 4, 4, 3, 4, 0, 1, 6, 3, 4, 3, 2, 6, 5, 3, 4, 5,
       1, 3, 4, 2, 0, 6, 7, 1], dtype=uint8)
y_train_B[:30]

输出:

array([1., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0.,
       0., 0., 1., 1., 0., 0., 1., 1., 0., 1., 1., 1., 1.], dtype=float32)

模型A:

# 训练模型A
tf.random.set_seed(42)
np.random.seed(42)

model_A = tf.keras.models.Sequential()
model_A.add(tf.keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(tf.keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(tf.keras.layers.Dense(8, activation="softmax"))

model_A.compile(loss="sparse_categorical_crossentropy",optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),metrics=["accuracy"])

history = model_A.fit(X_train_A, y_train_A, epochs=20,validation_data=(X_valid_A, y_valid_A))

输出:

Epoch 1/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.5926 - accuracy: 0.8103 - val_loss: 0.3890 - val_accuracy: 0.8677
Epoch 2/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.3523 - accuracy: 0.8786 - val_loss: 0.3288 - val_accuracy: 0.8822
Epoch 3/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.3170 - accuracy: 0.8896 - val_loss: 0.3013 - val_accuracy: 0.8991
Epoch 4/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2973 - accuracy: 0.8975 - val_loss: 0.2895 - val_accuracy: 0.9018
Epoch 5/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2834 - accuracy: 0.9021 - val_loss: 0.2776 - val_accuracy: 0.9066
Epoch 6/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2729 - accuracy: 0.9061 - val_loss: 0.2735 - val_accuracy: 0.9063
Epoch 7/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2640 - accuracy: 0.9093 - val_loss: 0.2722 - val_accuracy: 0.9083
Epoch 8/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2572 - accuracy: 0.9125 - val_loss: 0.2588 - val_accuracy: 0.9141
Epoch 9/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2518 - accuracy: 0.9136 - val_loss: 0.2563 - val_accuracy: 0.9143
Epoch 10/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2468 - accuracy: 0.9153 - val_loss: 0.2541 - val_accuracy: 0.9158
Epoch 11/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2422 - accuracy: 0.9177 - val_loss: 0.2495 - val_accuracy: 0.9155
Epoch 12/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2382 - accuracy: 0.9190 - val_loss: 0.2513 - val_accuracy: 0.9126
Epoch 13/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2350 - accuracy: 0.9199 - val_loss: 0.2445 - val_accuracy: 0.9153
Epoch 14/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2315 - accuracy: 0.9213 - val_loss: 0.2415 - val_accuracy: 0.9175
Epoch 15/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2287 - accuracy: 0.9214 - val_loss: 0.2446 - val_accuracy: 0.9185
Epoch 16/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2254 - accuracy: 0.9225 - val_loss: 0.2385 - val_accuracy: 0.9195
Epoch 17/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2230 - accuracy: 0.9235 - val_loss: 0.2409 - val_accuracy: 0.9183
Epoch 18/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2201 - accuracy: 0.9245 - val_loss: 0.2423 - val_accuracy: 0.9153
Epoch 19/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2178 - accuracy: 0.9252 - val_loss: 0.2329 - val_accuracy: 0.9195
Epoch 20/20
1375/1375 [==============================] - 4s 3ms/step - loss: 0.2156 - accuracy: 0.9261 - val_loss: 0.2333 - val_accuracy: 0.9208
# 保存模型A
model_A.save("my_model_A.h5")
print("Saved successfully")

输出:

Saved successfully

模型B:

model_B = tf.keras.models.Sequential()
model_B.add(tf.keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(tf.keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(tf.keras.layers.Dense(1, activation="sigmoid"))

model_B.compile(loss="binary_crossentropy",optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),metrics=["accuracy"])

history = model_B.fit(X_train_B, y_train_B, epochs=20,validation_data=(X_valid_B, y_valid_B))

输出:

Epoch 1/20
7/7 [==============================] - 0s 28ms/step - loss: 0.9573 - accuracy: 0.4650 - val_loss: 0.6314 - val_accuracy: 0.6004
Epoch 2/20
7/7 [==============================] - 0s 13ms/step - loss: 0.5692 - accuracy: 0.7450 - val_loss: 0.4784 - val_accuracy: 0.8529
Epoch 3/20
7/7 [==============================] - 0s 13ms/step - loss: 0.4503 - accuracy: 0.8650 - val_loss: 0.4102 - val_accuracy: 0.8945
Epoch 4/20
7/7 [==============================] - 0s 12ms/step - loss: 0.3879 - accuracy: 0.8950 - val_loss: 0.3647 - val_accuracy: 0.9178
Epoch 5/20
7/7 [==============================] - 0s 13ms/step - loss: 0.3435 - accuracy: 0.9250 - val_loss: 0.3300 - val_accuracy: 0.9320
Epoch 6/20
7/7 [==============================] - 0s 13ms/step - loss: 0.3081 - accuracy: 0.9300 - val_loss: 0.3019 - val_accuracy: 0.9402
Epoch 7/20
7/7 [==============================] - 0s 13ms/step - loss: 0.2800 - accuracy: 0.9350 - val_loss: 0.2804 - val_accuracy: 0.9422
Epoch 8/20
7/7 [==============================] - 0s 12ms/step - loss: 0.2564 - accuracy: 0.9450 - val_loss: 0.2606 - val_accuracy: 0.9473
Epoch 9/20
7/7 [==============================] - 0s 14ms/step - loss: 0.2362 - accuracy: 0.9550 - val_loss: 0.2428 - val_accuracy: 0.9523
Epoch 10/20
7/7 [==============================] - 0s 13ms/step - loss: 0.2188 - accuracy: 0.9600 - val_loss: 0.2281 - val_accuracy: 0.9544
Epoch 11/20
7/7 [==============================] - 0s 14ms/step - loss: 0.2036 - accuracy: 0.9700 - val_loss: 0.2150 - val_accuracy: 0.9584
Epoch 12/20
7/7 [==============================] - 0s 15ms/step - loss: 0.1898 - accuracy: 0.9700 - val_loss: 0.2036 - val_accuracy: 0.9584
Epoch 13/20
7/7 [==============================] - 0s 14ms/step - loss: 0.1773 - accuracy: 0.9750 - val_loss: 0.1931 - val_accuracy: 0.9615
Epoch 14/20
7/7 [==============================] - 0s 16ms/step - loss: 0.1668 - accuracy: 0.9800 - val_loss: 0.1838 - val_accuracy: 0.9635
Epoch 15/20
7/7 [==============================] - 0s 15ms/step - loss: 0.1570 - accuracy: 0.9900 - val_loss: 0.1746 - val_accuracy: 0.9686
Epoch 16/20
7/7 [==============================] - 0s 12ms/step - loss: 0.1481 - accuracy: 0.9900 - val_loss: 0.1674 - val_accuracy: 0.9686
Epoch 17/20
7/7 [==============================] - 0s 13ms/step - loss: 0.1406 - accuracy: 0.9900 - val_loss: 0.1604 - val_accuracy: 0.9706
Epoch 18/20
7/7 [==============================] - 0s 14ms/step - loss: 0.1334 - accuracy: 0.9900 - val_loss: 0.1539 - val_accuracy: 0.9706
Epoch 19/20
7/7 [==============================] - 0s 13ms/step - loss: 0.1268 - accuracy: 0.9900 - val_loss: 0.1482 - val_accuracy: 0.9716
Epoch 20/20
7/7 [==============================] - 0s 12ms/step - loss: 0.1208 - accuracy: 0.9900 - val_loss: 0.1431 - val_accuracy: 0.9716
model.summary()

输出:

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten_5 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 784)               3136      
_________________________________________________________________
dense_213 (Dense)            (None, 300)               235200    
_________________________________________________________________
batch_normalization_4 (Batch (None, 300)               1200      
_________________________________________________________________
activation (Activation)      (None, 300)               0         
_________________________________________________________________
dense_214 (Dense)            (None, 100)               30000     
_________________________________________________________________
activation_1 (Activation)    (None, 100)               0         
_________________________________________________________________
batch_normalization_5 (Batch (None, 100)               400       
_________________________________________________________________
dense_215 (Dense)            (None, 10)                1010      
=================================================================
Total params: 270,946
Trainable params: 268,578
Non-trainable params: 2,368
_________________________________________________________________

迁移学习:

model_A = tf.keras.models.load_model("my_model_A.h5")

model_B_on_A = tf.keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(tf.keras.layers.Dense(1, activation="sigmoid"))

model_A_clone = tf.keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy",optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),metrics=["accuracy"])

history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4, validation_data=(X_valid_B, y_valid_B))

输出:

Epoch 1/4
7/7 [==============================] - 0s 22ms/step - loss: 0.5727 - accuracy: 0.6550 - val_loss: 0.5767 - val_accuracy: 0.6430
Epoch 2/4
7/7 [==============================] - 0s 13ms/step - loss: 0.5367 - accuracy: 0.6900 - val_loss: 0.5398 - val_accuracy: 0.6846
Epoch 3/4
7/7 [==============================] - 0s 13ms/step - loss: 0.5003 - accuracy: 0.7350 - val_loss: 0.5082 - val_accuracy: 0.7160
Epoch 4/4
7/7 [==============================] - 0s 12ms/step - loss: 0.4692 - accuracy: 0.7550 - val_loss: 0.4799 - val_accuracy: 0.7404
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

model_B_on_A.compile(loss="binary_crossentropy",optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),metrics=["accuracy"])

history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16, validation_data=(X_valid_B, y_valid_B))

输出:

Epoch 1/16
7/7 [==============================] - 0s 23ms/step - loss: 0.3920 - accuracy: 0.8200 - val_loss: 0.3428 - val_accuracy: 0.8671
Epoch 2/16
7/7 [==============================] - 0s 14ms/step - loss: 0.2775 - accuracy: 0.9350 - val_loss: 0.2585 - val_accuracy: 0.9290
Epoch 3/16
7/7 [==============================] - 0s 13ms/step - loss: 0.2070 - accuracy: 0.9650 - val_loss: 0.2099 - val_accuracy: 0.9544
Epoch 4/16
7/7 [==============================] - 0s 13ms/step - loss: 0.1661 - accuracy: 0.9800 - val_loss: 0.1782 - val_accuracy: 0.9696
Epoch 5/16
7/7 [==============================] - 0s 13ms/step - loss: 0.1390 - accuracy: 0.9800 - val_loss: 0.1556 - val_accuracy: 0.9757
Epoch 6/16
7/7 [==============================] - 0s 13ms/step - loss: 0.1192 - accuracy: 0.9950 - val_loss: 0.1389 - val_accuracy: 0.9797
Epoch 7/16
7/7 [==============================] - 0s 14ms/step - loss: 0.1047 - accuracy: 0.9950 - val_loss: 0.1263 - val_accuracy: 0.9848
Epoch 8/16
7/7 [==============================] - 0s 13ms/step - loss: 0.0935 - accuracy: 0.9950 - val_loss: 0.1162 - val_accuracy: 0.9858
Epoch 9/16
7/7 [==============================] - 0s 12ms/step - loss: 0.0846 - accuracy: 1.0000 - val_loss: 0.1065 - val_accuracy: 0.9888
Epoch 10/16
7/7 [==============================] - 0s 13ms/step - loss: 0.0762 - accuracy: 1.0000 - val_loss: 0.0999 - val_accuracy: 0.9899
Epoch 11/16
7/7 [==============================] - 0s 13ms/step - loss: 0.0704 - accuracy: 1.0000 - val_loss: 0.0939 - val_accuracy: 0.9899
Epoch 12/16
7/7 [==============================] - 0s 13ms/step - loss: 0.0649 - accuracy: 1.0000 - val_loss: 0.0888 - val_accuracy: 0.9899
Epoch 13/16
7/7 [==============================] - 0s 13ms/step - loss: 0.0602 - accuracy: 1.0000 - val_loss: 0.0839 - val_accuracy: 0.9899
Epoch 14/16
7/7 [==============================] - 0s 12ms/step - loss: 0.0559 - accuracy: 1.0000 - val_loss: 0.0802 - val_accuracy: 0.9899
Epoch 15/16
7/7 [==============================] - 0s 13ms/step - loss: 0.0525 - accuracy: 1.0000 - val_loss: 0.0769 - val_accuracy: 0.9899
Epoch 16/16
7/7 [==============================] - 0s 15ms/step - loss: 0.0496 - accuracy: 1.0000 - val_loss: 0.0739 - val_accuracy: 0.9899
model_B.evaluate(X_test_B, y_test_B)

输出:

63/63 [==============================] - 0s 2ms/step - loss: 0.1408 - accuracy: 0.9705

[0.1408407837152481, 0.9704999923706055]
model_B_on_A.evaluate(X_test_B, y_test_B)

输出:

63/63 [==============================] - 0s 2ms/step - loss: 0.0681 - accuracy: 0.9930

[0.0681409016251564, 0.9929999709129333]
(1-0.9704999923706055)/(1-0.9929999709129333)

输出:

4.2142692926661045

如上输出所示,使用迁移学习的模型错误率下降了近4个百分点。

虽然上面得到了不错的效果,但是迁移学习更适合于CNN网络。

2.2 无监督预训练(Unsupervised Pretraining)

假设需要处理一个比较复杂的问题,而且没有足够有标签的数据,也找不到一个比较合适的可以用来迁移学习的预训练模型。此时,如果无法收集更多有标签的数据,那么无监督预训练就派上用场了。

早期训练深度学习模型比较难,通常使用一种叫作贪婪逐层预训练的方法。其主要思想是首先训练一个单层的无监督模型,通常使用RBM,然后冻结该层并再加一层,再进行训练,训练完后再冻结再加入新层,如次反复。

而目前,通常使用自编码器或GAN一步到位就可以训练好无监督学习模型。

2.3 辅助任务的预训练(Pretraining on an Auxiliary Task)

如果无法获得得多有标签的数据,还有一种方法就是训练一个辅助模型用于获得或者产生更多有标签的数据。

对于自然语言处理(natural language processing, NLP)任务,可以下载很多文本文档,然后对其自动生成标签。

自监督学习(Self-supervised learning)就是数据自动通过某种方法打上标签,然后利用标注的数据训练出模型。由于标注数据没有人工参与,因为被归类到无监督学习中。

3. 更快的优化器(Faster Optimizers)

训练神经网络非常耗时,到目前为止介绍了如下几种加速训练的方法:

  1. 使用较好的权重参数初始化方法
  2. 使用合适的优化器
  3. 使用批归一化
  4. 重用预训练模型的部分结构

3.1 Momentum Optimization

随机梯度下降不考虑之前的梯度,因此如果当前梯度很小,则会非常慢。

而Momentum优化器考虑了之前的梯度,计算公式如下:

m \leftarrow \beta m - \eta \triangledown_\theta J(\theta)

{\theta} \leftarrow \theta + m

其中:

  • 𝛽称为动量(momentum),取值0至1之间,通常取0.9。
  • 𝜂为学习率

随机梯度开始下降很快,但是接近最优值时会越来越慢。

momentum优化器下降越来越快,直接最低点。它还可以有效避免陷入局部最优解。由于动量,可能会越过最优解,然后又返回来从而发生振荡。

momentum另一个缺点是模型多了一个超参数,但一般情况下设置0.9即可达到很好的效果,并且比常规的SGD速度快。

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

3.2 Nesterov加速梯度(Nesterov Accelerated Gradient, NAG)

Nesterov加速梯度是momentum优化器的一个变体,由Yurii Nesterov在1983年提出,Nesterov momentum优化器不在局部位置theta处求梯度,而是在动量稍微前面一些的theta+beta*m处,NAG算法公式如下:

m \leftarrow \beta m - \eta \triangledown_\theta J(\theta + \beta m)

{\theta} \leftarrow \theta + m

NAG比常规的动量优化器更快收敛。

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

3.3 AdaGrad

AdaGrad使得将下降的方向指向全局最优。计算公式如下:

s \leftarrow s + \triangledown_\theta J(\theta) \otimes \triangledown_\theta J(\theta)

\theta \leftarrow \theta - \eta \triangledown_\theta J(\theta) \oslash \sqrt{s+\epsilon}

第一步:将梯度的平方累加到s向量中,第二步:基本与梯度下降一致,但是对梯度向量使用\sqrt{s+\epsilon} 进行了绽放。其中epsilon作为平滑项为了防止除零,一般设定为10e-10。

该算法可能会降低学习速度,但是达到最优解的速度更快,这种也叫自适合学习率(adaptive learning rate)。

AdaGrad常常用于简单二次问题,用于训练神经网络时常常停止较早,因为学习率被缩减很多导致在达到全局最优解之前提前停止训练。虽然Keras中有Adagrad优化器,但不建议在深度神经网络中使用,可能在简单任务例如线性回归中比较有效。

optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.001)

3.4 RMSProp

RMSProp解决了AdaGrad由于下降太快而无法收敛到全局最优解的问题,RMSProp仅累加最近的梯度,而不是从训练开始的所有梯度。

s \leftarrow \beta s + (1-\beta)\triangledown_\theta J(\theta) \otimes \triangledown_\theta J(\theta)

\theta \leftarrow \theta - \eta \triangledown_\theta J(\theta) \oslash \sqrt{s+\epsilon}

其中:𝛽为衰减率,一般高为0.9。虽然又增加了一个超参数,但是0.9的效果一般很好,所以不需要再对其进行修改。

在发明adam之前,RMSProp一直是最好的优化算法。

optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

3.5 Adam和Nadam优化器

Adam(adaptive moment estimation)综合了动量优化器和RMSProp的思想,其算法公式如下:

  1. m \leftarrow \beta_1 m - (1-\beta_1)\triangledown_\theta J(\theta)
  2. s \leftarrow \beta_2 s + (1-\beta)\triangledown_\theta J(\theta) \otimes \triangledown_\theta J(\theta)
  3. \hat{m} \leftarrow \frac{m}{1-\beta_1^T}
  4. \hat{s} \leftarrow \frac{s}{1-\beta_2^T}
  5. \theta \leftarrow \theta + \eta \hat{m} \oslash \sqrt{s+\epsilon}

其中:𝛽1通常设为0.9,𝛽2通常设为0.999,𝜖 通常设为10e-7

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

3.6 Adam变体AdaMax

将Adam中第2步换成,并删除了步骤4:

s \leftarrow max(\beta_2 s, \triangledown_\theta J(\theta))

optimizer = tf.keras.optimizers.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

3.7 Adam变化Nadam

Adam加Nesterov,因此通常比Adam稍微快一些

optimizer = tf.keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

4. 学习率调整(Learning Rate Scheduling)

如果学习率设置的特别大,则训练可能会发散;如果设置的比较大,前半部分训练速度会比较快,而到后面在最优解附近振荡,无法很好地停下来;如果设置的比较小,则模型训练过程很慢,但会在最优解处收敛。

learning schedules:就是在开始设置比较大的学习率以便加速训练速度,到后面不断降低学习率,以防错过最优解。如下是几种常用的方法:

1. Power scheduling将学习率表示为迭代次数的函数:\eta(t) = \eta_0/(1+t/s)^c,其中\eta_0为初始学习率,次方c(通常设为1)和步数s是超参数。随着不断训练迭代次数的增大,学习率不断下降,前面下降的快,后面下降的越来越慢。

Keras中默认使用c=1, s=1/decay:

optimizer = tf.keras.optimizers.SGD(lr=0.01, decay=1e-4) 

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,validation_data=(X_valid_scaled, y_valid))

输出:

Epoch 1/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.4854 - accuracy: 0.8289 - val_loss: 0.4019 - val_accuracy: 0.8634
Epoch 2/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3770 - accuracy: 0.8657 - val_loss: 0.3771 - val_accuracy: 0.8686
Epoch 3/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3449 - accuracy: 0.8773 - val_loss: 0.3642 - val_accuracy: 0.8724
Epoch 4/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3231 - accuracy: 0.8851 - val_loss: 0.3509 - val_accuracy: 0.8806
Epoch 5/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3075 - accuracy: 0.8896 - val_loss: 0.3476 - val_accuracy: 0.8798
Epoch 6/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2953 - accuracy: 0.8942 - val_loss: 0.3402 - val_accuracy: 0.8798
Epoch 7/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2850 - accuracy: 0.8984 - val_loss: 0.3425 - val_accuracy: 0.8820
Epoch 8/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2756 - accuracy: 0.9023 - val_loss: 0.3349 - val_accuracy: 0.8882
Epoch 9/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2680 - accuracy: 0.9044 - val_loss: 0.3328 - val_accuracy: 0.8842
Epoch 10/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2612 - accuracy: 0.9065 - val_loss: 0.3313 - val_accuracy: 0.8834
Epoch 11/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2548 - accuracy: 0.9097 - val_loss: 0.3264 - val_accuracy: 0.8854
Epoch 12/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2489 - accuracy: 0.9116 - val_loss: 0.3266 - val_accuracy: 0.8884
Epoch 13/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2442 - accuracy: 0.9132 - val_loss: 0.3240 - val_accuracy: 0.8846
Epoch 14/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2391 - accuracy: 0.9150 - val_loss: 0.3255 - val_accuracy: 0.8880
Epoch 15/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2346 - accuracy: 0.9164 - val_loss: 0.3238 - val_accuracy: 0.8866
Epoch 16/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2310 - accuracy: 0.9185 - val_loss: 0.3189 - val_accuracy: 0.8912
Epoch 17/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2269 - accuracy: 0.9198 - val_loss: 0.3225 - val_accuracy: 0.8864
Epoch 18/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2233 - accuracy: 0.9205 - val_loss: 0.3209 - val_accuracy: 0.8882
Epoch 19/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2200 - accuracy: 0.9224 - val_loss: 0.3185 - val_accuracy: 0.8894
Epoch 20/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2162 - accuracy: 0.9237 - val_loss: 0.3206 - val_accuracy: 0.8906
Epoch 21/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2138 - accuracy: 0.9254 - val_loss: 0.3219 - val_accuracy: 0.8878
Epoch 22/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2111 - accuracy: 0.9254 - val_loss: 0.3202 - val_accuracy: 0.8894
Epoch 23/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2081 - accuracy: 0.9263 - val_loss: 0.3219 - val_accuracy: 0.8878
Epoch 24/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2057 - accuracy: 0.9279 - val_loss: 0.3183 - val_accuracy: 0.8900
Epoch 25/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2032 - accuracy: 0.9296 - val_loss: 0.3228 - val_accuracy: 0.8890
learning_rate = 0.01
decay = 1e-4
batch_size = 32
n_steps_per_epoch = len(X_train) // batch_size
epochs = np.arange(n_epochs)
lrs = learning_rate / (1 + decay * epochs * n_steps_per_epoch)

plt.plot(epochs, lrs,  "o-")
plt.axis([0, n_epochs - 1, 0, 0.01])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Power Scheduling", fontsize=14)
plt.grid(True)

plt.show()

输出:

2. Exponential scheduling\eta(t) = \eta_0 0.1^{t/s}

def exponential_decay_fn(epoch):
    return 0.01 * 0.1**(epoch / 20)

def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1**(epoch / s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 25

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,validation_data=(X_valid_scaled, y_valid),callbacks=[lr_scheduler])

输出:

Epoch 1/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.8467 - accuracy: 0.7609 - val_loss: 0.7625 - val_accuracy: 0.7572 - lr: 0.0100
Epoch 2/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.6825 - accuracy: 0.7932 - val_loss: 0.6958 - val_accuracy: 0.7904 - lr: 0.0089
Epoch 3/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.5947 - accuracy: 0.8178 - val_loss: 0.6523 - val_accuracy: 0.8182 - lr: 0.0079
Epoch 4/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.5415 - accuracy: 0.8342 - val_loss: 0.6047 - val_accuracy: 0.8186 - lr: 0.0071
Epoch 5/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4785 - accuracy: 0.8492 - val_loss: 0.5403 - val_accuracy: 0.8432 - lr: 0.0063
Epoch 6/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4411 - accuracy: 0.8593 - val_loss: 0.4944 - val_accuracy: 0.8510 - lr: 0.0056
Epoch 7/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4106 - accuracy: 0.8693 - val_loss: 0.5093 - val_accuracy: 0.8628 - lr: 0.0050
Epoch 8/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.3799 - accuracy: 0.8772 - val_loss: 0.5035 - val_accuracy: 0.8604 - lr: 0.0045
Epoch 9/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.3555 - accuracy: 0.8845 - val_loss: 0.4684 - val_accuracy: 0.8726 - lr: 0.0040
Epoch 10/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.3319 - accuracy: 0.8913 - val_loss: 0.4865 - val_accuracy: 0.8596 - lr: 0.0035
Epoch 11/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.3080 - accuracy: 0.8946 - val_loss: 0.5358 - val_accuracy: 0.8558 - lr: 0.0032
Epoch 12/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2829 - accuracy: 0.9051 - val_loss: 0.5133 - val_accuracy: 0.8754 - lr: 0.0028
Epoch 13/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2665 - accuracy: 0.9097 - val_loss: 0.4599 - val_accuracy: 0.8786 - lr: 0.0025
Epoch 14/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2485 - accuracy: 0.9159 - val_loss: 0.4454 - val_accuracy: 0.8756 - lr: 0.0022
Epoch 15/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2268 - accuracy: 0.9220 - val_loss: 0.4413 - val_accuracy: 0.8882 - lr: 0.0020
Epoch 16/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2100 - accuracy: 0.9284 - val_loss: 0.4901 - val_accuracy: 0.8852 - lr: 0.0018
Epoch 17/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1964 - accuracy: 0.9337 - val_loss: 0.5033 - val_accuracy: 0.8880 - lr: 0.0016
Epoch 18/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1831 - accuracy: 0.9387 - val_loss: 0.4918 - val_accuracy: 0.8918 - lr: 0.0014
Epoch 19/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1685 - accuracy: 0.9434 - val_loss: 0.4844 - val_accuracy: 0.8912 - lr: 0.0013
Epoch 20/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1598 - accuracy: 0.9459 - val_loss: 0.5139 - val_accuracy: 0.8846 - lr: 0.0011
Epoch 21/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1487 - accuracy: 0.9503 - val_loss: 0.5492 - val_accuracy: 0.8894 - lr: 0.0010
Epoch 22/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1404 - accuracy: 0.9545 - val_loss: 0.5564 - val_accuracy: 0.8922 - lr: 8.9125e-04
Epoch 23/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1324 - accuracy: 0.9568 - val_loss: 0.5844 - val_accuracy: 0.8918 - lr: 7.9433e-04
Epoch 24/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1232 - accuracy: 0.9607 - val_loss: 0.6140 - val_accuracy: 0.8902 - lr: 7.0795e-04
Epoch 25/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1163 - accuracy: 0.9633 - val_loss: 0.6287 - val_accuracy: 0.8936 - lr: 6.3096e-04
plt.plot(history.epoch, history.history["lr"], "o-")
plt.axis([0, n_epochs - 1, 0, 0.011])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Exponential Scheduling", fontsize=14)
plt.grid(True)

plt.show()

输出:

如果想在每次迭代而不是每个epoch更新学习率,则需要自定义callback类:

K = tf.keras.backend

class ExponentialDecay(tf.keras.callbacks.Callback):
    def __init__(self, s=40000):
        super().__init__()
        self.s = s

    def on_batch_begin(self, batch, logs=None):
        # Note: the `batch` argument is reset at each epoch
        lr = K.get_value(self.model.optimizer.lr)
        K.set_value(self.model.optimizer.lr, lr * 0.1**(1 / s))

    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        logs['lr'] = K.get_value(self.model.optimizer.lr)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(10, activation="softmax")
])
lr0 = 0.01
optimizer = tf.keras.optimizers.Nadam(learning_rate=lr0)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs = 25

s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
exp_decay = ExponentialDecay(s)
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,validation_data=(X_valid_scaled, y_valid),
                    callbacks=[exp_decay])

输出:

Epoch 1/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.8009 - accuracy: 0.7677 - val_loss: 0.8101 - val_accuracy: 0.7580 - lr: 0.0089
Epoch 2/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.6447 - accuracy: 0.8066 - val_loss: 0.5788 - val_accuracy: 0.8472 - lr: 0.0079
Epoch 3/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.5534 - accuracy: 0.8301 - val_loss: 0.5007 - val_accuracy: 0.8470 - lr: 0.0071
Epoch 4/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.5228 - accuracy: 0.8390 - val_loss: 0.4884 - val_accuracy: 0.8452 - lr: 0.0063
Epoch 5/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4468 - accuracy: 0.8564 - val_loss: 0.5696 - val_accuracy: 0.8450 - lr: 0.0056
Epoch 6/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4387 - accuracy: 0.8617 - val_loss: 0.5285 - val_accuracy: 0.8458 - lr: 0.0050
Epoch 7/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4079 - accuracy: 0.8689 - val_loss: 0.4563 - val_accuracy: 0.8608 - lr: 0.0045
Epoch 8/25
1719/1719 [==============================] - 8s 4ms/step - loss: 0.3551 - accuracy: 0.8823 - val_loss: 0.5059 - val_accuracy: 0.8664 - lr: 0.0040
Epoch 9/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3367 - accuracy: 0.8887 - val_loss: 0.5077 - val_accuracy: 0.8634 - lr: 0.0035
Epoch 10/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3071 - accuracy: 0.8981 - val_loss: 0.4854 - val_accuracy: 0.8622 - lr: 0.0032
Epoch 11/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2884 - accuracy: 0.9034 - val_loss: 0.4310 - val_accuracy: 0.8822 - lr: 0.0028
Epoch 12/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2745 - accuracy: 0.9085 - val_loss: 0.3991 - val_accuracy: 0.8842 - lr: 0.0025
Epoch 13/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2507 - accuracy: 0.9152 - val_loss: 0.4072 - val_accuracy: 0.8864 - lr: 0.0022
Epoch 14/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2287 - accuracy: 0.9217 - val_loss: 0.4278 - val_accuracy: 0.8922 - lr: 0.0020
Epoch 15/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2147 - accuracy: 0.9260 - val_loss: 0.4727 - val_accuracy: 0.8828 - lr: 0.0018
Epoch 16/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2003 - accuracy: 0.9322 - val_loss: 0.4798 - val_accuracy: 0.8816 - lr: 0.0016
Epoch 17/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1833 - accuracy: 0.9363 - val_loss: 0.4768 - val_accuracy: 0.8880 - lr: 0.0014
Epoch 18/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1732 - accuracy: 0.9399 - val_loss: 0.4821 - val_accuracy: 0.8876 - lr: 0.0013
Epoch 19/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1602 - accuracy: 0.9459 - val_loss: 0.4611 - val_accuracy: 0.8882 - lr: 0.0011
Epoch 20/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1485 - accuracy: 0.9505 - val_loss: 0.5002 - val_accuracy: 0.8862 - lr: 9.9967e-04
Epoch 21/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1404 - accuracy: 0.9526 - val_loss: 0.5100 - val_accuracy: 0.8890 - lr: 8.9094e-04
Epoch 22/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1296 - accuracy: 0.9565 - val_loss: 0.5331 - val_accuracy: 0.8924 - lr: 7.9404e-04
Epoch 23/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1240 - accuracy: 0.9581 - val_loss: 0.5813 - val_accuracy: 0.8864 - lr: 7.0767e-04
Epoch 24/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1159 - accuracy: 0.9619 - val_loss: 0.6023 - val_accuracy: 0.8904 - lr: 6.3071e-04
Epoch 25/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1071 - accuracy: 0.9647 - val_loss: 0.6166 - val_accuracy: 0.8856 - lr: 5.6211e-04
n_steps = n_epochs * len(X_train) // 32
steps = np.arange(n_steps)
lrs = lr0 * 0.1**(steps / s)

plt.plot(steps, lrs, "-", linewidth=2)
plt.axis([0, n_steps - 1, 0, lr0 * 1.1])
plt.xlabel("Batch")
plt.ylabel("Learning Rate")
plt.title("Exponential Scheduling (per batch)", fontsize=14)
plt.grid(True)
plt.show()

输出:

3. Piecewise constant scheduling:每不同有epochs设置不同的学习率,例如前5 epochs设定\eta_0 = 0.1,接着50 epochs设定\eta_0 = 0.001,等等。

def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001
    
def piecewise_constant(boundaries, values):
    boundaries = np.array([0] + boundaries)
    values = np.array(values)
    def piecewise_constant_fn(epoch):
        return values[np.argmax(boundaries > epoch) - 1]
    return piecewise_constant_fn

piecewise_constant_fn = piecewise_constant([5, 15], [0.01, 0.005, 0.001])

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(piecewise_constant_fn)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,validation_data=(X_valid_scaled, y_valid),
                    callbacks=[lr_scheduler])

输出:

Epoch 1/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.8698 - accuracy: 0.7538 - val_loss: 0.9075 - val_accuracy: 0.7786 - lr: 0.0100
Epoch 2/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.7719 - accuracy: 0.7702 - val_loss: 1.2671 - val_accuracy: 0.6852 - lr: 0.0100
Epoch 3/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.8871 - accuracy: 0.7415 - val_loss: 0.9666 - val_accuracy: 0.7132 - lr: 0.0100
Epoch 4/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.8058 - accuracy: 0.7593 - val_loss: 0.9763 - val_accuracy: 0.6970 - lr: 0.0100
Epoch 5/25
1719/1719 [==============================] - 6s 4ms/step - loss: 1.0436 - accuracy: 0.6513 - val_loss: 1.2444 - val_accuracy: 0.6288 - lr: 0.0100
Epoch 6/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.8879 - accuracy: 0.6553 - val_loss: 0.9390 - val_accuracy: 0.6750 - lr: 0.0050
Epoch 7/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.7548 - accuracy: 0.7186 - val_loss: 0.7852 - val_accuracy: 0.7534 - lr: 0.0050
Epoch 8/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.6474 - accuracy: 0.7623 - val_loss: 0.7529 - val_accuracy: 0.7542 - lr: 0.0050
Epoch 9/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.5770 - accuracy: 0.8196 - val_loss: 0.5932 - val_accuracy: 0.8414 - lr: 0.0050
Epoch 10/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.5391 - accuracy: 0.8449 - val_loss: 0.5886 - val_accuracy: 0.8486 - lr: 0.0050
Epoch 11/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.5011 - accuracy: 0.8542 - val_loss: 0.5813 - val_accuracy: 0.8550 - lr: 0.0050
Epoch 12/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4896 - accuracy: 0.8568 - val_loss: 0.6262 - val_accuracy: 0.8544 - lr: 0.0050
Epoch 13/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4655 - accuracy: 0.8617 - val_loss: 0.6051 - val_accuracy: 0.8572 - lr: 0.0050
Epoch 14/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4856 - accuracy: 0.8613 - val_loss: 0.6400 - val_accuracy: 0.8624 - lr: 0.0050
Epoch 15/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4549 - accuracy: 0.8693 - val_loss: 0.6759 - val_accuracy: 0.8544 - lr: 0.0050
Epoch 16/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.3333 - accuracy: 0.8979 - val_loss: 0.5550 - val_accuracy: 0.8728 - lr: 0.0010
Epoch 17/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3035 - accuracy: 0.9052 - val_loss: 0.5827 - val_accuracy: 0.8688 - lr: 0.0010
Epoch 18/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2858 - accuracy: 0.9093 - val_loss: 0.5516 - val_accuracy: 0.8774 - lr: 0.0010
Epoch 19/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.2751 - accuracy: 0.9109 - val_loss: 0.5637 - val_accuracy: 0.8740 - lr: 0.0010
Epoch 20/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2640 - accuracy: 0.9148 - val_loss: 0.5681 - val_accuracy: 0.8774 - lr: 0.0010
Epoch 21/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2537 - accuracy: 0.9192 - val_loss: 0.5629 - val_accuracy: 0.8776 - lr: 0.0010
Epoch 22/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2446 - accuracy: 0.9208 - val_loss: 0.6173 - val_accuracy: 0.8722 - lr: 0.0010
Epoch 23/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2373 - accuracy: 0.9230 - val_loss: 0.5811 - val_accuracy: 0.8750 - lr: 0.0010
Epoch 24/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2299 - accuracy: 0.9251 - val_loss: 0.6269 - val_accuracy: 0.8768 - lr: 0.0010
Epoch 25/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2219 - accuracy: 0.9278 - val_loss: 0.6314 - val_accuracy: 0.8746 - lr: 0.0010
plt.plot(history.epoch, [piecewise_constant_fn(epoch) for epoch in history.epoch], "o-")
plt.axis([0, n_epochs - 1, 0, 0.011])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Piecewise Constant Scheduling", fontsize=14)
plt.grid(True)
plt.show()

输出:

4. Performance scheduling:每N步计算验证集错误率,当错误率停止下降时,学习率乘以系数𝜆

tf.random.set_seed(42)
np.random.seed(42)

lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(10, activation="softmax")
])
optimizer = tf.keras.optimizers.SGD(lr=0.02, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs, validation_data=(X_valid_scaled, y_valid),
                    callbacks=[lr_scheduler])

输出:

Epoch 1/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5893 - accuracy: 0.8080 - val_loss: 0.4866 - val_accuracy: 0.8440 - lr: 0.0200
Epoch 2/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.4949 - accuracy: 0.8399 - val_loss: 0.5799 - val_accuracy: 0.8386 - lr: 0.0200
Epoch 3/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5274 - accuracy: 0.8387 - val_loss: 0.5704 - val_accuracy: 0.8496 - lr: 0.0200
Epoch 4/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5211 - accuracy: 0.8457 - val_loss: 0.4998 - val_accuracy: 0.8472 - lr: 0.0200
Epoch 5/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5173 - accuracy: 0.8495 - val_loss: 0.5814 - val_accuracy: 0.8408 - lr: 0.0200
Epoch 6/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5041 - accuracy: 0.8567 - val_loss: 0.5605 - val_accuracy: 0.8558 - lr: 0.0200
Epoch 7/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3054 - accuracy: 0.8941 - val_loss: 0.3880 - val_accuracy: 0.8734 - lr: 0.0100
Epoch 8/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2582 - accuracy: 0.9063 - val_loss: 0.3813 - val_accuracy: 0.8794 - lr: 0.0100
Epoch 9/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2346 - accuracy: 0.9129 - val_loss: 0.4071 - val_accuracy: 0.8878 - lr: 0.0100
Epoch 10/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2165 - accuracy: 0.9209 - val_loss: 0.4049 - val_accuracy: 0.8880 - lr: 0.0100
Epoch 11/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2013 - accuracy: 0.9259 - val_loss: 0.4272 - val_accuracy: 0.8822 - lr: 0.0100
Epoch 12/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.1980 - accuracy: 0.9267 - val_loss: 0.4333 - val_accuracy: 0.8782 - lr: 0.0100
Epoch 13/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.1893 - accuracy: 0.9300 - val_loss: 0.4563 - val_accuracy: 0.8882 - lr: 0.0100
Epoch 14/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.1302 - accuracy: 0.9497 - val_loss: 0.4210 - val_accuracy: 0.8926 - lr: 0.0050
Epoch 15/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.1166 - accuracy: 0.9553 - val_loss: 0.4265 - val_accuracy: 0.8922 - lr: 0.0050
Epoch 16/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.1059 - accuracy: 0.9591 - val_loss: 0.4405 - val_accuracy: 0.8968 - lr: 0.0050
Epoch 17/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.0990 - accuracy: 0.9620 - val_loss: 0.4638 - val_accuracy: 0.8940 - lr: 0.0050
Epoch 18/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.0923 - accuracy: 0.9642 - val_loss: 0.4891 - val_accuracy: 0.8922 - lr: 0.0050
Epoch 19/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.0718 - accuracy: 0.9728 - val_loss: 0.4853 - val_accuracy: 0.8984 - lr: 0.0025
Epoch 20/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.0653 - accuracy: 0.9759 - val_loss: 0.4915 - val_accuracy: 0.8960 - lr: 0.0025
Epoch 21/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.0615 - accuracy: 0.9777 - val_loss: 0.5045 - val_accuracy: 0.8968 - lr: 0.0025
Epoch 22/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.0584 - accuracy: 0.9788 - val_loss: 0.5199 - val_accuracy: 0.8986 - lr: 0.0025
Epoch 23/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.0551 - accuracy: 0.9802 - val_loss: 0.5174 - val_accuracy: 0.8960 - lr: 0.0025
Epoch 24/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.0461 - accuracy: 0.9846 - val_loss: 0.5340 - val_accuracy: 0.8966 - lr: 0.0012
Epoch 25/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.0438 - accuracy: 0.9849 - val_loss: 0.5439 - val_accuracy: 0.8942 - lr: 0.0012
plt.plot(history.epoch, history.history["lr"], "bo-")
plt.xlabel("Epoch")
plt.ylabel("Learning Rate", color='b')
plt.tick_params('y', colors='b')
plt.gca().set_xlim(0, n_epochs - 1)
plt.grid(True)

ax2 = plt.gca().twinx()
ax2.plot(history.epoch, history.history["val_loss"], "r^-")
ax2.set_ylabel('Validation Loss', color='r')
ax2.tick_params('y', colors='r')

plt.title("Reduce LR on Plateau", fontsize=14)
plt.show()

输出:

5. tf.keras scheduling:从keras中选择一个schedule并传入学习率,每步更新一次学习率,而不是每个epoch。这种方法简单,并且当保存模型时学习率和schedule也会被保存。这种方法不是Keras的API,而是tf.keras中特有的。

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(10, activation="softmax")
])
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = tf.keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = tf.keras.optimizers.SGD(learning_rate)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs = 25

history = model.fit(X_train_scaled, y_train, epochs=n_epochs,validation_data=(X_valid_scaled, y_valid))

输出:

Epoch 1/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.4893 - accuracy: 0.8274 - val_loss: 0.4095 - val_accuracy: 0.8598
Epoch 2/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3819 - accuracy: 0.8651 - val_loss: 0.3740 - val_accuracy: 0.8696
Epoch 3/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.3486 - accuracy: 0.8765 - val_loss: 0.3731 - val_accuracy: 0.8686
Epoch 4/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3262 - accuracy: 0.8836 - val_loss: 0.3495 - val_accuracy: 0.8804
Epoch 5/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3102 - accuracy: 0.8895 - val_loss: 0.3431 - val_accuracy: 0.8794
Epoch 6/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2956 - accuracy: 0.8953 - val_loss: 0.3416 - val_accuracy: 0.8816
Epoch 7/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2852 - accuracy: 0.8989 - val_loss: 0.3355 - val_accuracy: 0.8812
Epoch 8/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2759 - accuracy: 0.9018 - val_loss: 0.3364 - val_accuracy: 0.8826
Epoch 9/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2676 - accuracy: 0.9053 - val_loss: 0.3265 - val_accuracy: 0.8860
Epoch 10/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2606 - accuracy: 0.9068 - val_loss: 0.3241 - val_accuracy: 0.8866
Epoch 11/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2549 - accuracy: 0.9088 - val_loss: 0.3252 - val_accuracy: 0.8862
Epoch 12/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2495 - accuracy: 0.9127 - val_loss: 0.3302 - val_accuracy: 0.8816
Epoch 13/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2448 - accuracy: 0.9137 - val_loss: 0.3218 - val_accuracy: 0.8878
Epoch 14/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2414 - accuracy: 0.9147 - val_loss: 0.3223 - val_accuracy: 0.8852
Epoch 15/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2374 - accuracy: 0.9166 - val_loss: 0.3210 - val_accuracy: 0.8868
Epoch 16/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2342 - accuracy: 0.9179 - val_loss: 0.3185 - val_accuracy: 0.8888
Epoch 17/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2316 - accuracy: 0.9185 - val_loss: 0.3199 - val_accuracy: 0.8894
Epoch 18/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2290 - accuracy: 0.9197 - val_loss: 0.3169 - val_accuracy: 0.8900
Epoch 19/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2269 - accuracy: 0.9206 - val_loss: 0.3198 - val_accuracy: 0.8896
Epoch 20/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2249 - accuracy: 0.9218 - val_loss: 0.3170 - val_accuracy: 0.8906
Epoch 21/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2228 - accuracy: 0.9223 - val_loss: 0.3180 - val_accuracy: 0.8902
Epoch 22/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2215 - accuracy: 0.9224 - val_loss: 0.3165 - val_accuracy: 0.8918
Epoch 23/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2200 - accuracy: 0.9234 - val_loss: 0.3172 - val_accuracy: 0.8896
Epoch 24/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2187 - accuracy: 0.9241 - val_loss: 0.3167 - val_accuracy: 0.8902
Epoch 25/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2179 - accuracy: 0.9241 - val_loss: 0.3166 - val_accuracy: 0.8920

对于piecewise可以参考如下使用:

learning_rate = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
    boundaries=[5.0*n_steps_per_epoch, 15.0*n_steps_per_epoch],
    values=[0.01,0.005,0.001])

6. 1cycle scheduling:以𝜂0为学习率开始训练,并线性增长至𝜂1,然后再线性降低至𝜂0,继续训练再线性降低几个数量级以完成训练。

K = tf.keras.backend

class ExponentialLearningRate(tf.keras.callbacks.Callback):
    def __init__(self, factor):
        self.factor = factor
        self.rates = []
        self.losses = []
    def on_batch_end(self, batch, logs):
        self.rates.append(K.get_value(self.model.optimizer.lr))
        self.losses.append(logs["loss"])
        K.set_value(self.model.optimizer.lr, self.model.optimizer.lr * self.factor)

def find_learning_rate(model, X, y, epochs=1, batch_size=32, min_rate=10**-5, max_rate=10):
    init_weights = model.get_weights()
    iterations = len(X) // batch_size * epochs
    factor = np.exp(np.log(max_rate / min_rate) / iterations)
    init_lr = K.get_value(model.optimizer.lr)
    K.set_value(model.optimizer.lr, min_rate)
    exp_lr = ExponentialLearningRate(factor)
    history = model.fit(X, y, epochs=epochs, batch_size=batch_size,
                        callbacks=[exp_lr])
    K.set_value(model.optimizer.lr, init_lr)
    model.set_weights(init_weights)
    return exp_lr.rates, exp_lr.losses

def plot_lr_vs_loss(rates, losses):
    plt.plot(rates, losses)
    plt.gca().set_xscale('log')
    plt.hlines(min(losses), min(rates), max(rates))
    plt.axis([min(rates), max(rates), min(losses), (losses[0] + min(losses)) / 2])
    plt.xlabel("Learning rate")
    plt.ylabel("Loss")
    
tf.random.set_seed(42)
np.random.seed(42)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",optimizer=tf.keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

batch_size = 128
rates, losses = find_learning_rate(model, X_train_scaled, y_train, epochs=1, batch_size=batch_size)
plot_lr_vs_loss(rates, losses)

输出:

430/430 [==============================] - 1s 3ms/step - loss: nan - accuracy: 0.3952

class OneCycleScheduler(tf.keras.callbacks.Callback):
    def __init__(self, iterations, max_rate, start_rate=None,
                 last_iterations=None, last_rate=None):
        self.iterations = iterations
        self.max_rate = max_rate
        self.start_rate = start_rate or max_rate / 10
        self.last_iterations = last_iterations or iterations // 10 + 1
        self.half_iteration = (iterations - self.last_iterations) // 2
        self.last_rate = last_rate or self.start_rate / 1000
        self.iteration = 0
    def _interpolate(self, iter1, iter2, rate1, rate2):
        return ((rate2 - rate1) * (self.iteration - iter1)
                / (iter2 - iter1) + rate1)
    def on_batch_begin(self, batch, logs):
        if self.iteration < self.half_iteration:
            rate = self._interpolate(0, self.half_iteration, self.start_rate, self.max_rate)
        elif self.iteration < 2 * self.half_iteration:
            rate = self._interpolate(self.half_iteration, 2 * self.half_iteration,
                                     self.max_rate, self.start_rate)
        else:
            rate = self._interpolate(2 * self.half_iteration, self.iterations,
                                     self.start_rate, self.last_rate)
            rate = max(rate, self.last_rate)
        self.iteration += 1
        K.set_value(self.model.optimizer.lr, rate)
        
n_epochs = 25
onecycle = OneCycleScheduler(len(X_train) // batch_size * n_epochs, max_rate=0.05)
history = model.fit(X_train_scaled, y_train, epochs=n_epochs, batch_size=batch_size,
                    validation_data=(X_valid_scaled, y_valid), callbacks=[onecycle])

输出:

Epoch 1/25
430/430 [==============================] - 1s 3ms/step - loss: 0.6572 - accuracy: 0.7740 - val_loss: 0.4872 - val_accuracy: 0.8338
Epoch 2/25
430/430 [==============================] - 1s 3ms/step - loss: 0.4580 - accuracy: 0.8397 - val_loss: 0.4274 - val_accuracy: 0.8522
Epoch 3/25
430/430 [==============================] - 1s 3ms/step - loss: 0.4121 - accuracy: 0.8545 - val_loss: 0.4113 - val_accuracy: 0.8592
Epoch 4/25
430/430 [==============================] - 1s 3ms/step - loss: 0.3837 - accuracy: 0.8642 - val_loss: 0.3868 - val_accuracy: 0.8690
Epoch 5/25
430/430 [==============================] - 1s 3ms/step - loss: 0.3639 - accuracy: 0.8718 - val_loss: 0.3762 - val_accuracy: 0.8680
Epoch 6/25
430/430 [==============================] - 1s 3ms/step - loss: 0.3456 - accuracy: 0.8773 - val_loss: 0.3746 - val_accuracy: 0.8700
Epoch 7/25
430/430 [==============================] - 1s 3ms/step - loss: 0.3329 - accuracy: 0.8810 - val_loss: 0.3632 - val_accuracy: 0.8714
Epoch 8/25
430/430 [==============================] - 1s 3ms/step - loss: 0.3185 - accuracy: 0.8858 - val_loss: 0.3954 - val_accuracy: 0.8608
Epoch 9/25
430/430 [==============================] - 1s 3ms/step - loss: 0.3064 - accuracy: 0.8891 - val_loss: 0.3484 - val_accuracy: 0.8752
Epoch 10/25
430/430 [==============================] - 1s 3ms/step - loss: 0.2944 - accuracy: 0.8927 - val_loss: 0.3397 - val_accuracy: 0.8814
Epoch 11/25
430/430 [==============================] - 1s 3ms/step - loss: 0.2837 - accuracy: 0.8960 - val_loss: 0.3466 - val_accuracy: 0.8798
Epoch 12/25
430/430 [==============================] - 1s 3ms/step - loss: 0.2708 - accuracy: 0.9026 - val_loss: 0.3647 - val_accuracy: 0.8694
Epoch 13/25
430/430 [==============================] - 1s 3ms/step - loss: 0.2535 - accuracy: 0.9082 - val_loss: 0.3345 - val_accuracy: 0.8848
Epoch 14/25
430/430 [==============================] - 1s 3ms/step - loss: 0.2403 - accuracy: 0.9138 - val_loss: 0.3449 - val_accuracy: 0.8812
Epoch 15/25
430/430 [==============================] - 1s 3ms/step - loss: 0.2278 - accuracy: 0.9180 - val_loss: 0.3256 - val_accuracy: 0.8854
Epoch 16/25
430/430 [==============================] - 1s 3ms/step - loss: 0.2157 - accuracy: 0.9231 - val_loss: 0.3287 - val_accuracy: 0.8832
Epoch 17/25
430/430 [==============================] - 1s 3ms/step - loss: 0.2061 - accuracy: 0.9266 - val_loss: 0.3341 - val_accuracy: 0.8872
Epoch 18/25
430/430 [==============================] - 1s 3ms/step - loss: 0.1977 - accuracy: 0.9305 - val_loss: 0.3235 - val_accuracy: 0.8898
Epoch 19/25
430/430 [==============================] - 1s 3ms/step - loss: 0.1891 - accuracy: 0.9341 - val_loss: 0.3225 - val_accuracy: 0.8906
Epoch 20/25
430/430 [==============================] - 1s 3ms/step - loss: 0.1820 - accuracy: 0.9366 - val_loss: 0.3220 - val_accuracy: 0.8920
Epoch 21/25
430/430 [==============================] - 1s 3ms/step - loss: 0.1752 - accuracy: 0.9401 - val_loss: 0.3212 - val_accuracy: 0.8896
Epoch 22/25
430/430 [==============================] - 1s 3ms/step - loss: 0.1700 - accuracy: 0.9421 - val_loss: 0.3174 - val_accuracy: 0.8948
Epoch 23/25
430/430 [==============================] - 1s 3ms/step - loss: 0.1655 - accuracy: 0.9439 - val_loss: 0.3178 - val_accuracy: 0.8942
Epoch 24/25
430/430 [==============================] - 1s 3ms/step - loss: 0.1627 - accuracy: 0.9456 - val_loss: 0.3168 - val_accuracy: 0.8932
Epoch 25/25
430/430 [==============================] - 1s 3ms/step - loss: 0.1610 - accuracy: 0.9464 - val_loss: 0.3161 - val_accuracy: 0.8944

5. 通过正则化防止过拟合

深度神经网络通常有成千上万个参数,有时甚至达到数百万个,这便给网络难以置信的自由度,并且意味着可以拟合各种复杂的数据集,但往往也存在着过拟合的风险。因此需要正则化。

第十章已经介绍过一种最好用的正则化方法,即early stopping。虽然批归一化(BN)被用来解决梯度不稳定的问题,但也可以达到很好的正则化效果。下面将介绍其它几个常用的正则法方法:L1、L2正则化、dropout、最大范数正则化(max_norm regularization)。

5.1 L1和L2正则化

layer = tf.keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=tf.keras.regularizers.l2(0.01))
# or l1(0.1) for ℓ1 regularization with a factor or 0.1
# or l1_l2(0.1, 0.01) for both ℓ1 and ℓ2 regularization, with factors 0.1 and 0.01 respectively
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="elu",kernel_initializer="he_normal",
                          kernel_regularizer=tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                       kernel_regularizer=tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dense(10, activation="softmax",
                          kernel_regularizer=tf.keras.regularizers.l2(0.01))
])

model.summary()

model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs, validation_data=(X_valid_scaled, y_valid))

输出:

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten_12 (Flatten)         (None, 784)               0         
_________________________________________________________________
dense_34 (Dense)             (None, 300)               235500    
_________________________________________________________________
dense_35 (Dense)             (None, 100)               30100     
_________________________________________________________________
dense_36 (Dense)             (None, 10)                1010      
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________

Epoch 1/2
1719/1719 [==============================] - 7s 4ms/step - loss: 1.6699 - accuracy: 0.8116 - val_loss: 0.7200 - val_accuracy: 0.8324
Epoch 2/2
1719/1719 [==============================] - 7s 4ms/step - loss: 0.7182 - accuracy: 0.8276 - val_loss: 0.6826 - val_accuracy: 0.8376

l2()正则化函数在每步训练结束时将正则化损失加到损失函数上。

为了使代码好看、简洁,可使用Python的functools.partial()函数:

from functools import partial

RegularizedDense = partial(tf.keras.layers.Dense, activation="relu",kernel_initializer="he_normal",
                          kernel_regularizer=tf.keras.regularizers.l2(0.01))

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28,28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax")
])

model.summary()

model.compile(loss="sparse_categorical_crossentropy",optimizer="nadam",metrics=["accuracy"])

history = model.fit(X_train_scaled, y_train, epochs=2, validation_data=(X_valid_scaled, y_valid))

输出:

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten_13 (Flatten)         (None, 784)               0         
_________________________________________________________________
dense_37 (Dense)             (None, 300)               235500    
_________________________________________________________________
dense_38 (Dense)             (None, 100)               30100     
_________________________________________________________________
dense_39 (Dense)             (None, 10)                1010      
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________

Epoch 1/2
1719/1719 [==============================] - 7s 4ms/step - loss: 1.5084 - accuracy: 0.8173 - val_loss: 0.7206 - val_accuracy: 0.8416
Epoch 2/2
1719/1719 [==============================] - 7s 4ms/step - loss: 0.7243 - accuracy: 0.8305 - val_loss: 0.6923 - val_accuracy: 0.8428

5.2 Dropout

Dropout:模型训练过程每一步,一定比例的神经元不参与训练。也就是说这一步不参与训练,可以下一步就参与训练了,而且在测试和预测时,所有神经元都参与计算。

一般情况下,循环神经网络dropout比率设为20-30%,卷积神经网络设为40-50%。

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,validation_data=(X_valid_scaled, y_valid))

输出:

Epoch 1/2
1719/1719 [==============================] - 7s 4ms/step - loss: 0.5669 - accuracy: 0.8050 - val_loss: 0.3605 - val_accuracy: 0.8646
Epoch 2/2
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4206 - accuracy: 0.8457 - val_loss: 0.3431 - val_accuracy: 0.8694

通常情况下,如果模型已经发生了过拟合,此时可以增大dropout比率,反之如果模型欠拟合就减少dropout比率。dropou会导致收敛速度变慢,但最终会得到一个不错的模型。

如果模型使用SELU激活函数,则应该使用alpha dropout,即dropout的一种变体,即保留输入的平均值和方差。

tf.random.set_seed(42)
np.random.seed(42)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.AlphaDropout(rate=0.2),
    tf.keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.AlphaDropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    tf.keras.layers.AlphaDropout(rate=0.2),
    tf.keras.layers.Dense(10, activation="softmax")
])
optimizer = tf.keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs = 20
history = model.fit(X_train_scaled, y_train, epochs=n_epochs, validation_data=(X_valid_scaled, y_valid))

输出:

Epoch 1/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.6654 - accuracy: 0.7595 - val_loss: 0.5929 - val_accuracy: 0.8408
Epoch 2/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.5605 - accuracy: 0.7934 - val_loss: 0.5595 - val_accuracy: 0.8392
Epoch 3/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.5284 - accuracy: 0.8049 - val_loss: 0.4848 - val_accuracy: 0.8590
Epoch 4/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.5070 - accuracy: 0.8125 - val_loss: 0.4618 - val_accuracy: 0.8598
Epoch 5/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4918 - accuracy: 0.8181 - val_loss: 0.4688 - val_accuracy: 0.8572
Epoch 6/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4864 - accuracy: 0.8189 - val_loss: 0.4704 - val_accuracy: 0.8596
Epoch 7/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4722 - accuracy: 0.8237 - val_loss: 0.4696 - val_accuracy: 0.8696
Epoch 8/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4635 - accuracy: 0.8283 - val_loss: 0.4709 - val_accuracy: 0.8604
Epoch 9/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4581 - accuracy: 0.8308 - val_loss: 0.4123 - val_accuracy: 0.8738
Epoch 10/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4529 - accuracy: 0.8325 - val_loss: 0.4713 - val_accuracy: 0.8658
Epoch 11/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4473 - accuracy: 0.8336 - val_loss: 0.4112 - val_accuracy: 0.8688
Epoch 12/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4465 - accuracy: 0.8341 - val_loss: 0.5552 - val_accuracy: 0.8486
Epoch 13/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4410 - accuracy: 0.8371 - val_loss: 0.4546 - val_accuracy: 0.8666
Epoch 14/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4317 - accuracy: 0.8388 - val_loss: 0.4423 - val_accuracy: 0.8728
Epoch 15/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4321 - accuracy: 0.8371 - val_loss: 0.4319 - val_accuracy: 0.8722
Epoch 16/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4308 - accuracy: 0.8411 - val_loss: 0.4295 - val_accuracy: 0.8738
Epoch 17/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4256 - accuracy: 0.8417 - val_loss: 0.5517 - val_accuracy: 0.8566
Epoch 18/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4263 - accuracy: 0.8420 - val_loss: 0.5069 - val_accuracy: 0.8690
Epoch 19/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4212 - accuracy: 0.8442 - val_loss: 0.4775 - val_accuracy: 0.8764
Epoch 20/20
1719/1719 [==============================] - 5s 3ms/step - loss: 0.4237 - accuracy: 0.8431 - val_loss: 0.4106 - val_accuracy: 0.8750
model.evaluate(X_test_scaled, y_test)

输出:

313/313 [==============================] - 1s 2ms/step - loss: 0.4480 - accuracy: 0.8660

[0.4479816257953644, 0.8659999966621399]
model.evaluate(X_train_scaled, y_train)

输出:

1719/1719 [==============================] - 3s 2ms/step - loss: 0.3351 - accuracy: 0.8863

[0.33514055609703064, 0.8862727284431458]

5.3 蒙特卡洛dropout(Monte Carlo (MC) Dropout)

tf.random.set_seed(42)
np.random.seed(42)

y_probas = np.stack([model(X_test_scaled, training=True) for sample in range(100)])

y_probas.shape

输出:

(100, 10000, 10)

training=True表示预测时打开dropout。由于每次dropout的神经元不一样,因此每次预测的结果也不一样。

由于predict()函数返回矩阵为m*n,即(10000,10),堆叠100个这样的矩阵,就成了(100,10000,10)。

y_proba = y_probas.mean(axis=0)
y_std = y_probas.std(axis=0)

print(y_proba.shape, y_std.shape)

输出:

(10000, 10) (10000, 10)

然后对100个不同的结果求平均值和方差。这样得出的结果即蒙特卡洛估计比单独一次的预测(不开dropout)可信的多。

下面预测测试集中的一个样本:

np.round(model.predict(X_test_scaled[:1]), 2)

输出:

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.05, 0.  , 0.01, 0.  , 0.94]],
      dtype=float32)

上面的一次单独的预测结果表明,这个样本可能属于类别9,即ankle boot。

下面查看使用蒙特卡洛估计的结果:

np.round(y_probas[:, :1], 2)

输出:

array([[[0.  , 0.  , 0.  , 0.  , 0.  , 0.41, 0.  , 0.33, 0.  , 0.26]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.3 , 0.  , 0.68]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.62, 0.  , 0.01, 0.  , 0.38]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.11, 0.  , 0.38, 0.  , 0.51]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.28, 0.  , 0.45, 0.  , 0.27]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.27, 0.  , 0.72]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.38, 0.  , 0.23, 0.  , 0.39]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.26, 0.  , 0.74]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.61, 0.  , 0.04, 0.  , 0.35]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.05, 0.  , 0.02, 0.  , 0.93]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.  , 0.21, 0.  , 0.77]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.95, 0.  , 0.01, 0.  , 0.04]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.89, 0.  , 0.06, 0.  , 0.04]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.42, 0.  , 0.28, 0.  , 0.29]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.09, 0.  , 0.06, 0.  , 0.85]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.74, 0.  , 0.03, 0.  , 0.24]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.54, 0.  , 0.01, 0.  , 0.46]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.04, 0.  , 0.06, 0.  , 0.91]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.23, 0.  , 0.77]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.22, 0.  , 0.03, 0.  , 0.75]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.2 , 0.  , 0.01, 0.  , 0.79]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.9 , 0.  , 0.04, 0.  , 0.06]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.  , 0.02, 0.  , 0.96]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.03, 0.  , 0.32, 0.  , 0.66]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.12, 0.  , 0.03, 0.  , 0.85]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.52, 0.  , 0.43, 0.  , 0.05]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.59, 0.  , 0.11, 0.  , 0.31]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.27, 0.  , 0.13, 0.  , 0.6 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.74, 0.  , 0.02, 0.  , 0.23]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.39, 0.  , 0.45, 0.  , 0.16]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.51, 0.  , 0.07, 0.  , 0.42]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.63, 0.  , 0.03, 0.  , 0.34]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.88, 0.  , 0.04, 0.  , 0.08]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.77, 0.  , 0.22, 0.  , 0.01]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.38, 0.  , 0.09, 0.  , 0.53]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.14, 0.  , 0.79, 0.  , 0.07]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.18, 0.  , 0.03, 0.  , 0.79]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.82, 0.  , 0.13, 0.  , 0.05]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.25, 0.  , 0.45, 0.  , 0.29]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.93, 0.  , 0.  , 0.  , 0.06]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.34, 0.  , 0.11, 0.  , 0.56]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.12, 0.  , 0.02, 0.  , 0.87]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.71, 0.  , 0.19, 0.  , 0.1 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.08, 0.  , 0.91]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.07, 0.  , 0.06, 0.  , 0.87]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.27, 0.  , 0.72]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.21, 0.  , 0.22, 0.  , 0.57]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.06, 0.  , 0.12, 0.  , 0.82]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.  , 0.25, 0.  , 0.73]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.11, 0.  , 0.19, 0.  , 0.7 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.36, 0.  , 0.22, 0.  , 0.42]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.47, 0.  , 0.16, 0.  , 0.37]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.11, 0.  , 0.02, 0.  , 0.87]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.76, 0.  , 0.09, 0.  , 0.15]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.11, 0.  , 0.  , 0.  , 0.89]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.02, 0.  , 0.97]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.  , 0.39, 0.  , 0.59]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.24, 0.  , 0.24, 0.  , 0.52]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.1 , 0.  , 0.3 , 0.  , 0.6 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.21, 0.  , 0.09, 0.  , 0.7 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.61, 0.  , 0.03, 0.  , 0.36]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.16, 0.  , 0.28, 0.  , 0.56]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.54, 0.  , 0.16, 0.  , 0.3 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.06, 0.  , 0.03, 0.  , 0.91]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.52, 0.  , 0.05, 0.  , 0.44]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.57, 0.  , 0.05, 0.  , 0.37]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.28, 0.  , 0.02, 0.  , 0.7 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.03, 0.  , 0.01, 0.  , 0.96]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.14, 0.  , 0.11, 0.  , 0.75]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.19, 0.  , 0.25, 0.  , 0.56]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.35, 0.  , 0.17, 0.  , 0.48]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.21, 0.  , 0.04, 0.  , 0.75]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.44, 0.  , 0.15, 0.  , 0.42]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.08, 0.  , 0.91]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.21, 0.  , 0.11, 0.  , 0.68]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.13, 0.  , 0.03, 0.  , 0.84]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.19, 0.  , 0.11, 0.  , 0.7 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.4 , 0.  , 0.49, 0.  , 0.1 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.25, 0.  , 0.22, 0.  , 0.53]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.09, 0.  , 0.06, 0.  , 0.85]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.07, 0.  , 0.38, 0.  , 0.56]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.46, 0.  , 0.04, 0.  , 0.5 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.08, 0.  , 0.  , 0.  , 0.92]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.08, 0.  , 0.2 , 0.  , 0.72]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.07, 0.  , 0.24, 0.  , 0.69]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.26, 0.  , 0.06, 0.  , 0.68]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 1.  , 0.  , 0.  , 0.  , 0.  ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.25, 0.  , 0.04, 0.  , 0.71]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.59, 0.  , 0.1 , 0.  , 0.3 ]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.1 , 0.  , 0.01, 0.  , 0.89]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.08, 0.  , 0.61, 0.  , 0.31]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.41, 0.  , 0.05, 0.  , 0.54]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.8 , 0.  , 0.18, 0.  , 0.02]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.77, 0.  , 0.11, 0.  , 0.13]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.79, 0.  , 0.06, 0.  , 0.15]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.42, 0.  , 0.35, 0.  , 0.23]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.11, 0.  , 0.13, 0.  , 0.76]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.29, 0.  , 0.12, 0.  , 0.59]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.99, 0.  , 0.01, 0.  , 0.01]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.69, 0.  , 0.08, 0.  , 0.22]]],
      dtype=float32)

结果似乎需要一些分析,该样本属于的类别可能有:类别9、5、7。

类别9、5、7分别代表靴子、休闲鞋、高跟鞋,对于上面这样的预测结果很合理,因为都是鞋子。

对这些预测结果求平均值,就得到了MC Dropout预测结果:

np.round(y_proba[:1], 2)

输出:

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.33, 0.  , 0.15, 0.  , 0.52]],
      dtype=float32)

虽然模型最终也认为该样本是类别9,但比一次预测就达到94%的结果更合理。

下面查看其方差:

y_std = y_probas.std(axis=0)
np.round(y_std[:1], 2)

输出:

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.29, 0.  , 0.15, 0.  , 0.29]],
      dtype=float32)
y_pred = np.argmax(y_proba, axis=1)
y_pred

输出:

array([9, 2, 1, ..., 8, 1, 5], dtype=int64)
accuracy = np.sum(y_pred == y_test) / len(y_test)
accuracy

输出:

0.865

注意:MC dropout方法中进行不同预测的次数属于超参数,需要进行优化和调整,该值越大可能最终结果越准确、不确定度评估的越好,但是相应地耗时也增长。

class MCDropout(tf.keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

class MCAlphaDropout(tf.keras.layers.AlphaDropout):
    def call(self, inputs):
        return super().call(inputs, training=True)
    
tf.random.set_seed(42)
np.random.seed(42)

mc_model = tf.keras.models.Sequential([
    MCAlphaDropout(layer.rate) if isinstance(layer, tf.keras.layers.AlphaDropout) else layer
    for layer in model.layers
])

mc_model.summary()

输出:

Model: "sequential_14"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten_15 (Flatten)         (None, 784)               0         
_________________________________________________________________
mc_alpha_dropout (MCAlphaDro (None, 784)               0         
_________________________________________________________________
dense_43 (Dense)             (None, 300)               235500    
_________________________________________________________________
mc_alpha_dropout_1 (MCAlphaD (None, 300)               0         
_________________________________________________________________
dense_44 (Dense)             (None, 100)               30100     
_________________________________________________________________
mc_alpha_dropout_2 (MCAlphaD (None, 100)               0         
_________________________________________________________________
dense_45 (Dense)             (None, 10)                1010      
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________
optimizer = tf.keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)
mc_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

mc_model.set_weights(model.get_weights())

np.round(np.mean([mc_model.predict(X_test_scaled[:1]) for sample in range(100)], axis=0), 2)

输出:

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.32, 0.  , 0.21, 0.  , 0.47]],
      dtype=float32)

如果从头开始训练一个模型,则只需要在使用Dropout的地方换成MCDropout即可。

如果已经训练好了一个具有Dropout结构的模型,此时只需要再创建一个模型,除了将Dropout换成MCDropout其它结构与已训练的模型一样,再把已训练模型的权重参数拷贝到新模型即可。

总之,MC Dropout是一种非常优秀的技术,可以增强dropout模型并能得到很好的不确定性评估。

5.4 最大范数正则化(Max-Norm Regularization)

最大范数正则化不再将正则化损失项添加到总损失中,而在每次训练结束时计算||W||2,并根据需要更新W参数,更新公式为:

W \leftarrow W\frac{r}{\left \| W \right \|_2}

其中,r为超参数,降低r会增加正则化程度,从而降低过拟合风险。

最大范数正则化也有减轻梯度不稳定问题的作用。

Keras中的实现如下:

layer = tf.keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal",
                           kernel_constraint=tf.keras.constraints.max_norm(1.0))
MaxNormDense = partial(tf.keras.layers.Dense,
                       activation="selu", kernel_initializer="lecun_normal",
                       kernel_constraint=tf.keras.constraints.max_norm(1.))

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    MaxNormDense(300),
    MaxNormDense(100),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,validation_data=(X_valid_scaled, y_valid))

输出:

Epoch 1/2
1719/1719 [==============================] - 7s 4ms/step - loss: 0.4738 - accuracy: 0.8333 - val_loss: 0.3692 - val_accuracy: 0.8654
Epoch 2/2
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3555 - accuracy: 0.8715 - val_loss: 0.3808 - val_accuracy: 0.8636

max_norm()参数axis默认为0

6. 总结和实用指南

大部分情况下可以按照下面的配置进行模型构建:

超参数使用值
Kernel initializerHe initialization
Activation functionELU
Normalization浅层网络不使用,深层网络使用Batch Norm
RegularizationEarly stopping (+ℓ reg. if needed)
OptimizerMomentum optimization (or RMSProp or Nadam)
Learning rate schedule1cycle

对于全连接网络,可以参考如下配置:

超参数使用值
Kernel initializerLeCun initialization
Activation functionSELU
Normalization不使用
RegularizationAlpha dropout if needed
OptimizerMomentum optimization (or RMSProp or Nadam)
Learning rate schedule1cycle

大部分网络可以参考上面的两个配置进行模型的构建,对于一些特殊的要求,请参考下面内容:

  1. 如果想训练一个稀疏模型,可以使用L1正则化。如果想训练一个特别稀疏的模型,则可以使用TensorFlow Model Optimization Toolkit。这将打破自动归一化,因此需要使用默认设置。
  2. 如果想训练一个低延迟的模型,即预测速度非常快的模型,此时就应该使用较少的层,将BN层添加到前一层, 并且仅可能使用较快的激活函数,例如Leaky ReLU或ReLU。最后将数值精度从32位调成16或8位。
  3. 如果想训练一个风险敏感型模型,或者对延迟没有要求的模型,则可以使用MC Dropout来提高性能并获得更可靠的概率评估和不确定评估。

 

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值