深度学习优化介绍
模型优化和调整
我们通常会调整模型,以期望获得一个“既快又好”的模型(for both efficiency and effectiveness)。
模型优化一般集中于过程(inference,推断,这里指模型训练过程)和结果(goal)两个方面。
优化结果
- 更好的准确性(better accuracy)
- 更高的模型评价(metrics),例如F1
- 避免偏差(variance)和偏好(bias)
- 更低的成本(lower cost)
- 更小的模型尺寸
- 最小的延迟
- 更低的CPU、内存、硬盘需求
- 上述二者往往是矛盾的,需要作出权衡
优化训练过程
- 训练时间
- 更低的循环和试验次数
- 避免训练陷阱(pitfalls)
- 梯度消失(vanishing gradients)
- 梯度爆炸(exploding gradients)
- 过拟合(overfitting)
调整过程
准备
- 设置清晰的目标
- 准确和效率(Accuracy vs. efficiency)
- 选择和准备训练数据
- 尽可能多地覆盖更多的种类
- 为模型测试作计划
- 多种测试用例
- 模拟产品的使用场景
调整杠杆(Tuning Levers)
注:这里指调整的工具,或者可调整的方面。
- 网络结构(Network architecture)
- 层,结点,权重
- 激活函数
- 训练参数
- 期(Epochs)和批(Batches)
- 归一化(Normalization)和正则化(Regulariztion)。详解请参考概念归一化、标准化和正则化的区别与联系 - 知乎 (zhihu.com)
- 优化因子(Optimizer)
最佳调整实践
- 选择一个杠杆或超参数
- 用你的理解和经验确定一组试验的值。例如激活函数,我们通常会选择Relu。
- 基于相同的采样数据和环境,执行试验,并记录日志
- 比较并选择最好的选项/值
- 组合多个杠杆、并选出最佳方案
- 使用多个独立的数据集验证方案的稳定性
注意
- 没有一个适用任何问题的方案(No one size fits all)
- 超参数最好的取值依赖于你的具体的问题和输入数据
- 试验,试验,试验
- 第一个试验用于确定最佳方案
- 第二个试验用于验证方案的稳定性
- 第三个试验用于新的输入,验证其适用范围
程序公共函数
下面我们将通过一个具体的程序讲解基本的模型调整和优化。在此之前,先定义整个试验所使用的公共函数。其中,模型数据使用经典的Iris数据集。代码如下
# Import packages
import pandas as pd
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization, Dropout
import matplotlib.pyplot as plt
#-------------------------------------------------------------------------
#Function to convert Flower names to numeric values
#-------------------------------------------------------------------------
def type_to_numeric(x):
if x == 'setosa':
return 0
elif x == 'versicolor':
return 1
else:
return 2
#-------------------------------------------------------------------------
#Function to read data and process. Get ready for Deep Learning
#-------------------------------------------------------------------------
def get_data():
iris_data = load_iris()
#Use a Label encoder to convert String to numberic values for the target variable
label_encoder = preprocessing.LabelEncoder()
X_data = iris_data.data
Y_data = iris_data.target
#Create a scaler model that is fit on the input data
scaler = StandardScaler().fit(X_data)
X_data = scaler.transform(X_data)
#print(X_data[:5])
#Convert target variable as a one-hot-encoding array
Y_data = tf.keras.utils.to_categorical(Y_data, 3)
#print(Y_data[:5])
return X_data, Y_data
#-------------------------------------------------------------------------
#Function to create the default configuration for the model. This will
#be overridden as required during experimentation
#-------------------------------------------------------------------------
def base_model_config():
model_config = {
"HIDDEN_NODES" : [32, 64],
"HIDDEN_ACTIVATION" : "relu",
"OUTPUT_NODES" : 3,
"OUTPUT_ACTIVATION" : "softmax",
"WEIGHTS_INITIALIZER" : "random_normal",
"BIAS_INITIALIZER" : "zeros",
"NORMALIZATION" : "none",
"OPTIMIZER" : "rmsprop",
"LEARNING_RATE" : 0.001,
"REGULARIZER" : None,
"DROPOUT_RATE" : 0.0,
"EPOCHS" : 10,
"BATCH_SIZE" : 16,
"VALIDATION_SPLIT" : 0.2,
"VERBOSE" : 0,
"LOSS_FUNCTION" : "categorical_crossentropy",
"METRICS" : ["accuracy"]
}
return model_config
#-------------------------------------------------------------------------
#Function to create an optimizer based on the optimizer name and learning rate
#-------------------------------------------------------------------------
def get_optimizer(optimizer_name, learning_rate):
#'sgd', 'rmsprop', 'adma', 'adagrad'
optimizer = None
if optimizer_name == 'adagrad':
optimizer = keras.optimizers.Adagrad(lr=learning_rate)
elif optimizer_name == 'rmsprop':
optimizer = keras.optimizers.RMSprop(lr=learning_rate)
elif optimizer_name == 'adma':
optimizer = keras.optimizers.Adam(lr=learning_rate)
else:
optimizer = keras.optimizers.SGD(lr=learning_rate)
return optimizer
#-------------------------------------------------------------------------
#Function to create a model and fit the model
#-------------------------------------------------------------------------
def create_and_run_model(model_config, X, Y, model_name):
model = Sequential(name = model_name)
for layer in range(len(model_config["HIDDEN_NODES"])):
if layer == 0:
model.add(Dense(model_config["HIDDEN_NODES"][layer],
input_shape = (X.shape[1],),
name = "Dense-Layer-" + str(layer),
kernel_initializer = model_config["WEIGHTS_INITIALIZER"],
bias_initializer = model_config["BIAS_INITIALIZER"],
kernel_regularizer = model_config["REGULARIZER"],
activation = model_config["HIDDEN_ACTIVATION"]
)
)
else:
if model_config["NORMALIZATION"] == "batch":
model.add(BatchNormalization())
if model_config["DROPOUT_RATE"] > 0.0:
model.add(Dropout(model_config["DROPOUT_RATE"]))
model.add(Dense(model_config["HIDDEN_NODES"][layer],
name = "Dense-Layer-" + str(layer),
kernel_initializer = model_config["WEIGHTS_INITIALIZER"],
bias_initializer = model_config["BIAS_INITIALIZER"],
kernel_regularizer = model_config["REGULARIZER"],
activation = model_config["HIDDEN_ACTIVATION"]
)
)
model.add(Dense(model_config["OUTPUT_NODES"],
name = "Output-Layer",
activation = model_config["OUTPUT_ACTIVATION"]))
optimizer = get_optimizer(model_config["OPTIMIZER"],
model_config["LEARNING_RATE"])
model.compile(loss = model_config["LOSS_FUNCTION"],
optimizer = optimizer,
metrics = model_config["METRICS"])
print("\n*******************************************************")
model.summary()
X_train, X_val, Y_train, Y_val = train_test_split(
X, Y,
stratify=Y,
test_size=model_config["VALIDATION_SPLIT"]
)
history = model.fit(X_train,
Y_train,
batch_size=model_config["BATCH_SIZE"],
epochs=model_config["EPOCHS"],
verbose=model_config["VERBOSE"],
validation_data=(X_val, Y_val))
return history
#-------------------------------------------------------------------------
#Function to plot a graph based on the results derived
#-------------------------------------------------------------------------
def plot_graph(accuracy_measures, title):
plt.figure(figsize=(15, 10))
for experiment in accuracy_measures.keys():
plt.plot(accuracy_measures[experiment],
label = experiment,
linewidth = 3)
plt.title(title)
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
深度学习网络调整
期和批大小的调整
批大小(batch size)
批大小是指一次传递给模型的采样数据个数。该参数越大,意味着
- 更好的GPU利用率
- 更少的训练循环次数
- 可能导致训练过程不稳定
通常应该使用合适的批大小,推荐值为32
期(epochs)
期(数)是指整个训练集反复传递给模型的次数,即对训练集的数据会重复训练几次。随着期数的增加,收益会逐渐减小,并且可能导致模型不稳定。
对于期的建议是:通常在准确度稳定时,选择最早出现的期数。
试验程序
#Initialize the measures
accuracy_measures = {}
for batch_size in range(16, 128, 16):
#Load default configuration
model_config = base_model_config()
#Acquire and process input data
X,Y = get_data()
#set epoch
model_config["EPOCHS"] = 20
#set batch size
model_config["BATCH_SIZE"] = batch_size
model_name = "Batch-Size-" + str(batch_size)
history = create_and_run_model(model_config, X, Y, model_name)
accuracy_measures[model_name] = history.history["accuracy"]
#Plot
plot_graph(accuracy_measures, "Compare Batch Size and Epoch")
绘图结果如下
可以看到以下规律:
- 随着期的增加,准确度呈上升趋势,且逐渐趋于平稳
- 批大小越小,初始准确度越高
此外,由于Iris数据集的数据个数比较少,多次运行后得到的结果并不稳定。
隐藏层调整
随着层数的增加
- 可能会学习到更复杂的关系
- 带来更多的训练和推断成本,包括时间成本
- 增加过拟合的风险
建议
- 对于简单问题,2个隐藏层一般就足够了
- 从一个较小的值开始,基于试验结果逐步增加层数
试验程序
#Initialize the measures
accuracy_measures = {}
layer_list = []
for layer_count in range(1, 6):
#32 nodes in each layer
layer_list.append(32)
#Load default configuration
model_config = base_model_config()
#Acquire and process input data
X,Y = get_data()
#"HIDDEN_NODES" includes all nodes in layers from input layer to the last hidden layer
model_config["HIDDEN_NODES"] = layer_list
model_name = "Layer-" + str(layer_count)
history = create_and_run_model(model_config, X, Y, model_name)
accuracy_measures[model_name] = history.history["accuracy"]
#Plot
plot_graph(accuracy_measures, "Compare Layers")
程序运行结果如下
可以看出,随着层数的增加,结果不一定好。
结点数调整
随着结点数的增加
- 可能会学习到更复杂的关系
- 带来更多的训练和推断成本,包括时间成本
- 增加过拟合的风险
建议
- 介于输入层和输出层的结点数之间
- 从一个较小的值(如32)开始,基于试验来逐渐增加
注意:输入层的结点个数等于每个采样的feature个数,和采样个数没有关系。
试验程序
下面只给出程序,读者可以自己运行以观察效果。
#Initialize the measures
accuracy_measures = {}
for node_count in range(8, 40, 8):
#have a fixed number of 2 hidden layers
layer_list = []
for layer_count in range(2):
layer_list.append(node_count)
#Load default configuration
model_config = base_model_config()
#Acquire and process input data
X,Y = get_data()
#"HIDDEN_NODES" includes all nodes in layers from input layer to the last hidden layer
model_config["HIDDEN_NODES"] = layer_list
model_name = "Nodes-" + str(node_count)
history = create_and_run_model(model_config, X, Y, model_name)
accuracy_measures[model_name] = history.history["accuracy"]
#Plot
plot_graph(accuracy_measures, "Compare Nodes")
选择激活函数
一般来说,隐藏层和输出层会使用不同的激活函数。
对于隐藏层,激活函数
- 依赖于问题本身,以及神经网络的结构
- 主要影响梯度下降
- 建议:
- 对于ANN和CNN,优先ReLU;对于RNN,优先sigmoid
- 试验也是必须的
对于输出层,激活函数
- 依赖于问题的类型
- 二分类问题 -- sigmoid
- 多分类问题 -- softmax
- 回归问题 -- linear
试验程序
#Initialize the measures
accuracy_measures = {}
activation_list = ['relu', 'sigmoid', 'tanh']
for activation in activation_list:
#Load default configuration
model_config = base_model_config()
#Acquire and process input data
X,Y = get_data()
#"HIDDEN_NODES" includes all nodes in layers from input layer to the last hidden layer
model_config["HIDDEN_ACTIVATION"] = activation
model_name = "Model-" + activation
history = create_and_run_model(model_config, X, Y, model_name)
accuracy_measures[model_name] = history.history["accuracy"]
#Plot
plot_graph(accuracy_measures, "Compare Activation Functions")
初始化权重
有以下初始化技术:
- 正态随机(Random normal):从一个标准正态分布中随机取值
- 零(Zeros):所有参数取值为零
- 一(Ones):所有参数取值为一
- 均匀随机(Random uniform):从一个均与分布中随机取值
正态随机和均匀随机的区别是:在标准正态分布中,取得的随机值会更加接近于平均值,而在均匀分布中,取得的随机值在值域内都是均匀的。推荐使用正态随机,因为其在多数情况下具有最好的性能。当然,其它也可以使用,甚至包括某些自定义初始化技术。
试验程序
#Initialize the measures
accuracy_measures = {}
initializer_list = ['random_normal', 'zeros', 'ones', 'random_uniform']
for initializer in initializer_list:
#Load default configuration
model_config = base_model_config()
#Acquire and process input data
X,Y = get_data()
#"HIDDEN_NODES" includes all nodes in layers from input layer to the last hidden layer
model_config["WEIGHTS_INITIALIZER"] = initializer
model_name = "Model-" + initializer
history = create_and_run_model(model_config, X, Y, model_name)
accuracy_measures[model_name] = history.history["accuracy"]
#Plot
plot_graph(accuracy_measures, "Compare Weights Initialization")