2021SC@SDUSC
深度学习可以归结为一个优化问题,最小化目标函数。最优化的求解过程,首先求解目标函数的梯度,然后将参数向负梯度方向更新,学习率表明梯度更新的步伐大小。最优化的过程依赖的算法称为优化器,可以看出深度学习优化器的两个核心是梯度与学习率,前者决定参数更新的方向后者决定参数更新程度。
PaddleOCR的方向分类器所使用的的优化器为Adam优化算法,它结合了AdaGrad和RMSProp两种优化算法的优点,对梯度的一阶矩估计(First Moment Estimation,即梯度的均值)和二阶矩估计(Second Moment Estimation,即梯度的未中心化的方差)进行综合考虑,计算出更新步长。Adam的公式如下图所示:
方向分类器的beta1的值为0.9,beta2的值为0.999,学习率为0.001,L2正则化。第一个公式计算历史梯度的一阶指数平滑值,用于得到带有动量的梯度值。第二个公式计算历史梯度平方的一阶指数平滑值,用于得到每个权重参数的学习率权重参数。第三个公式计算变量更新值,而且可知变量更新值正比于历史梯度的一阶指数平滑值,反比于历史梯度平方的一阶指数平滑值。历史梯度的一阶指数平滑就是使用历史梯度信息对当前梯度进行一个修正(消除变量更新时的震荡),从而得到一个稳定的梯度更新值。历史梯度平方的一阶指数平滑可以解决梯度稀疏的问题;频繁更新的梯度将会被赋予一个较小的学习率,而稀疏的梯度则会被赋予一个较大的学习率,通过上述机制,在数据分布稀疏的场景,能更好利用稀疏梯度的信息,比标准的SGD算法更有效地收敛。由于Adam加入了一阶动量和二阶动量,基于包含L2正则化项的梯度来计算一阶动量和二阶动量,使得参数的更新系数就会变化,与单纯的权重衰减就会变得不同。总结来说,Adam 优化器可以根据历史梯度的震荡情况和过滤震荡后的真实历史梯度对变量进行更新,L2正则化可以避免过拟合问题。
PaddleOCR的方向分类器的optimizer代码如下所示,位于PaddleOCR-release-2.2->ppocr->optimizer->optimizer.py。
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from paddle import optimizer as optim
class Adam(object):
def __init__(self,
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
parameter_list=None,
weight_decay=None,
grad_clip=None,
name=None,
lazy_mode=False,
**kwargs):
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.parameter_list = parameter_list
self.learning_rate = learning_rate
self.weight_decay = weight_decay
self.grad_clip = grad_clip
self.name = name
self.lazy_mode = lazy_mode
def __call__(self, parameters):
opt = optim.Adam(
learning_rate=self.learning_rate,
beta1=self.beta1,
beta2=self.beta2,
epsilon=self.epsilon,
weight_decay=self.weight_decay,
grad_clip=self.grad_clip,
name=self.name,
lazy_mode=self.lazy_mode,
parameters=parameters)
return opt
上述代码涉及的Adam算法主体在paddle包中的optimizer包的adam.py。其中的主要代码如下所示:
class Adam(Optimizer):
def _get_accumulator(self, name, param):
"""Utility function to fetch an accumulator for a parameter
Args:
name: name of the accumulator
param: parameter variable for which accumulator is to be fetched
Returns:
accumulator variable for the parameter
"""
if self._name is not None:
name = self._name + "_" + name
find_master = self._multi_precision and param.dtype == core.VarDesc.VarType.FP16
target_param = self._master_weights[
param.name] if find_master else param
target_name = target_param.name
if (name not in self._accumulators or
target_name not in self._accumulators[name]):
raise Exception("Accumulator {} does not exist for parameter {}".
format(name, target_name))
return self._accumulators[name][target_name]
def _append_optimize_op(self, block, param_and_grad):
assert isinstance(block, framework.Block)
moment1 = self._get_accumulator(self._moment1_acc_str,
param_and_grad[0])
moment2 = self._get_accumulator(self._moment2_acc_str,
param_and_grad[0])
beta1_pow_acc = self._get_accumulator(self._beta1_pow_acc_str,
param_and_grad[0])
beta2_pow_acc = self._get_accumulator(self._beta2_pow_acc_str,
param_and_grad[0])
find_master = self._multi_precision and param_and_grad[
0].dtype == core.VarDesc.VarType.FP16
master_weight = (self._master_weights[param_and_grad[0].name]
if find_master else None)
lr = self._create_param_lr(param_and_grad)
# create the adam optimize op
if framework.in_dygraph_mode():
_beta1 = self._beta1 if not isinstance(
self._beta1, Variable) else self._beta1.numpy().item(0)
_beta2 = self._beta2 if not isinstance(
self._beta2, Variable) else self._beta2.numpy().item(0)
_, _, _, _, _ = core.ops.adam(
param_and_grad[0], param_and_grad[1], lr, moment1, moment2,
beta1_pow_acc, beta2_pow_acc, param_and_grad[0], moment1,
moment2, beta1_pow_acc, beta2_pow_acc, 'epsilon', self._epsilon,
'lazy_mode', self._lazy_mode, 'min_row_size_to_use_multithread',
1000, 'beta1', _beta1, 'beta2', _beta2)
return None
inputs = {
"Param": [param_and_grad[0]],
"Grad": [param_and_grad[1]],
"LearningRate": [lr],
"Moment1": [moment1],
"Moment2": [moment2],
"Beta1Pow": [beta1_pow_acc],
"Beta2Pow": [beta2_pow_acc]
}
outputs = {
"ParamOut": [param_and_grad[0]],
"Moment1Out": [moment1],
"Moment2Out": [moment2],
"Beta1PowOut": [beta1_pow_acc],
"Beta2PowOut": [beta2_pow_acc],
}
attrs = {
"epsilon": self._epsilon,
"lazy_mode": self._lazy_mode,
"min_row_size_to_use_multithread": 1000,
"multi_precision": find_master
}
if isinstance(self._beta1, Variable):
inputs['Beta1Tensor'] = self._beta1
else:
attrs['beta1'] = self._beta1
if isinstance(self._beta2, Variable):
inputs['Beta2Tensor'] = self._beta2
else:
attrs['beta2'] = self._beta2
if find_master:
inputs["MasterParam"] = master_weight
outputs["MasterParamOut"] = master_weight
adam_op = block.append_op(
type=self.type,
inputs=inputs,
outputs=outputs,
attrs=attrs,
stop_gradient=True)
return adam_op