# tensorflow Optimizer算法

## tensorflow优化器

    tf.train.Optimizer
tf.train.MomentumOptimizer
tf.train.FtrlOptimizer
tf.train.RMSPropOptimizer

### 梯度下降法

${h}_{\theta }\left(x\right)={\theta }_{1}x$$h_\theta(x)=\theta_1x$，其参数是${\theta }_{1}$$\theta_1$，则其损失函数是：
$J\left({\theta }_{1}\right)=\frac{1}{2m}\sum _{i=1}^{m}\left({h}_{\theta }\left({x}^{\left(i\right)}\right)-{y}^{\left(i\right)}{\right)}^{2}$$J(\theta_1)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$
${\theta }_{1}$$\theta_1$通过如下求得：
$minimiz{e}_{{\theta }_{1}}J\left({\theta }_{1}\right)$$minimize_{\theta_1}J(\theta_1)$

### 梯度下降原理

${\theta }_{j}={\theta }_{j}-\alpha \frac{\mathrm{\partial }}{\mathrm{\partial }{\theta }_{j}}J\left({\theta }_{0},{\theta }_{1}\right)$$\theta_{j}=\theta_j-\alpha \frac{\partial}{\partial \theta_j}J(\theta_0,\theta_1)$

### 梯度下降法

#### 批量梯度下降法

${\theta }_{j}={\theta }_{j}-\eta \cdot {▽}_{\theta }J\left(\theta \right)$$\theta_j=\theta_j-\eta \cdot \bigtriangledown_{\theta}J(\theta)$

$h\left(\theta \right)=\sum _{j=0}^{n}{\theta }_{j}{x}_{j}$$h(\theta)=\sum_{j=0}^n\theta_jx_j$
$J\left(\theta \right)=\frac{1}{2m}\sum _{i=1}^{m}\left({y}^{i}-{h}_{\theta }\left({x}^{i}\right){\right)}^{2}$$J(\theta)=\frac{1}{2m}\sum_{i=1}^m(y^i-h_\theta(x^i))^2$
2.求解方法
$\frac{\mathrm{\partial }J\left(\theta \right)}{\mathrm{\partial }{\theta }_{j}}=-\frac{1}{m}\sum _{i=1}^{m}\left({y}^{i}-{h}_{\theta }\left({x}^{i}\right)\right){x}_{j}^{i}$$\frac{\partial{J(\theta)}}{\partial \theta_j}=-\frac{1}{m}\sum_{i=1}^m(y^i-h_\theta(x^i))x_j^i$
${\theta }_{j}={\theta }_{j}+\frac{1}{m}\sum _{i=1}^{m}\left({y}^{i}-{h}_{\theta }\left({x}^{i}\right)\right){x}_{j}^{i}$$\theta_j=\theta_j+\frac{1}{m}\sum_{i=1}^{m}(y^i-h_\theta(x^i))x_j^i$

for i in range(nb_epochs):
params = params - learning_rate * params_grad

#### 随机梯度下降法

${\theta }_{j}={\theta }_{j}-\eta \cdot {▽}_{\theta }J\left(\theta ;{x}^{\left(i\right)};{y}^{\left(i\right)}\right)$$\theta_j=\theta_j-\eta\cdot \bigtriangledown_{\theta}J(\theta;x^{(i)};y^{(i)})$

$J\left(\theta \right)=\frac{1}{2m}\sum _{i=1}^{m}\left({y}^{i}-{h}_{\theta }\left({x}^{i}\right){\right)}^{2}$$J(\theta)=\frac{1}{2m}\sum_{i=1}^m(y^i-h_\theta(x^i))^2$

${\theta }_{j}={\theta }_{j}+\left({y}^{i}-{h}_{\theta }\left({x}^{i}\right)\right){x}_{j}^{i},对每一个j求解$$\theta_j=\theta_j+(y^i-h_\theta(x^i))x_j^i,对每一个j求解$

for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params = params - learning_rate * params_grad

#### mini-batch 梯度下降法

${\theta }_{j}={\theta }_{j}-\eta \cdot {▽}_{\theta }J\left(\theta ;{x}^{\left(i;i+n\right)};{y}^{\left(i;i+n\right)}\right)$$\theta_j=\theta_j-\eta\cdot \bigtriangledown_{\theta}J(\theta;x^{(i;i+n)};y^{(i;i+n)})$

for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params = params - learning_rate * params_grad

### 问题

1.合适的学习率，$\alpha$$\alpha$比较难获得
$\alpha$$\alpha$过大导致震荡无法得到最优解，过小导致学习过程漫长。
2.对所有参数学习率只有一个，如果数据是稀疏的，并且特征具有不同的频率时，更倾向于对不同频率特征使用不同的学习率，对很少发生的特征采用较大的学习率。
3.目标函数门限需要提前定义，一旦计算中小于门限就停止，数据调度训练的选择对其有影响，通常使用shuffle打断以减小这种影响。
4.高维非凸误差函数最小求解技术难度大。

${\theta }_{t+1}={\theta }_{t}-\alpha \frac{dJ\left(\theta \right)}{d{\theta }_{t}}$$\theta_{t+1}=\theta_t-\alpha\frac{dJ(\theta)}{d\theta_t}$

$E\left[{g}^{2}{\right]}_{t}=\rho E\left[{g}^{2}{\right]}_{t-1}+\left(1-\rho \right){g}_{t}^{2}$$E[g^2]_t=\rho E[g^2]_{t-1}+(1-\rho)g_t^2$

$RMS\left[g{\right]}_{t}=\sqrt{E\left[{g}^{2}{\right]}_{t}+ϵ}$$RMS[g]_t=\sqrt{E[g^2]_t+\epsilon}$

$\mathrm{\Delta }\theta =-\frac{RMS\left[\mathrm{\Delta }\theta {\right]}_{t-1}}{RMS\left[g{\right]}_{t}}{g}_{t}$$\Delta \theta=-\frac{RMS[\Delta \theta]_{t-1}}{RMS[g]_t}g_t$
${\theta }_{t+1}={\theta }_{t}+\mathrm{\Delta }{\theta }_{t}$$\theta_{t+1}=\theta_t+\Delta \theta_t$

1 计算变量$E\left[{g}^{2}{\right]}_{0}=0$$E[g^2]_0=0$$E\left[\mathrm{\Delta }{\theta }^{2}{\right]}_{0}=0$$E[\Delta\theta^2]_0=0$
2: for t=1:T do %%Loop over #of updates

3: 计算梯度：${g}_{t}$$g_t$
4: 滑动平均梯度:$E\left[{g}^{2}{\right]}_{t}=\rho E\left[{g}^{2}{\right]}_{t-1}+\left(1-\rho \right){g}_{t}^{2}$$E[g^2]_t=\rho E[g^2]_{t-1}+(1-\rho)g_t^2$
5. 计算参数跟新$\mathrm{\Delta }\theta =-\frac{RMS\left[\mathrm{\Delta }{\theta }^{2}{\right]}_{t}}{RMS\left[g{\right]}_{t}}{g}_{t}$$\Delta \theta = - \frac{RMS[\Delta \theta^2]_t}{RMS[g]_t}g_t$
6. 计算更新$E\left[\mathrm{\Delta }{x}^{2}{\right]}_{t}=\rho E\left[\mathrm{\Delta }{\theta }^{2}{\right]}_{t-1}+\left(1-\rho \right){\theta }^{2}$$E[\Delta x^2]_t=\rho E[\Delta \theta^2]_{t-1}+(1-\rho)\theta^2$
7. 更新参数${\theta }_{t+1}={\theta }_{t}+\mathrm{\Delta }{\theta }_{t}$$\theta_{t+1}=\theta_t+\Delta \theta_t$

8.end for

$\mathrm{\Delta }{\theta }_{t}=-\frac{\eta }{\sqrt{\sum _{\tau }^{t}{g}_{\tau }^{2}+ϵ}}⨀{g}_{t}$$\Delta \theta_t=-\frac{\eta}{\sqrt{\sum_\tau^tg_{\tau}^2+ \epsilon}} \bigodot g_t$
${\theta }_{t+1}={\theta }_{t}-\frac{\eta }{\sqrt{\sum _{\tau }^{t}{g}_{\tau }^{2}+ϵ}}⨀{g}_{t}$$\theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\sum_\tau^tg_{\tau}^2+\epsilon}} \bigodot g_t$

## tf.train.MomentumOptimizer

${\theta }_{t}=\rho {\theta }_{t-1}-\eta \frac{dJ\left(\theta \right)}{d{\theta }_{t}}$$\theta_t=\rho\theta_{t-1}-\eta\frac{dJ(\theta)}{d\theta_t}$

${m}_{t}={\beta }_{1}{m}_{t-1}+\left(1-{\beta }_{1}\right){g}_{t}$$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t$
${v}_{t}={\beta }_{2}{v}_{t-1}+\left(1-{\beta }_{2}\right){g}_{t}^{2}$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$

${\stackrel{^}{m}}_{t}=\frac{{m}_{t}}{1-{\beta }_{1}^{t}}$$\hat m_t=\frac{m_t}{1-\beta_1^t}$
${\stackrel{^}{v}}_{t}=\frac{{v}_{t}}{1-{\beta }_{2}^{t}}$$\hat v_t=\frac{v_t}{1-\beta_2^t}$
${\theta }_{t+1}={\theta }_{t}-\frac{\eta }{\sqrt{{\stackrel{^}{v}}_{t}+ϵ}}{\stackrel{^}{m}}_{t}$$\theta_{t+1}=\theta_t-\frac {\eta}{\sqrt{\hat v_t + \epsilon}}\hat m_t$

## tf.train.RMSPropOptimizer

$E\left[{g}^{2}{\right]}_{t}=0.9E\left[{g}^{2}{\right]}_{t-1}+0.1{g}_{t}^{2}$$E[g^2]_t=0.9E[g^2]_{t-1}+0.1g_t^2$
$\theta \right)t+1={\theta }_{t}-\frac{\eta }{\sqrt{E\left[{g}^{2}{\right]}_{t}+ϵ}}⨀{g}_{t}$$\theta){t+1}=\theta_t-\frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}\bigodot g_t$

## 如何选用optimizer

SGD通常训练时间更长，容易陷入鞍点，但是在好的初始化和学习率调度方案的情况下，结果更可靠

## 其它梯度优化方法

1.数据重拍(shuffle函数)和数据多次重复训练
2.批量归一化，防止逐级训练中的梯度消失和溢出
3.提前终止，防止过拟合，监控验证数据集在训练中的损失，合适时提前终止。
4.增加高斯分布的梯度噪声，
${g}_{t,i}={g}_{t,i}+N\left(0,{\delta }^{2}\right)$$g_{t,i}=g_{t,i}+N(0,\delta^2)$
${\delta }_{t}^{2}=\frac{\eta }{\left(1+t{\right)}^{\gamma }}$$\delta_t^2=\frac{\eta}{(1+t)^{\gamma}}$

## tensorflow中使用方法

# Create an optimizer with the desired parameters.
# Add Ops to the graph to minimize a cost by updating a list of variables.
# "cost" is a Tensor, and the list of variables contains tf.Variable
# objects.
opt_op = opt.minimize(cost, var_list=<list of variables>)

# Execute opt_op to do one step of training:
opt_op.run()

# Create an optimizer.

# Compute the gradients for a list of variables.

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.

opt.apply_gradients(capped_grads_and_vars)