TensorFlow Train篇

最新推荐文章于 2021-05-25 22:47:32 发布

达文喜

最新推荐文章于 2021-05-25 22:47:32 发布

阅读量9.7k

点赞数 4

分类专栏： TensorFlow

TensorFlow 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Training

Optimizers 类

Optimizer

The Optimizer base class provides methods to compute gradients for a loss and apply gradients to variables. A collection of subclasses implement classic optimization algorithms such as GradientDescent and Adagrad.

Optimizers这个基类提供函数用来计算梯度，并将梯度状态更新。该基类的一系类子类实现了一些经典的算法，比如说GradientDescent和Adagrad

class tf.train.Optimizer

Base class for optimizers.

This class defines the API to add Ops to train a model. You never use this class directly, but instead instantiate one of its subclasses such as GradientDescentOptimizer, AdagradOptimizer, or MomentumOptimizer.

用于优化的基类定义了一系列API，通过这些API可以在训练模型时添加一些操作。例如：

Usage
# Create an optimizer with the desired parameters.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Add Ops to the graph to minimize a cost by updating a list of variables.
# "cost" is a Tensor, and the list of variables contains variables.Variable
# objects.
opt_op = opt.minimize(cost, <list of variables>)
Execute opt_op to do one step of training:
opt_op.run()

Processing gradients before applying them.

Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps：
Compute the gradients with compute_gradients().
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().
可以通过minimize()函数来同时计算梯度并更新该梯度所对应的参数状态（其计算的就是对应的参数的梯度），若想先计算梯度，然后再将梯度对应参数状态更新，可以通过以下几步来使用：

* 利用 compute_gradients() 函数先计算梯度
* 按照自己的需求来处理梯度
* 调用apply_gradients() 函数来更新该梯度所对应的参数的状态。
* 这样有一个好处是，梯度可以按照自己的需求来改动后再使用。
例如：

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1])) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

基类函数：
tf.train.Optimizer.__init__(use_locking, name)

Create a new Optimizer.
This must be called by the constructors of subclasses.
必须是子类的构造器来调用

Args:
use_locking: Bool. If True apply use locks to prevent concurrent updates to variables.
name: A non-empty string. The name to use for accumulators created for the optimizer.

Raises:
ValueError: if name is malformed.

tf.train.Optimizer.minimize(loss, global_step=None, var_list=None, gate_gradients=1, name=None)

Add operations to minimize ‘loss’ by updating ‘var_list’.
This method simply combines calls compute_gradients() and apply_gradients(). If you want to process the gradient before applying them call compute_gradients() and apply_gradients() explicitly instead of using this function.
该函数是compute_gradients() and apply_gradients() 的结合，如果你想在得到梯度后，先修改梯度再更新参数状态，可以调用compute_gradients()得到梯度，修改后再调用apply_gradients(), 而minimize()是直接得到梯度不做修改就更新参数状态。

Args:
loss: A Tensor containing the value to minimize.
global_step: Optional Variable to increment by one after the variables have been updated.
var_list: Optional list of variables.Variable to update to minimize ‘loss’. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.
name: Optional name for the returned operation.
Returns:
An Operation that updates the variables in ‘var_list’. If ‘global_step’ was not None, that operation also increments global_step.**

Raises:
ValueError: if some of the variables are not variables.Variable objects.

tf.train.Optimizer.compute_gradients(loss, var_list=None, gate_gradients=1)

Compute gradients of “loss” for the variables in “var_list”.
This is the first part of minimize(). It returns a list of (gradient, variable) pairs where “gradient” is the gradient for “variable”. Note that “gradient” can be a Tensor, a IndexedSlices, or None if there is no gradient for the given variable.
minimize()的第一步，返回一个元素为(梯度，参数)对的列表[ (), () ],其中每个tuple对中梯度是该参数的梯度值。

Args:
loss: A Tensor containing the value to minimize.
var_list: Optional list of variables.Variable to update to minimize “loss”. Defaults to the list of variables collected in the graph under the key GraphKey.TRAINABLE_VARIABLES. 参数变量列表（也就是要求的参数权重W列表）。
gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.

Returns:
A list of (gradient, variable) pairs.

Raises:
TypeError: If var_list contains anything else than variables.Variable.
ValueError: If some arguments are invalid.

`tf.train.Optimizer.apply_gradients(grads_and_vars, global_step=None, name=None)

Apply gradients to variables.
This is the second part of minimize(). It returns an Operation that applies gradients.
minimize() 的第二部分

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by compute_gradients().
上一步得到的[(gradient, variable)]
global_step: Optional Variable to increment by one after the variables have been updated.
？？？？？？
name: Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
Returns:
An Operation that applies the specified gradients. If ‘global_step’ was not None, that operation also increments global_step。
返回的是一个操作。

Raises:
TypeError: if grads_and_vars is malformed.

Gating Gradients

Both minimize() and compute_gradients() accept a gate_gradient argument that controls the degree of parallelism during the application of the gradients.

The possible values are: GATE_NONE, GATE_OP, and GATE_GRAPH.

GATE_NONE: Compute and apply gradients in parallel. This provides the maximum parallelism in execution, at the cost of some non-reproducibility in the results. For example the two gradients of MatMul depend on the input values: With GATE_NONE one of the gradients could be applied to one of the inputs before the other gradient is computed resulting in non-reproducible results.
并行计算并应用梯度，该参数在计算一些非重复性损失的时候提供最大的并行度。比如矩阵相乘的两个梯度依赖于输入的矩阵值，但是在GATE_NONE 下，其中的一个梯度可以作用于输入，在另一个不重复的梯度计算出来前

GATE_OP: For each Op, make sure all gradients are computed before they are used. This prevents race conditions for Ops that generate gradients for multiple inputs where the gradients depend on the inputs.
对于每个梯度应用操作，确保所有的梯度都先计算出来了。对于梯度依赖于多输入的情况，这种方式能有效地防止梯度计算时的资源恶性竞争

GATE_GRAPH: Make sure all gradients for all variables are computed before any one of them is used. This provides the least parallelism but can be useful if you want to process all gradients before applying any of them.
确保在更新参数状态前，所有梯度都被计算出来了，这种方式提供了一定的并行度,。

Slots

Some optimizer subclasses, such as MomentumOptimizer and AdagradOptimizer allocate and manage additional variables associated with the variables to train. These are called Slots. Slots have names and you can ask the optimizer for the names of the slots that it uses. Once you have a slot name you can ask the optimizer for the variable it created to hold the slot value.
一些优化的子类，如MomentumOptimizer和AdagradOptimizer在参数训练时会分配和管理一些额外的参数。这些额外的参数都与训练的参数有关，是所谓的slots。可以调用优化器来获得slots的的名称。一旦你有一个slot名字，可以调用优化类创建的额外参数来保存该slot的值。
This can be useful if you want to log debug a training algorithm, report stats about the slots, etc.
该方式在训练调试算法或者获取slot的状态时非常有效

tf.train.Optimizer.get_slot_names()

Return a list of the names of slots created by the Optimizer.
See get_slot().
Returns:
A list of strings.

tf.train.Optimizer.get_slot(var, name)

Return a slot named “name” created for “var” by the Optimizer.
Some Optimizer subclasses use additional variables. For example Momentum and Adagrad use variables to accumulate updates. This method gives access to these Variables if for some reason you need them.
有些优化子类需要使用参数来累计更新，这样的话用slot就可以获取某些参数的状态

Use get_slot_names() to get the list of slot names created by the Optimizer.

Args:
var: A variable passed to minimize() or apply_gradients().
name: A string.
Returns:
The Variable for the slot if it was created, None otherwise.

子类

`class tf.train.GradientDescentOptimizer`

tf.train.GradientDescentOptimizer.__init__(learning_rate, use_locking=False, name='GradientDescent')

`class tf.train.AdagradOptimizer`

Optimizer that implements the Adagrad algorithm.
tf.train.AdagradOptimizer.__init__(learning_rate, initial_accumulator_value=0.1, use_locking=False, name='Adagrad')

`class tf.train.MomentumOptimizer`

Optimizer that implements the Momentum algorithm.
tf.train.MomentumOptimizer.__init__(learning_rate, momentum, use_locking=False, name='Momentum')

`class tf.train.AdamOptimizer`

Optimizer that implements the Adam algorithm.
tf.train.AdamOptimizer.__init__(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')
Construct a new Adam optimizer.

`class tf.train.FtrlOptimizer`

Optimizer that implements the FTRL algorithm.
tf.train.FtrlOptimizer.__init__(learning_rate, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='Ftrl')

`class tf.train.RMSPropOptimizer`

Optimizer that implements the RMSProp algorithm.
tf.train.RMSPropOptimizer.__init__(learning_rate, decay, momentum=0.0, epsilon=1e-10, use_locking=False, name='RMSProp')

Gradient Computation

TensorFlow provides functions to compute the derivatives for a given TensorFlow computation graph, adding operations to the graph. The optimizer classes automatically compute derivatives on your graph, but creators of new Optimizers or expert users can call the lower-level functions below.

对于有其他需求，可以用更底层的方法来操作梯度

tf.gradients(ys, xs, grad_ys=None, name='gradients', colocate_gradients_with_ops=False, gate_gradients=False, aggregation_method=None)

class tf.AggregationMethod

tf.stop_gradient(input, name=None)

Gradient Clipping

TensorFlow provides several operations that you can use to add clipping functions to your graph. You can use these functions to perform general data clipping, but they’re particularly useful for handling exploding or vanishing gradients.
下述方法时合用于处理梯度爆炸和梯度消散

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

tf.clip_by_norm(t, clip_norm, name=None)

tf.clip_by_average_norm(t, clip_norm, name=None)

tf.clip_by_global_norm(t_list, clip_norm, use_norm=None, name=None)

tf.global_norm(t_list, name=None)

Decaying the learning rate 学习速率衰减应用

tf.train.exponential_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)
Applies exponential decay to the learning rate.适用于学习率的指数衰减

Moving Averages动态均值

有些训练算法，如Gradient Descent和 Momentum在优化时，若保持的变量的动态平均会很有用。使用动态平均的评价往往能显著改善效果。

tf.train.ExponentialMovingAverage

tf.train.ExponentialMovingAverage.__init__(decay, num_updates=None, name='ExponentialMovingAverage')
Creates a new ExponentialMovingAverage object.

tf.train.ExponentialMovingAverage.apply(var_list=None)

*tf.train.ExponentialMovingAverage.average(var)

未完待续