SGD
源码
构造函数头文件:
/**
* Stochastic Gradient Descent is a technique for minimizing a function which
* can be expressed as a sum of other functions. That is, suppose we have
*
* \f[
* f(A) = \sum_{i = 0}^{n} f_i(A)
* \f]
*
* and our task is to minimize \f$ A \f$. Stochastic gradient descent iterates
* over each function \f$ f_i(A) \f$, based on the specified update policy. By
* default vanilla update policy (see ens::VanillaUpdate) is used. The SGD class
* supports either scanning through each of the \f$ n \f$ functions \f$
* f_i(A)\f$ linearly, or in a random sequence. The algorithm continues until
* \f$ j\f$ reaches the maximum number of iterations---or when a full sequence
* of updates through each of the \f$ n \f$ functions \f$ f_i(A) \f$ produces an
* improvement within a certain tolerance \f$ \epsilon \f$. That is,
*
* \f[
* | f(A_{j + n}) - f(A_j) | < \epsilon.
* \f]
*
* The parameter \f$\epsilon\f$ is specified by the tolerance parameter to the
* constructor; \f$n\f$ is specified by the maxIterations parameter.
*
* This class is useful for data-dependent functions whose objective function
* can be expressed as a sum of objective functions operating on an individual
* point. Then, SGD considers the gradient of the objective function operating
* on an individual point in its update of \f$ A \f$.
*
* SGD can optimize differentiable separable functions. For more details, see
* the documentation on function types included with this distribution or on the
* ensmallen website.
*
* @tparam UpdatePolicyType Update policy used by SGD during the iterative
* update process. By default vanilla update policy (see ens::VanillaUpdate)
* is used.
* @tparam DecayPolicyType Decay policy used during the iterative update
* process to adjust the step size. By default the step size isn't going to
* be adjusted (i.e. NoDecay is used).
*/
template<typename UpdatePolicyType = VanillaUpdate,
typename DecayPolicyType = NoDecay>
class SGD
{
public:
/**
* Construct the SGD optimizer with the given function and parameters. The
* defaults here are not necessarily good for the given problem, so it is
* suggested that the values used be tailored to the task at hand. The
* maximum number of iterations refers to the maximum number of points that
* are processed (i.e., one iteration equals one point; one iteration does not
* equal one pass over the dataset).
*
* @param stepSize Step size for each iteration.
* @param batchSize Batch size to use for each step.
* @param maxIterations Maximum number of iterations allowed (0 means no
* limit).
* @param tolerance Maximum absolute tolerance to terminate algorithm.
* @param shuffle If true, the function order is shuffled; otherwise, each
* function is visited in linear order.
* @param updatePolicy Instantiated update policy used to adjust the given
* parameters.
* @param decayPolicy Instantiated decay policy used to adjust the step size.
* @param resetPolicy Flag that determines whether update policy parameters
* are reset before every Optimize call.
* @param exactObjective Calculate the exact objective (Default: estimate the
* final objective obtained on the last pass over the data).
*/
SGD(const double stepSize = 0.01,
const size_t batchSize = 32,
const size_t maxIterations = 100000,
const double tolerance = 1e-5,
const bool shuffle = true,
const UpdatePolicyType& updatePolicy = UpdatePolicyType(),
const DecayPolicyType& decayPolicy = DecayPolicyType(),
const bool resetPolicy = true,
const bool exactObjective = false);
实现:
template<typename UpdatePolicyType, typename DecayPolicyType>
SGD<UpdatePolicyType, DecayPolicyType>::SGD(
const double stepSize,
const size_t batchSize,
const size_t maxIterations,
const double tolerance,
const bool shuffle,
const UpdatePolicyType& updatePolicy,
const DecayPolicyType& decayPolicy,
const bool resetPolicy,
const bool exactObjective) :
stepSize(stepSize),
batchSize(batchSize),
maxIterations(maxIterations),
tolerance(tolerance),
shuffle(shuffle),
exactObjective(exactObjective),
updatePolicy(updatePolicy),
decayPolicy(decayPolicy),
resetPolicy(resetPolicy),
isInitialized(false)
{
/* Nothing to do. */ }
随机梯度下降在注释中说得很清楚
构造函数预先定义了一系列初值,不妨去看一下两个模板参数的实现:
VanillaUpdate
/**
* Vanilla update policy for Stochastic Gradient Descent (SGD). The following
* update scheme is used to update SGD in every iteration:
*
* \f[
* A_{j + 1} = A_j + \alpha \nabla f_i(A)
* \f]
*
* where \f$ \alpha \f$ is a parameter which specifies the step size. \f$ i \f$
* is chosen according to \f$ j \f$ (the iteration number).
*/
class VanillaUpdate
{
public:
/**
* The UpdatePolicyType policy classes must contain an internal 'Policy'
* template class with two template arguments: MatType and GradType. This is
* instantiated at the start of the optimization.
*/
template<typename MatType, typename GradType>
class Policy
{
public:
/**
* This is called by the optimizer method before the start of the iteration
* update process. The vanilla update doesn't initialize anything.
*
* @param parent Instantiated parent class.
* @param rows Number of rows in the gradient matrix.
* @param cols Number of columns in the gradient matrix.
*/
Policy(const VanillaUpdate& /* parent */,
const size_t /* rows */,
const size_t /* cols */)
{
/* Do nothing. */ }
/**
* Update step for SGD. The function parameters are updated in the negative
* direction of the gradient.
*
* @param iterate Parameters that minimize the function.
* @param stepSize Step size to be used for the given iteration.
* @param gradient The gradient matrix.
*/
void Update(MatType& iterate,
const double stepSize,
const GradType& gradient)
{
// Perform the vanilla SGD update.
iterate -= stepSize * gradient;
}
};
};
正如名字所示(《具体数学》中,作者也用 vanilla 和 rocky road 两种风味的冰淇淋比喻简单型和复杂型),是简单的更新策略:
f k + 1 = f k + λ k p k f_{k+1} = f_k + \lambda_k p_k fk+1=fk+λkpk
p k p_k pk 是搜索方向,取负梯度方向 p k = − ∇ f k p_k = - \nabla f_k pk=−∇fk, λ k \lambda_k λk 是步长
NoDecay
/**
* Definition of the NoDecay class. Use this as a template for your own.
*/
class NoDecay
{
public:
/**
* This constructor is called before the first iteration.
*/
NoDecay() {
}
/**
* The DecayPolicyType policy classes must contain an internal 'Policy'
* template class with two template arguments: MatType and GradType. This is
* initialized at the start of the optimization, and holds parameters specific
* to an individual optimization.
*/
template<typename MatType, typename GradType>
class Policy
{
public:
/**
* This constructor is called by the SGD Optimize() method before the start
* of the iteration update process.
*/
Policy(NoDecay& /* parent */) {
}
/**
* This function is called in each iteration after the policy update.
*
* @param iterate Parameters that minimize the function.
* @param stepSize Step size to be used for the given iteration.
* @param gradient The gradient matrix.
*/
void Update(MatType& /* iterate */,
double& /* stepSize */,
const GradType& /* gradient */)
{
// Nothing to do here.
}
/**
* This function is called in each iteration after the SVRG update step.
*
* @param iterate Parameters that minimize the function.
* @param iterate0 The last function parameters at time t - 1.
* @param gradient The current gradient matrix at time t.
* @param fullGradient The computed full gradient.
* @param stepSize Step size to be used for the given iteration.
*/
void Update(const MatType& /* iterate */,
const MatType& /* iterate0 */,
const GradType& /* gradient */,
const GradType& /* fullGradient */,
const size_t /* numBatches */,
double& /* stepSize */)
{
// Nothing to do here.
}
};
};
正如名字所说,什么都没做
Optimize头文件:
/**
* Optimize the given function using stochastic gradient descent. The given
* starting point will be modified to store the finishing point of the
* algorithm, and the final objective value is returned.
*
* @tparam SeparableFunctionType Type of the function to be optimized.
* @tparam MatType Type of matrix to optimize with.
* @tparam GradType Type of matrix to use to represent function gradients.
* @tparam v Types of callback functions.
* @param function Function to optimize.
* @param iterate Starting point (will be modified).
* @param callbacks Callback functions.
* @return Objective value of the final point.
*/
template<typename SeparableFunctionType,
typename MatType,
typename GradType,
typename... CallbackTypes>
typename std::enable_if<IsArmaType<GradType>::value,
typename MatType::elem_type>::type
Optimize(SeparableFunctionType& function,
MatType& iterate,
CallbackTypes&&... callbacks);
实现:
//! Optimize the function (minimize).
template<typename UpdatePolicyType, typename DecayPolicyType>
template<typename SeparableFunctionType,
typename MatType,
typename GradType,
typename... CallbackTypes>
typename std::enable_if<IsArmaType<GradType>::value,
typename MatType::elem_type>::type
SGD<UpdatePolicyType, DecayPolicyType>::Optimize(
SeparableFunctionType& function,
MatType& iterateIn,
CallbackTypes&&... callbacks)
{
// Convenience typedefs.
typedef typename MatType::elem_type ElemType;
typedef typename MatTypeTraits<MatType>::BaseMatType BaseMatType;
typedef