Adam(ensmallen)

最新推荐文章于 2024-05-18 09:48:34 发布

胧月夜い

最新推荐文章于 2024-05-18 09:48:34 发布

阅读量320

点赞数

文章标签：随机梯度下降算法

本文链接：https://blog.csdn.net/qq_46013251/article/details/119066660

版权

Adam

SGD
- 源码
Adam
- 伪代码
- 源码
- AdaMax
- - 伪代码
  - 源码
参考

SGD

源码

构造函数头文件：

/**
 * Stochastic Gradient Descent is a technique for minimizing a function which
 * can be expressed as a sum of other functions.  That is, suppose we have
 *
 * \f[
 * f(A) = \sum_{i = 0}^{n} f_i(A)
 * \f]
 *
 * and our task is to minimize \f$ A \f$.  Stochastic gradient descent iterates
 * over each function \f$ f_i(A) \f$, based on the specified update policy. By
 * default vanilla update policy (see ens::VanillaUpdate) is used. The SGD class
 * supports either scanning through each of the \f$ n \f$ functions \f$
 * f_i(A)\f$ linearly, or in a random sequence.  The algorithm continues until
 * \f$ j\f$ reaches the maximum number of iterations---or when a full sequence
 * of updates through each of the \f$ n \f$ functions \f$ f_i(A) \f$ produces an
 * improvement within a certain tolerance \f$ \epsilon \f$.  That is,
 *
 * \f[
 * | f(A_{j + n}) - f(A_j) | < \epsilon.
 * \f]
 *
 * The parameter \f$\epsilon\f$ is specified by the tolerance parameter to the
 * constructor; \f$n\f$ is specified by the maxIterations parameter.
 *
 * This class is useful for data-dependent functions whose objective function
 * can be expressed as a sum of objective functions operating on an individual
 * point.  Then, SGD considers the gradient of the objective function operating
 * on an individual point in its update of \f$ A \f$.
 *
 * SGD can optimize differentiable separable functions.  For more details, see
 * the documentation on function types included with this distribution or on the
 * ensmallen website.
 *
 * @tparam UpdatePolicyType Update policy used by SGD during the iterative
 *     update process. By default vanilla update policy (see ens::VanillaUpdate)
 *     is used.
 * @tparam DecayPolicyType Decay policy used during the iterative update
 *     process to adjust the step size. By default the step size isn't going to
 *     be adjusted (i.e. NoDecay is used).
 */
template<typename UpdatePolicyType = VanillaUpdate,
         typename DecayPolicyType = NoDecay>
class SGD
{
   
 public:
  /**
   * Construct the SGD optimizer with the given function and parameters. The
   * defaults here are not necessarily good for the given problem, so it is
   * suggested that the values used be tailored to the task at hand.  The
   * maximum number of iterations refers to the maximum number of points that
   * are processed (i.e., one iteration equals one point; one iteration does not
   * equal one pass over the dataset).
   *
   * @param stepSize Step size for each iteration.
   * @param batchSize Batch size to use for each step.
   * @param maxIterations Maximum number of iterations allowed (0 means no
   *     limit).
   * @param tolerance Maximum absolute tolerance to terminate algorithm.
   * @param shuffle If true, the function order is shuffled; otherwise, each
   *     function is visited in linear order.
   * @param updatePolicy Instantiated update policy used to adjust the given
   *                     parameters.
   * @param decayPolicy Instantiated decay policy used to adjust the step size.
   * @param resetPolicy Flag that determines whether update policy parameters
   *                    are reset before every Optimize call.
   * @param exactObjective Calculate the exact objective (Default: estimate the
   *        final objective obtained on the last pass over the data).
   */
  SGD(const double stepSize = 0.01,
      const size_t batchSize = 32,
      const size_t maxIterations = 100000,
      const double tolerance = 1e-5,
      const bool shuffle = true,
      const UpdatePolicyType& updatePolicy = UpdatePolicyType(),
      const DecayPolicyType& decayPolicy = DecayPolicyType(),
      const bool resetPolicy = true,
      const bool exactObjective = false);

实现：

template<typename UpdatePolicyType, typename DecayPolicyType>
SGD<UpdatePolicyType, DecayPolicyType>::SGD(
    const double stepSize,
    const size_t batchSize,
    const size_t maxIterations,
    const double tolerance,
    const bool shuffle,
    const UpdatePolicyType& updatePolicy,
    const DecayPolicyType& decayPolicy,
    const bool resetPolicy,
    const bool exactObjective) :
    stepSize(stepSize),
    batchSize(batchSize),
    maxIterations(maxIterations),
    tolerance(tolerance),
    shuffle(shuffle),
    exactObjective(exactObjective),
    updatePolicy(updatePolicy),
    decayPolicy(decayPolicy),
    resetPolicy(resetPolicy),
    isInitialized(false)
{
    /* Nothing to do. */ }

随机梯度下降在注释中说得很清楚

构造函数预先定义了一系列初值，不妨去看一下两个模板参数的实现：

VanillaUpdate

/**
 * Vanilla update policy for Stochastic Gradient Descent (SGD). The following
 * update scheme is used to update SGD in every iteration:
 *
 * \f[
 * A_{j + 1} = A_j + \alpha \nabla f_i(A)
 * \f]
 *
 * where \f$ \alpha \f$ is a parameter which specifies the step size.  \f$ i \f$
 * is chosen according to \f$ j \f$ (the iteration number).
 */
class VanillaUpdate
{
   
 public:
  /**
   * The UpdatePolicyType policy classes must contain an internal 'Policy'
   * template class with two template arguments: MatType and GradType.  This is
   * instantiated at the start of the optimization.
   */
  template<typename MatType, typename GradType>
  class Policy
  {
   
   public:
    /**
     * This is called by the optimizer method before the start of the iteration
     * update process.  The vanilla update doesn't initialize anything.
     *
     * @param parent Instantiated parent class.
     * @param rows Number of rows in the gradient matrix.
     * @param cols Number of columns in the gradient matrix.
     */
    Policy(const VanillaUpdate& /* parent */,
           const size_t /* rows */,
           const size_t /* cols */)
    {
    /* Do nothing. */ }

   /**
    * Update step for SGD.  The function parameters are updated in the negative
    * direction of the gradient.
    *
    * @param iterate Parameters that minimize the function.
    * @param stepSize Step size to be used for the given iteration.
    * @param gradient The gradient matrix.
    */
    void Update(MatType& iterate,
                const double stepSize,
                const GradType& gradient)
    {
   
      // Perform the vanilla SGD update.
      iterate -= stepSize * gradient;
    }
  };
};

正如名字所示（《具体数学》中，作者也用 vanilla 和 rocky road 两种风味的冰淇淋比喻简单型和复杂型），是简单的更新策略：
$f_{k+1} = f_k + \lambda_k p_k$
$p_k$ 是搜索方向，取负梯度方向 $p_k = - \nabla f_k$ ， $\lambda_k$ 是步长

NoDecay

/**
 * Definition of the NoDecay class. Use this as a template for your own.
 */
class NoDecay
{
   
 public:
  /**
   * This constructor is called before the first iteration.
   */
  NoDecay() {
    }

  /**
   * The DecayPolicyType policy classes must contain an internal 'Policy'
   * template class with two template arguments: MatType and GradType.  This is
   * initialized at the start of the optimization, and holds parameters specific
   * to an individual optimization.
   */
  template<typename MatType, typename GradType>
  class Policy
  {
   
   public:
    /**
     * This constructor is called by the SGD Optimize() method before the start
     * of the iteration update process.
     */
    Policy(NoDecay& /* parent */) {
    }

    /**
     * This function is called in each iteration after the policy update.
     *
     * @param iterate Parameters that minimize the function.
     * @param stepSize Step size to be used for the given iteration.
     * @param gradient The gradient matrix.
     */
    void Update(MatType& /* iterate */,
                double& /* stepSize */,
                const GradType& /* gradient */)
    {
   
      // Nothing to do here.
    }

    /**
     * This function is called in each iteration after the SVRG update step.
     *
     * @param iterate Parameters that minimize the function.
     * @param iterate0 The last function parameters at time t - 1.
     * @param gradient The current gradient matrix at time t.
     * @param fullGradient The computed full gradient.
     * @param stepSize Step size to be used for the given iteration.
     */
    void Update(const MatType& /* iterate */,
                const MatType& /* iterate0 */,
                const GradType& /* gradient */,
                const GradType& /* fullGradient */,
                const size_t /* numBatches */,
                double& /* stepSize */)
    {
   
      // Nothing to do here.
    }
  };
};

正如名字所说，什么都没做

Optimize头文件：

  /**
   * Optimize the given function using stochastic gradient descent.  The given
   * starting point will be modified to store the finishing point of the
   * algorithm, and the final objective value is returned.
   *
   * @tparam SeparableFunctionType Type of the function to be optimized.
   * @tparam MatType Type of matrix to optimize with.
   * @tparam GradType Type of matrix to use to represent function gradients.
   * @tparam v Types of callback functions.
   * @param function Function to optimize.
   * @param iterate Starting point (will be modified).
   * @param callbacks Callback functions.
   * @return Objective value of the final point.
   */
  template<typename SeparableFunctionType,
           typename MatType,
           typename GradType,
           typename... CallbackTypes>
  typename std::enable_if<IsArmaType<GradType>::value,
      typename MatType::elem_type>::type
  Optimize(SeparableFunctionType& function,
           MatType& iterate,
           CallbackTypes&&... callbacks);

实现：

//! Optimize the function (minimize).
template<typename UpdatePolicyType, typename DecayPolicyType>
template<typename SeparableFunctionType,
         typename MatType,
         typename GradType,
         typename... CallbackTypes>
typename std::enable_if<IsArmaType<GradType>::value,
typename MatType::elem_type>::type
SGD<UpdatePolicyType, DecayPolicyType>::Optimize(
    SeparableFunctionType& function,
    MatType& iterateIn,
    CallbackTypes&&... callbacks)
{
   
  // Convenience typedefs.
  typedef typename MatType::elem_type ElemType;
  typedef typename MatTypeTraits<MatType>::BaseMatType BaseMatType;
  typedef typename MatT

最低0.47元/天解锁文章

胧月夜い

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Adam(ensmallen)

AdamSGD源码AdamAdaMaxSGD源码构造函数头文件：/** * Stochastic Gradient Descent is a technique for minimizing a function which * can be expressed as a sum of other functions. That is, suppose we have * * \f[ * f(A) = \sum_{i = 0}^{n} f_i(A) * \f] * * and ou
复制链接

扫一扫