深度学习Course2第二周Optimization Algorithms习题整理

Optimization Algorithms

  1. Using the notation for mini-batch gradient descent. To what of the following does a [ 2 ] { 4 } ( 3 ) a^{[2]\lbrace 4 \rbrace(3)} a[2]{4}(3) correspond?
  • The activation of the third layer when the input is the fourth example of the second mini-batch.
  • The activation of the second layer when the input is the third example of the fourth mini-batch.
  • The activation of the fourth layer when the input is the second example of the third mini-batch.
  • The activation of the second layer when the input is the fourth example of the third mini-batch.
  1. Which of these statements about mini-batch gradient descent do you agree with?
  • You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches so that the algorithm processes all mini-batches at the same time (vectorization).
  • Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
  • When the mini-batch size is the same as the training size, mini-batch gradient descent is equivalent to batch gradient descent.
    (解释: Batch gradient descent uses all the examples at each iteration, this is equivalent to having only one mini-batch of the size of the complete training set in mini-batch gradient descent.)
  1. Why is the best mini-batch size usually not 1 and not m, but instead something in-between? Check all that are true.
  • If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
  • If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
  • If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
  • If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.
  1. While using mini-batch gradient descent with a batch size larger than 1 but less than m, the plot of the cost function JJ looks like this:
    在这里插入图片描述
    You notice that the value of J J J is not always decreasing. Which of the following is the most likely reason for that?
  • In mini-batch gradient descent we calculate J ( y ^ { t } , y { t } ) ) J(\hat{y} ^{\{t\}} ,{y} ^{\{t\}} )) J(y^{t},y{t})) thus with each batch we compute over a new set of data.
  • A bad implementation of the backpropagation process, we should use gradient check to debug our implementation.
  • You are not implementing the moving averages correctly. Using moving averages will smooth the graph.
  • The algorithm is on a local minimum thus the noisy behavior.
    (解释:Yes. Since at each iteration we work with a different set of data or batch the loss function doesn’t have to be decreasing at each iteration.)
  1. Suppose the temperature in Casablanca over the first two days of January are the same:
    Jan 1st: θ 1 = 1 0 o C \theta_1 = 10^o C θ1=10oC
    Jan 2nd: θ 2 = 1 0 o C \theta_2 = 10^oC θ2=10oC

    (We used Fahrenheit in the lecture, so we will use Celsius here in honor of the metric world.)
    Say you use an exponentially weighted average with β = 0.5 \beta = 0.5 β=0.5 to track the temperature: v 0 = 0 v_0 = 0 v0=0, v t = β v t − 1 + ( 1 − β ) θ t v_t = \beta v_{t-1} +(1-\beta)\theta_t vt=βvt1+(1β)θt​. If v 2 v_2 v2​ is the value computed after day 2 without bias correction, and v 2 c o r r e c t e d v_2^{corrected} v2corrected​ is the value you compute with bias correction. What are these values? (You might be able to do this without a calculator, but you don’t actually need one. Remember what bias correction is doing.)
  • v 2 = 10 v_2=10 v2=10, v 2 c o r r e c t e d = 10 v^{corrected}_{2}=10 v2corrected=10
  • v 2 = 7.5 v_2=7.5 v2=7.5, v 2 c o r r e c t e d = 7.5 v^{corrected}_{2}=7.5 v2corrected=7.5
  • v 2 = 7.5 v_2=7.5 v2=7.5, v 2 c o r r e c t e d = 10 v^{corrected}_{2}=10 v2corrected=10
  • v 2 = 10 v_2=10 v2=10, v 2 c o r r e c t e d = 7.5 v^{corrected}_{2}=7.5 v2corrected=7.5
  1. Which of the following is true about learning rate decay?
  • The intuition behind it is that for later epochs our parameters are closer to a minimum thus it is more convenient to take smaller steps to prevent large oscillations.
  • We use it to increase the size of the steps taken in each mini-batch iteration.
  • The intuition behind it is that for later epochs our parameters are closer to a minimum thus it is more convenient to take larger steps to accelerate the convergence.
  • It helps to reduce the variance of a model.
    (解释:Reducing the learning rate with time reduces the oscillation around a minimum.)
  1. You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: v t = β v t − 1 + ( 1 − β ) θ t v_{t} = \beta v_{t-1} + (1-\beta)\theta_t vt=βvt1+(1β)θt​. The yellow and red lines were computed using values b e t a 1 beta_1 beta1​ and b e t a 2 beta_2 beta2​ respectively. Which of the following are true?
    在这里插入图片描述
  • β 1 < β 2 \beta_1<\beta_2 β1<β2
  • β 1 = β 2 \beta_1=\beta_2 β1=β2
  • β 1 > β 2 \beta_1>\beta_2 β1>β2
  • β 1 = 0 , β 2 > 0 \beta_1=0,\beta_2>0 β1=0,β2>0
    解释:越向右越平滑,β越大
  1. Consider the figure:
    在这里插入图片描述
    Suppose this plot was generated with gradient descent with momentum β = 0.01 \beta = 0.01 β=0.01. What happens if we increase the value of β \beta β to 0.1?
  • The gradient descent process starts oscillating in the vertical direction.
  • The gradient descent process starts moving more in the horizontal direction and less in the vertical.
  • The gradient descent process moves less in the horizontal direction and more in the vertical direction.
  • The gradient descent process moves more in the horizontal and the vertical axis.
    解释:随着β增大,走的步伐跨度越大,振幅越小,The use of a greater value of β causes a more efficient process thus reducing the oscillation in the horizontal direction and moving the steps more in the vertical direction.
  1. Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J ( W [ 1 ] , b [ 1 ] , . . . , W [ L ] , b [ L ] ) \mathcal{J}(W^{[1]},b^{[1]},...,W^{[L]},b^{[L]}) J(W[1],b[1],...,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value for J \mathcal{J} J? (Check all that apply)
  • Normalize the input data.
    (解释:Yes. In some cases, if the scale of the features is very different, normalizing the input data will speed up the training process.)
  • Try better random initialization for the weights
    (解释:Yes. As seen in previous lectures this can help the gradient descent process to prevent vanishing gradients.)
  • Add more data to the training set.
  • Try using gradient descent with momentum.
    (解释:Yes. The use of momentum can improve the speed of the training. Although other methods might give better results, such as Adam.)
  1. Which of the following are true about Adam?
  • Adam can only be used with batch gradient descent and not with mini-batch gradient descent.
  • The most important hyperparameter on Adam is ϵ ϵ ϵ and should be carefully tuned.
  • Adam combines the advantages of RMSProp and momentum.
  • Adam automatically tunes the hyperparameter α α α .
    (解释:Precisely Adam combines the features of RMSProp and momentum that is why we use two-parameter β 1 β1 β1 and β 2 β2 β2, besides ϵ ϵ ϵ.)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
### 回答1: 近端策略优化算法(proximal policy optimization algorithms)是一种用于强化学习的算法,它通过优化策略来最大化累积奖励。该算法的特点是使用了一个近端约束,使得每次更新策略时只会对其进行微调,从而保证了算法的稳定性和收敛性。近端策略优化算法在许多强化学习任务中都表现出了很好的效果,成为了当前最流行的强化学习算法之一。 ### 回答2: 近端策略优化算法是一种新兴的强化学习算法。它具有高效的策略优化和稳定的收敛性。近端策略优化算法在深度学习、自然语言处理、机器视觉、机器人学和其他应用领域都得到了广泛的应用。 近端策略优化算法的核心思想是对策略函数进行优化,以便最大化预期奖励。该算法使用指数加权平均方法来维护与策略函数相关的价值函数和状态值函数。在每个时间步中,它会使用当前策略函数执行一个或多个轨迹,然后使用这些轨迹更新策略函数的参数。 相比于其他优化策略的强化学习算法,近端策略优化算法有以下几个优点: 1. 收敛速度快——该算法具有高效的优化算法和稳定的训练过程,可以在较短的时间内收敛到最优解。 2. 收敛性强——该算法能够在训练过程中处理大的批量数据,并且可以快速地找到全局最优解。 3. 易于实现和调整——该算法的实现过程需要较少的超参数,使其易于实现和调整。 4. 可扩展性强——该算法可以扩展到复杂的问题和大规模数据集合。 总结: 近端策略优化算法是一种高效、稳定、易于实现的强化学习算法。它能够快速地处理大规模数据集合,并找到全局最优解。该算法在深度学习、自然语言处理、机器视觉、机器人学等领域中得到了广泛的应用。 ### 回答3: Proximal Policy Optimization (PPO)算法是一种强化学习中的模型优化算法。它的主要目标是发现学习最优策略的方法,并将其应用到机器人控制、游戏玩法、交通规划和服务机器人等任务中。 PPO算法的核心思想是使用一个剪切函数来限制策略更新的幅度,以确保算法的收敛性和稳定性。与传统的Policy Gradient算法不同,PPO算法对不同样本的更新幅度进行了限制,避免了策略更新过于激进或保守的情况,从而使算法更加可靠。 PPO算法的目标函数由两部分组成:第一部分是优化目标,即最大化期望奖励,第二部分是剪切函数。在PPO算法中,剪切函数被定义为两个策略之间的距离,它用于限制策略更新的幅度,以确保策略优化的稳定性。该函数使用了一个参数 $\epsilon$ 来控制策略更新的幅度,当距离超过阈值时,算法就会停止更新策略。 PPO算法的主要优点在于它的稳定性和可靠性。与其他优化算法相比,PPO算法采用了一种有限的剪切函数,从而避免了策略更新过于激进或保守的情况,而这种情况往往会导致算法崩溃或无法收敛。此外,PPO算法还具有高效性和可扩展性,可以应用于大规模深度学习中。 总之,PPO算法是一种强化学习中比较先进的算法,应用范围广泛,而且具有稳定性和可靠性,是未来智能机器人、自动驾驶等领域的重要研究方向。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

l8947943

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值