reinforce

34 篇文章 2 订阅

I am studying RL with reinforcement/reinforce.py in pytorch/examples. I have some questions about it.

  1. What does action.reinforce(r)22 internally do ?

  2. Below is REINFORCE update rule where v_t is a return. We need to do gradient "ascent" as below but if we use optimizer.step2, it is gradient "descent". Is action.reinforce(r) multiplying log probability by -r? Then, it makes sense.

  1. In autograd.backward(model.saved_actions, [None for _ in model.saved_actions])6, what is the role ofNone here?

  1. It finds the .creator of the output and calls this method24. Basically, it just saves the reward in the.reward attribute of the creator function. Then, when the backward method is called, theStochasticFunction class will discard the grad_output it received and pass the saved reward16 to the backward method.
  2. Yes, the gradient formulas are written in such a way that they negate the reward. You might not findreward.neg() there because they might have been slightly rewritten, but it's still a gradient to be used with a descent.
  3. You need to give autograd the first list, so that it can discover all the stochastic nodes you want to optimize. The second list can look like that because the stochastic functions don't need any gradients (they'll discard them anyway), so you can give them None.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值