reinforce

最新推荐文章于 2024-08-10 16:16:57 发布

算法学习者

最新推荐文章于 2024-08-10 16:16:57 发布

阅读量7.3k

点赞数

分类专栏： RL pytorch

37 篇文章 10 订阅

订阅专栏

34 篇文章 2 订阅

订阅专栏

I am studying RL with reinforcement/reinforce.py in pytorch/examples. I have some questions about it.

What does action.reinforce(r)22 internally do ?
Below is REINFORCE update rule where v_t is a return. We need to do gradient "ascent" as below but if we use optimizer.step2, it is gradient "descent". Is action.reinforce(r) multiplying log probability by -r? Then, it makes sense.

In autograd.backward(model.saved_actions, [None for _ in model.saved_actions])6, what is the role ofNone here?

It finds the .creator of the output and calls this method24. Basically, it just saves the reward in the.reward attribute of the creator function. Then, when the backward method is called, theStochasticFunction class will discard the grad_output it received and pass the saved reward16 to the backward method.
Yes, the gradient formulas are written in such a way that they negate the reward. You might not findreward.neg() there because they might have been slightly rewritten, but it's still a gradient to be used with a descent.
You need to give autograd the first list, so that it can discover all the stochastic nodes you want to optimize. The second list can look like that because the stochastic functions don't need any gradients (they'll discard them anyway), so you can give them None.