I am studying RL with reinforcement/reinforce.py in pytorch/examples. I have some questions about it.
-
What does action.reinforce(r)22 internally do ?
-
Below is REINFORCE update rule where v_t is a return. We need to do gradient "ascent" as below but if we use optimizer.step2, it is gradient "descent". Is action.reinforce(r) multiplying log probability by
-r
? Then, it makes sense.
- In autograd.backward(model.saved_actions, [None for _ in model.saved_actions])6, what is the role of
None
here?
- It finds the
.creator
of the output and calls this method24. Basically, it just saves the reward in the.reward
attribute of the creator function. Then, when thebackward
method is called, theStochasticFunction
class will discard thegrad_output
it received and pass the saved reward16 to thebackward
method. - Yes, the gradient formulas are written in such a way that they negate the reward. You might not find
reward.neg()
there because they might have been slightly rewritten, but it's still a gradient to be used with a descent. - You need to give autograd the first list, so that it can discover all the stochastic nodes you want to optimize. The second list can look like that because the stochastic functions don't need any gradients (they'll discard them anyway), so you can give them None.