In the last post Overview of RL we've seen two different methodologies: Policy Gradient which aims at training a policy (Actor); and Q-Learning which aims at training a state-action value function (Critic).
We start this post by first providing some insights on the intuition behind Actor-Critic. I am inspired by this nice post : https://towardsdatascience.com/introduction-to-actor-critic-7642bdb2b3d2.
AC = Policy Gradient + Q-Learning
I suppose you all know what is Policy Gradient. Concisely, we want to learn a policy (Actor), we decide to play the game 1000 times and get the total reward of each episode. Now we have the expected reward and the aim is to maximize it by gradient ascent. Remember this equation: