- TD Learning - Temporal Difference Learning
前面的蒙特卡洛方法 :https://blog.csdn.net/weixin_43909872/article/details/85873569
蒙特卡洛方法需要完整的episode去做分析计算,但很多情况下我们无法得到完整的episode链,这时候可以使用TD learning方法,在线时序差分学习。
具体算法如下:
David Silver的ppt中有三张图,很清楚的对比了MC,TD以及DP的不同:
- Sarsa
算法 :主要就是中间计算Q(S,A)的部分
对于我们的例子gridworld:
o o o o o o o o o o
o o o o o o o o o o
o o o o o o o o o o
x o o o o o o T o o
o o o o o o o o o o
o o o o o o o o o o
o o o o o o o o o o
主要代码:
# Pick the next action
next_action_probs = policy(next_state)
next_action = np.random.choice(np.arange(len(next_action_probs)), p=next_action_probs)
# Update statistics
stats.episode_rewards[i_episode] += reward
stats.episode_lengths[i_episode] = t
# TD Update
td_target = reward + discount_factor * Q[next_state][next_action]
运算结果:
- Q-Learning
算法如下,区别就在于计算Q(S,A)的时候用的是最好的Q(S’,a),而不是选择的下一步。
所以Q-Learning是off-policy的,而Sarsa是on-policy的
Double-Q-Learning
Q-Learning的主要代码:
# Take a step
action_probs = policy(state)
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
next_state, reward, done, _ = env.step(action)
# Update statistics
stats.episode_rewards[i_episode] += reward
stats.episode_lengths[i_episode] = t
# TD Update
best_next_action = np.argmax(Q[next_state])
td_target = reward + discount_factor * Q[next_state][best_next_action]
运算结果:
全部代码地址:
https://github.com/Neo-47/Reinforcement-Learning-Algorithms/tree/master/TD