增强学习 - MDPs - TD Learning(Sarsa & Q-Learning)

  1. TD Learning - Temporal Difference Learning

前面的蒙特卡洛方法 :https://blog.csdn.net/weixin_43909872/article/details/85873569
蒙特卡洛方法需要完整的episode去做分析计算,但很多情况下我们无法得到完整的episode链,这时候可以使用TD learning方法,在线时序差分学习。

具体算法如下:
在这里插入图片描述
David Silver的ppt中有三张图,很清楚的对比了MC,TD以及DP的不同:
在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

  1. Sarsa

算法 :主要就是中间计算Q(S,A)的部分
在这里插入图片描述
对于我们的例子gridworld:
o o o o o o o o o o
o o o o o o o o o o
o o o o o o o o o o
x o o o o o o T o o
o o o o o o o o o o
o o o o o o o o o o
o o o o o o o o o o

主要代码:

            # Pick the next action
            next_action_probs = policy(next_state)
            next_action = np.random.choice(np.arange(len(next_action_probs)), p=next_action_probs)

            # Update statistics
            stats.episode_rewards[i_episode] += reward
            stats.episode_lengths[i_episode] = t

            # TD Update
            td_target = reward + discount_factor * Q[next_state][next_action]

运算结果:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  1. Q-Learning

算法如下,区别就在于计算Q(S,A)的时候用的是最好的Q(S’,a),而不是选择的下一步。
所以Q-Learning是off-policy的,而Sarsa是on-policy的
在这里插入图片描述

Double-Q-Learning
在这里插入图片描述

Q-Learning的主要代码:

# Take a step
            action_probs = policy(state)
            action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
            next_state, reward, done, _ = env.step(action)

            # Update statistics
            stats.episode_rewards[i_episode] += reward
            stats.episode_lengths[i_episode] = t

            # TD Update
            best_next_action = np.argmax(Q[next_state])
            td_target = reward + discount_factor * Q[next_state][best_next_action]

运算结果:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

全部代码地址:
https://github.com/Neo-47/Reinforcement-Learning-Algorithms/tree/master/TD

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Q-learning is a popular reinforcement learning algorithm used to solve Markov Decision Processes (MDPs). In Python, you can implement Q-learning using various libraries such as NumPy and TensorFlow. Here's a basic implementation of Q-learning in Python: ```python import numpy as np # Define the Q-learning function def q_learning(env, num_episodes, learning_rate, discount_factor, epsilon): # Initialize the Q-table num_states = env.observation_space.n num_actions = env.action_space.n Q = np.zeros((num_states, num_actions)) # Q-learning loop for episode in range(num_episodes): state = env.reset() done = False while not done: # Choose an action using epsilon-greedy policy if np.random.uniform() < epsilon: action = env.action_space.sample() else: action = np.argmax(Q[state]) # Perform the action and observe the next state and reward next_state, reward, done, _ = env.step(action) # Update the Q-table Q[state, action] += learning_rate * (reward + discount_factor * np.max(Q[next_state]) - Q[state, action]) state = next_state return Q # Example usage env = gym.make('your_environment') # Replace 'your_environment' with the name of your environment num_episodes = 1000 learning_rate = 0.1 discount_factor = 0.9 epsilon = 0.1 Q_table = q_learning(env, num_episodes, learning_rate, discount_factor, epsilon) ``` In this example, `env` represents the environment you want to train your agent on (e.g., a grid world). `num_episodes` is the number of episodes the agent will play to learn the optimal policy. `learning_rate` controls the weight given to the new information compared to the old information, while `discount_factor` determines the importance of future rewards. `epsilon` is the exploration rate that balances exploration and exploitation. Note that you need to install the required libraries (e.g., NumPy and gym) before running the code.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值