强化学习实战(二):用Q-Learning和SARSA解决出租车问题

1. 出租车问题

问题描述

The Taxi Problem
    from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition"
    by Tom Dietterich

    Description:
    There are four designated locations in the grid world indicated by R(ed), B(lue), G(reen), and Y(ellow). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drive to the passenger's location, pick up the passenger, drive to the passenger's destination (another one of the four specified locations), and then drop off the passenger. Once the passenger is dropped off, the episode ends.

    Observations: 
    There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is the taxi), and 4 destination locations. 

MAP = [
    "+---------+",
    "|R: | : :G|",
    "| : : : : |",
    "| : : : : |",
    "| | : | : |",
    "|Y| : |B: |",
    "+---------+",
]    

    Actions: 
    There are 6 discrete deterministic actions:
    - 0: move south
    - 1: move north
    - 2: move east 
    - 3: move west 
    - 4: pickup passenger
    - 5: dropoff passenger
    
    Rewards: 
    There is a reward of -1 for each action and an additional reward of +20 for delievering the passenger. There is a reward of -10 for executing actions "pickup" and "dropoff" illegally.
    
    Rendering:
    - blue: passenger
    - magenta: destination
    - yellow: empty taxi
    - green: full taxi
    - other letters (R, G, B and Y): locations for passengers and destinations

    actions:
    - 0: south
    - 1: north
    - 2: east
    - 3: west
    - 4: pickup
    - 5: dropoff

    state space is represented by:
        (taxi_row, taxi_col, passenger_location, destination)
        

2. Q-Learning和SARSA

理论部分暂略

2.1 Q-Learning

2.1.1 算法描述

在这里插入图片描述

2.1.2 流程图

Created with Raphaël 2.2.0 开始 初始化随机Q值 开始情节,初始化状态 采用ε-贪婪策略选择状态s下的动作a 执行动作a,转移到新状态s', 并得到奖励r 根据算法更新Q表 s'是否是最终状态? 结束 yes no

2.2 SARSA

2.2.1 算法描述

在这里插入图片描述

2.2.2 流程图

Created with Raphaël 2.2.0 开始 初始化随机Q值 开始情节,初始化状态 采用ε-贪婪策略选择状态s下的动作a 执行动作a,转移到新状态s', 并得到奖励r 采用ε-贪婪策略选择状态s'下的动作a' 根据算法更新Q表 s'是否是最终状态? 结束 yes no

2.3 二者的区别

首先介绍两个概念:

同策略(on-policy):产生数据的策略与评估和要改善的策略是同一个策略。
异策略(off-policy):产生数据的策略与评估和要改善的策略不是同一个策略。

Q-Learning是off-policy算法,SARSA是on-policy算法。

我们从两者的算法描述可以看出,Q-Learning在episode内的每一step都采用ε-贪婪策略重新选取动作,即下一step选取的动作跟上一step没有直接关系;而SARSA在episode内下一step执行的动作是上一step更新Q表时采用ε-贪婪策略选取的那个动作。

3. 代码实现

3.1 gym环境的一些解释

3.1.1 env.reset()

Resets the state of the environment and returns an initial observation.
Returns: observation (object): the initial observation of the space.

.reset()用于重置环境,回到初始状态。可以理解为,游戏从头开始了。

3.1.2 env.step()

Run one timestep of the environment's dynamics. When end of episode is reached, you are responsible for calling `reset()` to reset this environment's state.

Accepts an action and returns a tuple (observation, reward, done, info).

Args:
       action (object): an action provided by the environment
        
Returns:
       observation (object): agent's observation of the current environment
       reward (float) : amount of reward returned after previous action
       done (boolean): whether the episode has ended, in which case further step() calls will return undefined results
       info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

.step()用于执行一个动作,最后返回一个元组(observation, reward, done, info)

元组变量的含义:
observation (object): 智能体执行动作a后的状态,也就是所谓的“下一步状态s’ ”
reward (浮点数) : 智能体执行动作a后获得的奖励
done (布尔值): 判断episode是否结束,即s’是否是最终状态?是,则done=True;否,则done=False。
info (字典): 一些辅助诊断信息(有助于调试,也可用于学习),一般用不到。

3.2 Q-Learning

import gym
import random

env = gym.make('Taxi-v2')

# 学习率
alpha = 0.5
# 折扣因子
gamma = 0.9
# ε
epsilon = 0.05

# 初始化Q表
Q = {}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        Q[(s, a)] = 0


# 更新Q表
def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
    # maxQ(s',a')
    qa = max([Q[(nextstate, a)] for a in range(env.action_space.n)])
    # 更新Q值
    Q[(prev_state, action)] += alpha * (reward + gamma * qa - Q[(prev_state, action)])


# ε-贪婪策略选取动作
def epsilon_greedy_policy(state, epsilon):
    # 如果<ε,随机选取一个另外的动作(探索)
    if random.uniform(0, 1) < epsilon:
        return env.action_space.sample()
    # 否则,选取令当前状态下Q值最大的动作(开发)
    else:
        return max(list(range(env.action_space.n)), key=lambda x: Q[(state, x)])


# 训练1000个episode
for i in range(1000):
    r = 0
    # 初始化状态(env.reset()用于重置环境)
    state = env.reset()
    # 一个episode
    while True:
        # 输出当前agent和environment的状态(可删除)
        # env.render()
        # 采用ε-贪婪策略选取动作
        action = epsilon_greedy_policy(state, epsilon)
        # 执行动作,得到一些信息
        nextstate, reward, done, _ = env.step(action)
        # 更新Q表
        update_q_table(state, action, reward, nextstate, alpha, gamma)
        # s ⬅ s'
        state = nextstate
        # 累加奖励
        r += reward
        # 判断episode是否到达最终状态
        if done:
            break
    # 打印当前episode的奖励
    print("[Episode %d] Total reward: %d" % (i + 1, r))
env.close()

3.3 SARSA

import gym
import random

env = gym.make('Taxi-v2')

# 学习率
alpha = 0.5
# 折扣因子
gamma = 0.9
# ε
epsilon = 0.05

# 初始化Q表
Q = {}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        Q[(s, a)] = 0.0


# ε-贪婪策略选取动作
def epsilon_greedy_policy(state, epsilon):
    # 如果<ε,随机选取一个另外的动作(探索)
    if random.uniform(0, 1) < epsilon:
        return env.action_space.sample()
    # 否则,选取令当前状态下Q值最大的动作(开发)
    else:
        return max(list(range(env.action_space.n)), key=lambda x: Q[(state, x)])


# 训练1000个episode
for i in range(1000):
    r = 0
    # 初始化状态(env.reset()用于重置环境)
    state = env.reset()
    # 采用ε-贪婪策略选取动作
    action = epsilon_greedy_policy(state, epsilon)
    # 一个episode
    while True:
        # 输出当前agent和environment的状态(可删除)
        env.render()
        # 执行动作得到的一些信息
        nextstate, reward, done, _ = env.step(action)
        # 采用ε-贪婪策略选取下一步的动作
        nextaction = epsilon_greedy_policy(nextstate, epsilon)
        # 更新Q表
        Q[(state, action)] += alpha * (reward + gamma * Q[(nextstate, nextaction)] - Q[(state, action)])
        # a ⬅ a'
        action = nextaction
        # s ⬅ s'
        state = nextstate
        # 累加奖励
        r += reward
        # 判断episode是否到达最终状态
        if done:
            break
    # 打印当前episode的奖励
    print("[Episode %d] Total reward: %d" % (i + 1, r))
env.close()

4.Reference

[1] http://gym.openai.com/
[2] Reinforcement Learning: An Introduction (2018)
[3] Hands-On Reinforcement Learning with Python: Master reinforcement and deep reinforcement learning using OpenAI Gym and TensorFlow [M]

敬请批评指正!

  • 11
    点赞
  • 66
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
Q-learningSARSA都属于时序差分强化学习方法,而不是蒙特卡洛强化学习方法。 时序差分强化学习是一种结合了动态规划和蒙特卡洛方法强化学习方法。它通过使用经验数据进行增量式的更新,同时利用了当前和未来的估计值来逼近最优值函数。 具体来说,Q-learningSARSA都是基于Q值函数的时序差分强化学习算法。 1. Q-learning:Q-learning是一种基于动态规划的无模型强化学习算法。它使用了时序差分(TD)方法,通过不断迭代更新Q值函数的估计值,使其逼近最优的Q值。Q-learning算法通过将当前状态和动作的估计值与下一个状态和动作的最大估计值相结合,来更新Q值函数的估计值。 2. SARSASARSA是一种基于时序差分的强化学习算法,也是一种模型-free的强化学习算法。SARSA算法使用了时序差分的方法,通过不断迭代更新Q值函数的估计值。与Q-learning不同的是,SARSA算法采用了一个策略(Policy)来决定下一个动作,并在更新Q值时使用下一个动作的估计值。 时序差分强化学习方法与蒙特卡洛强化学习方法相比,具有更高的效率和更好的适应性。它可以在每个时间步骤中进行更新,不需要等到任务结束后才进行更新,从而更快地收敛到最优策略。而蒙特卡洛强化学习方法则需要等到任务结束后才能获取完整的回报信息,进行全局更新。
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值