强化学习之Sarsa

最新推荐文章于 2024-04-22 15:24:24 发布

lix_gogogo

最新推荐文章于 2024-04-22 15:24:24 发布

阅读量195

点赞数

分类专栏：强化学习文章标签： python 机器学习开发语言

本文链接：https://blog.csdn.net/mengdeer_Q/article/details/129212920

版权

强化学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

本文介绍了Sarsa和Q-learning两种强化学习算法的异同，主要区别在于更新Q表的方式和选择后续动作的策略。Sarsa是在线学习算法，更新时基于实际采取的动作，而Q-learning则采用最大Q值来预测。同时，文章提供了一个从Q-learning修改为Sarsa算法的Python代码示例。

摘要由CSDN通过智能技术生成

Sarsa也是强化学习的算法之一，其标签为在线学习（On-policy），即说到做到型的算法。下面分别是Q-learning和Sarsa的伪代码（图来源于B站Up主莫烦，我做了红色备注）：

可以看到，Q-learning和Sarsa还是比较相似的。

我认为主要的不同包括以下2点：

1、更新Q表的公式方面，Q-learning为maxQ（s', a'），即考虑下一步时直接选择Q值大的作为贡献；而Sarsa为Q（s', a'），这里的a'为采用贪婪值作为策略（百分之90选择maxQ（s', a'），而10%的概率随机选择，可能是maxQ（s', a'）也可能不是）选择的，不一定百分百是maxQ（s', a'）。

2、对于下一个循环的a，Sarsa为对上一个循环中S和a更新时的a',而Q-learning要根据贪婪值策略重新进行选择这一循环的a。即Q-learning在更新上一个Q表时认为下一个状态的行动并不一定为真实的下一个状态选择的行动，但是Sarsa却是，因此是说到做到的算法。

以下是将上一篇文章（Q-learning）中的代码进行修改，变成Sarsa算法(修改的代码行后面有##### Sarsa)，若有误还请广大网友批评指正：

import numpy as np
import pandas as pd
import time

np.random.seed(2)  # reproducible

N_STATES = 6  # the length of the 1 dimensional world
ACTIONS = ['left', 'right']  # available actions
EPSILON = 0.9  # greedy police
ALPHA = 0.1  # learning rate
GAMMA = 0.9  # discount factor
MAX_EPISODES = 13  # maximum episodes
FRESH_TIME = 0.3  # fresh time for one move


def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),  # q_table initial values
        columns=actions,  # actions's name
    )
    # print(table)    # show table
    return table


def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()):  # act non-greedy or state-action have no value
        # if (state_actions == 0).all():
        action_name = np.random.choice(ACTIONS)
    else:  # act greedy
        action_name = state_actions.idxmax()  # replace argmax to idxmax as argmax means a different function in newer version of pandas
    return action_name


def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':  # move right
        if S == N_STATES - 2:  # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:  # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R


def update_env(S, episode, step_counter):
    # This is how environment be updated
    env_list = ['-'] * (N_STATES - 1) + ['T']  # '---------T' our environment
    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' % (episode + 1, step_counter)
        print('\r{}'.format(interaction), end='')
        time.sleep(2)
        print('\r                                ', end='')
    else:
        env_list[S] = 'o'
        interaction = ''.join(env_list)
        print('\r{}'.format(interaction), end='')
        time.sleep(FRESH_TIME)


def rl():
    # main part of RL loop
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        A = choose_action(S, q_table)  ###### SARSA
        while not is_terminated:
            S_, R = get_env_feedback(S, A)  # 得到下一个状态S_和奖励值R
            A_ = choose_action(S, q_table)  ###### SARSA 采用贪婪值方法获取下一状态在下一步的行动

            q_predict = q_table.loc[S, A]

            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.loc[S_, A_]  ###### SARSA
            else:
                q_target = R  # next state is terminal
                is_terminated = True  # terminate this episode

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)  # update
            S = S_  # move to next state
            A = A_  ###### SA

            update_env(S, episode, step_counter + 1)
            step_counter += 1
    return q_table


if __name__ == "__main__":
    q_table = rl()
    print('\r\nQ-table:\n')
    print(q_table)