强化学习入门2:SARSA算法

最新推荐文章于 2024-04-18 16:35:55 发布

I_belong_to_jesus

最新推荐文章于 2024-04-18 16:35:55 发布

阅读量3.8k

点赞数 4

分类专栏：推荐算法文章标签：算法

本文链接：https://blog.csdn.net/fangfanglovezhou/article/details/122682593

版权

推荐算法专栏收录该内容

14 篇文章 5 订阅

订阅专栏

SARSA(State-Action-Reward-State-Action)是一种基于强化学习的算法，与Q-Learning一样，都是在智体的行为过程中迭代式地学习，但SARSA采用了和Q-Learning不同的迭代策略。SARSA算法实现如下：

for i in range(200):
    e = Env()
    action = epsilon_greedy(Q, e.present_state)
    while (e.is_end is False) and (e.step < MAX_STEP):
        state = e.present_state
        reward = e.interact(action)
        new_state = e.present_state
        new_action = epsilon_greedy(Q, e.present_state)
        Q[state, action] = (1 - ALPHA) * Q[state, action] + \
            ALPHA * (reward + GAMMA * Q[new_state, new_action])
        action = new_action

对比下Q-Learning算法Q函数的更新：

for i in range(200):
    e = Env()
    while (e.is_end is False) and (e.step < MAX_STEP):
        action = epsilon_greedy(Q, e.present_state)
        state = e.present_state
        reward = e.interact(action)
        new_state = e.present_state
        Q[state, action] = (1 - ALPHA) * Q[state, action] + \
            ALPHA * (reward + GAMMA * Q[new_state, :].max())

更详细代码可以参考Q-Learning，SARSA算法的更新步骤如下：

1）记录当前的状态保存到state；

2）执行上一步选好的action（同样使用epsilon_greedy算法选择下一个状态，一定概率随机游走，一定概率选择收益最大的那个状态）得到奖励reward和执行完此次action后得到的新状态new_state;

3) 在新的状态new_state，使用epsilon_greedy算法选择下一个action；

4）reward + GAMMA * Q[new_state, new_action]以一定的学习率来更新Q[state, action]；

5）action = new_action 将会在下一个循环中执行。

可以看到SARSA算法的过程，智体从一个状态（state，S）出发，执行一个状态（action，A），得到奖励（Reward，R）和新的状态（S），在新的状态下选择一个新的动作（A）,通过新的状态和新的动作来更新Q函数，因此算法的名字为：S-A-R-S-A->SARSA。

I_belong_to_jesus

关注

4
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
强化学习入门2:SARSA算法

SARSA(State-Action-Reward-State-Action)是一种基于强化学习的算法，与Q-Learning一样，都是在智体的行为过程中迭代式地学习，但SARSA采用了和Q-Learning不同的迭代策略。SARSA算法实现如下：for i in range(200): e = Env() action = epsilon_greedy(Q, e.present_state) while (e.is_end is False) and (e.step <
复制链接

扫一扫

专栏目录