【RL】Q-Learning & SARSA (State-Action-Reward-State-Action)

Q-Learning & SARSA (State-Action-Reward-State-Action)

Q-Learning

Abstract

Q-Learning is one of the most important algorithms in reinforcement learning. Morden reinforcement learning algorithms are mainly the improvement of Q-Learning. Q-Table is the core of Q-Learning.

Mathematical Principles

Q ( s t ,    a t )    =    Q ( s t ,    a t )    +    α    [ r t    +    γ    max ⁡ a    Q ( s t + 1 ,    a t )    −    Q ( s t ,    a t ) ] Q\left( s_{t},\; a_{t} \right)\; =\; Q\left( s_{t},\; a_{t} \right)\; +\; \alpha \; \left[ r_{t}\; +\; \gamma \; \max _{a}\; Q\left( s_{t+1},\; a_{t} \right)\; -\; Q\left( s_{t},\; a_{t} \right) \right] Q(st,at)=Q(st,at)+α[rt+γamaxQ(st+1,at)Q(st,at)]

Core Code

# 采样 <s, a, r, s'>
    def learn(self, state, action, reward, next_state):
        current_q = self.q_table[state][action]
        # 更新Q表
        new_q = reward + self.discount_factor * max(self.q_table[next_state])
        self.q_table[state][action] += self.learning_rate * (new_q - current_q)

Greedy Strategy

It is very limited if a agent always perform next actions depend on Q(s,a). Greedy Strategy is used to balance experience and exploration.

 # 从Q-table中选取动作
    def get_action(self, state):
        if np.random.rand() < self.epsilon:
            # 贪婪策略随机探索动作
            action = np.random.choice(self.actions)
        else:
            # 从q表中选择
            state_action = self.q_table[state]
            action = self.arg_max(state_action)
        return action

All Code of Q-Learning

import numpy as np
import random
from collections import defaultdict


class QLearningAgent:
    def __init__(self, actions):
        # 四种动作分别用序列表示:[0, 1, 2, 3]
        self.actions = actions
        self.learning_rate = 0.01
        self.discount_factor = 0.9
        #epsilon贪婪策略取值
        self.epsilon = 0.1
        self.q_table = defaultdict(lambda: [0.0, 0.0, 0.0, 0.0])

    # 采样 <s, a, r, s'>
    def learn(self, state, action, reward, next_state):
        current_q = self.q_table[state][action]
        # 更新Q表
        new_q = reward + self.discount_factor * max(self.q_table[next_state])
        self.q_table[state][action] += self.learning_rate * (new_q - current_q)

    # 从Q-table中选取动作
    def get_action(self, state):
        if np.random.rand() < self.epsilon:
            # 贪婪策略随机探索动作
            action = np.random.choice(self.actions)
        else:
            # 从q表中选择
            state_action = self.q_table[state]
            action = self.arg_max(state_action)
        return action

    @staticmethod
    def arg_max(state_action):
        max_index_list = []
        max_value = state_action[0]
        for index, value in enumerate(state_action):
            if value > max_value:
                max_index_list.clear()
                max_value = value
                max_index_list.append(index)
            elif value == max_value:
                max_index_list.append(index)
        return random.choice(max_index_list)

SARSA (State-Action-Reward-State-Action)

Abstract

SARSA is very similar to Q-Learning, but a little different. In terms of performance, agent driven by SARSA is “timider” than Q-Learning.

Mathematical Principles

Q ( s t ,    a t )    =    Q ( s t ,    a t )    +    α    [ r t    +    γ    Q ( s t + 1 ,    a t + 1 )    −    Q ( s t ,    a t ) ] Q\left( s_{t},\; a_{t} \right)\; =\; Q\left( s_{t},\; a_{t} \right)\; +\; \alpha \; \left[ r_{t}\; +\; \gamma \; Q\left( s_{t+1},\; a_{t+1} \right)\; -\; Q\left( s_{t},\; a_{t} \right) \right] Q(st,at)=Q(st,at)+α[rt+γQ(st+1,at+1)Q(st,at)]

Core Code

# 采样 <s, a, r, a', s'>
    def learn(self, state, action, reward, next_action, next_state):
        current_q = self.q_table[state][action]
        # 更新Q表
        new_q = reward + self.discount_factor * self.q_table[next_state][next_action]
        self.q_table[state][action] += self.learning_rate * (new_q - current_q)

Reference

《Python深度学习基于PyTorch》(吴茂贵等著)
深度强化学习系列课程(Shusen Wang老师)

Author (of this markdown)

SaleJuice

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值