Q-Learning & SARSA (State-Action-Reward-State-Action)
Q-Learning
Abstract
Q-Learning is one of the most important algorithms in reinforcement learning. Morden reinforcement learning algorithms are mainly the improvement of Q-Learning. Q-Table is the core of Q-Learning.
Mathematical Principles
Q ( s t , a t ) = Q ( s t , a t ) + α [ r t + γ max a Q ( s t + 1 , a t ) − Q ( s t , a t ) ] Q\left( s_{t},\; a_{t} \right)\; =\; Q\left( s_{t},\; a_{t} \right)\; +\; \alpha \; \left[ r_{t}\; +\; \gamma \; \max _{a}\; Q\left( s_{t+1},\; a_{t} \right)\; -\; Q\left( s_{t},\; a_{t} \right) \right] Q(st,at)=Q(st,at)+α[rt+γamaxQ(st+1,at)−Q(st,at)]
Core Code
# 采样 <s, a, r, s'>
def learn(self, state, action, reward, next_state):
current_q = self.q_table[state][action]
# 更新Q表
new_q = reward + self.discount_factor * max(self.q_table[next_state])
self.q_table[state][action] += self.learning_rate * (new_q - current_q)
Greedy Strategy
It is very limited if a agent always perform next actions depend on Q(s,a). Greedy Strategy is used to balance experience and exploration.
# 从Q-table中选取动作
def get_action(self, state):
if np.random.rand() < self.epsilon:
# 贪婪策略随机探索动作
action = np.random.choice(self.actions)
else:
# 从q表中选择
state_action = self.q_table[state]
action = self.arg_max(state_action)
return action
All Code of Q-Learning
import numpy as np
import random
from collections import defaultdict
class QLearningAgent:
def __init__(self, actions):
# 四种动作分别用序列表示:[0, 1, 2, 3]
self.actions = actions
self.learning_rate = 0.01
self.discount_factor = 0.9
#epsilon贪婪策略取值
self.epsilon = 0.1
self.q_table = defaultdict(lambda: [0.0, 0.0, 0.0, 0.0])
# 采样 <s, a, r, s'>
def learn(self, state, action, reward, next_state):
current_q = self.q_table[state][action]
# 更新Q表
new_q = reward + self.discount_factor * max(self.q_table[next_state])
self.q_table[state][action] += self.learning_rate * (new_q - current_q)
# 从Q-table中选取动作
def get_action(self, state):
if np.random.rand() < self.epsilon:
# 贪婪策略随机探索动作
action = np.random.choice(self.actions)
else:
# 从q表中选择
state_action = self.q_table[state]
action = self.arg_max(state_action)
return action
@staticmethod
def arg_max(state_action):
max_index_list = []
max_value = state_action[0]
for index, value in enumerate(state_action):
if value > max_value:
max_index_list.clear()
max_value = value
max_index_list.append(index)
elif value == max_value:
max_index_list.append(index)
return random.choice(max_index_list)
SARSA (State-Action-Reward-State-Action)
Abstract
SARSA is very similar to Q-Learning, but a little different. In terms of performance, agent driven by SARSA is “timider” than Q-Learning.
Mathematical Principles
Q ( s t , a t ) = Q ( s t , a t ) + α [ r t + γ Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) ] Q\left( s_{t},\; a_{t} \right)\; =\; Q\left( s_{t},\; a_{t} \right)\; +\; \alpha \; \left[ r_{t}\; +\; \gamma \; Q\left( s_{t+1},\; a_{t+1} \right)\; -\; Q\left( s_{t},\; a_{t} \right) \right] Q(st,at)=Q(st,at)+α[rt+γQ(st+1,at+1)−Q(st,at)]
Core Code
# 采样 <s, a, r, a', s'>
def learn(self, state, action, reward, next_action, next_state):
current_q = self.q_table[state][action]
# 更新Q表
new_q = reward + self.discount_factor * self.q_table[next_state][next_action]
self.q_table[state][action] += self.learning_rate * (new_q - current_q)
Reference
《Python深度学习基于PyTorch》(吴茂贵等著)
深度强化学习系列课程(Shusen Wang老师)
Author (of this markdown)
SaleJuice