强化学习——Qlearning和Sarsa的区别
主要是讲述一下自己学习Qlearning和Sarsa的感悟
前言
自从北京学习回到成都,一个暑假虽然”书本上“的知识没有学到什么,但是确实让身为本科生的自己眼界开阔了很多,也更加的明确了自己的目标。强化学习这一个领域之前也是仅仅知识了解大概的算法流程,知其然,不知其所以然。直到现在我也觉得其和遗传学习很相似,难道他们之间是包含与被包含的关系?(需要再继续学习才能去了解了)
本篇口水话主要是讲述一下Qlearning的算法特点,以及其与Sarsa之间的关系和不同之处,希望能给大家带来帮助。算法示例根据莫凡python的程序改:
https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow
Qlearning算法简单示例
Qlearning算法的流程:1、建立qtable,设置学习率alpha,贪心指数greedy,记忆程度gamma,初始化状态state
2、根据当前的state,选择action(greedy主要在选择action起到作用,主要是帮助算法跳出局部最优)
3、根据选择的action,计算新的state_,和奖励reward
4、在qtable中 q_old = qtable[state,action] 和 q_new = reward + gammaqtable[state_,:],max()
5、更新qtable,qtable[state,action] += alpha(q_new-q_old)
6、state = state_
7、循环步骤2-6
#Qlearning算法
import pandas as pd
import numpy as np
import time
import sys
class Rl:
def __init__(self):
self.n_state = 6 #------T
self.n_epoch = 15
self.action = ['left','right']
self.alpha = 0.1
self.greedy = 0.9
self.gamma = 0.9
self.fresh = 0.1
# np.random.seed(0)
def build_qtable(self):
q_table = pd.DataFrame(np.zeros((self.n_state,len(self.action))),columns=self.action)
# print(q_table)
return q_table
def chose_action(self,state,q_table):
action = q_table.iloc[state,:]
if np.random.rand()>self.greedy or action.all()==0:
idx_name = np.random.choice(action.index)
# print('\r'+idx_name,end='')
else:
idx_name = action.idxmax()
return idx_name
def get_env_feedback(self,state,action):
if action == 'right':
if state == self.n_state-1:
reward = 1
state_ = 'terminal'
else:
reward = 0
state_ = state + 1
else:
if state == 0:
state_ = 0
reward = 0
else:
state_ = state - 1
reward = 0
return state_,reward
def update_env(self,state,episode,n_step):
env = ['-']*self.n_state+['T']
if state == 'terminal':
result = '第{0}次,总共走了{1}次'.format(episode,n_step)
print(result)
else:
env[state] = 'o'
print('\r'+''.join(env),end='')
time.sleep(0.1)
def train(self):
q_table = self.build_qtable()
for episode in range(self.n_epoch):
n_step = 0
state = 0
is_compelete = False
self.update_env(state,episode,n_step)
while not is_compelete:
action = self.chose_action(state,q_table)
state_,reward = self.get_env_feedback(state,action)
q_predict = q_table.loc[state,action]
if state_ != 'terminal':
q_real = reward + self.gamma*q_table.iloc[state_,:].max()
else:
q_real = reward
is_compelete = True
q_table.loc[state,action] += self.alpha*(q_real-q_predict)
n_step += 1
state = state_
self.update_env(state,episode,n_step)
# print(q_table)
return q_table
rl = Rl()
rl.train()
Sarsa算法简单示例
Qlearning算法的流程:1、建立qtable,设置学习率alpha,贪心指数greedy,记忆程度gamma,初始化状态state
2、根据当前的state,选择action(greedy主要在选择action起到作用,主要是帮助算法跳出局部最优)
3、根据选择的action,计算新的state_,和奖励reward
4、在qtable中 q_old = qtable[state,action] 和 q_new = reward + gammaqtable[state_,action]
5、更新qtable,qtable[state,action] += alpha(q_new-q_old),state = state_
6、state = state_,下一次循环执行的action就是此次循环的action,下一次循环选出的action_作为下下次执行
7、循环步骤2-6
#Sarsa算法
import pandas as pd
import time
import numpy as np
class Rl:
def __init__(self):
self.n_state = 6 #------T
self.n_epoch = 20
self.action = ['left','right']
self.alpha = 0.1
self.greedy = 0.9
self.gamma = 0.9
self.fresh = 0.01
# np.random.seed(0)
def build_qtable(self):
q_table = pd.DataFrame(np.zeros((self.n_state,len(self.action))),columns=self.action)
# print(q_table)
return q_table
def chose_action(self,state,q_table,action_):
action = q_table.iloc[state,:]
if np.random.rand()>self.greedy or action.all()==0:
idx_name = np.random.choice(action.index)
# print('\r'+idx_name,end='')
else:
idx_name = action.idxmax()
if action_ == None:
return idx_name,idx_name
else:
return action_, idx_name
def get_env_feedback(self,state,action):
if action == 'right':
if state == self.n_state-1:
reward = 1
state_ = 'terminal'
else:
reward = 0
state_ = state + 1
else:
if state == 0:
state_ = 0
reward = 0
else:
state_ = state - 1
reward = 0
return state_,reward
def update_env(self,state,episode,n_step):
env = ['-']*self.n_state+['T']
if state == 'terminal':
result = '第{0}次,总共走了{1}次'.format(episode,n_step)
print(result)
else:
env[state] = 'o'
print('\r'+''.join(env),end='')
time.sleep(0.1)
def train(self):
q_table = self.build_qtable()
for episode in range(self.n_epoch):
n_step = 0
state = 0
is_compelete = False
action_ = None
self.update_env(state,episode,n_step)
while not is_compelete:
action,action_ = self.chose_action(state,q_table,action_)
state_,reward = self.get_env_feedback(state,action)
q_predict = q_table.loc[state,action_]
if state_ != 'terminal':
q_real = reward + self.gamma*q_table.loc[state_,action_]
else:
q_real = reward
is_compelete = True
q_table.loc[state,action_] += self.alpha*(q_real-q_predict)
n_step += 1
state = state_
self.update_env(state,episode,n_step)
# print(q_table)
return q_table
rl = Rl()
rl.train()
Qlearning执行结果
Sarsa程序的执行效果
虽然上述的程序执行结果具有随机性,但是大概是可以看到Qlearning收敛的速度比较快,Sarsa收敛可能需要更多的时间,但是Sarsa相比较于Qlearing更加不容易”犯错“。
总结
Qlearning和Sarsa算法都是通过建立qtable来做决策,其区别主要是在action执行部分,Qlearning会模拟执行,但是在计算q_new时并不一定执行模拟的action,而是选择qtable[state_,:].max()对应的action。Sarsa则根据挑选出来的action,计算q_new时执行挑选出来的action。
甘愿为理想“头破血流”