问题描述:
为了演示这个问题,我们假设代理是出租车程序。有四个地点,代理必须在一个地点接一名乘客,然后把他们放在另一个地点。成功落车奖励+20分,每行驶一步要花费-1分。代理也可能因为,非法地点载客和下客,而失去10分奖励。代理的目标是,在短时间内,学会在正确地点上客和下客,而不添加非法乘客。
环境中的(R,G,Y,B)代表的是不同的上下车地点,小矩形代表的是出租车
解决思路:
1.无策略学习算法:Qlearing
算法步骤
1. 首先我们初始化Qfunction为任意值
First, we initialize the Q function to some arbitrary values
2.从状态中选择行动的方法是贪心策略,然后进入到下一状态
We take an action from a state using epsilon-greedy policy ( ) and move it to the new state
3.更新q值
We update the Q value of a previous state by following the update rule:
4.重复步骤2,3直到结束
We repeat the steps 2 and 3 till we reach the terminal state
实现代码
#off-policy learning algorithm
from math import gamma
from operator import ne
import gym
import random
env = gym.make("Taxi-v1")
# env.render()
alpha = 0.4
gamma = 0.99
epsilon = 0.017
# initialize a Q table; it has a dictionary that stores the state-action value pair as (state, action)
q={}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
q[(s,a)] = 0.0
# updating the Q table via our Q learning update rule
#maximum value for the state-action pair and store it in a qa variable
def update_q_table(prev_state,action,reward,nextstate,alpha,gamma):
qa = max([q[(nextstate,a)] for a in range(env.action_space.n)] )
q[(prev_state,action)] += alpha * (reward+gamma *qa -q[(prev_state,action)])
#performing the epsilon-greedy policy
# We generate some random number in uniform distribution and if
# the number is less than the epsilon, we explore a different action in the state, or else we
# exploit the action that has a maximum q value:
def epsilon_greedy_policy(state,epsilon):
if random.uniform(0,1) < epsilon:
return env.action_space.sample()
else:
return max(list(range(env.action_space.n)),key=lambda x:q[(state,x)])
#for each episode perform Q learing
for i in range(8000):
r = 0
#first we initialize the env
prev_state = env.reset()
while True:
env.render()
# in each state we select action by epsilon greedy policy
action = epsilon_greedy_policy(prev_state,epsilon)
#then we take the selected action and move to next state
nextstate,reward,done,_ = env.step(action)
#update the q value using update_q_table function
update_q_table(prev_state,action,reward,nextstate,alpha,gamma)
#update the prev state as next state
prev_state = nextstate
#store the rewards in r
r += reward
#if done if we reached the terminal state of the episode
#if break the loop and start the next episode
if done:
break
print("total reward",r)
#total reward 9
env.close()
运行截图
2.策略学习算法:SARSA
算法步骤
1.初始化q值为任意值
First, we initialize the Q values to some arbitrary values
2.从状态中选择行动的方法是贪心策略,然后进入到下一状态
We select an action by the epsilon-greedy policy ( ) and move from one state to another
3.更新q值
We update the Q value previous state by following the update rule ,
where a' is the action selected by an epsilon-greedy policy ( )
实现代码
#on-policy learning algorithm
from math import gamma
from operator import ne
from cv2 import bilateralFilter
import gym
import random
env = gym.make("Taxi-v1")
alpha = 0.85
gamma = 0.90
epsilon = 0.8
Q={}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
Q[(s,a)] = 0.0
#for exploration
def epsilon_greedy(state,epsilon):
if random.uniform(0,1) <epsilon:
return env.action_space.sample()
else:
return max(list(range(env.action_space.n)),key= lambda x: Q[(state,x)])
for i in range(4000):
#store cumulative reward of each episode in r
r = 0
#for each iteration,initialize the state
state = env.reset()
#pick the action using epsilon greedy policy
action = epsilon_greedy(state,epsilon)
while True:
env.render()
nextstate,reward,done,_ = env.step(action)
#pick the next action
nextaction =epsilon_greedy(nextstate,epsilon)
#calculate q value of the prev state
Q[(state,action)] += alpha *(reward + gamma *Q[(nextstate,nextaction)] - Q[(state,action)])
action = nextaction
state = nextstate
r+= reward
if done:
break
print("total reward",r)
env.close()
运行截图
两种方法比较:
在QLearning中,我们采取行动的方法是贪心策略,但是在更新QValue的时候,我们简单的选择了具有最大价值的行动。
但是在SARSA中,我们采取行动的方法也是贪心策略,同时更新QValue的时候,我们也使用了贪心策略
参考:
《Hands-on Reinforcement Learning with Python. Master Reinforcement and Deep Reinforcement Learning using OpenAI Gym and TensorFlow》