【强化学习】使用TDLearning来解决出租车问题

问题描述:

为了演示这个问题,我们假设代理是出租车程序。有四个地点,代理必须在一个地点接一名乘客,然后把他们放在另一个地点。成功落车奖励+20分,每行驶一步要花费-1分。代理也可能因为,非法地点载客和下客,而失去10分奖励。代理的目标是,在短时间内,学会在正确地点上客和下客,而不添加非法乘客。

环境中的(R,G,Y,B)代表的是不同的上下车地点,小矩形代表的是出租车

        

解决思路:

1.无策略学习算法:Qlearing

算法步骤

1. 首先我们初始化Qfunction为任意值

   First, we initialize the Q function to some arbitrary values

2.从状态中选择行动的方法是贪心策略,然后进入到下一状态

   We take an action from a state using epsilon-greedy policy ( \epsilon >0) and move it to the new state

3.更新q值

   We update the Q value of a previous state by following the update rule:

        Q(s,a)=Q(s,a) +a(r+\gamma max(s{}' ,a)-Q(s,a))

4.重复步骤2,3直到结束

   We repeat the steps 2 and 3 till we reach the terminal state

实现代码

#off-policy learning algorithm
from math import gamma
from operator import ne
import  gym
import random

env = gym.make("Taxi-v1")
# env.render()

alpha = 0.4
gamma = 0.99
epsilon = 0.017
# initialize a Q table; it has a dictionary that stores the state-action value pair as (state, action)
q={}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        q[(s,a)] = 0.0

# updating the Q table via our Q learning update rule
#maximum value for the state-action pair and store it in a qa variable
def update_q_table(prev_state,action,reward,nextstate,alpha,gamma):
    qa = max([q[(nextstate,a)] for a in range(env.action_space.n)] )
    q[(prev_state,action)] += alpha * (reward+gamma *qa -q[(prev_state,action)])


#performing the epsilon-greedy policy
# We generate some random number in uniform distribution and if
# the number is less than the epsilon, we explore a different action in the state, or else we
# exploit the action that has a maximum q value:
def epsilon_greedy_policy(state,epsilon):
    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)),key=lambda x:q[(state,x)])



#for each episode perform Q learing
for i in range(8000):
    r = 0
    #first we initialize the env
    prev_state = env.reset()
    
    
    while True:
        env.render()

        # in each state we select action by epsilon greedy policy 
        action = epsilon_greedy_policy(prev_state,epsilon)
        
        #then we take the selected action and move to next state
        nextstate,reward,done,_ = env.step(action)

        #update the q value using update_q_table function
        update_q_table(prev_state,action,reward,nextstate,alpha,gamma)

        #update the prev state as next state
        prev_state = nextstate

        #store the rewards in r
        r += reward
        
        #if done if we reached the terminal state of the episode
        #if break the loop and start the next episode
        if done:
            break
    print("total reward",r)
    #total reward 9

env.close()

运行截图 

 2.策略学习算法:SARSA

算法步骤

1.初始化q值为任意值

   First, we initialize the Q values to some arbitrary values

2.从状态中选择行动的方法是贪心策略,然后进入到下一状态

   We select an action by the epsilon-greedy policy ( \epsilon >0) and move from one state to another

3.更新q值

We update the Q value previous state by following the update rule ,

Q(s,a)=Q(s,a) +a(r+\gamma Q(s{}',a{}')-Q(s,a))

where a' is the action selected by an epsilon-greedy policy (\epsilon >0 )

实现代码 

#on-policy learning algorithm
from math import gamma
from operator import ne
from cv2 import bilateralFilter
import  gym
import random

env = gym.make("Taxi-v1")

alpha = 0.85
gamma = 0.90
epsilon = 0.8
Q={}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        Q[(s,a)] = 0.0

#for exploration
def epsilon_greedy(state,epsilon):
    if random.uniform(0,1) <epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)),key= lambda x: Q[(state,x)])



for i in range(4000):
    #store cumulative reward of each episode in r
    r = 0

    #for each iteration,initialize the state
    state = env.reset()

    #pick the action using epsilon greedy policy
    action = epsilon_greedy(state,epsilon)
    while True:
        env.render()
        nextstate,reward,done,_ = env.step(action)
        #pick the next action 
        nextaction =epsilon_greedy(nextstate,epsilon)

        #calculate q value of the prev state 
        Q[(state,action)] += alpha *(reward + gamma *Q[(nextstate,nextaction)] - Q[(state,action)])

        action = nextaction
        state = nextstate
        r+= reward

        if done:
            break
    print("total reward",r)

env.close()

运行截图 

两种方法比较:

        在QLearning中,我们采取行动的方法是贪心策略,但是在更新QValue的时候,我们简单的选择了具有最大价值的行动。

        但是在SARSA中,我们采取行动的方法也是贪心策略,同时更新QValue的时候,我们也使用了贪心策略

参考:

《Hands-on Reinforcement Learning with Python. Master Reinforcement and Deep Reinforcement Learning using OpenAI Gym and TensorFlow》

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值