Q-Learning实战——找房间

介绍

样例来自A Painless Q-learning Tutorial (一个 Q-learning 算法的简明教程)

简单来说就是从某个房间开始,找到去目标房间的路径。
在这里插入图片描述

代码实现

import numpy as np
from tqdm import tqdm, trange

room_num = 6
room_paths = [(0, 4), (3, 4), (3, 1), (1, 5), (2, 3), (4, 5)]
target_room = 5

# Q 矩阵,默认值填充0
Q = np.zeros((room_num, room_num))
# R 矩阵,默认值填充-1
reward = np.full((room_num, room_num), -1)
# 有路径的房间,奖励设为0
for room_path in room_paths:
    if room_path[1] == target_room:
        reward[room_path[0]][room_path[1]] = 100 # 房间到达目标房间,奖励设为100
    else:
        reward[room_path[0]][room_path[1]] = 0

    # 双向路径
    if room_path[0] == target_room:
        reward[room_path[1]][room_path[0]] = 100 # 房间到达目标房间,奖励设为100
    else:
        reward[room_path[1]][room_path[0]] = 0

reward[target_room][target_room] = 100 # 目标房间奖励设为100
print("reward:")
print(reward)

max_epoch = 2000
lamma = 0.8
modes = ['one-path', 'one-step']
# one-path 一直走,直到到达目标房间
# one-step 只走一步
mode = modes[1]

for epoch in trange(max_epoch):
    new_Q = Q.copy()
    current_state = np.random.randint(0, room_num)

    def one_step(current_state, Q, reward, lamma):
        # 随机选择一个可行的动作
        p_action = (reward[current_state] >= 0).astype(int) / np.sum(reward[current_state] >= 0)
        current_action = np.random.choice(room_num, p=p_action)
        # 更新 Q 矩阵
        new_Q[current_state][current_action] = reward[current_state][current_action] + lamma * np.max(Q[current_action])
        new_state = current_action
        return new_state, new_Q

    if mode == 'one-step':
        _, new_Q = one_step(current_state, Q, reward, lamma)
    else:
        while current_state != target_room:
            current_state, new_Q = one_step(current_state, Q, reward, lamma)
    Q = new_Q

print("Q:")
print(Q.round())

这里有两种更新策略:

  1. one-path:一直走并每一步更新,直到到达目标房间
  2. one-step:每次只走一步,然后更新,然后再随机初始状态

运行结果

参考结果

在这里插入图片描述
来自A Painless Q-learning Tutorial (一个 Q-learning 算法的简明教程)

one-path

Q:
[[  0.   0.   0.   0.  80.   0.]
 [  0.   0.   0.  64.   0. 100.]
 [  0.   0.   0.  64.   0.   0.]
 [  0.  80.  51.   0.  80.   0.]
 [ 64.   0.   0.  64.   0. 100.]
 [  0.   0.   0.   0.   0.   0.]]

这里跟参考结果不一样是因为没有考虑自环且到达目标房间后就直接结束此次寻路,因此目标房间没有更新Q的相关值。

one-step

Q:
[[  0.   0.   0.   0. 400.   0.]
 [  0.   0.   0. 320.   0. 500.]
 [  0.   0.   0. 320.   0.   0.]
 [  0. 400. 256.   0. 400.   0.]
 [320.   0.   0. 320.   0. 500.]
 [  0. 400.   0.   0. 400. 500.]]

这次的结果跟参考的一样,因为这里只考虑走一步,不管起始房间和结束房间是哪里都无所谓,因此最终结果跟参考一致。

  • 4
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值