Q Learning_强化学习q-learning用矩阵-CSDN博客

本文链接：https://blog.csdn.net/weixin_40653652/article/details/109170938

引言

现在有这样的一个迷宫，在任意一个房间中放进一个agent，使用非监督学习的方法，agent和环境之间不断地互动，训练得到相应的Q矩阵。
在这里插入图片描述

import numpy as np
import random

R = np.ones((12,12))
R = R*-1
R[0,3]=R[3,0]=0
R[1,2]=R[2,1]=0
R[2,5]=R[5,2]=0
R[6,7]=R[7,6]=0
R[3,6]=R[6,3]=0
R[3,4]=R[4,3]=0
R[1,4]=R[4,1]=0
R[7,4]=R[4,7]=0
R[5,8]=R[8,5]=0
R[8,9]=R[9,8]=0
R[9,10]=R[10,9]=0
R[10,11]=R[11,10]=100
γ = 0.8
Q = np.zeros((12, 12))
Q = np.matrix(Q)
for i in range(3000):
    state = random.randint(0, 11)
    while True:
        r_pos_action = []
        for action in range(12):
            if R[state, action] >= 0:
                r_pos_action.append(action)
        next_state = r_pos_action[random.randint(0, len(r_pos_action) - 1)]
        Q[state, next_state] = R[state, next_state] + γ *(Q[next_state]).max()  
        state = next_state
        if state==11:
            break
print('Q矩阵:')
print(Q)
state = 0
c = state
action1 = []
while not (state == 11 and action == 11):
    action1.append(Q[state,:].argmax())
    action = Q[state,:].argmax()
    state = Q[state,:].argmax()
action1.insert(0, c)
print('路径：')
for i in action1[:-1]:
    print(i,'-->',end='')
print(action1[-1])

训练得到的Q矩阵如下图所示：
在这里插入图片描述

路径：
0 -->3 -->4 -->1 -->2 -->5 -->8 -->9 -->10 -->11
Agent的行走路径如下图所示：
在这里插入图片描述
将迷宫改成这样，

import numpy as np
import random

R = np.ones((12,12))
R = R*-1
R[4,5]=R[5,4]=0
R[7,10]=R[10,7]=0
R[0,3]=R[3,0]=0
R[1,2]=R[2,1]=0
R[2,5]=R[5,2]=0
R[6,7]=R[7,6]=0
R[3,6]=R[6,3]=0
R[3,4]=R[4,3]=0
R[1,4]=R[4,1]=0
R[7,4]=R[4,7]=0
R[5,8]=R[8,5]=0
R[8,9]=R[9,8]=0
R[9,10]=R[10,9]=0
R[10,11]=R[11,10]=100
γ = 0.8
Q = np.zeros((12, 12))
Q = np.matrix(Q)
for i in range(3000):
    state = random.randint(0, 11)
    while True:
        r_pos_action = []
        for action in range(12):
            if R[state, action] >= 0:
                r_pos_action.append(action)
        next_state = r_pos_action[random.randint(0, len(r_pos_action) - 1)]
        Q[state, next_state] = R[state, next_state] + γ *(Q[next_state]).max()  
        state = next_state
        if state==11:
            break
print('Q矩阵:')
print(Q)
state = 0
c = state
action1 = []
while not (state == 11 and action == 11):
    action1.append(Q[state,:].argmax())
    action = Q[state,:].argmax()
    state = Q[state,:].argmax()
action1.insert(0, c)
print('路径：')
for i in action1[:-1]:
    print(i,'-->',end='')
print(action1[-1])