强化学习：Q表格方法（Qlearning and Sarsa）

.XMY.

已于 2023-03-06 21:54:55 修改

阅读量1w

点赞数 13

分类专栏：练习文章标签：强化学习 python 算法

于 2021-01-21 18:08:00 首次发布

本文链接：https://blog.csdn.net/weixin_43968987/article/details/112959287

版权

练习专栏收录该内容

5 篇文章

订阅专栏

无须要快乐，反正你一早枯死。 ——月球上的人

第一篇文章，不会用各种的编辑功能，界面会很糟糕哈。
抛开其他的不谈，直接进入主题。
这里介绍一个最简单的强化学习方法，即Q_learning的Q表格实现。
1.强化学习的交互过程：
在这里插入图片描述
假设先从环境的角度出发，环境给出了一个观测状态（obs），智能体（agent）接受这个状态量并作出反馈。评价函数判定这个反馈（动作）在环境中的“好坏”并给出一定的回报。训练算法会根据回报与交互序列等变量优化智能体的动作选择策略。这个反馈也会生成新的观测状态（next_obs），并且会与智能体再次交互。持续这个过程直到达到目的或者是“挂了”。

2.环境：
这里没用到其他的强化学习环境，就我自己简单写了一个适用于Q表格的环境，类似于悬崖问题（CliffWalking）。主体是一个7×7的表格，可以自定义起点与终点。暂时定义了一个2×2的方格，当环境中的“小王八”走到这个方格中，就结束这个寻路的过程。（这个怎么缩进？）
在这里插入图片描述
如上图，1代表结束，0即路径，-1代表终点。
然后附上环境的代码（好拉跨）：

import numpy as np

class Turtle(object):

	def __init__(self,x,y,x_,y_):
    
    	  self.signal = 0
    	  self.maze = np.zeros((7,7))
    	  self.init_location = np.array([x,y])
    	  self.destination = (x_,y_)
    	  self.reward = -1

    def step(self,action):
        #0 down  1 up  2 left  3 right
        if action == 0:
            self.init_location[0] += 1
            if self.init_location[0]>6:
                self.init_location[0] = 6
        elif action == 1:
            self.init_location[0] -= 1
            if self.init_location[0]<0:
                self.init_location[0] = 0
        elif action == 2:
            self.init_location[1] -= 1
            if self.init_location[1]<0:
                self.init_location[1] = 0
        else:
            self.init_location[1] += 1
            if self.init_location[1]>6:
                self.init_location[1] = 6
                
        if self.maze[self.init_location[0]][self.init_location[1]] == -1 or self.maze[self.init_location[0]][self.init_location[1]] == 1:
            self.signal = 1
        if self.maze[self.init_location[0]][self.init_location[1]] == 1:
            self.reward = -50
        if self.maze[self.init_location[0]][self.init_location[1]] == -1:
            self.reward = 50
            
        return self.init_location[0]*7+self.init_location[1],self.reward,self.signal
    
    def reset(self):
        
        self.init_location = np.array([0,0])
        self.maze[self.destination[0]][self.destination[1]] = -1
        self.signal = 0
        self.reward = -1
        for i in range(1,3):
            for j in range(1,3):
                self.maze[i][j] = 1
        return 0

s = Turtle(0,0,5,5)
s.reset()

3.智能体
在这里叨叨一哈智能体。在这里采用两种算法来解决，主体是Q_learning的方法，至于Sarsa的方法，改一点代码即可实现。
比较一下这两个算法，Q_learning比较“大胆”，Sarsa就显得比较“谨慎”。直接看两种方法的更新公式：
Q：Q(s,a) <-- Q(s,a) + α[r + γmax(Q(s’, : )) - Q(s,a)]
S：Q(s,a) <-- Q(s,a) + α[r + γQ(s’,a’) - Q(s,a)]
对于Q，可以看到，采取最大化的策略，即我不管后面的动作怎样，我只选取下一个可能性最大的动作来更新表格。这种训练方法训练智能体，在CliffWalking环境里会比较明显，它会紧贴着悬崖壁走到终点，因为这种寻路的方法是路径最短的。但是智能体在训练初期相比于Sarsa很容易掉入悬崖。
对于S，需要考虑到下一个动作具体是什么，也即考虑到“长期”的动作的影响。在CliffWalking环境中，智能体会远离悬崖移动到终点。因为掉入悬崖会得到很坏的回报，由于算法考虑“长期”的回报，所以智能体会尽可能的远离悬崖。
上述两个结果在这个环境中表现的不明显，但是可以通过修改坐标，更改环境来实现自定义环境。
但是在实际的运用中，Q_learning方法要比Sarsa用的多。为啥捏？先看Sarsa，拆开来看是obs，action，reward，next_obs，next_action。看最后一个是next_action,即要输入下一个动作是什么。这个也直接映射在更新函数上，α[r + γQ(s’,a’) - Q(s,a)]要取得下一个动作a’，而Q_learning的：α[r + γmax(Q(s’, : )) - Q(s,a)]并不需要。在现实生活中，要拿到下一个动作，未免太过牵强，也不常用。（我好想敲个空行，但是真的不大会弄，界面很糟糕）
下面附上主体的代码：

#Snack是一个.py文件，装了写的Turtle环境
from Snack import Turtle
import numpy as np
#Using Q_table to solve this problem

class Q_table_for_snack(object):
    
    def __init__(self):
        #初始化Q表格
        self.table = np.zeros((49,4))
    
    def sample(self,obs):
        '''
        动作选择函数：
        有0.1的概率随机选择一个动作
        即为“探索”的过程，学东西不能死学，适当的探索是必要的
        可定义探索率衰减，随着训练进行，探索的概率越来越低
        但是这个环境与方法相对容易，未定义衰减
        '''
        if np.random.uniform()<0.1:
            return np.random.randint(4)
        else:
            '''
            选择“概率”最大的动作，但是可能有多个动作概率相同，需要从中选择一个
            使用numpy库的where，choice函数随机选择概率最大的动作之一
            '''
            action_max = np.max(self.table[obs,:])
            choice_action = np.where(action_max==self.table[obs,:])[0]
            return np.random.choice(choice_action)
         
    def learn(self,obs,action,reward,n_obs,n_action,done):
        '''
        根据两种方法更新Q表格
        更新公式如下：
        Q: Q(s,a) <-- Q(s,a) + α[r + γ*max*Q(s', : ) - Q(s,a)]
        S: Q(s,a) <-- Q(s,a) + α[r + γ*Q(s',a') - Q(s,a)]
        '''
        Q_ = self.table[obs][action]
        if done:
            target_Q = reward
        else:
            target_Q = reward + 0.9*np.max(self.table[n_obs,:])  #Q_learning
            #target_Q = reward + 0.9*self.table[n_obs][n_action]
        self.table[obs][action] += 0.1*(target_Q-Q_)
        
def run(env,agent):
    #初始化环境
    statement = env.reset()
    #选取动作
    action = agent.sample(statement)
    score = 0
    steps = 0

    while True:
    
        #当环境未结束，持续的循环执行交互过程
        next_statement,reward,done = env.step(action)
        next_action = agent.sample(next_statement)
        agent.learn(statement,action,reward,next_statement,next_action,done)
        statement = next_statement
        action = next_action
        score += reward
        steps += 1
        
        if done:
            break
            
    #打印得分
    print("The score is",score)

#定义环境
env = Turtle(0,0,5,5)

agent = Q_table_for_snack()
for i in range(150):
    run(env,agent)

最后附带一下Q与S的训练结果：(第一个是Sarsa)
Sarsa
Q_learning
下面是完整的代码逻辑（包含了对于每一步运行之后探索阈值的缩小以及表示的可视化）

import matplotlib.pyplot as plt
import numpy as np
import time

class Turtle(object):

    def __init__(self, x, y, x_, y_):
        self.signal = 0
        self.maze = np.zeros((7, 7))
        self.init_location = np.array([x, y])
        self.destination = (x_, y_)
        self.reward = -1

    def step(self, action):
        # 0 down  1 up  2 left  3 right
        if action == 0:
            self.init_location[0] += 1
            if self.init_location[0] > 6:
                self.init_location[0] = 6
        elif action == 1:
            self.init_location[0] -= 1
            if self.init_location[0] < 0:
                self.init_location[0] = 0
        elif action == 2:
            self.init_location[1] -= 1
            if self.init_location[1] < 0:
                self.init_location[1] = 0
        else:
            self.init_location[1] += 1
            if self.init_location[1] > 6:
                self.init_location[1] = 6

        if self.maze[self.init_location[0]][self.init_location[1]] == -1 or self.maze[self.init_location[0]][
            self.init_location[1]] == 1:
            self.signal = 1
        if self.maze[self.init_location[0]][self.init_location[1]] == 1:
            self.reward = -50
        if self.maze[self.init_location[0]][self.init_location[1]] == -1:
            self.reward = 50

        return self.init_location[0] * 7 + self.init_location[1], self.reward, self.signal

    def reset(self):

        self.init_location = np.array([0, 0])
        self.maze[self.destination[0]][self.destination[1]] = -1
        self.signal = 0
        self.reward = -1
        for i in range(1, 3):
            for j in range(1, 3):
                self.maze[i][j] = 1
        return 0

class Q_table_for_snack(object):

    def __init__(self):

        self.table = np.zeros((49, 4))
        self.decrease = 0.001
        self.min_limit = 0.1

    def sample(self, obs):

        self.min_limit -= self.decrease
        if np.random.uniform() < self.min_limit:
            return np.random.randint(4)
        else:

            action_max = np.max(self.table[obs, :])
            choice_action = np.where(action_max == self.table[obs, :])[0]
            return np.random.choice(choice_action)

    def learn(self, obs, action, reward, n_obs, n_action, done):

        Q_ = self.table[obs][action]
        if done:
            target_Q = reward
        else:
            target_Q = reward + 0.9 * np.max(self.table[n_obs, :])  # Q_learning
        self.table[obs][action] += 0.1 * (target_Q - Q_)

    def Routine(self, Max):
        for i in range(7):
            for j in range(7):
                if Max[i * 7 + j] == 0:
                    print("↓", end="  ")
                elif Max[i * 7 + j] == 1:
                    print("↑", end="  ")
                elif Max[i * 7 + j] == 2:
                    print("←", end="  ")
                else:
                    print("→", end="  ")
            print()

def run(env, agent):

    statement = env.reset()
    action = agent.sample(statement)
    score = 0
    steps = 0

    while True:

        next_statement, reward, done = env.step(action)
        next_action = agent.sample(next_statement)
        agent.learn(statement, action, reward, next_statement, next_action, done)
        statement = next_statement
        action = next_action
        score += reward
        steps += 1
        if done:
            break

    print()
    return score

if "__main__":

    env = Turtle(0, 0, 5, 5)
    agent = Q_table_for_snack()

    x = []
    y = []

    for i in range(150):
        score = run(env, agent)
        Max = np.argmax(agent.table,axis=1)
        agent.Routine(Max)
        x.append(i)
        y.append(score)

    plt.plot(x, y, color='green', marker='', linestyle='solid', linewidth=2, markersize=12)
    plt.show()

下面是对150轮次训练的结果：
在这里插入图片描述
然后我们可以可视化智能体行走的逻辑，用上下左右的符号来表示，下面是4次每次150轮训练结束之后依据QTable中的数值规划出的路径选择。当处于一个特定的点的时候，方向符号表示了在此点的最大可能的趋向。注意陷阱与终点的方向符号不具有实际意义。
在这里插入图片描述