强化学习实例分析:GridWorld【值迭代和策略迭代算法Python实现】

强化学习笔记

主要基于b站西湖大学赵世钰老师的【强化学习的数学原理】课程,个人觉得赵老师的课件深入浅出,很适合入门.

第一章 强化学习基本概念
第二章 贝尔曼方程
第三章 贝尔曼最优方程
第四章 值迭代和策略迭代
第五章 强化学习实践—GridWorld



在前面的章节,我们探讨了强化学习中两个关键算法:值迭代和策略迭代的数学原理。这两个算法帮助我们优化决策过程,目标是最大化长期收益。值迭代通过更新状态值函数逼近最优策略,策略迭代评估并调整策略以达到更优解。

现在,我们将通过一个实例展示这两个算法如何实现。我们会选取一个具体的问题场景——GridWorld,通过分析和代码实现,以便更加深入理解算法的工作原理及实现方法。完整Python代码在文章末尾。

一、问题描述

1 Intro to GridWorld

如下图所示,黑色格子代表墙壁/障碍物,白色格子是非终点,带有“s”的格子是每个episode的起点,带有“5”的格子是目标点。agent从“s”格子开始,在每一步中,智能体可以选择四种操作中的一种:“上”、“右”、“下”、“左”,移动到该方向的下一个格子:

  • 如果下一个格子是墙/障碍物,智能体不会移动,并获得值为**-1**的奖励;
  • 如果下一个格子是一个非终点格子,智能体移动到该格子并获得值为0的奖励;
  • 如果下一个格子是目标格子,则该episode结束,智能体将获得值为5的奖励。
%matplotlib inline #ipynb
import numpy as np
import matplotlib.pyplot as plt
import gym
from IPython import display
import random

from gridworld import GridWorld

gw = GridWorld()
gw.plot_grid(plot_title='The grid')

截屏2024-04-09 18.25.39

2 state and state values

除去网格周围的墙,一共有56个格子(包括网格内的障碍物),它们对应56种状态(障碍物和目标是不可到达的状态)。我们使用0到55之间的数字来表示这些状态(有关状态数和各自位置之间的转换,请参见gridworld.py)。我们用GridWorld.plot_state_values()来画每个状态的状态值 v ( s ) v(s) v(s):

gw.plot_state_values(np.arange(56),value_format="{:d}",plot_title='Each states as an integer')

截屏2024-04-10 10.22.24

3 take actions

使用GridWorld.step(action)执行一个动作,使用GridWorld.reset()重启一个episode,Action是0 ~ 3的整数:

  1. 0:“向上”;
  2. 1:“向右”;
  3. 2:“向下”;
  4. 3:“向左”.
gw.reset()

current_state = gw.get_current_state()
tile_pos = gw.int_to_state(current_state)

print("The current state is {}, which corresponds to tile position {}\n".format(current_state,tile_pos))

action = np.random.randint(4)
reward, terminated, next_state = gw.step(action)
tile_pos = gw.int_to_state(next_state)

print("Take action {}, get reward {}, move to state {}".format(action,reward,next_state))
print("Now the current state is {}, which corresponds to tile position {}\n".format(next_state,tile_pos))

gw.reset()
current_state = gw.get_current_state()
tile_pos = gw.int_to_state(current_state)
print("Reset episoid")
print("Now the current state is {}, which corresponds to tile position {}".format(current_state,tile_pos))

输出如下:

截屏2024-04-10 10.30.12

4 Plot Deterministic Policies

确定性策略是一个从状态到动作的函数,可以用(56,)-numpy数组表示,该数组的值都是[0-3]的整数。下面函数在网格上画出了一个随机的确定性策略:

gw.plot_policy(np.random.randint(4,size=(56,)),plot_title='A deterministic policy')

截屏2024-04-10 10.34.37

二、算法实现

1 Policy Iteration

回想一下,策略 π ( s ) \pi(s) π(s)的价值函数 v π ( s ) v_\pi(s) vπ(s)可以通过策略评估迭代计算(参见我的博客:值迭代和策略迭代),迭代由
v k + 1 ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v k ( s ) ]   , v_{k+1}(s)=\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{k}(s)]\,, vk+1(s)=aπ(as)s,rp(s,rs,a)[r+γvk(s)],给出,可以写成
v k + 1 ( s ) = ∑ a π ( a ∣ s ) [ E π [ r ∣ s , a ] + ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ]   . v_{k+1}(s)=\sum_{a}\pi(a|s)\left[\mathbb{E}_\pi[r|s,a]+\sum_{s'}p(s'|s,a)v_k(s')\right]\,. vk+1(s)=aπ(as)[Eπ[rs,a]+sp(ss,a)vk(s)].如果我们将值函数 v k + 1 , v k v_{k+1},v_k vk+1,vk写成向量,那么我们有
v k + 1 = ∑ a π ( a ∣ s ) [ R π ( a ) + P π ( a ) v k ]   . v_{k+1} = \sum_{a}\pi(a|s)\left[R_\pi(a)+P_\pi(a)v_{k}\right]\,. vk+1=aπ(as)[Rπ(a)+Pπ(a)vk].其中 R π ( a ) R_\pi(a) Rπ(a)是行动 a a a下的期望奖励, P π ( a ) P_\pi(a) Pπ(a)是行动 a a a下的转移概率矩阵。策略迭代算法代码如下所示:

def policy_iteration(self,gamma=0.9,max_it=1000,tol=1e-5):
        
        # stochastic policy
        stochastic_mat = np.random.rand(self.get_state_num(),4)
        pi = stochastic_mat / stochastic_mat.sum(axis=1)[:,None] # pi(a|s) 
        policy = np.argmax(pi,axis=1)
        
        
        for _ in range(max_it):

            # Policy Evaluation
            v = np.zeros((self.get_state_num(),))
            for _ in range(max_it):
                value_temp = np.zeros((self.get_state_num(),))
                for action in range(4):
                    reward, tran_prob = self.transition(action)
                    value_temp = value_temp + pi[:,action]*(reward+gamma* np.matmul(tran_prob,v))
                if np.linalg.norm(value_temp-v)<tol:
                    break
                else:
                    v = value_temp

            v_final = v
        
            # Policy Improvement
            q = np.zeros((self.get_state_num(),4)) # q(s,a)
            for action in range(4):
                    reward, tran_prob = self.transition(action)
                    q[:,action] = reward+gamma* np.matmul(tran_prob,v_final)
            now_policy = np.argmax(q,axis=1)
            
            # check if policy is stable
            if np.array_equal(policy,now_policy):
                optimal_policy = policy
                optimal_v = v_final
                break
            else:
                policy = now_policy
                pi = np.zeros((self.get_state_num(),4))
                pi[np.arange(self.get_state_num()),policy] = 1 # greedy policy
        
        return optimal_v,optimal_policy

选择 γ = 0.9 \gamma=0.9 γ=0.9,运行结果如下图所示:

v,policy = gw.policy_iteration(gamma=0.9)
gw.plot_state_values(v, plot_title='Value function of Policy Itertation')
gw.plot_policy(policy, plot_title='Greedy Policy of Policy Iteration')

截屏2024-04-10 12.45.49

2 Value Iteration

同理,我们可以实现Value Iteration算法

    def value_iteration(self, gamma,eps = 1e-5,
        max_iter= 2000):
    # input: 
    #         gamma,     (float 0-1) discount of the return
    #         eps,       (float) stopping criteria
    #         max_iter,  (int) maximum number of iteration
    # output: 
    #         optim value,  (1d numpy array, float) optimal value function 
    #         optim_policy, (1d numpy array, int {0,1,2,3}) optimal policy

        
        policy = []
        
        v = np.zeros((self.get_state_num(),))
        
        for _ in range(max_iter):
            
            # Policy Update
            q = np.zeros((self.get_state_num(),4))  # q(s,a)

            for action in range(4):
                    reward, tran_prob = self.transition(action)
                    q[:,action] = reward+gamma* np.matmul(tran_prob,v)
        
            policy = np.argmax(q,axis=1)
            
            # Value Update
            v_tmp = np.max(q,axis = 1) # v_{k+1}(s) = max_a q_k(s,a)
            if np.linalg.norm(v_tmp-v) < eps:
                break
            else:
                v = v_tmp
            
        
        optim_value = v
        optim_policy = policy
        

        return optim_value, optim_policy

选择 γ = 0.9 \gamma=0.9 γ=0.9,运行结果如下图所示:

v,policy = gw.value_iteration(gamma=0.9)
gw.plot_state_values(v, plot_title='State values of value iteration')
gw.plot_policy(policy, plot_title='Optimal policy of value iteration')

截屏2024-04-10 12.50.07

我们可以看到值迭代算法和策略迭代算法得到的 v ∗ v^* v是一致的,最优策略也是一致的.

三、完整代码

import matplotlib.pyplot as plt
import numpy as np


class GridWorld:

    def __init__(self, reward_wall=-5):
        # initialize grid with 2d numpy array
        # >0: goal
        # -1: wall/obstacles
        # 0: non-terminal
        self._grid = np.array(
            [[0, 0, 0, 0, 0, -1, 0, 0],
             [0, 0, 0, -1, 0, 0, 0, 5],
             [0, 0, 0, -1, -1, 0, 0, 0],
             [0, 0, 0, -1, -1, 0, 0, 0],
             [0, 0, 0, 0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0, 0, 0, 0]
             ])
        # wall around the grid, padding grid with -1
        self._grid_padded = np.pad(self._grid, pad_width=1, mode='constant', constant_values=-1)
        self._reward_wall = reward_wall

        # set start state
        self._start_state = (1, 1)
        self._random_start = False

        # store position of goal states and non-terminal states
        idx_goal_state_y, idx_goal_state_x = np.nonzero(self._grid > 0)
        self._goal_states = [(idx_goal_state_y[i], idx_goal_state_x[i]) for i in range(len(idx_goal_state_x))]

        idx_non_term_y, idx_non_term_x = np.nonzero(self._grid == 0)
        self._non_term_states = [(idx_non_term_y[i], idx_non_term_x[i]) for i in range(len(idx_non_term_x))]

        # store the current state in the padded grid
        self._state_padded = (self._start_state[0] + 1, self._start_state[1] + 1)

    def get_state_num(self):
        # get the number of states (total_state_number) in the grid, note: the wall/obstacles inside the grid are
        # counted as state as well
        return np.prod(np.shape(self._grid))

    def get_state_grid(self):

        state_grid = np.multiply(np.reshape(np.arange(self.get_state_num()), self._grid.shape), self._grid >= 0) - (
                self._grid == -1)

        return state_grid, np.pad(state_grid, pad_width=1, mode='constant', constant_values=-1)

    def get_current_state(self):
        # get the current state as an integer from 0 to total_state_number-1
        y, x = self._state_padded
        return (y - 1) * self._grid.shape[1] + (x - 1)

    def int_to_state(self, int_obs):
        # convert an integer from 0 to total_state_number-1 to the position on the non-padded grid
        x = int_obs % self._grid.shape[1]
        y = int_obs // self._grid.shape[1]
        return y, x

    def reset(self):
        # reset the gridworld
        if self._random_start:
            # randomly start at a non-terminal state
            idx_start = np.random.randint(len(self._non_term_states))
            start_state = self._non_term_states[idx_start]
            self._state_padded = (start_state[0] + 1, start_state[1] + 1)
        else:
            # start at the designated start_state
            self._state_padded = (self._start_state[0] + 1, self._start_state[1] + 1)

    def step(self, action):
        # take one step according to the action
        # input: action (integer between 0 and 3)
        # output: reward           reward of this action
        #         terminated       1 if reaching the terminal state, 0 otherwise
        #         next_state       next state after this action, integer from 0 to total_state_number-1)
        y, x = self._state_padded

        if action == 0:  # up
            new_state_padded = (y - 1, x)
        elif action == 1:  # right
            new_state_padded = (y, x + 1)
        elif action == 2:  # down
            new_state_padded = (y + 1, x)
        elif action == 3:  # left
            new_state_padded = (y, x - 1)
        else:
            raise ValueError("Invalid action: {} is not 0, 1, 2, or 3.".format(action))

        new_y, new_x = new_state_padded
        if self._grid_padded[new_y, new_x] == -1:  # wall/obstacle
            reward = self._reward_wall
            new_state_padded = (y, x)
        elif self._grid_padded[new_y, new_x] == 0:  # non-terminal cell
            reward = 0.
        else:  # a goal
            reward = self._grid_padded[new_y, new_x]
            self.reset()
            terminated = 1
            return reward, terminated, self.get_current_state()

        terminated = 0
        self._state_padded = new_state_padded
        return reward, terminated, self.get_current_state()

    def plot_grid(self, plot_title=None):
        # plot the grid
        plt.figure(figsize=(5, 5),dpi=200)
        plt.imshow(self._grid_padded <= -1, cmap='binary', interpolation="nearest")
        ax = plt.gca()
        ax.grid(0)
        plt.xticks([])
        plt.yticks([])

        if plot_title:
            plt.title(plot_title)

        plt.text(
            self._start_state[1] + 1, self._start_state[0] + 1,
            r"$\mathbf{S}$", ha='center', va='center')
        for goal_state in self._goal_states:
            plt.text(
                goal_state[1] + 1, goal_state[0] + 1,
                "{:d}".format(self._grid[goal_state[0], goal_state[1]]), ha='center', va='center')
        h, w = self._grid_padded.shape
        for y in range(h - 1):
            plt.plot([-0.5, w - 0.5], [y + 0.5, y + 0.5], '-k', lw=2)
        for x in range(w - 1):
            plt.plot([x + 0.5, x + 0.5], [-0.5, h - 0.5], '-k', lw=2)

    def plot_state_values(self, state_values, value_format="{:.1f}",plot_title=None):
        # plot the state values
        # input: state_values     (total_state_number, )-numpy array, state value function
        #        plot_title       str, title of the plot
        plt.figure(figsize=(5, 5),dpi=200)
        plt.imshow((self._grid_padded <= -1) + (self._grid_padded > 0) * 0.5, cmap='Greys', vmin=0, vmax=1)
        ax = plt.gca()
        ax.grid(0)
        plt.xticks([])
        plt.yticks([])

        if plot_title:
            plt.title(plot_title)

        for (int_obs, state_value) in enumerate(state_values):
            y, x = self.int_to_state(int_obs)
            if (y, x) in self._non_term_states:
                plt.text(x + 1, y + 1, value_format.format(state_value), ha='center', va='center')

        h, w = self._grid_padded.shape
        for y in range(h - 1):
            plt.plot([-0.5, w - 0.5], [y + 0.5, y + 0.5], '-k', lw=2)
        for x in range(w - 1):
            plt.plot([x + 0.5, x + 0.5], [-0.5, h - 0.5], '-k', lw=2)

    def plot_policy(self, policy, plot_title=None):
        # plot a deterministic policy
        # input: policy           (total_state_number, )-numpy array, contains action as integer from 0 to 3
        #        plot_title       str, title of the plot
        action_names = [r"$\uparrow$", r"$\rightarrow$", r"$\downarrow$", r"$\leftarrow$"]
        plt.figure(figsize=(5, 5),dpi=200)
        plt.imshow((self._grid_padded <= -1) + (self._grid_padded > 0) * 0.5, cmap='Greys', vmin=0, vmax=1)
        ax = plt.gca()
        ax.grid(0)
        plt.xticks([])
        plt.yticks([])

        if plot_title:
            plt.title(plot_title)

        for (int_obs, action) in enumerate(policy):
            y, x = self.int_to_state(int_obs)
            if (y, x) in self._non_term_states:
                action_arrow = action_names[action]
                plt.text(x + 1, y + 1, action_arrow, ha='center', va='center')
    # Transition Function, return reward and transition probability
    def transition(self, action):
        if action == 0:  # up
            anchor_state_padded = (0, 1)
        elif action == 1:  # right
            anchor_state_padded = (1, 2)
        elif action == 2:  # down
            anchor_state_padded = (2, 1)
        elif action == 3:  # left
            anchor_state_padded = (1, 0)
        else:
            raise ValueError("Invalid action: {} is not 0, 1, 2, or 3.".format(action))

        state_num = self.get_state_num()
        h, w = self._grid.shape
        y_a, x_a = anchor_state_padded
        reward = np.multiply(self._grid_padded[y_a:y_a + h, x_a:x_a + w],self._grid==0)

        state_grid, state_grid_padded = self.get_state_grid()
        next_state = state_grid_padded[y_a:y_a + h, x_a:x_a + w]
        next_state = np.multiply(state_grid, next_state == -1) + np.multiply(next_state, next_state > -1)
        next_state[self._grid == -1] = -1
        next_state[self._grid > 0] = state_grid[self._grid > 0]

        next_state_vec = next_state.flatten()
        state_vec = state_grid.flatten()

        probability = np.zeros((state_num, state_num))
        probability[state_vec[state_vec > -1], next_state_vec[state_vec > -1]] = 1
        return reward.flatten(), probability
    
    # Value Iteration Algorithm
    def value_iteration(self, gamma,eps = 1e-5,
        max_iter= 2000):
    # input: 
    #         gamma,     (float 0-1) discount of the return
    #         eps,       (float) stopping criteria
    #         max_iter,  (int) maximum number of iteration
    # output: 
    #         optim value,  (1d numpy array, float) optimal value function 
    #         optim_policy, (1d numpy array, int {0,1,2,3}) optimal policy

        
        policy = []
        
        v = np.zeros((self.get_state_num(),))
        
        for _ in range(max_iter):
            
            # Policy Update
            q = np.zeros((self.get_state_num(),4))  # q(s,a)

            for action in range(4):
                    reward, tran_prob = self.transition(action)
                    q[:,action] = reward+gamma* np.matmul(tran_prob,v)
        
            policy = np.argmax(q,axis=1)
            
            # Value Update
            v_tmp = np.max(q,axis = 1) # v_{k+1}(s) = max_a q_k(s,a)
            if np.linalg.norm(v_tmp-v) < eps:
                break
            else:
                v = v_tmp
            
        
        optim_value = v
        optim_policy = policy
        

        return optim_value, optim_policy
    
    # Policy Iteration Algorithm
    def policy_iteration(self,gamma=0.9,max_it=1000,tol=1e-5):
        
        # stochastic policy
        stochastic_mat = np.random.rand(self.get_state_num(),4)
        pi = stochastic_mat / stochastic_mat.sum(axis=1)[:,None] # pi(a|s) 
        policy = np.argmax(pi,axis=1)
        
        
        for _ in range(max_it):

            # Policy Evaluation
            v = np.zeros((self.get_state_num(),))
            for _ in range(max_it):
                value_temp = np.zeros((self.get_state_num(),))
                for action in range(4):
                    reward, tran_prob = self.transition(action)
                    value_temp = value_temp + pi[:,action]*(reward+gamma* np.matmul(tran_prob,v))
                if np.linalg.norm(value_temp-v)<tol:
                    break
                else:
                    v = value_temp

            v_final = v
        
            # Policy Improvement
            q = np.zeros((self.get_state_num(),4)) # q(s,a)
            for action in range(4):
                    reward, tran_prob = self.transition(action)
                    q[:,action] = reward+gamma* np.matmul(tran_prob,v_final)
            now_policy = np.argmax(q,axis=1)
            
            # check if policy is stable
            if np.array_equal(policy,now_policy):
                optimal_policy = policy
                optimal_v = v_final
                break
            else:
                policy = now_policy
                pi = np.zeros((self.get_state_num(),4))
                pi[np.arange(self.get_state_num()),policy] = 1 # greedy policy
        
        return optimal_v,optimal_policy

四、参考资料

  1. Zhao, S… Mathematical Foundations of Reinforcement Learning. Springer Nature Press and Tsinghua University Press.
  2. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • 22
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值