【强化学习】港中大强化学习课程Assignment解析 01_2_课程assignment 2

本文链接：https://blog.csdn.net/Liao164462791/article/details/122559360

本文介绍了港中大强化学习课程的Assignment，主要讲解了有模型的表格型方法，包括策略迭代和价值迭代。在策略迭代中，详细阐述了更新价值函数直至收敛以及找到最优策略的过程。而在价值迭代中，强调了与策略迭代的区别，即每次循环只迭代一次价值函数。最后，展示了如何使用训练好的智能体在冰面游戏中应用所学的策略。

摘要由CSDN通过智能技术生成

【强化学习】港中大强化学习课程Assignment解析 01_2

课程相关

课程首页：https://cuhkrlcourse.github.io/
视频链接：https://space.bilibili.com/511221970/channel/seriesdetail?sid=764099【B站】
相关资料：https://datawhalechina.github.io/easy-rl/#/【EasyRL】
Reinforcement Learning: An Introduction：https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
Github首页（作业获取）：https://github.com/cuhkrlcourse/ierg5350-assignment-2021
Gitee（我的解析）：https://gitee.com/cstern-liao/cuhk_rl_assignment

2 有模型的表格型方法

Model-based vs. Model-free

智能体按照是否对真实世界建模分为有模型和无模型两类，有模型(Model-based)的智能体对真实世界建模成一个虚拟世界，智能体可以通过状态转移函数 $P(s_{t+1}\ |\ s_t, a_t)$ 和奖励函数 $R(s_t, a_t)$ 来预测在某个状态采取某个动作之后会转移到哪个状态，获得怎样的奖励，这样智能体可以直接通过学习策略或者价值函数来最大化奖励。但在真实世界中大部分问题我们没有办法得到环境中的全部元素，他的状态转移函数和奖励函数对我们来说是无法感知的，**这时就需要采用免模型学习。**免模型学习没有对真实环境进行建模，智能体只能在真实环境中通过一定的策略来执行动作，等待奖励和状态迁移，然后根据这些反馈信息来更新行为策略，这样反复迭代直到学习到最优策略。

这里Assignment作业使用的是有模型的表格型方法。

在Section2中，题目给出了父类TabularRLTrainerAbstract的定义。

# Run this cell without modification

class TabularRLTrainerAbstract:
    """This is the abstract class for tabular RL trainer. We will inherent the specify 
    algorithm's trainer from this abstract class, so that we can reuse the codes like
    getting the dynamic of the environment (self._get_transitions()) or rendering the
    learned policy (self.render())."""
    
    def __init__(self, env_name='FrozenLake8x8-v1', model_based=True):
        self.env_name = env_name
        self.env = gym.make(self.env_name)
        self.action_dim = self.env.action_space.n
        self.obs_dim = self.env.observation_space.n
        
        self.model_based = model_based

    def _get_transitions(self, state, act):
        """Query the environment to get the transition probability,
        reward, the next state, and done given a pair of state and action.
        We implement this function for you. But you need to know the 
        return format of this function.
        """
        self._check_env_name()
        assert self.model_based, "You should not use _get_transitions in " \
            "model-free algorithm!"
        
        # call the internal attribute of the environments.
        # `transitions` is a list contain all possible next states and the 
        # probability, reward, and termination indicater corresponding to it
        transitions = self.env.env.P[state][act]

        # Given a certain state and action pair, it is possible
        # to find there exist multiple transitions, since the 
        # environment is not deterministic.
        # You need to know the return format of this function: a list of dicts
        ret = []
        for prob, next_state, reward, done in transitions:
            ret.append({
   
                "prob": prob,
                "next_state": next_state,
                "reward": reward,
                "done": done
            })
        return ret
    
    def _check_env_name(self):
        assert self.env_name.startswith('FrozenLake')

    def print_table(self):
        """print beautiful table, only work for FrozenLake8X8-v0 env. We 
        write this function for you."""
        self._check_env_name()
        print_table(self.table)

    def train(self):
        """Conduct one iteration of learning."""
        raise NotImplementedError("You need to override the "
                                  "Trainer.train() function.")

    def evaluate(self):
        """Use the function you write to evaluate current policy.
        Return the mean episode reward of 1000 episodes when seed=0."""
        result = evaluate(self.policy, 1000, env_name=self.env_name)
        return result

    def render(self):
        """Reuse your evaluate function, render current policy 
        for one episode when seed=0"""
        evaluate(self.policy, 1, render=True, env_name=self.env_name)

它是一个抽象类，这意味着里面有一个函数train()需要在接下来进行重写，我们将在2.1节和2.2节中继承这个抽象类并重写train方法，分别实现策略迭代和价值迭代的过程。重点看一下其中 _get_transitions(self, state, act) 函数

2.1 策略迭代

首先回顾一下策略迭代算法：

在给定环境转移的情况下，更新价值函数直到收敛。第一步是一个小循环，价值函数与上一轮小循环的值相差很小（即收敛）时退出。

$v_{k+1}=E_{s'}[R(s,a)+\gamma v_k(s')]$

其中 $a$ 由当前迭代的策略函数给出， $s^{'}$ 是下一个状态， $R$ 是奖励函数， $v_k(s')$ 是上一个小循环中下一个状态的价值
找到该轮迭代中能使价值函数最大化的最优策略

$a=argmax_aE_{s'}[R(s,a)+\gamma v_k(s')]$
如果找到的最优策略跟前一轮一致，则停止迭代，否则回到第一步继续迭代