【CS285 深度强化学习】作业三之详解 [Deep Reinforcement Learning]

最新推荐文章于 2024-12-12 16:17:55 发布

Kin-Zhang

最新推荐文章于 2024-12-12 16:17:55 发布

阅读量1.3k

点赞数

分类专栏： Reinforcement Learning 强化学习文章标签：强化学习

本文链接：https://blog.csdn.net/qq_39537898/article/details/117673108

版权

Reinforcement Learning 强化学习专栏收录该内容

8 篇文章

订阅专栏

文章目录

前言/介绍
顺序阅读
引用与参考

前言/介绍

简略：训练一个卷积网络去玩Atari

完整：第一部分使用Q-Learning + 卷积网络去实现和评估训练出的Atari agent；第二部分是修改上次作业内的policy gradient使用actor-critic的形式去表示，虽然这一部分可能少于20行的代码修改，但是运行和对比结果需要很多时间因为我们需要训练较多数据为value function

这里环境搭建有个BUG就是需要切3.7的，然后又又又装一遍环境… 哎这个环境的BUG也挺多呀… 但愿不要让我再装一遍了，因为一台主机没法GPU（显卡版本较低）所以我拿的组里的主机GPU再跑，所以算起来我已经装了4遍相同的环境了… emmm 真难顶，主要是清华源总断【首先这次作业不要试图没有GPU 真的超级超级慢 PDF里写的说是没GPU至少1天，有的话可能3小时？】

代码地址：https://gitee.com/kin_zhang/drl-hwprogramm

顺序阅读

PDF

首先按PDF里推荐的阅读顺序是，从作业1和作业2内有的代码：

infrastructure/rl_trainer.py
infrastructure/utils.py
policies/MLP_policy.py

我发现我每次一写着就顺便把其他的都补充掉了，emmmm 为啥打开第一个直接：# TODO: get this from Piazza；这对于没有注册的不直接不给了嘛… 所以哎… 不过好像差距不大，参数有几个对比hw1&hw2变了需要注意一下

rl_trainer.py和上次hw2没啥不一样的，主要是多了一个save_expert_data_to_disk这个有点奇怪，我看了引用1的是itr=0和save_expert_data_to_disk=True，然后按道理itr=0 load expert后直接就return了就不会进入他那个呀 emmm 但是如果是第一次itr=0 load的时候也需要dump没必要呀 emmm 这就很令我疑惑，不知道Piazza上是怎么给出来的

其他的util.py什么的我觉得都一样… 就不复述了；到了MLP_policy.py最后update函数的输入没有advantages了，而是observations和actions， ~~也就是说~~

奇怪，没有按照他的顺序而是按照run_hw3_dqn.py跳转的方式去补充的没有MLP_policy.py里的MLPAC那个class呀，先放着吧

看run_hw3_dqn.py

首先是第一部分的run_hw3_dqn.py里

main函数里，

trainer = Q_Trainer(params)
trainer.run_training_loop()

然后Q_Trainer里初始化的agent是：self.params['agent_class'] = DQNAgent

所以我们就直接跳到DQNAgent继续看一下初始化：

self.critic = DQNCritic(agent_params, self.optimizer_spec)
self.actor = ArgMaxPolicy(self.critic)

self.replay_buffer = MemoryOptimizedReplayBuffer(
    agent_params['replay_buffer_size'], agent_params['frame_history_len'], lander=lander)

然后继续跳到各自的初始化内，都没啥要补充的也没啥要说的。那么就回到前面再前面：trainer.run_training_loop() 是运行Q_Trainer里的run_training_loop函数

self.rl_trainer = RL_Trainer(self.params)
def run_training_loop(self):
        self.rl_trainer.run_training_loop(
            self.agent_params['num_timesteps'],
            collect_policy = self.rl_trainer.agent.actor,
            eval_policy = self.rl_trainer.agent.actor,
        )

跳到对应的里面去，就是rl_train.py文件已经在上面被补充好了所以主要是self.rl_trainer.agent.actor 在初始化里写到这个是由agent_class(self.env, self.params['agent_params']) 决定的，也就是DQNAgent 里的actor，也就是self.actor = ArgMaxPolicy(self.critic)

都不管了直接跳到train的语句：

all_logs = self.train_agent()
def train_agent:
		# sample
		ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch = self.agent.sample(self.params['train_batch_size'])
		# use simple to train agent
		train_log = self.agent.train(ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch)

然后可以可到train函数里需要填，就按照他需要的函数跳过去看一下输入是什么给输入就行，跳入的函数一会再补充

def train(self, ob_no, ac_na, re_n, next_ob_no, terminal_n):
        log = {}
        if (self.t > self.learning_starts
                and self.t % self.learning_freq == 0
                and self.replay_buffer.can_sample(self.batch_size)
        ):

            # TODO fill in the call to the update function using the appropriate tensors
            # update(self, ob_no, ac_na, next_ob_no, reward_n, terminal_n)
            log = self.critic.update(ob_no, ac_na, next_ob_no, re_n, terminal_n)

            # TODO update the target network periodically 
            # HINT: your critic already has this functionality implemented
            if self.num_param_updates % self.target_update_freq == 0:
                self.critic.update_target_network()

前者比较明显，后者它提示了说是already implement了，至此大概流程就清楚了，再跳入去看critic那边的update函数

dqn_critic.py

首先可以看一下

qa_t_values = self.q_net(ob_no)
q_t_values = torch.gather(qa_t_values, 1, ac_na.unsqueeze(1)).squeeze(1)

这两行在干的事情，第一个是现在观测值输入到网络，得到qa值，也就是在此时观测值下对应动作的q值

然后gather是指：Gathers values along an axis specified by dim.

torch.gather(input, dim, index, *, sparse_grad=False, out=None) → Tensor

具体可见官方：https://pytorch.org/docs/stable/generated/torch.gather.html

也可见CSDN：Pytorch系列（1）：torch.gather()

懒得跳转版：利用index来索引input特定位置的数值

gather的时候为了让ac_na和qa_t_value的维度对应，从而全方位的索引，所以用了unsqueeze在第二维度增加了维度，然后再squeeze减小到原来的ac_na的shape

这里是关于squeeze和unsqueeze的图示：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YRuGpBQK-1623165046915)(https://i.stack.imgur.com/9AJJA.png)]

第一个TODO是compute the Q-values from the target network

但是我回溯了一下q_net_target →network_initializer(self.ob_dim, self.ac_dim) →hparams['q_func'] →'q_func': create_atari_q_network /'q_func': create_lander_q_network 取决于哪个环境的游戏→最后看到的是网络层

所以我们可以知道此时输入参数应该为观测值以获得下一步的动作
```
qa_tp1_values = self.q_net_target(next_ob_no)
```
后面则是：double q

那就回顾一下double q是啥：double Q-learning: $y=r+\gamma Q_{\phi'}(s',\text{arg max}_{a'}Q_{\phi}(s',a,'))$

从这里我们知道首先我们需要像已知第一步那样求出next ob对应的action value 但是用的是 $\phi$ 也就是现在q_net的

然后总体就是这样了：

# TODO compute the Q-values from the target network 
qa_tp1_values = self.q_net_target(next_ob_no)

if self.double_q:
    # You must fill this part for Q2 of the Q-learning portion of the homework.
    # In double Q-learning, the best action is selected using the Q-network that
    # is being updated, but the Q-value for this action is obtained from the
    # target Q-network. See page 5 of https://arxiv.org/pdf/1509.06461.pdf for more details.
    next_qa_value = self.q_net(next_ob_no)
    actions = next_qa_value.argmax(1)
    q_tp1 = torch.gather(qa_tp1_values, 1, actions.unsqueeze(1)).squeeze(1)
else:
    q_tp1, _ = qa_tp1_values.max(dim=1)

# TODO compute targets for minimizing Bellman error
# HINT: as you saw in lecture, this would be:
    #currentReward + self.gamma * qValuesOfNextTimestep * (not terminal)
target = reward_n + self.gamma * q_tp1 * (1 - terminal_n)

然后回到run_training_loop函数（我昨儿写的我发现我CTRL跳入跳到了hw2里面… 我明明pip install -e . 了呀呀呀

所以其实我漏掉了这里，首先是收集轨迹再去训练

# collect trajectories, to be used for training
if isinstance(self.agent, DQNAgent):
   # only perform an env step and add to replay buffer for DQN
   self.agent.step_env()

那就跳入self.agent.step_env()里面把

dqn_agent.py

首先第一个TODO已经提示了store the latest obs 然后返回的idx，跳到dqn_utils.py找到这样的一个功能的函数即可

第二个TODO则是使用eps去判断是否随机动作

这里的问题是为什么随机动作一定是random.randint这样的？

从后面的索引来看是在num_action之间的索引值

然后后面我感觉HINT都提示的比较明显了，应该问题不大吧吧吧

从agent跳进去的actor.get_action然后参考一写的是argmax(-1)但是我查的时候，那为啥不直接max？但是这里其实想要的是action，也就是索引，所以直接不-1应该也可以嘛？后面再随意试验一下
```
index_array = np.argmax(x, axis=-1)
# Same as np.max(x, axis=-1, keepdims=True)
```

其他的我感觉问题都不大，因为HINT提示很明显了

def step_env(self):
    """
        Step the env and store the transition
        At the end of this block of code, the simulator should have been
        advanced one step, and the replay buffer should contain one more transition.
        Note that self.last_obs must always point to the new latest observation.
    """        

    # TODO store the latest observation ("frame") into the replay buffer
    # HINT: the replay buffer used here is `MemoryOptimizedReplayBuffer`
        # in dqn_utils.py
    self.replay_buffer_idx = self.replay_buffer.store_frame(self.last_obs)

    eps = self.exploration.value(self.t)

    # TODO use epsilon greedy exploration when selecting action
    # random if minus eps or at start point it will return True or False
    perform_random_action = np.random.random() < eps or self.t < self.learning_starts
    if perform_random_action:
        # HINT: take random action 
            # with probability eps (see np.random.random())
            # OR if your current step number (see self.t) is less that self.learning_starts
        action = np.random.randint(self.num_actions)
    else:
        # HINT: Your actor will take in multiple previous observations ("frames") in order
            # to deal with the partial observability of the environment. Get the most recent 
            # `frame_history_len` observations using functionality from the replay buffer,
            # and then use those observations as input to your actor. 
        obs = self.replay_buffer.encode_recent_observation()
        action = self.actor.get_action(obs)
    
    # TODO take a step in the environment using the action from the policy
    # HINT1: remember that self.last_obs must always point to the newest/latest observation
    # HINT2: remember the following useful function that you've seen before:
        #obs, reward, done, info = env.step(action)
    self.last_obs, reward, done, info = self.env.step(action)

    # TODO store the result of taking this action into the replay buffer
    # HINT1: see your replay buffer's `store_effect` function
    # HINT2: one of the arguments you'll need to pass in is self.replay_buffer_idx from above
    # store_effect(self, idx, action, reward, done):
    self.replay_buffer.store_effect(self.replay_buffer_idx, action, reward, done)

    # TODO if taking this step resulted in done, reset the env (and the latest observation)
    if done:
        self.last_obs = self.env.reset()

至此我们检查一下PDF里提到的py文件我们是否都完成了

emmm 就差MLPPolicyAC我们没有跳进去过，好从后往前的找法我们知道了是从run_hw3_actor_critic.py里走

看run_hw3_actor_critic.py

首先看main函数里，恩老样子初始化

trainer = AC_Trainer(params)
trainer.run_training_loop()

然后再进入AC_Trainer里看

self.params['agent_class'] = ACAgent
self.rl_trainer = RL_Trainer(self.params)

然后再进到ACAgent看看初始化，OK 再就是RL_Trainer，好的没啥问题那就继续看，接着就是run_training_loop() 正如前面那样需要跳入Agent去看sample和train的步骤

ac_agent.py

首先这个直接给好了伪代码了，只要填入对应的update就OK了，然后再跳转到对应的update查看是否完整

def train(self, ob_no, ac_na, re_n, next_ob_no, terminal_n):
    for i in range(self.agent_params['num_critic_updates_per_agent_update']):
        loss_critic = self.critic.update(ob_no, ac_na, next_ob_no, re_n, terminal_n)

    advantage = self.estimate_advantage(ob_no, next_ob_no, re_n, terminal_n)

    for i in range(self.agent_params['num_critic_updates_per_agent_update']):
        loss_actor = self.actor.update(ob_no, ac_na, advantage)

    loss = OrderedDict()
    loss['Critic_Loss'] = loss_critic
    loss['Actor_Loss'] = loss_actor

    return loss

首先按顺序，我们看critic的update → 也就是bootstrapped_continuous_critic.py ：

for i in range(self.num_grad_steps_per_target_update * self.num_target_updates):
  if i % self.num_grad_steps_per_target_update:
      value_s_prime = self.forward(next_ob_no)
      targets = reward_n + self.gamma * value_s_prime * (1-terminal_n)
      targets = ptu.from_numpy(targets)
      
  predictions = self.forward(ptu.from_numpy(ob_no))

  assert predictions.shape == targets.shape
  loss = self.loss(predictions, targets)

  self.optimizer.zero_grad()
  loss.backward()
  self.optimizer.step()

由提示比较明显知道整个流程是什么：

Run Time: self.num_grad_steps_per_target_update * self.num_target_updates) times:
every self.num_grad_steps_per_target_update steps (which includes the first step)
recompute the target values by
1. calculating V(s’) by querying the critic with next_ob_no
2. computing the target values as r(s, a) + gamma * V(s’)
最后就是更新loss和optimizer的步骤了

接着看estimate_advantage的那个函数，HINT比较明显了主要是query critic是从critic里的函数forward_np得到的，基本就是照着提示来的

value_s = self.critic.forward_np(ob_no)
value_s_prime = self.critic.forward_np(next_ob_no)
qsa_value = re_n + self.gamma * value_s_prime * (1-terminal_n)
adv_n = qsa_value - value_s

到最后了看actor的update，也就是MLP_policy.py

MLP_policy.py

这个就更简单了直接从hw2里抽过来就行，PDF里提示了把nn_baseline忽略掉就行

class MLPPolicyAC(MLPPolicy):
    def update(self, observations, actions, adv_n=None):
        # TODO: update the policy and return the loss
        observations = ptu.from_numpy(observations)
        actions = ptu.from_numpy(actions)
        advantages = ptu.from_numpy(adv_n)

        log_pi = self.forward(observations).log_prob(actions)
        loss = torch.neg(torch.mean(torch.mul(log_pi, advantages)))
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return loss.item()