MAML-RL Pytorch 代码解读 (8) -- maml_rl/envs/navigation.py

Ctrl+Alt+L

已于 2023-01-15 15:53:05 修改

阅读量260

点赞数

分类专栏：源码解读 MetaRL_Notes 文章标签： pytorch python 深度学习

于 2023-01-15 15:49:48 首次发布

本文链接：https://blog.csdn.net/m0_48948682/article/details/128695041

版权

源码解读同时被 2 个专栏收录

26 篇文章 9 订阅

订阅专栏

MetaRL_Notes

24 篇文章 10 订阅

订阅专栏

MAML-RL Pytorch 代码解读 (8) – maml_rl/envs/navigation.py

基本介绍

在网上看到的元学习 MAML 的代码大多是跟图像相关的，强化学习这边的代码比较少。

因为自己的思路跟 MAML-RL 相关，所以打算读一些源码。

MAML 的原始代码是基于 tensorflow 的，在 Github 上找到了基于 Pytorch 源码包，学习这个包。

源码链接

https://github.com/dragen1860/MAML-Pytorch-RL

文件路径

./maml_rl/envs/navigation.py

`import` 包

import numpy as np

import gym
from gym import spaces
from gym.utils import seeding

`Navigation2DEnv()` 类

class Navigation2DEnv(gym.Env):
    
    #### 这个类是2D平面导航类。在每个时间步中，2D智能体采取一个动作（限幅在[-0.1,0.1]的速度），收到一个惩罚项等同于距离二范数，也就是说奖励是负距离。一系列2D导航任务被集成通过采样目标位置从x和y的范围是[-0.5, 0.5]中采样。
	"""2D navigation problems, as described in [1]. The code is adapted from 
	https://github.com/cbfinn/maml_rl/blob/9c8e2ebd741cb0c7b8bf2d040c4caeeb8e06cc95/maml_examples/point_env_randgoal.py

	At each time step, the 2D agent takes an action (its velocity, clipped in
	[-0.1, 0.1]), and receives a penalty equal to its L2 distance to the goal 
	position (ie. the reward is `-distance`). The 2D navigation tasks are 
	generated by sampling goal positions from the uniform distribution 
	on [-0.5, 0.5]^2.

	[1] Chelsea Finn, Pieter Abbeel, Sergey Levine, "Model-Agnostic 
		Meta-Learning for Fast Adaptation of Deep Networks", 2017 
		(https://arxiv.org/abs/1703.03400)
	"""

    #### 类初始化。继承gym.Env类。状态空间和动作空间分别是observation_space和action_space。采样任务，任务是字典属性，获取里面的‘goal’键作为目标。初始化状态为[0,0]，启动随机数。
	def __init__(self, task={}):
		super(Navigation2DEnv, self).__init__()

		self.observation_space = spaces.Box(low=-np.inf, high=np.inf,
		                                    shape=(2,), dtype=np.float32)
		self.action_space = spaces.Box(low=-0.1, high=0.1,
		                               shape=(2,), dtype=np.float32)

		self._task = task
		self._goal = task.get('goal', np.zeros(2, dtype=np.float32))
		self._state = np.zeros(2, dtype=np.float32)
		self.seed()

    #### 跳转到源码上是numpy的，意思就是获得一组随机数。第一个self.np_random是一个与随机数相关的实例，seed是随机数种子。
	def seed(self, seed=None):
		self.np_random, seed = seeding.np_random(seed)
		return [seed]

    #### 通过均匀分布随机数采样范围内的num_tasks数量的目标，为每个任务都分配采样的目标。
	def sample_tasks(self, num_tasks):
		goals = self.np_random.uniform(-0.5, 0.5, size=(num_tasks, 2))
		tasks = [{'goal': goal} for goal in goals]
		return tasks

    #### 重置目标。关键是重置任务。
	def reset_task(self, task):
		self._task = task
		self._goal = task['goal']

    #### 初始化2D状态信息是[0,0]。
	def reset(self, env=True):
		self._state = np.zeros(2, dtype=np.float32)
		return self._state

    #### 先将action先幅度在[-0.1,0.1]之间，然后再次判断动作是不是在动作空间上。这个action的含义是位移信息，因此通过平面向量原理，直接将状态坐标加上动作信息，获得下一个状态的坐标。最后再通过设置x和y变量计算与目标的信息，从而获得惩罚项也就是负奖励。最后通过判断智能体的位置判断出是不是做完了。
	def step(self, action):
		action = np.clip(action, -0.1, 0.1)
		assert self.action_space.contains(action)
		self._state = self._state + action

		x = self._state[0] - self._goal[0]
		y = self._state[1] - self._goal[1]
		reward = -np.sqrt(x ** 2 + y ** 2)
		done = ((np.abs(x) < 0.01) and (np.abs(y) < 0.01))

		return self._state, reward, done, self._task