MAML-RL Pytorch 代码解读 (12) -- maml_rl/envs/mujoco/ant.py

MAML-RL Pytorch 代码解读 (12) – maml_rl/envs/mujoco/ant.py

基本介绍

在网上看到的元学习 MAML 的代码大多是跟图像相关的,强化学习这边的代码比较少。

因为自己的思路跟 MAML-RL 相关,所以打算读一些源码。

MAML 的原始代码是基于 tensorflow 的,在 Github 上找到了基于 Pytorch 源码包,学习这个包。

源码链接

https://github.com/dragen1860/MAML-Pytorch-RL

文件路径

./maml_rl/envs/mujoco/ant.py

import

import numpy as np
from gym.envs.mujoco import AntEnv as AntEnv_

AntEnv()

这个类应该是一个总类,下面的变体都是在这个基础上变化的

class AntEnv(AntEnv_):
    
    #### @property是将被装饰的方法转化为一个同名的只读的特征属性,被装饰方法的文档字符串就是装饰后同名属性的文档字符串,且后面没有.setter和.deleter方法,说明这个装饰器将self._action_scaling变只读了。如果自己的实例没有名字为'action_space'的属性,那么返回1.0数值。如果self._action_scaling存在但是是None,那么就返回动作空间的一半空间。
	@property
	def action_scaling(self):
		if not hasattr(self, 'action_space'):
			return 1.0
		if self._action_scaling is None:
			lb, ub = self.action_space.low, self.action_space.high
			self._action_scaling = 0.5 * (ub - lb)
		return self._action_scaling

	#### 获取mujoco仿真器的位姿、速度、(归一化的接触摩擦力...?)和两个关于自身身体的数值。最后打包装成了np数组。
    def _get_obs(self):
		return np.concatenate([
			self.sim.data.qpos.flat[2:],
			self.sim.data.qvel.flat,
			np.clip(self.sim.data.cfrc_ext, -1, 1).flat,
			self.sim.data.get_body_xmat("torso").flat,
			self.get_body_com("torso").flat,
		]).astype(np.float32).flatten()

    #### 构建仿真器内部的相机。先指定camera_id,两款相机,其中一款是固定的,距离模型放置的0.35倍的距离
	def viewer_setup(self):
		camera_id = self.model.camera_name2id('track')
		self.viewer.cam.type = 2
		self.viewer.cam.fixedcamid = camera_id
		self.viewer.cam.distance = self.model.stat.extent * 0.35
		# Hide the overlay
		self.viewer._hide_overlay = True

    #### 用于渲染。如果采用的渲染模式是'rgb_array',那么从相机中渲染获得信息,设置图片大小是500x500,将图片转换成数据并返回。如果采用的渲染模式是'human',直接对仿真器渲染,不需要获得信息。
	def render(self, mode='human'):
		if mode == 'rgb_array':
			self._get_viewer().render()
			# window size used for old mujoco-py:
			width, height = 500, 500
			data = self._get_viewer().read_pixels(width, height, depth=False)
			return data
		elif mode == 'human':
			self._get_viewer().render()

AntVelEnv()

class AntVelEnv(AntEnv):
    
   #### 这个类是具有目标速度的蚂蚁环境,继承AntEnv()类。奖励函数由:控制消耗、幸存奖励,当前速度和目标速度之间的惩罚项。从均匀分布[0, 3]中采样目标速度。
   """Ant environment with target velocity, as described in [1]. The 
   code is adapted from
   https://github.com/cbfinn/maml_rl/blob/9c8e2ebd741cb0c7b8bf2d040c4caeeb8e06cc95/rllab/envs/mujoco/ant_env_rand.py

   The ant follows the dynamics from MuJoCo [2], and receives at each 
   time step a reward composed of a control cost, a contact cost, a survival 
   reward, and a penalty equal to the difference between its current velocity 
   and the target velocity. The tasks are generated by sampling the target 
   velocities from the uniform distribution on [0, 3].

   [1] Chelsea Finn, Pieter Abbeel, Sergey Levine, "Model-Agnostic 
      Meta-Learning for Fast Adaptation of Deep Networks", 2017 
      (https://arxiv.org/abs/1703.03400)
   [2] Emanuel Todorov, Tom Erez, Yuval Tassa, "MuJoCo: A physics engine for 
      model-based control", 2012 
      (https://homes.cs.washington.edu/~todorov/papers/TodorovIROS12.pdf)
   """

   #### 接受任务、任务的目标速度键值对、不进行归一化动作空间,声明对父类的继承。
   def __init__(self, task={}):
      self._task = task
      self._goal_vel = task.get('velocity', 0.0)
      self._action_scaling = None
      super(AntVelEnv, self).__init__()

   def step(self, action):
    
      #### 从仿真器中获取蚂蚁的采取动作前的位置(位姿)xposbefore;在仿真器中采用self.frame_skip帧率执行action动作后进行仿真;从仿真器中获取蚂蚁的采取动作后的位置(位姿)xposafter;
      xposbefore = self.get_body_com("torso")[0]
      self.do_simulation(action, self.frame_skip)
      xposafter = self.get_body_com("torso")[0]

      #### 前馈速度用速度公式求出来;然后获得前馈的速度有关的奖励;幸存奖励是0.05;控制损失应该是一个幅度,如果控制信号越大,那么损失就越大;接触摩擦力损失按公式计算。
      forward_vel = (xposafter - xposbefore) / self.dt
      forward_reward = -1.0 * np.abs(forward_vel - self._goal_vel) + 1.0
      survive_reward = 0.05
      ctrl_cost = 0.5 * 1e-2 * np.sum(np.square(action / self.action_scaling))
      contact_cost = 0.5 * 1e-3 * np.sum(
         np.square(np.clip(self.sim.data.cfrc_ext, -1, 1)))

      #### 从上一个类中获得位姿、接触摩擦力和其他一些从参数。计算奖励。self.state_vector()的意思是将蚂蚁的位姿[1]和速度[2]变成一个向量。如果状态信息都是有限值,且速度在[0.2,1.0]的范围内,那么就是没有完成notdone=True。反之就是完成了done=True。infos记录奖励信息。最后返回一个元组。
      observation = self._get_obs()
      reward = forward_reward - ctrl_cost - contact_cost + survive_reward
      state = self.state_vector()
      notdone = np.isfinite(state).all() \
                and state[2] >= 0.2 and state[2] <= 1.0
      done = not notdone
      infos = dict(reward_forward=forward_reward, reward_ctrl=-ctrl_cost,
                   reward_contact=-contact_cost, reward_survive=survive_reward,
                   task=self._task)
      return (observation, reward, done, infos)

   def sample_tasks(self, num_tasks):
      #### 从均匀分布[0.0,3.0]中采样num_tasks个任务,在每个任务中记录键值对'velocity'和数值。
      velocities = self.np_random.uniform(0.0, 3.0, size=(num_tasks,))
      tasks = [{'velocity': velocity} for velocity in velocities]
      return tasks

   def reset_task(self, task):
      #### 重置任务。
      self._task = task
      self._goal_vel = task['velocity']

AntDirEnv()

class AntDirEnv(AntEnv):
    
    #### 这个类是具有目标方向的蚂蚁环境,继承AntEnv()类。奖励函数由:控制消耗、接触消耗、幸存奖励,当前方向和目标方向之间的惩罚项。从{-1,1}=[0.5,0.5]中采样目标方向。
	"""Ant environment with target direction, as described in [1]. The 
	code is adapted from
	https://github.com/cbfinn/maml_rl/blob/9c8e2ebd741cb0c7b8bf2d040c4caeeb8e06cc95/rllab/envs/mujoco/ant_env_rand_direc.py

	The ant follows the dynamics from MuJoCo [2], and receives at each 
	time step a reward composed of a control cost, a contact cost, a survival 
	reward, and a reward equal to its velocity in the target direction. The 
	tasks are generated by sampling the target directions from a Bernoulli 
	distribution on {-1, 1} with parameter 0.5 (-1: backward, +1: forward).

	[1] Chelsea Finn, Pieter Abbeel, Sergey Levine, "Model-Agnostic 
		Meta-Learning for Fast Adaptation of Deep Networks", 2017 
		(https://arxiv.org/abs/1703.03400)
	[2] Emanuel Todorov, Tom Erez, Yuval Tassa, "MuJoCo: A physics engine for 
		model-based control", 2012 
		(https://homes.cs.washington.edu/~todorov/papers/TodorovIROS12.pdf)
	"""

    #### 接受任务、任务的目标方向键值对、不进行归一化动作空间,声明对父类的继承。
	def __init__(self, task={}):
		self._task = task
		self._goal_dir = task.get('direction', 1)
		self._action_scaling = None
		super(AntDirEnv, self).__init__()

	def step(self, action):
        
        #### 从仿真器中获取蚂蚁的采取动作前的位置(位姿)xposbefore;在仿真器中采用self.frame_skip帧率执行action动作后进行仿真;从仿真器中获取蚂蚁的采取动作后的位置(位姿)xposafter;
		xposbefore = self.get_body_com("torso")[0]
		self.do_simulation(action, self.frame_skip)
		xposafter = self.get_body_com("torso")[0]

        #### 前馈速度用速度公式求出来;然后获得前馈的速度有关的奖励;幸存奖励是0.05;控制损失应该是一个幅度,如果控制信号越大,那么损失就越大;接触摩擦力损失按公式计算。
		forward_vel = (xposafter - xposbefore) / self.dt
		forward_reward = self._goal_dir * forward_vel
		survive_reward = 0.05
		ctrl_cost = 0.5 * 1e-2 * np.sum(np.square(action / self.action_scaling))
		contact_cost = 0.5 * 1e-3 * np.sum(
			np.square(np.clip(self.sim.data.cfrc_ext, -1, 1)))

        #### 从上上一个类中获得位姿、接触摩擦力和其他一些从参数。计算奖励。self.state_vector()的意思是将蚂蚁的位姿[1]和速度[2]变成一个向量。如果状态信息都是有限值,且速度在[0.2,1.0]的范围内,那么就是没有完成notdone=True。反之就是完成了done=True。infos记录奖励信息。最后返回一个元组。
		observation = self._get_obs()
		reward = forward_reward - ctrl_cost - contact_cost + survive_reward
		state = self.state_vector()
		notdone = np.isfinite(state).all() \
		          and state[2] >= 0.2 and state[2] <= 1.0
		done = not notdone
		infos = dict(reward_forward=forward_reward, reward_ctrl=-ctrl_cost,
		             reward_contact=-contact_cost, reward_survive=survive_reward,
		             task=self._task)
		return (observation, reward, done, infos)

    #### 从伯努力分布中采样num_tasks个任务,在每个任务中记录键值对'direction'和数值。
	def sample_tasks(self, num_tasks):
		directions = 2 * self.np_random.binomial(1, p=0.5, size=(num_tasks,)) - 1
		tasks = [{'direction': direction} for direction in directions]
		return tasks

	#### 重置任务。
    def reset_task(self, task):
		self._task = task
		self._goal_dir = task['direction']

AntPosEnv()

class AntPosEnv(AntEnv):
    
   #### 这个类是具有目标位置的蚂蚁环境,继承AntEnv()类。奖励函数由:控制消耗、接触消耗、幸存奖励,当前位置和目标位置之间的惩罚项。从均匀分布x和y都是[-3, 3]的均匀分布中采样目标位置。
   """Ant environment with target position. The code is adapted from
   https://github.com/cbfinn/maml_rl/blob/9c8e2ebd741cb0c7b8bf2d040c4caeeb8e06cc95/rllab/envs/mujoco/ant_env_rand_goal.py

   The ant follows the dynamics from MuJoCo [1], and receives at each 
   time step a reward composed of a control cost, a contact cost, a survival 
   reward, and a penalty equal to its L1 distance to the target position. The 
   tasks are generated by sampling the target positions from the uniform 
   distribution on [-3, 3]^2.

   [1] Emanuel Todorov, Tom Erez, Yuval Tassa, "MuJoCo: A physics engine for 
      model-based control", 2012 
      (https://homes.cs.washington.edu/~todorov/papers/TodorovIROS12.pdf)
   """

   #### 接受任务、任务的目标方向键值对、不进行归一化动作空间,声明对父类的继承。
   def __init__(self, task={}):
      self._task = task
      self._goal_pos = task.get('position', np.zeros((2,), dtype=np.float32))
      self._action_scaling = None
      super(AntPosEnv, self).__init__()

   def step(self, action):
    
      #### 在仿真器中采用self.frame_skip帧率执行action动作后进行仿真;从仿真器中获取蚂蚁的采取动作后的位置xyposafter;
      self.do_simulation(action, self.frame_skip)
      xyposafter = self.get_body_com("torso")[:2]

      #### 当前位置和目标位置的曼哈顿距离作为奖励;幸存奖励是0.05;控制损失应该是一个幅度,如果控制信号越大,那么损失就越大;接触摩擦力损失按公式计算。
      goal_reward = -np.sum(np.abs(xyposafter - self._goal_pos)) + 4.0
      survive_reward = 0.05
      ctrl_cost = 0.5 * 1e-2 * np.sum(np.square(action / self.action_scaling))
      contact_cost = 0.5 * 1e-3 * np.sum(
         np.square(np.clip(self.sim.data.cfrc_ext, -1, 1)))

      #### 从上上上一个类中获得位置、接触摩擦力和其他一些从参数。计算奖励。self.state_vector()的意思是将蚂蚁的位姿[1]和速度[2]变成一个向量。如果状态信息都是有限值,且速度在[0.2,1.0]的范围内,那么就是没有完成notdone=True。反之就是完成了done=True。infos记录奖励信息。最后返回一个元组。
      observation = self._get_obs()
      reward = goal_reward - ctrl_cost - contact_cost + survive_reward
      state = self.state_vector()
      notdone = np.isfinite(state).all() \
                and state[2] >= 0.2 and state[2] <= 1.0
      done = not notdone
      infos = dict(reward_goal=goal_reward, reward_ctrl=-ctrl_cost,
                   reward_contact=-contact_cost, reward_survive=survive_reward,
                   task=self._task)
      return (observation, reward, done, infos)

   #### 从[-3.0, 3.0]和[-3.0, 3.0]的均匀分布中采样num_tasks个任务,在每个任务中记录键值对'position'和数值。
   def sample_tasks(self, num_tasks):
      positions = self.np_random.uniform(-3.0, 3.0, size=(num_tasks, 2))
      tasks = [{'position': position} for position in positions]
      return tasks
    
   #### 重置任务。
   def reset_task(self, task):
      self._task = task
      self._goal_pos = task['position']
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
以下是使用PyTorch实现的MAML元学习的示例代码: ```python import torch import torch.nn as nn import torch.optim as optim class MAML(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(MAML, self).__init__() self.input_size = input_size self.hidden_size = hidden_size self.output_size = output_size self.fc1 = nn.Linear(input_size, hidden_size) self.relu = nn.ReLU() self.fc2 = nn.Linear(hidden_size, output_size) def forward(self, x): x = self.fc1(x) x = self.relu(x) x = self.fc2(x) return x def clone(self, device=None): clone = MAML(self.input_size, self.hidden_size, self.output_size) if device is not None: clone.to(device) clone.load_state_dict(self.state_dict()) return clone class MetaLearner(nn.Module): def __init__(self, model, lr): super(MetaLearner, self).__init__() self.model = model self.optimizer = optim.Adam(self.model.parameters(), lr=lr) def forward(self, x): return self.model(x) def meta_update(self, task_gradients): for param, gradient in zip(self.model.parameters(), task_gradients): param.grad = gradient self.optimizer.step() self.optimizer.zero_grad() def train_task(model, data_loader, lr_inner, num_updates_inner): model.train() task_loss = 0.0 for i, (input, target) in enumerate(data_loader): input = input.to(device) target = target.to(device) clone = model.clone(device) meta_optimizer = MetaLearner(clone, lr_inner) for j in range(num_updates_inner): output = clone(input) loss = nn.functional.mse_loss(output, target) grad = torch.autograd.grad(loss, clone.parameters(), create_graph=True) fast_weights = [param - lr_inner * g for param, g in zip(clone.parameters(), grad)] clone.load_state_dict({name: param for name, param in zip(clone.state_dict(), fast_weights)}) output = clone(input) loss = nn.functional.mse_loss(output, target) task_loss += loss.item() grad = torch.autograd.grad(loss, model.parameters()) task_gradients = [-lr_inner * g for g in grad] meta_optimizer.meta_update(task_gradients) return task_loss / len(data_loader) # Example usage device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') input_size = 1 hidden_size = 20 output_size = 1 model = MAML(input_size, hidden_size, output_size) model.to(device) data_loader = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(torch.randn(100, input_size), torch.randn(100, output_size)), batch_size=10, shuffle=True) meta_optimizer = MetaLearner(model, lr=0.001) for i in range(100): task_loss = train_task(model, data_loader, lr_inner=0.01, num_updates_inner=5) print('Task loss:', task_loss) meta_optimizer.zero_grad() task_gradients = torch.autograd.grad(task_loss, model.parameters()) meta_optimizer.meta_update(task_gradients) ``` 在这个示例,我们定义了两个类,MAML和MetaLearner。MAML是一个普通的神经网络,而MetaLearner包含了用于更新MAML的元优化器。在每个任务上,我们使用MAML的副本进行内部更新,然后使用元优化器来更新MAML的权重。在元学习的过程,我们首先通过调用train_task函数来训练一个任务,然后通过调用meta_update函数来更新MAML的权重。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Ctrl+Alt+L

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值