Playing Atari with Deep Reinforcement Learning

Research Topic

得到一个model, 输入是raw high-dimensional sensory input,输出是Q value function

Reinforcement learning presents several challenges from a deep learning perspective:

  1. most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed.
  2. most deep learning algorithms assume the data samples to be independent, while in reinforcement one typically encounters sequences of highly correlated states.
  3. in RL the data distribution changes as the algorithms learns new behaviours, which can be problematic for deep learning methods that assume a fixed underlying distribution.

这篇文章提出了experience replay mechanism to alleviate the problems of correlated data and non-stationary distributions
(更正一下:experience replay mechanism 不是这篇文章提出的,是这篇文章提出的——Reinforcement learning for robots using neural networks)

Method

在这里插入图片描述
This approach has several advantages over standard online Q-learning:

  1. each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
  2. learning directly from consecutive samples is inefficient, due to the strong correlations between the samples, randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
  3. when learning on-policy the current parameters determine the next data sample that the parameters are trained on. By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in othe parameters.

data preprocessing and model architecture

raw data : 210 x 160 pixel images with a 128 color palette → \rightarrow gray-scale → \rightarrow down-sampling it to 110 x 84 image → \rightarrow final input representation : cropping an 84 x 84 region of the image that roughly captures the playing area

make the preprocessing to the last 4 frames of a history and stacks them to produce the input to the Q-function.

Experiment

reward

While they evaluated our agents on the real and unmodified games, they made one change to the reward structure of the games during training only. Since the scale of scores varies greatly from game to game, we fixed all positive rewards to be 1 and all negative rewards to be -1, leaving 0 rewards unchanged.
Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games. At the same time, it could affect the performance of out agent since it cannot differentiate between rewards of different magnitude.

还用了一个trick——frame-skipping

training and stability
  • evaluation metric
    • the total reward the agents collects in an episode or game averaged over a number of games
    • the policy’s estimated action-value function Q (they collect a fixed set of states by running a random policy before training starts and track the average of the maximum predicted Q for these states)
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值