Playing Atari with Deep Reinforcement Learning

最新推荐文章于 2023-09-09 22:22:27 发布

Vic_Hao

最新推荐文章于 2023-09-09 22:22:27 发布

阅读量277

点赞数 1

分类专栏： RL论文阅读 DQN

本文链接：https://blog.csdn.net/weixin_42018112/article/details/90349530

版权

RL论文阅读同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

DQN

1 篇文章 0 订阅

订阅专栏

Research Topic

得到一个model, 输入是raw high-dimensional sensory input，输出是Q value function

Reinforcement learning presents several challenges from a deep learning perspective:

most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed.
most deep learning algorithms assume the data samples to be independent, while in reinforcement one typically encounters sequences of highly correlated states.
in RL the data distribution changes as the algorithms learns new behaviours, which can be problematic for deep learning methods that assume a fixed underlying distribution.

这篇文章提出了experience replay mechanism to alleviate the problems of correlated data and non-stationary distributions
(更正一下：experience replay mechanism 不是这篇文章提出的，是这篇文章提出的——Reinforcement learning for robots using neural networks)

Method

在这里插入图片描述
This approach has several advantages over standard online Q-learning:

each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
learning directly from consecutive samples is inefficient, due to the strong correlations between the samples, randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
when learning on-policy the current parameters determine the next data sample that the parameters are trained on. By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in othe parameters.

data preprocessing and model architecture

raw data : 210 x 160 pixel images with a 128 color palette $\rightarrow$ gray-scale $\rightarrow$ down-sampling it to 110 x 84 image $\rightarrow$ final input representation : cropping an 84 x 84 region of the image that roughly captures the playing area

make the preprocessing to the last 4 frames of a history and stacks them to produce the input to the Q-function.

Experiment

reward

While they evaluated our agents on the real and unmodified games, they made one change to the reward structure of the games during training only. Since the scale of scores varies greatly from game to game, we fixed all positive rewards to be 1 and all negative rewards to be -1, leaving 0 rewards unchanged.
Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games. At the same time, it could affect the performance of out agent since it cannot differentiate between rewards of different magnitude.

还用了一个trick——frame-skipping

training and stability

evaluation metric
- the total reward the agents collects in an episode or game averaged over a number of games
- the policy’s estimated action-value function Q (they collect a fixed set of states by running a random policy before training starts and track the average of the maximum predicted Q for these states)

Vic_Hao

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Playing Atari with Deep Reinforcement Learning

Research Topic得到一个model, 输入是raw high-dimensional sensory input，输出是Q value functionReinforcement learning presents several challenges from a deep learning perspective:most successful deep learning ...
复制链接

扫一扫

专栏目录