Deep Reinforcement Learning: Pong from Pixels翻译和简单理解

最新推荐文章于 2024-07-13 00:26:02 发布

hehedadaq

最新推荐文章于 2024-07-13 00:26:02 发布

阅读量3.1k

点赞数 5

分类专栏： DRL RL 论文翻译文章标签： PG算法 policy gradient 深度强化学习

DRL 同时被 3 个专栏收录

33 篇文章 18 订阅

订阅专栏

12 篇文章 1 订阅

订阅专栏

论文翻译

7 篇文章 2 订阅

订阅专栏

原文链接：

http://karpathy.github.io/2016/05/31/rl/

文章目录

前言

还是简单讲讲吧，感觉不说，后面没法看的。
PG算法，大概的意思是，输入状态(state)，输出动作(actions)，环境会不定时的给你点反馈(rewards)，类似于人一样。
然后网络更新和有监督学习(supervised-learning)有点区别，有监督会直接告诉你哪个动作是正确的，但是 PG没有这个标签(label)，那怎么办？
只好利用简陋的环境给的反馈信息，假设我这次的行动取得了好的回报，那么我将大大的提高这种行动的概率。
如果被惩罚了，那么我就不会抑制这种行动的概率。
网络的更新梯度值，大概是　这次行动的梯度∇θlogp(x;θ)　×　得分值f(x)
用这个伪标签，来更新网络。
让决策更明智。

每次重新看，都会有新的理解
我已经又更新了一些内容，并且将哈工的一位大佬翻译的内容也加了进来，再次表示感谢，和膜拜，大佬翻译的比我好多了~
我把我今天画的一个PG结构流程图也加上来吧

Policy-Gradient结构流程图

在这里插入图片描述
另外把AC的也加进来吧，好做一个对比——

可以看到，PG中的f(K)被critic-net输出值代替了，关于AC网络的介绍，直接看莫烦大佬的教程就好了，这个没什么讲的。
emmmm,说了这么多，为啥感觉看我这篇博客的人应该不多…

Deep Reinforcement Learning: Pong from Pixels

May 31, 2016

This is a long overdue blog post on Reinforcement Learning (RL). RL is hot! You may have noticed that computers can now automatically learn to play ATARI games (from raw game pixels!), they are beating world champions at Go, simulated quadrupeds are learning to run and leap, and robots are learning how to perform complex manipulation tasks that defy explicit programming. It turns out that all of these advances fall under the umbrella of RL research. I also became interested in RL myself over the last ~year: I worked through Richard Sutton’s book, read through David Silver’s course, watched John Schulmann’s lectures, wrote an RL library in Javascript, over the summer interned at DeepMind working in the DeepRL group, and most recently pitched in a little with the design/development of OpenAI Gym, a new RL benchmarking toolkit. So I’ve certainly been on this funwagon for at least a year but until now I haven’t gotten around to writing up a short post on why RL is a big deal, what it’s about, how it all developed and where it might be going.

拿的另一篇翻译博客的翻译，，，原来早就有大佬翻译过了啊，，，

我把所有的图都重新传了一遍，防止有些原网页的图无法访问

这是一篇早就应该写的关于强化学习的文章。强化学习现在很火!你可能已经注意到计算机现在可以自动(从游戏画面的像素中)学会玩雅达利(Atari)游戏[1]，它们已经击败了围棋界的世界冠军，四足机器人学会了奔跑和跳跃，机器人正在学习如何执行复杂的操作任务而无需显式对其编程。所有的这些进步都源自强化学习研究的进展。我自己在过去一年里对强化学习产生了兴趣，我读完了Richard Sutton的书、David Silver的课程，浏览了John Schulmann的讲义，在DeepMind实习的那个暑假里写了一个Javascript的强化学习库，并且最近也参与了OpenAI Gym项目的设计与开发。所以，我在强化学习这驾充满乐趣的“马车”上已经待了至少一年的时间，但是直到现在，我才抽出时间来写一写为什么强化学习那么重要，它是关于什么，如何发展起来，未来可能走向何方?

在这里插入图片描述
Examples of RL in the wild. From left to right: Deep Q Learning network playing ATARI, AlphaGo, Berkeley robot stacking Legos, physically-simulated quadruped leaping over terrain.

It’s interesting to reflect on the nature of recent progress in RL. I broadly like to think about four separate factors that hold back AI:

Compute (the obvious one: Moore’s Law, GPUs, ASICs),
Data (in a nice form, not just out there somewhere on the internet - e.g. ImageNet),
Algorithms (research and ideas, e.g. backprop, CNN, LSTM), and
Infrastructure (software under you - Linux, TCP/IP, Git, ROS, PR2, AWS, AMT, TensorFlow, etc.).
我想从强化学习近几年的进展出发，分析四个推进AI发展的关键因素：
计算(摩尔定律、GPU, ASIC)
数据(良构数据，如：ImageNet)
算法(研究与想法，如：反向传播、卷积网络、LSTM)，以及
基础架构(Linux, TCP/IP, Git, ROS, PR2, AWS, AMT, TensorFlow, etc.)

Similar to what happened in Computer Vision, the progress in RL is not driven as much as you might reasonably assume by new amazing ideas. In Computer Vision, the 2012 AlexNet was mostly a scaled up (deeper and wider) version of 1990’s ConvNets. Similarly, the ATARI Deep Q Learning paper from 2013 is an implementation of a standard algorithm (Q Learning with function approximation, which you can find in the standard RL book of Sutton 1998), where the function approximator happened to be a ConvNet. AlphaGo uses policy gradients with Monte Carlo Tree Search (MCTS) - these are also standard components. Of course, it takes a lot of skill and patience to get it to work, and multiple clever tweaks on top of old algorithms have been developed, but to a first-order approximation the main driver of recent progress is not the algorithms but (similar to Computer Vision) compute/data/infrastructure.

与计算机视觉的情境相似，强化学习的最新进展并不是由于出现了新的天才的想法。在计算机视觉研究中，2012年的AlexNet很大程度上只是1990年ConvNets的一个扩展(更深、更宽)。类似的，2013年ATARI Deep Q-Learning的文章只是标准Q-Learning算法的一个实现，区别在于它将ConvNet作为函数逼近子(function approximator)。AlphaGo使用的是策略梯度(policy gradient)以及蒙特卡洛树搜索(MCTS)——这些都是标准且已有的技术。当然，我们需要很多技巧以及耐心来让它变得有效，并在这些“古老”的算法上进行一些巧妙的调整和变化。但是给强化学习研究带来进展的最直接原因并不是新的算法，而是计算能力、数据以及基础架构。

Now back to RL. Whenever there is a disconnect between how magical something seems and how simple it is under the hood I get all antsy and really want to write a blog post. In this case I’ve seen many people who can’t believe that we can automatically learn to play most ATARI games at human level, with one algorithm, from pixels, and from scratch - and it is amazing, and I’ve been there myself! But at the core the approach we use is also really quite profoundly dumb (though I understand it’s easy to make such claims in retrospect). Anyway, I’d like to walk you through Policy Gradients (PG), our favorite default choice for attacking RL problems at the moment. If you’re from outside of RL you might be curious why I’m not presenting DQN instead, which is an alternative and better-known RL algorithm, widely popularized by the ATARI game playing paper. It turns out that Q-Learning is not a great algorithm (you could say that DQN is so 2013 (okay I’m 50% joking)). In fact most people prefer to use Policy Gradients, including the authors of the original DQN paper who have shown Policy Gradients to work better than Q Learning when tuned well. PG is preferred because it is end-to-end: there’s an explicit policy and a principled approach that directly optimizes the expected reward. Anyway, as a running example we’ll learn to play an ATARI game (Pong!) with PG, from scratch, from pixels, with a deep neural network, and the whole thing is 130 lines of Python only using numpy as a dependency (Gist link). Lets get to it.
现在回到强化学习。每当遇到一些看似神奇而实则简单的事情时，我都会感到坐立不安并想为它写一篇文章。在这个问题(强化学习)里，我遇到过很多人，他们始终不相信我们能够通过一套算法，从像素开始从头学会玩ATARI游戏——这太惊人了，我自己也曾经这么想。但是我们所使用的方法本质上确实非常简单(我承认当事后诸葛很容易)。尽管如此，我将为大家介绍我们在解决强化学习问题中最喜欢用的方法——策略梯度(Policy Gradient, PG)。假如你是一个强化学习的门外汉，你可能会好奇为什么我不介绍一种更为人所知的强化学习算法——DQN(深度Q-网络)，也就是那篇ATARI游戏的论文(来自DeepMind)中所采用的方法。实际上Q-Learning并不是一个非常棒的算法，大部分人更亲睐使用策略梯度，就连原始DQN论文的作者也证明了策略梯度能够比Q-Learning取得更好的结果。策略梯度的优势在于它是端到端(end-to-end)的：存在一个显式的策略以及方法用来直接优化期望回报(expected reward)。总之，我们将以乒乓游戏(Pong)为例，来介绍如何使用策略梯度，在深度神经网络的帮助下基于游戏画面的像素来学会玩这个游戏。完整的实现只有130行Python代码(使用Numpy)。让我们开始吧。

从这段开始我的翻译

回到强化学习，每当有东西看起来非常神奇，并且和其中的简单原理存在脱节时，我都会感到非常的烦恼，想写一篇博客阐述一下这些内容(我也是？)，这时候我发现和多人不相信我们能用一种算法，从像素图片中从头开始，让机器自动学习打atari游戏，直到达到人类的水平，这是非常令人惊讶的，我一直都是这样认为的！但是我们用的核心方法仍然是不够灵活的。无论怎么说，我想引导你学习一下PG算法，这是目前我们冲击强化学习领域的最常见的算法。如果你是RL的门外汉，还想问为啥我们不用提供DQN，因为这也是一个可供选择、广为人知的RL算法，广泛出现在ATARI游戏论文中。事实证明，Q-learning 不是一种好的算法(你也可是说DQN是如此的2013？这是啥意思？过时的意思？)大多数人更倾向PG算法，包括原始ＤＱＮ算法的作者也表示ＰＧ算法在调参的时候更好。PG是首选的:端到端，有明确的策略和原则方法，直接优化预期回报(说的不是DDPG么).作为一个例子，我们用乒乓球这个实验仿真环境，开始我们的游戏。用ＰＧ算法，从头开始(应该是参数随机初始化，没有先验知识的那种？)，利用图像的像素值，一个比较深的网络，整个例程有130行代码，只用了numpy，没有用tensorflow或者pytorch，有点意思。让我们开始吧！
翻译好慢啊，我感觉我不该翻译这些没有营养的话了，毕竟我的大头应该放在DDPG的学习上，而我还没有看完，就剩1天了，有点慌…

Pong from pixels
在这里插入图片描述

Left: The game of Pong. Right: Pong is a special case of a Markov Decision Process (MDP): A graph where each node is a particular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.

正式翻译内容：

The game of Pong is an excellent example of a simple RL task. In the ATARI 2600 version we’ll use you play as one of the paddles (the other is controlled by a decent AI) and you have to bounce the ball past the other player (I don’t really have to explain Pong, right?). On the low level the game works as follows: we receive an image frame (a 210x160x3 byte array (integers from 0 to 255 giving pixel values)) and we get to decide if we want to move the paddle UP or DOWN (i.e. a binary choice). After every single choice the game simulator executes the action and gives us a reward: Either a +1 reward if the ball went past the opponent, a -1 reward if we missed the ball, or 0 otherwise. And of course, our goal is to move the paddle so that we get lots of reward.
大概意思是：
输入图像是210*160×3，归一化到0-255之间，输出动作是上下移动，在每一个选择之后，游戏模拟器执行动作并给予我们奖励：如果球越过对手则为+1奖励，如果我们错过了球则为-1奖励，否则为0。当然，我们的目标是移动球拍，以便获得大量奖励。

As we go through the solution keep in mind that we’ll try to make very few assumptions about Pong because we secretly don’t really care about Pong; We care about complex, high-dimensional problems like robot manipulation, assembly and navigation. Pong is just a fun toy test case, something we play with while we figure out how to write very general AI systems that can one day do arbitrary useful tasks.
当我们通过解决方案时请记住，我们会尝试对Pong做出很少的假设，因为我们暗中并不真正关心Pong; 我们关心机器人操纵，装配和导航等复杂的高维问题。Pong只是一个有趣的玩具测试案例，我们在玩弄时会弄清楚如何编写非常通用的AI系统，这些系统有朝一日可以执行任意有用的任务。

Policy network. First, we’re going to define a policy network that implements our player (or “agent”). This network will take the state of the game and decide what we should do (move UP or DOWN). As our favorite simple block of compute we’ll use a 2-layer neural network that takes the raw image pixels (100,800 numbers total (2101603)), and produces a single number indicating the probability of going UP. Note that it is standard to use a stochastic policy, meaning that we only produce a probability of moving UP. Every iteration we will sample from this distribution (i.e. toss a biased coin) to get the actual move. The reason for this will become more clear once we talk about training.
策略网络。 首先，我们将定义一个实现功能的玩家（或“代理”）的策略网络。该网络将采用游戏状态并决定我们应该做什么（向上或向下移动）。作为我们最喜欢的简单计算模块，我们将使用一个2层全连接神经网络来获取原始图像像素（总共100,800个（210 * 160 * 3）），并生成一个表示向上的概率的数字。请注意，使用随机策略是标准的，这意味着输出一个单独的值表示“上移”动作的概率。每次迭代我们都会从这个概率分布中进行采样（即抛出一个有偏见的硬币）以获得实际的移动。当我们讲到训练部分时，其原因将变得更加清晰。

在这里插入图片描述
Our policy network is a 2-layer fully-connected net.

and to make things concrete here is how you might implement this policy network in Python/numpy. Suppose we’re given a vector x that holds the (preprocessed) pixel information. We would compute:
具体一点就是，你如何在Python / numpy中实现这个策略网络。假设我们给出了一个x包含（已经预处理过的）像素信息的向量。我们会计算：

h = np.dot(W1, x) # compute hidden layer neuron activations
h[h<0] = 0 # ReLU nonlinearity: threshold at zero
logp = np.dot(W2, h) # compute log probability of going up
p = 1.0 / (1.0 + np.exp(-logp)) # sigmoid function (gives probability of going up)

where in this snippet W1 and W2 are two matrices that we initialize randomly. We’re not using biases because meh. Notice that we use the sigmoid non-linearity at the end, which squashes the output probability to the range [0,1]. Intuitively, the neurons in the hidden layer (which have their weights arranged along the rows of W1) can detect various game scenarios (e.g. the ball is in the top, and our paddle is in the middle), and the weights in W2 can then decide if in each case we should be going UP or DOWN. Now, the initial random W1 and W2 will of course cause the player to spasm on spot. So the only problem now is to find W1 and W2 that lead to expert play of Pong!

在这个代码段中，W1和W2是我们随机初始化两个矩阵。我们没有使用偏置是因为meh。请注意，我们在末尾使用sigmoid非线性，将输出概率压缩到范围[0,1]。直观地假设网络具有这样的结构和功能关系，第一个隐藏层中的神经元（其权重沿着行排列W1）可以检测各种游戏场景（例如，球位于顶部，我们的拍子位于中间），然后权重W2可以决定是否在每种情况下，我们都应该向上或向下。现在，最初的随机W1和W2当然会导致玩家当场绝望。所以现在唯一的问题是找到W1并W2导致Pong的专家级别的游戏能力！

Fine print: preprocessing. Ideally you’d want to feed at least 2 frames to the policy network so that it can detect motion. To make things a bit simpler (I did these experiments on my Macbook) I’ll do a tiny bit of preprocessing, e.g. we’ll actually feed difference frames to the network (i.e. subtraction of current and last frame).
fine print(不知道怎么翻译–补充的意思)：之前提到预处理，理想情况下，您应该向策略网络至少提供2帧图片，以便它可以检测到运动。为了使事情变得更简单（我在Macbook上进行了这些实验），我将进行一些预处理，例如，我们实际上将帧差图提供给网络（即减去当前帧和最后一帧）。

It sounds kind of impossible. At this point I’d like you to appreciate just how difficult the RL problem is. We get 100,800 numbers (2101603) and forward our policy network (which easily involves on order of a million parameters in W1 and W2). Suppose that we decide to go UP. The game might respond that we get 0 reward this time step and gives us another 100,800 numbers for the next frame. We could repeat this process for hundred timesteps before we get any non-zero reward! E.g. suppose we finally get a +1. That’s great, but how can we tell what made that happen? Was it something we did just now? Or maybe 76 frames ago? Or maybe it had something to do with frame 10 and then frame 90? And how do we figure out which of the million knobs to change and how, in order to do better in the future? We call this the credit assignment problem. In the specific case of Pong we know that we get a +1 if the ball makes it past the opponent. The true cause is that we happened to bounce the ball on a good trajectory, but in fact we did so many frames ago - e.g. maybe about 20 in case of Pong, and every single action we did afterwards had zero effect on whether or not we end up getting the reward. In other words we’re faced with a very difficult problem and things are looking quite bleak.

这听起来有点不太可能。在这一点上，我想让你了解RL问题是多么困难。我们得到100,800个像素（210 * 160 * 3），并前向传递到我们的策略网络（这很容易涉及到对W1和W2中一百万个参数的调整）。假设我们决定选择“向上”。游戏可能会回应我们这次获得0奖励，并为下一帧提供另外100,800个数字。在得到任何非零奖励之前，我们可以重复这个过程一百步！例如，假设我们最终获得+1。这很好，但是是什么让这种情况发生的呢？这是我们刚才做的事吗？或者也许是76帧之前？或者它可能与第10帧和第90帧有关？为了在未来做得更好，我们如何确定要改变的百万个旋钮中的哪一个以及如何改变？我们称之为信用分配问题((credit assignment))。在Pong这个特定例子中，我们知道如果球越过对手，我们得到+1。真实原因是~~我们碰巧在一个良好的轨道上回击球~~ 我们恰好以一个巧妙的轨迹弹出球，但实际上我们在很多帧之前就已经做过了 - 例如在Pong的情况下可能大约20帧，只不过之后的动作都不直接导致我们结束游戏获得回报。尤其可见，难度之大！
~~换句话说，我们面临着一个非常棘手的问题，事情看起来相当暗淡。~~

Supervised Learning. Before we dive into the Policy Gradients solution I’d like to remind you briefly about supervised learning because, as we’ll see, RL is very similar. Refer to the diagram below. In ordinary supervised learning we would feed an image to the network and get some probabilities, e.g. for two classes UP and DOWN. I’m showing log probabilities (-1.2, -0.36) for UP and DOWN instead of the raw probabilities (30% and 70% in this case) because we always optimize the log probability of the correct label (this makes math nicer, and is equivalent to optimizing the raw probability because log is monotonic). Now, in supervised learning we would have access to a label. For example, we might be told that the correct thing to do right now is to go UP (label 0). In an implementation we would enter gradient of 1.0 on the log probability of UP and run backprop to compute the gradient vector ∇Wlogp(y=UP∣x). This gradient would tell us how we should change every one of our million parameters to make the network slightly more likely to predict UP. For example, one of the million parameters in the network might have a gradient of -2.1, which means that if we were to increase that parameter by a small positive amount (e.g. 0.001), the log probability of UP would decrease by 2.1 * 0.001 (decrease due to the negative sign). If we then did a parameter update then, yay, our network would now be slightly more likely to predict UP when it sees a very similar image in the future.
我们先来看看有监督学习是怎么操作的：
监督学习： 在我们深入研究策略梯度解决方案之前，我想简要回顾一下有关监督学习的问题，因为正如我们所看到的，这和RL非常相似。请参考下图。在普通的监督学习中，我们将图像提供给网络并获得一些概率，例如，两个类UP和DOWN。我显示UP和DOWN的log(概率)–(-1.2，-0.36)，而不是原始概率（在这种情况下为30％和70％）因为我们总是优化正确标签的对数概率（这在数学上更好，并且相当于优化原始概率，因为log是单调的）。现在，在监督学习中，我们有标签。例如，我们可能会被告知现在正确的做法是UP（标签0）。在实现中，我们将输入1的渐变。∇w ^logp (y = U.P | x)。这个梯度将告诉我们如何更改百万个参数中的每一个，以使网络更有可能预测UP。例如，网络中的百万个参数中的一个可能具有-2.1的梯度，这意味着如果我们将该参数增加一个小的正数量（例如0.001,学习率），UP的对数概率将减少2.1 * 0.001（减少由于负号）。如果我们然后进行参数更新，那么，我们的网络现在在将来看到和这个非常相似的图像时会更有可能预测为UP。

在这里插入图片描述
Policy Gradients. Okay, but what do we do if we do not have the correct label in the Reinforcement Learning setting? Here is the Policy Gradients solution (again refer to diagram below). Our policy network calculated probability of going UP as 30% (logprob -1.2) and DOWN as 70% (logprob -0.36). We will now sample an action from this distribution; E.g. suppose we sample DOWN, and we will execute it in the game. At this point notice one interesting fact: We could immediately fill in a gradient of 1.0 for DOWN as we did in supervised learning, and find the gradient vector that would encourage the network to be slightly more likely to do the DOWN action in the future. So we can immediately evaluate this gradient and that’s great, but the problem is that at least for now we do not yet know if going DOWN is good. But the critical point is that that’s okay, because we can simply wait a bit and see! For example in Pong we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case). In the example below, going DOWN ended up to us losing the game (-1 reward). So if we fill in -1 for log probability of DOWN and do backprop we will find a gradient that discourages the network to take the DOWN action for that input in the future (and rightly so, since taking that action led to us losing the game).

策略梯度：

梯度的概念

在我们学习下面的内容之前，我们先复习一下梯度的概念，在神经网络中，如何让网络的输出，变得更偏向于我们想要的结果呢？

先来上一篇参考博客吧，自己整不明白了~
https://blog.csdn.net/zhulf0804/article/details/52250220
或者看这个视频的6分30秒后的内容——
https://www.bilibili.com/video/av32598744

假设对于输入x，经过我们的一系列权重参数w，算出来我们的输出y是0.7，正确答案是1，我们该如何更新网络参数w，让输出更接近于1？
这时候就得用上梯度这个概念了。
面对更新时，我们的自变量将会变成w，同样的x(现在x是参数了)，因变量还是y，如何修改w让y变的更大？直观上看，就应该是沿着w这个函数曲线的正方向，直接往上走，y值就应该会变得更大了。所以这个就可以了。
视频中提到的有监督学习，其实是将loss变得更小，这里本质是一样的：
假设我们预测值y=0.7，正确答案是1，那么我最终是希望loss=(1-0.7)^2 = 0.09变得更小，因变量是loss，其实也就是让y变得更大。
如果正确答案是0.2，loss=(0.2-0.7)^2 = 0.25变得更小，因变量是loss，让loss变小，也就是让y变得更小。
如此说来，大致我们就可以这样理解：沿着正梯度方向更新参数，会函数输出值y变大，负方向会让输出值y减小，记得这个结论，就差不多了。

心态雪崩，上午更新了一个关于为啥这样梯度更新，会变好的部分，竟然没保存就重启了电脑，啥也没有了~

但是！我们现在是强化学习，程序并不知道si(s是状态，i是步数，或者时间，类似于一个索引标记，下同)状态，对应的正确动作！只知道采取了动作ai，而且并不一定是正确的动作，并且获得了ri的奖励。

那么在训练策略网络时，需要为每个ai指定一个值Ai，这个Ai一般称为Advantage，也就是优势。当Ai>0时，表示动作ai是一个正确的动作，相反，Ai<0表示一个错误的动作。Ai绝对值越大，表示动作越强烈。那么我们就可以用Ai作为评价策略网络输出动作的一个标签，我们极大化Ai*(-ln(p(ai|si)))就好了，如果动作是好的，Ai为正，那么最终p(ai|si)就会变大，Ai越大，p(ai|si)变得越多，反之就变得少一点。
动作是坏的，Ai为负，最终的p(ai|si)就会变小，Ai越大，p(ai|si)也就越小，反之变得小的不那么厉害。

关于Ai的分配和计算：

一般用回报折扣公式计算：

在这里插入图片描述
这里的T是此局游戏最终的步数，?是衰减系数，一般是0.99，rt是当前步数获得的回报。
我们来看两个例子：

环境如上，下面具体分析下

倒立摆：

Action：平台左右移动；
Reward：倾倒没有超过?时，都有回报1。也就是游戏结束前，每步的ri=1

Ai公式如上~

?是衰减系数，取0.99
T是游戏结束的步数，假设T = 4
i是当前步数
当t=1时，也就是第一步
A1 = 0.99 + 0.99* 0.99 + 0.99* 0.99* 0.99≈0.993
A2 = 0.99 + 0.99 0.99≈0.99*2
A3 = 0.99
在这里插入图片描述
可见确实像线性递减~
然后这个还需要归一化处理：
A_???=(A−????)/???
其中A是Ai的列表序列，mean是A这个列表的平均数，std是标准差。归一化之后得到的值分布就是这样——

可以看到标准化后的值，才可以正确的反应出真实动作好坏！如果直接用ri的话，那么所有都是正的，全部都促进提升，网络更新岂不是一点原则都没有。
标准化后，在同一批数据中，你的ri虽然也是正的，但是很小，那么你的Ai也会变成负的，你的行为也将会被抑制！
OK，这里也能解释莫烦大佬的这句话——
在这里插入图片描述

来看看第二个例子：
在这里插入图片描述

标准化之前的ri分布——

标准化之后Ai分布——

下面是源博客的翻译，翻译的不是特别好，随便看看吧，内容都在上面了~

好的，但如果我们在强化学习设置中没有正确的标签，我们该怎么办呢？这是Policy Gradients解决方案（再次参考下图）。我们的策略网络计算出UP概率为30％（log(prob) = -1.2）和DOWN为70％（log(prob) =-0.36）。我们现在将从此概率分布中采样一个动作; 例如，假设我们采样得到了DOWN这个动作，我们将在游戏中执行它。这时候我们注意到一个有趣的现象：我们可以立即传入DOWN的梯度，就像我们在监督学习中所做的那样，找到梯度向量，我们直接把这个作为是正的梯度传进去，更新网络，那么将会使当前输出值更大，这将促使网络在未来稍微更有可能做DOWN操作。所以我们可以立即评估这个梯度，这很好，但是，问题是在PG中，我们还不知道DOWN是否好。
也不用太担心，因为我们可以等一下，看看！例如在Pong中，我们可以等到比赛结束，然后获得我们得到的奖励（如果我们赢了则为+1，如果我们输了则为-1），并输入该标量作为我们采取的动作的梯度系数（DOWN在这种情况下）。在下面的例子中，DOWN结束了我们输掉比赛（-1奖励）。因此，如果我们填写-1表示DOWN的对数概率并执行backprop，我们将找到一个梯度阻止网络在将来对该输入采取DOWN动作（正确地说，因为采取该行动导致我们输了游戏）。

在这里插入图片描述
And that’s it: we have a stochastic policy that samples actions and then actions that happen to eventually lead to good outcomes get encouraged in the future, and actions taken that lead to bad outcomes get discouraged. Also, the reward does not even need to be +1 or -1 if we win the game eventually. It can be an arbitrary measure of some kind of eventual quality. For example if things turn out really well it could be 10.0, which we would then enter as the gradient instead of -1 to start off backprop. That’s the beauty of neural nets; Using them can feel like cheating: You’re allowed to have 1 million parameters embedded in 1 teraflop of compute and you can make it do arbitrary things with SGD. It shouldn’t work, but amusingly we live in a universe where it does.
就是这样：我们有一个随机策略，可以采取行动，然后在未来鼓励最终导致良好结果的行动，并且采取导致不良后果的行动会受到抑制(Monte Carlo Search)。此外，如果我们最终赢得比赛，奖励甚至不需要一定为+1或-1。它可以是任意度量。例如，如果事情结果非常好，它可能是10.0，然后我们将以梯度而不是-1的形式输入以启动backprop。这就是神经网络的美丽之处; 使用它们可能会感觉像是作弊：你可以在万亿次浮点数计算中嵌入百万个参数，你可以用SGD(Stochastic Gradient Descent，随机梯度下降算法)做任意事情。它本不该起作用，但有趣的是我们生活的这个宇宙，确实如此。

Training protocol. So here is how the training will work in detail. We will initialize the policy network with some W1, W2 and play 100 games of Pong (we call these policy “rollouts”). Lets assume that each game is made up of 200 frames so in total we’ve made 20,000 decisions for going UP or DOWN and for each one of these we know the parameter gradient, which tells us how we should change the parameters if we wanted to encourage that decision in that state in the future. All that remains now is to label every decision we’ve made as good or bad. For example suppose we won 12 games and lost 88. We’ll take all 20012 = 2400 decisions we made in the winning games and do a positive update (filling in a +1.0 in the gradient for the sampled action, doing backprop, and parameter update encouraging the actions we picked in all those states). And we’ll take the other 20088 = 17600 decisions we made in the losing games and do a negative update (discouraging whatever we did). And… that’s it. The network will now become slightly more likely to repeat actions that worked, and slightly less likely to repeat actions that didn’t work. Now we play another 100 games with our new, slightly improved policy and rinse and repeat.

训练准则

因此，训练将如何详细工作。我们将对策略网络W1和W2初始化，玩100场乒乓球比赛（我们将这些策略称为“轮？”）。让我们假设每个游戏都由200帧组成，所以总共我们已经做出了20,000次UP或DOWN的决定，并且对于每一个我们都知道当前参数的梯度，它告诉我们如果我们想要鼓励将来在该状态做出决定，如何改变参数。现在剩下的就是将我们做出的每一个决定都标记为好或坏——寻找梯度的系数——好还是坏？多好？多坏？。例如，假设我们赢了12场比赛并输掉了88场比赛。我们将在获胜的比赛中做出所有200 * 12 = 2400的决定并做出积极的更新（在采样动作的梯度中传入+1.0，做反向传播，和参数更新，提高我们在所有这些状态选择的行动的概率）。我们将在失败的游戏中做出其他200 * 88 = 17600个决定并做出负面更新（不管我们做了什么）。而且…就是这样。现在，网络将更有可能重复有效的操作，并且稍微不太可能重复不起作用的操作。现在我们用我们新的、略微改进的策略再打100场比赛并冲洗和重复。

Policy Gradients: Run a policy for a while. See what actions led to high rewards. Increase their probability.
策略梯度：运行策略一段时间。看看哪些行动带来了高回报。增加他们的概率。

在这里插入图片描述
Cartoon diagram of 4 games. Each black circle is some game state (three example states are visualized on the bottom), and each arrow is a transition, annotated with the action that was sampled. In this case we won 2 games and lost 2 games. With Policy Gradients we would take the two games we won and slightly encourage every single action we made in that episode. Conversely, we would also take the two games we lost and slightly discourage every single action we made in that episode.
4场比赛动画片图。每个黑色圆圈都是一些游戏状态（三个示例状态在底部可视化），每个箭头都是一个transition(转移状态)，用采样的动作注释。在这种情况下，我们赢了2场比赛，输了2场比赛。通过政策梯度，我们将参加我们赢得的两场比赛，并略微提高我们在那一轮中所做的每一个动作概率。相反，我们也会参加我们输掉的两场比赛，并略微劝阻我们在那一集中所做的每一个动作。

If you think through this process you’ll start to find a few funny properties. For example what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150? If every single action is now labeled as bad (because we lost), wouldn’t that discourage the correct bounce on frame 50? You’re right - it would. However, when you consider the process over thousands/millions of games, then doing the first bounce correctly makes you slightly more likely to win down the road, so on average you’ll see more positive than negative updates for the correct bounce and your policy will end up doing the right thing.
如果你想通了这个过程，你会发现一些有趣的属性。例如，如果我们在第50帧中做出了很好的动作（正确地将球弹回），但是在第150帧中错过了球？如果现在每个动作都被标记为坏（因为我们输了），那么梯度更新时，不会阻止第50帧的正确反弹吗？你是对的 - 它最终还是会的有抑制作用的。但是，当您考虑这个过程超过数千/百万的游戏时，正确地进行第一次反弹会使您更有可能赢得胜利，所以平均而言，您会看到正确反弹的正确更新将会更多，你的策略网络也会以做正确决策而收敛。

Update: December 9, 2016 - alternative view. In my explanation above I use the terms such as “fill in the gradient and backprop”, which I realize is a special kind of thinking if you’re used to writing your own backprop code, or using Torch where the gradients are explicit and open for tinkering. However, if you’re used to Theano or TensorFlow you might be a little perplexed because the code is oranized around specifying a loss function and the backprop is fully automatic and hard to tinker with. In this case, the following alternative view might be more intuitive. In vanilla supervised learning the objective is to maximize ∑ilogp(yi∣xi) where xi,yi are training examples (such as images and their labels). Policy gradients is exactly the same as supervised learning with two minor differences: 1) We don’t have the correct labels yi so as a “fake label” we substitute the action we happened to sample from the policy when it saw xi, and 2) We modulate the loss for each example multiplicatively based on the eventual outcome, since we want to increase the log probability for actions that worked and decrease it for those that didn’t. So in summary our loss now looks like ∑iAilogp(yi∣xi), where yi is the action we happened to sample and Ai is a number that we call an advantage. In the case of Pong, for example, Ai could be 1.0 if we eventually won in the episode that contained xi and -1.0 if we lost. This will ensure that we maximize the log probability of actions that led to good outcome and minimize the log probability of those that didn’t. So reinforcement learning is exactly like supervised learning, but on a continuously changing dataset (the episodes), scaled by the advantage, and we only want to do one (or very few) updates based on each sampled dataset.
这段的公式有点复杂，我简单说说吧，因为这段还涉及到一些理论推导。
更新：2016年12月9日 - 替代视角(这该咋翻译)？。在我上面的解释中，我使用了诸如“传入梯度和反向传播”之类的术语，如果您习惯于编写自己的反向代码，或者使用Torch，其中梯度是明确的，那么我会认为这是一种特殊的思维方式——修修补补？。但是，如果你已经习惯了Theano或TensorFlow，你可能会有点困惑，因为代码是在指定一个损失函数时被oranized，而backprop是全自动的，很难修改。在这种情况下，以下替代视图？可能更直观。在原始的监督学习中，目标是最大化 $\sum_i \log p(y_i \mid x_i)$ ,（哈哈，能用公式编辑了，完美！）其中xi，yi是训练样例（例如图像及其标签）。策略梯度与监督学习完全相同，只有两个细微差别：
1）我们没有正确的标签，所以作为一个“伪标签”，我们用它来代替我们碰巧看到xi时，从策略网络中采样的行动；
2）我们基于最终结果乘法地调整每个示例的损失，因为我们希望增加有效行为的对数概率，并减少那些无效的行为。
总而言之，我们的损失现在看起来像 $\sum_i A_i \log p(y_i \mid x_i)$ ，其中yi是我们碰巧采样的动作，Ai是一个我们称之为优势的数字。例如，在Pong的情况下，如果我们控制xi最终赢了，我们的优势Ai为1，如果我们输了，那么Ai就是-1.0。这将确保我们最大化导致良好结果的操作的对数概率，并最小化那些无效或者失败操作的对数概率。因此，强化学习与监督学习完全相同，但是在不断变化的数据集（轮数）上，按优势进行缩放，我们只想基于每个采样数据集进行一次（或极少数）更新。
OK，这样翻译我是可以理解的，不知道你们怎么看～

More general advantage functions. I also promised a bit more discussion of the returns. So far we have judged the goodness of every individual action based on whether or not we win the game. In a more general RL setting we would receive some reward rt at every time step. One common choice is to use a discounted reward, so the “eventual reward” in the diagram above would become Rt=∑∞k=0γkrt+k, where γ is a number between 0 and 1 called a discount factor (e.g. 0.99). The expression states that the strength with which we encourage a sampled action is the weighted sum of all rewards afterwards, but later rewards are exponentially less important. In practice it can can also be important to normalize these. For example, suppose we compute Rt for all of the 20,000 actions in the batch of 100 Pong game rollouts above. One good idea is to “standardize” these returns (e.g. subtract mean, divide by standard deviation) before we plug them into backprop. This way we’re always encouraging and discouraging roughly half of the performed actions. Mathematically you can also interpret these tricks as a way of controlling the variance of the policy gradient estimator. A more in-depth exploration can be found here.
更一般的优势函数。我还承诺对回报进行更多的讨论。到目前为止，我们已经根据我们是否赢得比赛，来判断每个人的行动是否良好。在更一般的RL设置中，我们将在每个时间步骤获得一些奖励(也就是说奖励信息不是那么的稀疏sparse，比如走一步环境就会有一个反馈)。一种常见的选择是使用折扣奖励，因此上图中的“最终奖励”将变为 $R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$ ，其中γ是0和1之间的数字，称为折扣因子（例如0.99）。该表达式表明，我们鼓励采样行动的强度是之后所有奖励的加权总和，但后来的奖励在指数上显得不那么重要。在实践中，将这些标准化也很重要。例如，假设我们计算上面一批100 Pong游戏推出中所有20,000个动作的Rt。一个好的解决方法是对这些回报值Ai，进行归一化处理（例如，减去平均值，除以标准偏差）。通过这种方式，被激励和被抑制的动作大致各分一半！(这里的意思大概是，如果不标准化的话，数据差距很大，那么可能只有少部分的动作会被更新，具体证明看下面)。在数学上，您还可以将这些技巧解释为控制策略梯度估计的方差的一种方法。可以在这里找到更深入的探索。

Deriving Policy Gradients. I’d like to also give a sketch of where Policy Gradients come from mathematically. Policy Gradients are a special case of a more general score function gradient estimator. The general case is that when we have an expression of the form Ex∼p(x∣θ)[f(x)] - i.e. the expectation of some scalar valued score function f(x) under some probability distribution p(x;θ) parameterized by some θ. Hint hint, f(x) will become our reward function (or advantage function more generally) and p(x) will be our policy network, which is really a model for p(a∣I), giving a distribution over actions for any image I. Then we are interested in finding how we should shift the distribution (through its parameters θ) to increase the scores of its samples, as judged by f (i.e. how do we change the network’s parameters so that action samples get higher rewards). We have that:

公式推导

我还是得自己把这个来龙去脉整清楚，加上一些自己在其他书上看到的信息，做一个整合，以及一小问题的注释
数学准备知识：

期望和概率分布的关系
这里的概率分布函数是f(x),g(x)是值函数。那么连续的公式如下，这个也就解释了为啥下面的第一个公式如何展开的。

推导策略梯度：我还想简要介绍一下策略梯度来自数学的部分。策略梯度是更一般的分数函数梯度估计器的特例。一般情况是当我们有一个 $E_{x \sim p(x \mid \theta)} [f(x)]$ 形式的表达式 - 即某个概率分布p(x;θ) 下，某个标量值得分函数f(x)的期望值）由一些θ参数化。注意，f(x)将成为我们的奖励函数（或者说更普遍的优势函数），p(x)将成为我们的策略网络，它实际上是 $\mid I)$ 的模型，为任何输入图像I分配行动(输入信息是图片，输出是动作).然后我们将目标放在，如何改变分布θ（通过其参数θ）以增加其样本的分数，如f所判断的（即我们如何改变网络的参数以使动作样本获得更高的奖励）。我们有：

在这里插入图片描述

关于采样

文中多次提到sample，之前看过这篇博客，也提到了PG的sample会消耗大量的时间，我本来不是很理解，我在哪儿采样了？
在莫烦大佬的教程中，我的loss是这样算出来的：

with tf.name_scope('loss'):
      # 最大化 总体 reward (log_p * R) 就是在最小化 -(log_p * R), 而 tf 的功能里只有最小化 loss
      neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=all_act, labels=self.tf_acts) # 所选 action 的概率 -log 值
      # 下面的方式是一样的:
      # neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1)
      loss = tf.reduce_mean(neg_log_prob * self.tf_vt)  # (vt = 本reward + 衰减的未来reward) 引导参数的梯度下降

即，算出了这一轮游戏中所有的负log(行动概率)的和，乘上vt，再求平均，这里应该就是一个采样过程了，只不过是全采样。(我目前是这样认为的，以后要是有更好的理解会再更新的)2019年02.15.09

To put this in English, we have some distribution p(x;θ) (I used shorthand p(x) to reduce clutter) that we can sample from (e.g. this could be a gaussian). For each sample we can also evaluate the score function f which takes the sample and gives us some scalar-valued score. This equation is telling us how we should shift the distribution (through its parameters θ) if we wanted its samples to achieve higher scores, as judged by f. In particular, it says that look: draw some samples x, evaluate their scores f(x), and for each x also evaluate the second term ∇θlogp(x;θ). What is this second term? It’s a vector - the gradient that’s giving us the direction in the parameter space that would lead to increase of the probability assigned to an x. In other words if we were to nudge θ in the direction of ∇θlogp(x;θ) we would see the new probability assigned to some x slightly increase. If you look back at the formula, it’s telling us that we should take this direction and multiply onto it the scalar-valued score f(x). This will make it so that samples that have a higher score will “tug” on the probability density stronger than the samples that have lower score, so if we were to do an update based on several samples from p the probability density would shift around in the direction of higher scores, making highly-scoring samples more likely.

用中文表示就是，我们有一些分布p（x;θ）（我使用简写p(x)来减少干扰）我们可以从中采样（例如，这可能是高斯）。对于每个样本，我们还可以评估得到样本的得分函数f，并给出一些标量值的得分。这个等式告诉我们如果我们希望它的样本达到更高的分数，我们应该如何改变分布（通过它的参数θ），如f所判断的那样。特别是，它表示看起来：绘制一些样本x，评估它们的分数f(x)，并且对于每个x也评估第二项∇θlogp(x;θ)。这个第二项是什么？它是一个向量-梯度，它给出了参数空间中的方向，这将导致分配给x的概率增加。换句话说，如果我们在θlogp(x;θ)的方向上轻推θ，我们会看到分配给某些x的新概率略微增加。如果你回顾一下这个公式，就会告诉我们我们应该采用这个方向并将其乘以标量值得分f(x)。这将使得得分较高的样本将比具有较低分数的样本“拉扯”概率密度更强(大概的意思就是得分高的那些样本，参数更新的就厉害)，因此如果我们基于来自p的几个样本进行更新，则概率密度将在更高分的方向，使得得分高的样本更有可能。

A visualization of the score function gradient estimator. Left: A gaussian distribution and a few samples from it (blue dots). On each blue dot we also plot the gradient of the log probability with respect to the gaussian’s mean parameter. The arrow indicates the direction in which the mean of the distribution should be nudged to increase the probability of that sample. Middle: Overlay of some score function giving -1 everywhere except +1 in some small regions (note this can be an arbitrary and not necessarily differentiable scalar-valued function). The arrows are now color coded because due to the multiplication in the update we are going to average up all the green arrows, and the negative of the red arrows. Right: after parameter update, the green arrows and the reversed red arrows nudge us to left and towards the bottom. Samples from this distribution will now have a higher expected score, as desired.

在这里插入图片描述
这张图莫烦大佬没有详细的解释，我来简单的翻译一下，并看看是否有一些自己的理解。
得分函数梯度估计器的可视化：

左：高斯分布和一些样本（蓝点）。在每个蓝点上，我们还绘制了对数概率相对于高斯平均参数的梯度。箭头表示应该轻推分布均值的方向，以增加该样本的概率。
中：在某些小区域中，除了+1之外，某些得分函数的叠加给出-1到处（注意这可以是任意的，不一定是可区分的标量值函数）。箭头现在是彩色编码的，因为由于更新中的乘法，我们将平均所有绿色箭头为+1和红色箭头为-1。
右：在参数更新后，绿色箭头和反向红色箭头将我们向左和向下推动。根据需要，来自此分布的样本现在将具有更高的预期分数。
我的简单理解，箭头是梯度更新的方向，咱们这个不是梯度反向更新，要有一个正反馈，所以这个梯度是正向更新的。
因此如果梯度方向和得分函数的值同号，那么就正向更新，中间图的左下角刚好是这样，那么那段曲线，将有向做下的趋势，并且有点往外拉的样子
右上角基本上得分函数是-1，那么得反过来，所以最终拉动的效果也是往左下，有点往内缩的样子，所以整个的图都往左下偏移了，而且变小了
以上是我的直观理解，不知道到底对不对，我感觉应该OK…

I hope the connection to RL is clear. Our policy network gives us samples of actions, and some of them work better than others (as judged by the advantage function). This little piece of math is telling us that the way to change the policy’s parameters is to do some rollouts, take the gradient of the sampled actions, multiply it by the score and add everything, which is what we’ve done above. For a more thorough derivation and discussion I recommend John Schulman’s lecture.
我希望有监督学习与RL的联系是明确的。我们的策略网络为我们提供了行动样本，其中一些比其他行动更好（由优势函数判断）。这个小小的数学告诉我们，改变策略参数的方法是进行一些推移(大概是上面的图那样的变化)，采用采样动作的梯度，将其乘以得分值，并加上所有内容，这就是我们上面所做的。为了更全面的推导和讨论，我推荐John Schulman的讲座。

Learning. Alright, we’ve developed the intuition for policy gradients and saw a sketch of their derivation. I implemented the whole approach in a 130-line Python script, which uses OpenAI Gym’s ATARI 2600 Pong. I trained a 2-layer policy network with 200 hidden layer units using RMSProp on batches of 10 episodes (each episode is a few dozen games, because the games go up to score of 21 for either player). I did not tune the hyperparameters too much and ran the experiment on my (slow) Macbook, but after training for 3 nights I ended up with a policy that is slightly better than the AI player. The total number of episodes was approximately 8,000 so the algorithm played roughly 200,000 Pong games (quite a lot isn’t it!) and made a total of ~800 updates. I’m told by friends that if you train on GPU with ConvNets for a few days you can beat the AI player more often, and if you also optimize hyperparameters carefully you can also consistently dominate the AI player (i.e. win every single game). However, I didn’t spend too much time computing or tweaking, so instead we end up with a Pong AI that illustrates the main ideas and works quite well:

学习

好，我们已经了解了策略梯度以及它的推导过程。我在OpenAI Gym项目中ATARI 2600 Pong的基础之上实现了整个方法，一共是130行Python代码。我训练了一个两层的策略网络，隐层200个神经元，使用RMSProp(这个算法需要去看一下？)算法更新参数。我没有进行太多地调参(超参：什么是超参数)，实验在我的Macbook上跑的。在训练了3晚上之后（增强学习的训练时间一般都会多久，在自己点电脑上？），我得到了一个比AI玩家稍好的策略。这个算法大概模拟了200,000局游戏，执行了800次左右参数更新。我朋友告诉我，如果使用ConvNet在GPU上跑几天的话，你的胜率会更高。而如果你再优化一下超参数，你就可以完胜每一局游戏。但是我在这篇文章里只是为了展示核心的思想，并没有花太多时间去调整模型，所以最终得到的只是一个表现还不错的乒乓游戏AI。

这儿是油管上的一个视频，我这边没法加链接，自己去看吧。
The learned agent (in green, right) facing off with the hard-coded AI opponent (left).

Learned weights. We can also take a look at the learned weights. Due to preprocessing every one of our inputs is an 80x80 difference image (current frame minus last frame). We can now take every row of W1, stretch them out to 80x80 and visualize. Below is a collection of 40 (out of 200) neurons in a grid. White pixels are positive weights and black pixels are negative weights. Notice that several neurons are tuned to particular traces of bouncing ball, encoded with alternating black and white along the line. The ball can only be at a single spot, so these neurons are multitasking and will “fire” for multiple locations of the ball along that line. The alternating black and white is interesting because as the ball travels along the trace, the neuron’s activity will fluctuate as a sine wave and due to the ReLU it would “fire” at discrete, separated positions along the trace. There’s a bit of noise in the images, which I assume would have been mitigated if I used L2 regularization.

权重的学习

我们来观察一下所学到的网络权重(weights)。经预处理之后，网络的输入是一个8080的差值图像(当前帧减去上一帧，以反映图像变化轨迹)。我们将W1按行进行展开(每一行的维度则为8080)并可视化。下面展示了200个隐层神经元中的40个。白色像素表示权值为正，黑色为负。可以看到，有一些神经元显示了球的运动轨迹(黑白相间的线条)。由于每个时刻球只能处在一个位置，所以这些神经元实际上在进行多任务处理(multitasking)，并在球出现在相应位置时被充分激活。当球沿着相应的轨迹运动时，该神经元的响应将呈现出正弦波形的波动，而由于ReLU的影响，它只会在一些离散的位置被激活。画面中有一些噪声，我想可以通过L2正则来消解。

“”"
这里我并不是很理解这种可视化，作者没有用卷积网络，而且这种轨迹我也不知道有啥用，就先这样吧。
“”"
在这里插入图片描述
What isn’t happening
So there you have it - we learned to play Pong from from raw pixels with Policy Gradients and it works quite well. The approach is a fancy form of guess-and-check, where the “guess” refers to sampling rollouts from our current policy, and the “check” refers to encouraging actions that lead to good outcomes. Modulo some details, this represents the state of the art in how we currently approach reinforcement learning problems. Its impressive that we can learn these behaviors, but if you understood the algorithm intuitively and you know how it works you should be at least a bit disappointed. In particular, how does it not work?

策略梯度算法是以一种巧妙的“guess-and-check（trial-and-error）”模式工作的，“guess”指的是我们每一次都根据当前策略来对动作进行采样，“check”指的是我们不断激励那些能够带来好运(赢或者得分更高)的动作。这种算法是目前解决强化学习问题最好的方法，但是一旦你理解了它的工作方式，你多少会感到有点失望，并心生疑问——它在什么情况下会失效呢?

上面的算法部分基本翻译完毕了，下面是一些作者的思考，我直接复制了谷歌翻译的内容。

Compare that to how a human might learn to play Pong. You show them the game and say something along the lines of “You’re in control of a paddle and you can move it up and down, and your task is to bounce the ball past the other player controlled by AI”, and you’re set and ready to go. Notice some of the differences:
相比之下，人类可能会学习如何玩Pong。你向他们展示游戏，然后说出一句“你掌控着拍子，你可以上下移动它，你的任务是将球弹回AI控制的另一个玩家”，然后你“ 重新设定并准备好了。请注意一些差异：

In practical settings we usually communicate the task in some manner (e.g. English above), but in a standard RL problem you assume an arbitrary reward function that you have to discover through environment interactions. It can be argued that if a human went into game of Pong but without knowing anything about the reward function (indeed, especially if the reward function was some static but random function), the human would have a lot of difficulty learning what to do but Policy Gradients would be indifferent, and likely work much better. Similarly, if we took the frames and permuted the pixels randomly then humans would likely fail, but our Policy Gradient solution could not even tell the difference (if it’s using a fully connected network as done here).
在实际场景中，人类通常以某种方式(比如：英语)对这个任务进行交流，但是在一个标准的强化学习问题中，我们只有与环境交互而产生的回报。假如人类在对回报函数一无所知的情况下学习玩这个游戏(尤其是当回报函数是某种静态但是随机的函数)，那么他也许会遇到非常多的困难，而此时策略梯度的方法则会有效得多。类似的，假如我们把每一帧画面随机打乱，那么人类很可能会输掉游戏，而对于策略梯度法却不会有什么影响。
A human brings in a huge amount of prior knowledge, such as intuitive physics (the ball bounces, it’s unlikely to teleport, it’s unlikely to suddenly stop, it maintains a constant velocity, etc.), and intuitive psychology (the AI opponent “wants” to win, is likely following an obvious strategy of moving towards the ball, etc.). You also understand the concept of being “in control” of a paddle, and that it responds to your UP/DOWN key commands. In contrast, our algorithms start from scratch which is simultaneously impressive (because it works) and depressing (because we lack concrete ideas for how not to).
人类有丰富的先验知识，比如物理相关(球的反弹，它不会瞬移，也不会忽然停止，它是匀速运动，等)与心理直觉。而且你也知道球拍是由你控制的，会对你的“上移/下移”指令作出回应。而我们的算法则是完全从一无所知的状态开始学习。
Policy Gradients are a brute force solution, where the correct actions are eventually discovered and internalized into a policy. Humans build a rich, abstract model and plan within it. In Pong, I can reason that the opponent is quite slow so it might be a good strategy to bounce the ball with high vertical velocity, which would cause the opponent to not catch it in time. However, it also feels as though we also eventually “internalize” good solutions into what feels more like a reactive muscle memory policy. For example if you’re learning a new motor task (e.g. driving a car with stick shift?) you often feel yourself thinking a lot in the beginning but eventually the task becomes automatic and mindless.
策略梯度是一种brute force(暴力搜索)算法。正确的动作最终会被发现并被内化至策略当中。人类则会构建一个强大的抽象模型，并根据它来作出规划。在乒乓游戏里，我能够推断出对手速度很慢，因此将球快速击出可能是一个好的策略。然而，似乎我们最终同样会把这些好的策略内化至我们的肌肉记忆，从而作出反射式操作。比如，当我们在学习开手动档车的时候，你会发现自己在一开始需要很多思考，但是最终会成为一种自动而下意识的操作。
Policy Gradients have to actually experience a positive reward, and experience it very often in order to eventually and slowly shift the policy parameters towards repeating moves that give high rewards. With our abstract model, humans can figure out what is likely to give rewards without ever actually experiencing the rewarding or unrewarding transition. I don’t have to actually experience crashing my car into a wall a few hundred times before I slowly start avoiding to do so.
采用策略梯度时需要不断地从正回报样本中进行学习，从而调整参数。而在人类进行学习的过程中，即使没有经历过相应的选择(样本)，也能够判断出哪个动作会有较高的回报。比如，我并不需要上百次的撞墙经历才能够学会在开车时如何避免(撞墙)——这个例子举得感觉很好。

在这里插入图片描述

In conclusion, once you understand the “trick” by which these algorithms work you can reason through their strengths and weaknesses. In particular, we are nowhere near humans in building abstract, rich representations of games that we can plan within and use for rapid learning. One day a computer will look at an array of pixels and notice a key, a door, and think to itself that it is probably a good idea to pick up the key and reach the door. For now there is nothing anywhere close to this, and trying to get there is an active area of research.

综上所述，一旦你理解了这些算法工作的原理，你就可以推断出它们的优势以及不足。特别地，我们离像人类一样通过构建抽象而丰富的游戏表示以进行快速地规划和学习还差得很远。未来的某天，计算机将会从一组像素中观察到一把钥匙和一扇门，并且会想要拿起钥匙去开门。目前为止，我们离这个目标还有很长的距离，而且这也是相关研究的热点领域。

Non-differentiable computation in Neural Networks
I’d like to mention one more interesting application of Policy Gradients unrelated to games: It allows us to design and train neural networks with components that perform (or interact with) non-differentiable computation. The idea was first introduced in Williams 1992 and more recently popularized by Recurrent Models of Visual Attention under the name “hard attention”, in the context of a model that processed an image with a sequence of low-resolution foveal glances (inspired by our own human eyes). In particular, at every iteration an RNN would receive a small piece of the image and sample a location to look at next. For example the RNN might look at position (5,30), receive a small piece of the image, then decide to look at (24, 50), etc. The problem with this idea is that there a piece of network that produces a distribution of where to look next and then samples from it. Unfortunately, this operation is non-differentiable because, intuitively, we don’t know what would have happened if we sampled a different location. More generally, consider a neural network from some inputs to outputs:

神经网络中的不可微计算

这一节将讨论策略梯度的另一个有趣的应用，与游戏无关：它允许我们设计、训练含有不可微计算的神经网络。这种思想最早在论文“Recurrent Models of visual Attention”中提出。这篇论文主要研究低分辨率图像序列的处理。在每一轮迭代中，RNN会接收图片的某一部分作为输入，然后采样出下个区域的位置。例如，RNN可能先看到(5, 30)这个位置，获得该区域的图片信息，然后决定接下来看(24, 50)这个位置。这种方法的关键在于从当前图片(局部)中产生下一个观察位置的概率分布并进行采样。不幸地是，这个操作是不可微的。更一般地，我们考虑下面的神经网络：
在这里插入图片描述
Notice that most arrows (in blue) are differentiable as normal, but some of the representation transformations could optionally also include a non-differentiable sampling operation (in red). We can backprop through the blue arrows just fine, but the red arrow represents a dependency that we cannot backprop through.
注意到大部分箭头(蓝色)是连续可微的，但是其中某些过程可能会包含不可微的采样操作(红色)。对于蓝箭头我们可以轻易地进行反向传播，而对于红色箭头则不能。

Policy gradients to the rescue! We’ll think about the part of the network that does the sampling as a small stochastic policy embedded in the wider network. Therefore, during training we will produce several samples (indicated by the branches below), and then we’ll encourage samples that eventually led to good outcomes (in this case for example measured by the loss at the end). In other words we will train the parameters involved in the blue arrows with backprop as usual, but the parameters involved with the red arrow will now be updated independently of the backward pass using policy gradients, encouraging samples that led to low loss. This idea was also recently formalized nicely in Gradient Estimation Using Stochastic Computation Graphs.
使用策略梯度可以解决这个问题!我们将网络中的采样过程视为嵌入在网络中的一个随机策略。那么，在训练时，我们将产生不同的样本(下图中的分支)，然后我们在参数更新过程中对那些最终能够得到好结果(最终的loss小)的采样进行激励。换言之，对于蓝色箭头上的参数，我们依然按照一般的反向传播来更新;而对于红色箭头上的参数，我们采用策略梯度法更新。论文：Gradient Estimation Using Stochastic Computation Graphs对这种思路作了很好地形式化描述。
在这里插入图片描述
Trainable Memory I/O. You’ll also find this idea in many other papers. For example, a Neural Turing Machine has a memory tape that they it read and write from. To do a write operation one would like to execute something like m[i] = x, where i and x are predicted by an RNN controller network. However, this operation is non-differentiable because there is no signal telling us what would have happened to the loss if we were to write to a different location j != i. Therefore, the NTM has to do soft read and write operations. It predicts an attention distribution a (with elements between 0 and 1 and summing to 1, and peaky around the index we’d like to write to), and then doing for all i: m[i] = a[i]*x. This is now differentiable, but we have to pay a heavy computational price because we have to touch every single memory cell just to write to one position. Imagine if every assignment in our computers had to touch the entire RAM!
可训练的记忆读写。你会在很多其他论文中看到这种想法，例如：Neural Turing Machine(NTM)有一个可读写的存储带。在执行写操作时，我们可能执行的是类似m[i]=x的操作，其中i和x是由RNN控制器预测的。但是这个操作并不可微，因为我们并不知道如果我们写的是一个不同的位置 j!=i 时loss会有什么样的变化。因此，NTM只好执行一种“软”的读/写操作：它首先预测一个注意力的分布(attention distribution)a(按可写的位置进行归一化)，然后执行：for all i: m[i] = a[i]*x。是的，这样做的确可微了，但是计算代价也高了很多。想象一下，我们每次赋值都需要访问整个内存空间!

However, we can use policy gradients to circumvent this problem (in theory), as done in RL-NTM. We still predict an attention distribution a, but instead of doing the soft write we sample locations to write to: i = sample(a); m[i] = x. During training we would do this for a small batch of i, and in the end make whatever branch worked best more likely. The large computational advantage is that we now only have to read/write at a single location at test time. However, as pointed out in the paper this strategy is very difficult to get working because one must accidentally stumble by working algorithms through sampling. The current consensus is that PG works well only in settings where there are a few discrete choices so that one is not hopelessly sampling through huge search spaces.
然而，我们可以使用策略梯度来避免这个问题(理论上)，正如RL-NTM所做的(arxiv.org/abs/1505.00521)。我们仍然预测一个注意力的分布 a，再根据a 对将要写的地址 i 进行采样：i = sample(a); m[i] = x。在训练过程中，我们会对 i 采样多次，并通过策略梯度法来对那些最好的采样进行激励。这样一来，在预测时就只需要对一个位置进行读/写操作了。但是，正如这篇文章(RL-NTM)所指出的，策略梯度算法更适用于采样空间较小的情况，而在这种搜索空间巨大的任务中，策略梯度并不是很有效。

However, with Policy Gradients and in cases where a lot of data/compute is available we can in principle dream big - for instance we can design neural networks that learn to interact with large, non-differentiable modules such as Latex compilers (e.g. if you’d like char-rnn to generate latex that compiles), or a SLAM system, or LQR solvers, or something. Or, for example, a superintelligence might want to learn to interact with the internet over TCP/IP (which is sadly non-differentiable) to access vital information needed to take over the world. That’s a great example.

Conclusions
We saw that Policy Gradients are a powerful, general algorithm and as an example we trained an ATARI Pong agent from raw pixels, from scratch, in 130 lines of Python. More generally the same algorithm can be used to train agents for arbitrary games and one day hopefully on many valuable real-world control problems. I wanted to add a few more notes in closing:
我们已经看到策略梯度是一种强大且通用的算法。我们以ATARI 乒乓游戏为例，使用130行Python代码训练得到了一个智能体。同样的算法也适用于其他的任何游戏，甚至在未来可用于有实际价值的控制问题。在掷笔之前，我想再补充一些说明：

On advancing AI. We saw that the algorithm works through a brute-force search where you jitter around randomly at first and must accidentally stumble into rewarding situations at least once, and ideally often and repeatedly before the policy distribution shifts its parameters to repeat the responsible actions. We also saw that humans approach these problems very differently, in what feels more like rapid abstract model building - something we have barely even scratched the surface of in research (although many people are trying). Since these abstract models are very difficult (if not impossible) to explicitly annotate, this is also why there is so much interest recently in (unsupervised) generative models and program induction.
关于推进人工智能。我们看到策略梯度算法是通过一个brute-force搜索的过程来完成的。我们同样看到人类用以解决同样问题的方法截然不同。正是由于人类所使用的那些抽象模型非常难以(甚至不可能)进行显式地标注，所以最近越来越多的学者对于(无指导)生成模型(generative models)以及程序归纳(program induction)产生了兴趣。

On use in complex robotics settings. The algorithm does not scale naively to settings where huge amounts of exploration are difficult to obtain. For instance, in robotic settings one might have a single (or few) robots, interacting with the world in real time. This prohibits naive applications of the algorithm as I presented it in this post. One related line of work intended to mitigate this problem is deterministic policy gradients - instead of requiring samples from a stochastic policy and encouraging the ones that get higher scores, the approach uses a deterministic policy and gets the gradient information directly from a second network (called a critic) that models the score function. This approach can in principle be much more efficient in settings with very high-dimensional actions where sampling actions provides poor coverage, but so far seems empirically slightly finicky to get working. Another related approach is to scale up robotics, as we’re starting to see with Google’s robot arm farm, or perhaps even Tesla’s Model S + Autopilot.
在复杂机器人场景中的应用。将策略梯度算法应用到搜索空间巨大的场景中并非易事，比如机器人控制(一个或多个机器人实时与现实世界进行交互)。解决这个问题的一个相关研究思路是：确定性的策略梯度(deterministic policy gradients)，这种方法采用一个确定性的策略，并从一个额外的估值网络(称为critic)中获取梯度信息。另一种思路是增加额外的指导。在很多实际的情形中，我们能够从人类玩家那儿获取一些专业的指导信息。比如AlphaGo先是根据人类玩家的棋谱，使用有监督学习得到能够较好地预测人类落子方式的策略，然后再使用策略梯度算法对此策略进行调整。

There is also a line of work that tries to make the search process less hopeless by adding additional supervision. In many practical cases, for instance, one can obtain expert trajectories from a human. For example AlphaGo first uses supervised learning to predict human moves from expert Go games and the resulting human mimicking policy is later finetuned with policy gradients on the “real” objective of winning the game. In some cases one might have fewer expert trajectories (e.g. from robot teleoperation) and there are techniques for taking advantage of this data under the umbrella of apprenticeship learning. Finally, if no supervised data is provided by humans it can also be in some cases computed with expensive optimization techniques, e.g. by trajectory optimization in a known dynamics model (such as F=ma in a physical simulator), or in cases where one learns an approximate local dynamics model (as seen in very promising framework of Guided Policy Search).
还有一系列工作试图通过增加额外的监督来使搜索过程变得毫无希望。例如，在许多实际情况中，人们可以从人类获得专家轨迹。例如，AlphaGo首先使用监督学习来预测来自Expert Go游戏的人类移动，并且随后通过政策梯度对赢得游戏的“真实”目标进行微调。在某些情况下，可能会有较少的专家轨迹（例如来自机器人遥控操作），并且有一些技术可以在学徒学习的保护下利用这些数据。最后，如果人类没有提供监督数据，在某些情况下也可以使用昂贵的优化技术计算，例如通过已知动力学模型中的轨迹优化（例如物理模拟器中的F = ma），或者在学习的情况下一个近似的局部动力学模型（如非常有前途的引导政策检索框架所示）。
On using PG in practice. As a last note, I’d like to do something I wish I had done in my RNN blog post. I think I may have given the impression that RNNs are magic and automatically do arbitrary sequential problems. The truth is that getting these models to work can be tricky, requires care and expertise, and in many cases could also be an overkill, where simpler methods could get you 90%+ of the way there. The same goes for Policy Gradients. They are not automatic: You need a lot of samples, it trains forever, it is difficult to debug when it doesn’t work. One should always try a BB gun before reaching for the Bazooka. In the case of Reinforcement Learning for example, one strong baseline that should always be tried first is the cross-entropy method (CEM), a simple stochastic hill-climbing “guess and check” approach inspired loosely by evolution. And if you insist on trying out Policy Gradients for your problem make sure you pay close attention to the tricks section in papers, start simple first, and use a variation of PG called TRPO, which almost always works better and more consistently than vanilla PG in practice. The core idea is to avoid parameter updates that change your policy too much, as enforced by a constraint on the KL divergence between the distributions predicted by the old and the new policy on a batch of data (instead of conjugate gradients the simplest instantiation of this idea could be implemented by doing a line search and checking the KL along the way).
策略梯度在实际中的使用。在我之前关于RNN的博文中，我可能已经让大家看到了RNN的魔力。而事实上让这些模型有效地工作是需要很多小技巧(trick)支撑的。策略梯度法也是如此。它是自动的，你需要大量的样本，它可以一直学下去，当它效果不好的时候却很难调试。无论什么时候，当我们使用火箭炮之前必须先试试空气枪。以强化学习为例，我们在任何时候都需要首先尝试的一个(强)基线系统是：交叉熵。假如你坚持要在你的问题中使用策略梯度，请一定要注意那些论文中所提到的小技巧，从简单的开始，并使用策略梯度的的一种变型：TRPO。TRPO在实际应用中几乎总是优于基本的策略梯度方法。其核心思想是通过引入旧策略和更新之后策略所预测的概率分布之间的KL距离，来控制参数更新的过程。

And that’s it! I hope I gave you a sense of where we are with Reinforcement Learning, what the challenges are, and if you’re eager to help advance RL I invite you to do so within our OpenAI Gym ? Until next time!

翻译总结：

感觉大佬的思路真的太棒了，将未知的RL，和常见的有监督学习联系起来，稍微点出其中的差别，让新学者降低了很多的认知门槛，感谢～
另外，希望大家最好还是直接看英文原版，不懂的单词，用划词软件看看，能看懂的，比看我的，比看莫烦大佬的都要好理解一些，虽然看中文的是一个捷径，但是以后还是得要自己走的呀。
加油，这篇翻译应该还是有点价值的，个人觉得，看完了这个基本上，PG算法就应该可以理解了。