# 基础知识

"But is it possible to do this theother way around: to convert a reinforcement learning task into a supervised learningtask?
"In general, there is no way to do this. The key difficulty is that whereas insupervised learning, the goal is to reconstruct the unknown function f that assignsoutput values y to data points x, in reinforcement learning, the goal is to find theinput x* that gives the maximum reward R(x*).
"Nonetheless, is there a way that we could apply ideas from supervised learningto perform reinforcement learning? Suppose, for example, that we are given a setof training examples of the form (xi, R(xi)), where the xi are points and the R(xi)are the corresponding observed rewards. In supervised learning, we would attemptto find a function h that approximates R well. If h were a perfect approximation ofR, then we could find x* by applying standard optimization algorithms to h."

# 从Q-Learning到DQN

## 3 价值函数近似Value Function Approximation

f可以是任意类型的函数，比如线性函数：

## 2 Why Policy Network?

所有带衰减reward的累加期望

## 5 另一个角度：直接算

f(s,a)不仅仅可以作为动作的评价指标，还可以作为目标函数。就如同AlphaGo，评价指标就是赢或者输，而目标就是结果赢。这和之前分析的目标完全没有冲突。因此，我们可以利用评价指标f(s,a)来优化Policy，同时也是在优化的同时优化了f(s,a).那么问题就变成对f(s,a)求关于参数的梯度。下面的公式直接摘自Andrej Karpathy的blog，f(x)即是f(s,a)

Actor Critic 方法的优势: 可以进行单步更新, 比传统的 Policy Gradient 要快（回合结束更新）.

Actor Critic 方法的劣势: 取决于 Critic 的价值判断, 但是 Critic 难收敛, 再加上 Actor 的更新, 就更难收敛. 为了解决收敛问题, Google Deepmind 提出了 Actor Critic 升级版 Deep Deterministic Policy Gradient. 后者融合了 DQN 的优势, 解决了收敛难的问题.

Actor 修改行为时就像蒙着眼睛一直向前开车, Critic 就是那个扶方向盘改变 Actor 开车方向的.

DDPG

DDPG中，actor网络的输入是state，输出Action，以DNN进行函数拟合，对于连续动作NN输出层可以用tanh或sigmod，离散动作以softmax作为输出层则达到概率输出的效果。Critic网络的输入为state和action，输出为Q值。

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客