在大多是强化学习(reinforcement learning RL)问题中,环境的model
都是未知的,也就无法直接做动态规划。一种方法是去学MDP
,在这个系列的理解强化学习中的策略迭代和值迭代
这篇文章中有具体思路。但这种做法还是会存在很多问题,就是在sample
过程中会比较麻烦,如果你随机sample
的话就会有某些state
你很难sample
到,而按照某种策略sample
的话,又很难得到真实的转移概率。一旦你的model
出现了问题,值迭代和策略迭代都将会出现问题。
于是就有了Model-free Reinforcement Learning
,直接与环境交互,直接从数据中学到model
。
Model-free Reinforcement Learning
Model-free Reinforcement Learning
需要从数据中estimate
出value
是多少(state or state-action pair),接下来拿到cumulative reward
的期望,得到这些case
之后,再去做model-free
的control
,去optimal
当前的policy
使得value function
最大化。
那model-free
的value function
如何来做prediction
呢?
在model-free
的RL
中我们无法获取state transition
和reward function
,我们仅仅是有一些episodes
。之前我们是拿这些episodes
学model
,在model free
的方法中拿这些episode
直接学value function
或者是policy
,不需要学MDP
。这里面两个关键的key steps
:1. estimate value function. 2. optimize policy.
Value Function Estimate
In model-based RL (MDP), the value function is calculated by dynamic programming
v π ( s ) = E π [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s ] v_{\pi}(s)=\mathbb{E_{\pi}}[R_{t+1}+\gamma v_{\pi}(\mathcal{S_{t+1}})|\mathcal{S_{t}=s}] vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s]
在model free
的方法中,我们不知道state transition
,由此无法计算上述等式的期望。
Monte-Carlo Methods
Monte-Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. For example, to calculate the circle’s surface. As show in following figure:
对上述方框中均匀撒上一些点,然后用如下等式计算即可:
Circle Surface = Square Surface × points in circle points in total \text{Circle Surface} = \text{Square Surface} \times \frac{\text{ points in circle}}{\text{points in total}} Circle Surface=Square Surface×points in total points in circle
Monte-Carlo Value Estimation
我们有很多episodes
,基于这些episode
,我们去计算total discounted reward
:
G t = R t + 1 + γ R t + 2 + … = ∑ k = 0 ∞ γ k R t + k + 1 G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} Gt=Rt+1+γRt+2+…=k=0∑∞γkRt+k+1
Value function
的 expected return
可表示为如下数学形式:
V π ( s ) = E [ G t ∣ s t = s , π ] ≈ 1 N ∑ i = 1 N G i ( i ) V^{\pi}(s) = \mathbb{E} [G_{t}|s_{t}=s,\pi] \\ \approx \frac{1}{N} \sum_{i=1}^{N} G_{i}^{(i)} Vπ(s)=E[Gt∣st=s,π]≈N1i=1∑NGi(i)
上述方法可总结为两步:1. 使用policy
π
\pi
π从state
s
s
s开始采样
N
N
N 个episodes
。2. 计算平均累计奖励(the average of cumulative reward )。可以看出来,这种基于采样的方法,直接一步到位,计算value
而不需要计算MDP
中的什么状态转移啥的。
上述思想更加细致、更具体的方法可用如下形式表示:
- Sample episodes of policy π \pi π。
- Every time-step
t
t
t that state
s
s
s is visited in an episode
- Increment counter N ( s ) ← N ( s ) + 1 N(s) \leftarrow N(s) +1 N(s)←N(s)+1
- Increment total return S ( s ) ← S ( s ) + G t S(s) \leftarrow S(s) +G_{t} S(s)←S(s)+Gt
- Value is estimated by mean return V ( s ) = S ( s ) / N ( s ) V(s)=S(s)/N(s) V(s)=S(s)/N(s)
- By law of large numbers V ( s ) ← V π V(s) \leftarrow V^{\pi} V(s)←Vπ as N ( s ) → ∞ N(s) \rightarrow \infty N(s)→∞。
Incremental Monte-Carlo Updates
- Update V ( s ) V(s) V(s) incrementally after each episode
- For each state S t S_{t} St with cumulative return G t G_{t} Gt
N ( S t ) ← N ( S t ) + 1 V ( S t ) ← V ( S t ) + 1 N ( S t ) ( G t − V ( S t ) ) \begin{array}{l} {N\left(S_{t}\right) \leftarrow N\left(S_{t}\right)+1} \\ {V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\frac{1}{N\left(S_{t}\right)}\left(G_{t}-V\left(S_{t}\right)\right)} \end{array} N(St)←N(St)+1V(St)←V(St)+N(St)1(Gt−V(St))
- For non-stationary problems (i.e. the environment could be varying over time), it can be useful to track a running mean, i.e. forget old episodes
如果环境的state transition
和reward function
一直在变,我们把这个环境叫做non-stationary
,环境本身肯定叫做stochastic
环境。但是如果分布不变,叫做statically environment
,但是环境本身的分布会发生变化的话,就需要去忘掉一些老的episode
,如果用平均的方法去做的话,老的episode
和新的episode
一样,它就忘不掉老的episode
。
V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V(S_{t}) \leftarrow V(S_{t}) + \alpha (G_{t} - V(S_{t})) V(St)←V(St)+α(Gt−V(St))
Monte-Carlo Value Estimation
的一些特点:
- MC methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transitions / rewards
- MC learns from complete episodes: no bootstrapping (discussed later)
- MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs i.e., all episodes must terminate
Temporal-Difference Learning
TD
的方法中引入对未来值函数的估计:
G t = R t + 1 + γ R t + 2 + … = R t + 1 + γ V ( S t + 1 ) G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=R_{t+1}+\gamma V(S_{t+1}) Gt=Rt+1+γRt+2+…=Rt+1+γV(St+1)
V ( S t ) ← V ( S t ) + α ( R t + 1 + γ V ( S t + 1 ) − V ( S t ) ) V(S_{t}) \leftarrow V(S_{t}) + \alpha(R_{t+1}+\gamma V(S_{t+1}) - V(S_{t})) V(St)←V(St)+α(Rt+1+γV(St+1)−V(St))
TD
的算法主要有以下四个特点:
- TD methods learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions / rewards
- TD learns from incomplete episodes, by bootstrapping
- TD updates a guess towards a guess
Monte Carlo vs. Temporal Difference
Monte Carlo
方法和Temporal Difference
方法对比如下:
-
The same goal: learn V π V_{\pi} Vπ from episodes of experience under policy π \pi π。
-
Incremental every-visit Monte-Carlo
- Update value V ( S t ) V(S_{t}) V(St) toward actual return G t G_{t} Gt。
V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}-V(S_{t})) V(St)←V(St)+α(Gt−V(St))
- Simplest temporal-difference learning algorithm: TD
- Update value V ( S t ) V(S_{t}) V(St) toward estimated return R t + 1 + γ V ( S t + 1 ) R_{t+1} + \gamma V(S_{t+1}) Rt+1+γV(St+1)。
- TD Target: R t + 1 + γ V ( S t + 1 ) R_{t+1} + \gamma V(S_{t+1}) Rt+1+γV(St+1);
- TD error: δ = R t + 1 + γ V ( S t + 1 ) − V ( S t ) \delta = R_{t+1} + \gamma V(S_{t+1}) - V(S_{t}) δ=Rt+1+γV(St+1)−V(St)
Advantages and Disadvantages of MC vs. TD
-
TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
-
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
Bias/Variance Trade-Off
- Return G t G_{t} Gt is unbiased estimate of V π ( S t ) V^{\pi}(S_{t}) Vπ(St)。
基于当前的策略去采样,然后计算平均值,这样得到的估计是无偏估计。
- TD target R t + 1 + γ V ( S t + 1 ) R_{t+1} + \gamma V(S_{t+1}) Rt+1+γV(St+1) is biased estimate of V π V^{\pi} Vπ。
TD target
中由于存在对未来的估计
V
(
S
t
+
1
)
V(S_{t+1})
V(St+1),这个估计如果是非常准确的,那TD target
也是unbiased estimate
,但是由于
V
(
S
t
+
1
)
V(S_{t+1})
V(St+1)很难估计准确,所以是 biased estimate
。
- TD target is of much lower variance than the return
TD target的方法一般比Return G t G_{t} Gt要小。Return G t G_{t} Gt depends on many random actions, transitions and rewards;TD target depends on one random action, transition and reward
Advantages and Disadvantages of MC vs. TD (2)
- MC has high variance, zero bias
MC
方法具有好的 convergence properties (even with function approximation) 并且 Not very sensitive to initial value 但是需要 Very simple to understand and use。需要多采样去降低variance
。
- TD has low variance, some bias
TD
的方法 Usually more efficient than MC ,TD converges to
V
π
(
S
t
)
V^{\pi}(S_{t})
Vπ(St),but not always with function approximation。并且 More sensitive to initial value than MC。
n-step model-free prediction
For time constraint, we may jump n-step prediction section and directly head to model-free control
- Define the n-step return
G t n = R t + 1 + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n V ( S t + n ) G_{t}^{n} = R_{t+1} + \gamma R_{t+2} + \cdots +\gamma^{n-1}R_{t+n} + \gamma^{n} V(S_{t+n}) Gtn=Rt+1+γRt+2+⋯+γn−1Rt+n+γnV(St+n)
- n-step temporal-difference learning
V ( S t ) ← V ( S t ) + α ( G t ( n ) − V ( S t ) ) V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}^{(n)} - V(S_{t})) V(St)←V(St)+α(Gt(n)−V(St))
有了值函数之后,我们就需要去做策略改进了。
我的微信公众号名称:小小何先生
公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!