Lect4_MC_TD_Model_free_prediction

最新推荐文章于 2024-08-29 23:58:21 发布

Ricky050

最新推荐文章于 2024-08-29 23:58:21 发布

阅读量158

点赞数

分类专栏： RL_by_DavidSilver_notes 文章标签：概率论算法 reinforcement learning

本文链接：https://blog.csdn.net/zzping01/article/details/120624433

版权

RL_by_DavidSilver_notes 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

Model-Free Prediction

Model-Free Prediction

Estimate the value function of an unknown MDP

Mento-Carlo Learning

Feature：

learn directly from episodes of experience
model-free: no knowledge of MDP transitions / rewards
learns from complete episodes: no bootstrapping
uses the simplest possible idea: value = mean return
can only apply MC to episodic MDPs: all episodes must terminate

Monte-Carlo Policy Evaluation

Goal: learn $\operatorname{v}_\pi$ from episodes of experience under policy $\pi$
$S_1, A_1, R_2, \dots, S_k \sim \pi$
the difinition of value funtion： $\operatorname{V}_\pi = \mathbb{E}\left[G_t \mid S_t =s\right]$ , but Monte-Carlo policy evaluation uses empirical mean return instead of expected return.

First-Visit Monte-Carlo Policy Evaluation

Algorithm:

To evaluate state S
The first time-step t that state s is visited in an episode,
增加计数： $\leftarrow N(s) + 1$
增加总回报return： $\leftarrow S(s) + G_t$
Value is estimated by mean return $\frac{S(s)}{N(s)}$
By law of arge numbers, $V(s) \rightarrow \operatorname{v}_\pi(s)\ as\ N(s) \rightarrow \infty $ (大数定律)

Incremental Mento-Carlo

foundation:

The mean $\mu_1,\mu_2, \dots$ of a sequence $x_1,x_2, \dots$ can be computed incrementally,
$\begin{aligned} \mu_k &= \frac{1}{k} \sum_{j=1}^k x_j \\ &= \frac{1}{k} \left(x_k + \sum_{j=1}^{k-1} x_j \right) \\ &= \frac{1}{k} \left(x_k + (k-1)\mu_{k-1} \right) \\ &= \mu_{k-1} + \frac{1}{k}(x_k - \mu_{k-1}) \end{aligned}$
Algorithm:

Update $V (s)$ incrementally after episode $S_1, A_1, R_2, \dots, S_T$
For each state $S_t$ with return $G_t$
1. $N(S_t) \leftarrow N(S_t) + 1$
2. $V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} \left(G_t - V(S_t) \right)$
In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.
1. $V(S_t) \leftarrow V(S_t) + \alpha \left(G_t - V(S_t) \right)$

Temporal-Difference Learning

Feature: compare with MC, please click here

learn directly from episodes of experience
model-free
learns from incomplete episodes, by bootstrapping
updates a guess towards a guess

MC vs. TD

Incremental every-visit Monte-Carlo
1. Update value $V(S_t )$ toward actual return $G_t$
  $V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}G_t} - V(S_t) \right)$
Simplest temporal-difference learning algorithm: TD(0)
1. Update value $V(S_t )$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$
  $V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}R_{t+1} + \gamma V(S_{t+1})} - V(S_t) \right)$
2. $R_{t+1} + \gamma V(S_{t+1})$ is called the TD target
3. $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is called the TD error
MC has high variance, zero bias. Not very sensitive to initial value
TD has low variance, some bias. More sensitive to initial value
TD exploits Markov property
MC does not exploit Markov property

Unified View

Dynamic Programming Backup

在这里插入图片描述

Mento-Carlo Backup

在这里插入图片描述

Temporal-Difference Backup

在这里插入图片描述

Unified View of RL

在这里插入图片描述

TD( $\lambda$ )

n-step TD

Consider the following n-step returns for $1,2,\dots, \infty$
$\begin{aligned} n=1 \text{(TD)}\quad \quad \ G_t^{(1)} &= R_{t+1} + \gamma V(S_{t+1}) \\ n=2 \qquad \quad \quad \ \ G_t^{(2)} &= R_{t+1} + +\gamma R_{t+2} + \gamma^2 V(S_{t+1}) \\ \vdots \\ n=\infty \text{(MC)} \quad \ G_t^{(\infty)} &= R_{t+1} + +\gamma R_{t+2} + \dots + \gamma^{T-1}R_T \end{aligned}$
n-step temporal-difference learning
$V(S_t) \leftarrow V(S_t) + \alpha \left(G_t^{(n)} - V(S_t) \right)$

Forward View of $TD(\lambda)$

$\lambda$ -return

对同一个状态 $S_t$ , 也许有许多不同的 n-step returns，为了有效的利用这些信息，我们对其取加权平均：

Using Weight $(1-\lambda)\lambda^{n-1}$ , so $G_t^\lambda = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1}G_t^{(n)}$

TD( $\lambda$ ):
$V(S_t) \leftarrow V(S_t) + \alpha \left(G_t^\lambda - V(S_t) \right)$
并且权值满足加和为1: $\sum \text{weight} = \sum_{n=1}^\infty (1-\lambda)\lambda^{n-1}= (1-\lambda) \frac{\lambda^0(1-\lambda^\infty)}{1-\lambda} = 1$

Feature:

Update value function towards the $\lambda$ -return
Forward-view looks into the future to compute $G_t^\lambda$
Like MC, can only be computed from complete episodes

Backward View of $TD(\lambda)$

Eligibility Traces（资格迹）

indicate 某状态 S_t 在 step t 时的影响
$\begin{aligned} E_0(s) &= 0 \\ E_t(s) &= \gamma \lambda E_{t-1}(s) + 1(S_t=s) \end{aligned}$
在这里插入图片描述

Keep an eligibility trace for every state s
For each step of episode, update value $V (s)$ for every state s
In proportion to TD-error $\delta_t$ and eligibility trace $E_t (s)$

$\begin{aligned} \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ V(s) &\leftarrow V(s) + \alpha \delta_tE_t(s) \end{aligned}$

Relationship Between Forward and Backward TD

Online

When $\lambda=0$ , only current state $s_t$ is updated for step t
$\begin{aligned} E_t(s) &= 1(s = s_t) \\ V(s) &\leftarrow V(s) + \alpha \delta_tE_t(s) \end{aligned}$
This is eaactly equivalent to TD(0) update
$V(S_t) \leftarrow V(S_t) + \alpha \delta_t$
When $\lambda = 1$ , credit is deferred until end of episode. Consider an episode where s is visited once at time-step k.
TD(1) eligibility trace discounts time since visit,
$E_t(s) = \gamma E_{t-1}(s)+ 1(S_t = s) = \begin{cases} 0 \quad \text{if}\ t<k \\ \gamma^{t-k} \quad \text{if}\ t ≥ k \end{cases}$
TD(1) updates accumulate error online,
$\begin{aligned} \sum_{t=1}^{T-1}\alpha \delta_t E_t(s) &= \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + \ldots + \gamma^{T-1-k}\delta_{T-1} \\ &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ &+ \gamma\left(R_{t+2}+ \gamma V(S_{t+2}) - V(S_{t+1}) \right) \\ &+ \gamma^2 \left(R_{t+3}+ \gamma V(S_{t+3}) - V(S_{t+2}) \right) \\ \vdots \\ &+ \gamma^{T-1-k} \left(R_{T}+ \gamma V(S_{T}) - V(S_{T-1}) \right) \\ &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots + \gamma^{T-1-k}R_T +\gamma^{T-t} V(S_{T}) - V(S), \quad V(S_T) = 0 \\ &= G_k - V(S_k) \end{aligned}$
1. TD(1) is roughly equivalent to every-visit Monte-Carlo
2. Error is accumulated online, step-by-step
3. If value function is only updated offline at end of episode, then total update is exactly the same as MC
offline

backward view = forward view 点击查看参考网页

说实话backward这块我没太搞懂，希望有大神解惑一下，为什么一开始要引入eligibility trace？offline 和 online 区别在哪？怎么体现在证明过程中的？

Algorithm

在这里插入图片描述

Compare forward and backward view

在这里插入图片描述

Ricky050

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lect4_MC_TD_Model_free_prediction

文章目录Model-Free PredictionMento-Carlo LearningMonte-Carlo Policy EvaluationFirst-Visit Monte-Carlo Policy EvaluationIncremental Mento-CarloTemporal-Difference LearningMC vs. TDUnified ViewDynamic Programming BackupMento-Carlo BackupTemporal-Difference Backu
复制链接

扫一扫