Chapter 10: On-policy Control with Approximation

xiwang_chn

于 2020-08-25 10:36:15 发布

阅读量145

点赞数

分类专栏： Reinforced learning

本文链接：https://blog.csdn.net/weixin_42017454/article/details/107567563

版权

Reinforced learning 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Notes of Chapter 10: On-policy Control with Approximation

1 Introduction
2 On-policy control with approximation of episodic tasks
- 2.1 *General gradient-descent update* for action-value prediction is:
- 2.2 Semi-gradient n-step Sarsa
3 On-policy control with approximation of continuing tasks

1 Introduction

In control problem, we focus on action-value function $\hat{q}(s,a,\mathbf{w})\approx q_*(s,a)$ , where $\mathbf{w}\in \mathbb{R}^d$ , because it is easy to plan with action-value function (just select the action with largest value; if the action-value function is not accurate enough, then we can polish it during decision time with Rollout algorithm, Monte Carlo Tree Search).

For episodic cases, it is easy to extend the evaluation algorithm in chapter 9, just use a $\epsilon$ -greedy policy (a soft version of greedy policy). A semi-gradient n-step sarsa algorithm is proposed.
For continuing cases, new definition of return (average reward) is defined. A differential semi-gradient Sarsa is proposed.

2 On-policy control with approximation of episodic tasks

2.1 General gradient-descent update for action-value prediction is:

$\mathbf{w}_{t+1}=\mathbf{w}_{t}+\alpha[U_t-\hat{q}(S_t,A_t,\mathbf{w}_{t})]\nabla \hat{q}(S_t,A_t,\mathbf{w}_{t}) \tag{10.1}$

2.2 Semi-gradient n-step Sarsa

By replacing the update target of (10.1) with n-step return:
$G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+\dots+\gamma^{n-1}R_{t+n}+\gamma^{n}\hat{q}(S_{t+n},A_{t+n},\mathbf{w}_{t+n-1}) ,(t+n<T)\tag{10.4}$ We can get the update equation for semi-gradient n-step Sarsa:
$\mathbf{w}_{t+n}=\mathbf{w}_{t+n-1}+\alpha[G_{t:t+n}-\hat{q}(S_t,A_t,\mathbf{w}_{t+n-1})]\nabla \hat{q}(S_t,A_t,\mathbf{w}_{t+n-1}),(0\leq t<T) \tag{10.5}$ Episodic semi-gradient n-step Sarsa for estimating $\hat{q}\approx q_*$ or $q_{\pi}$ :

在这里插入图片描述

3 On-policy control with approximation of continuing tasks

Average reward setting, alongside the episodic and discounted settings—for formulating the goal in Markov decision problems (MDPs). This setting applies to continuing problems with no start or end state, but also no discounting.

3.1 Average reward

Discounted value is problematic with function approximation. The root cause of the difficulties with the discounted control setting is that with function approximation we have lost the policy improvement theorem (Section 4.2). It is no longer true that if we change the policy to improve the discounted value of one state then we are guaranteed to have improved the overall policy in any useful sense (e.g. generalisation could ruin the policy elsewhere).

Average reward:

This quantity is essentially the average reward under $\pi$ , as suggested by (10.7). In particular, we consider all policies that attain the maximal value of $r(\pi)$ to be optimal.

在这里插入图片描述

Ergodicity assumption

$\mu_{\pi}=\mathop{lim}\limits_{t\to\infin}Pr\{S_t=s|A_{0:t-1}\sim\pi\}$ This assumption about the MDP is known as ergodicity. It means that where the MDP starts or any early decision made by the agent can have only a temporary effect; in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities. Ergodicity is sufficient to guarantee the existence of the limits in the equations above.

Steady state distribution

$\sum_s \mu_s(s)\sum_a \pi(a|s)p(s'|s,a)=\mu_{\pi}(s') \tag{10.8}$ It is the special distribution under which, if you select actions according to $\pi$ , you remain in the same distribution.

Differential return:

$G_t=R_{t+1}-r(\pi)+R_{t+2}-r(\pi)+R_{t+3}-r(\pi)+\dots \tag{10.9}$

Bellman equations:

在这里插入图片描述

Differential TD errors:

在这里插入图片描述

Gradient update with differential return/ differential TD errors:

$\mathbf{w}_{t+1}=\mathbf{w}_{t}+\alpha \delta_t \nabla \hat{q}(S_t,A_t,\mathbf{w}_t) \tag{10.12}$
Many of the previous algorithms and theoretical results carry over to this new setting without change.

Convergence

Methods that learn action values we seem to be currently without a local improvement guarantee.

3.2 Differential Semi-gradient n-step Sarsa

Differential n-step return:
$G_{t:t+n}=R_{t+1}-\bar{R}_{t+n-1}+\dots+R_{t+n}-\bar{R}_{t+n-1}+\hat{q}(S_{t+n},A_{t+n},\mathbf{w}_{t+n-1}) \tag{10.14}$
n-Step TD error
$\delta_t=G_{t:t+n}-\hat{q}(S_{t},A_t,\mathbf{w}_{t+n-1}) \tag{10.15}$
Differential semi-gradient n-step Sarsa for estimating $\hat{q}\approx q_{\pi}$ or $q_*$
TODO upload figure at page 277

xiwang_chn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chapter 10: On-policy Control with Approximation

Notes of Chapter 10: On-policy Control with Approximation1 Introduction2 On-policy control with approximation of episodic tasks2.1 *General gradient-descent update* for action-value prediction is:2.2 *Gradient-descent update* for semi-gradient n-step Sars
复制链接

扫一扫