Chapter 10: On-policy Control with Approximation

1 Introduction

In control problem, we focus on action-value function q ^ ( s , a , w ) ≈ q ∗ ( s , a ) \hat{q}(s,a,\mathbf{w})\approx q_*(s,a) q^(s,a,w)q(s,a), where w ∈ R d \mathbf{w}\in \mathbb{R}^d wRd, because it is easy to plan with action-value function (just select the action with largest value; if the action-value function is not accurate enough, then we can polish it during decision time with Rollout algorithm, Monte Carlo Tree Search).

  • For episodic cases, it is easy to extend the evaluation algorithm in chapter 9, just use a ϵ \epsilon ϵ-greedy policy (a soft version of greedy policy). A semi-gradient n-step sarsa algorithm is proposed.
  • For continuing cases, new definition of return (average reward) is defined. A differential semi-gradient Sarsa is proposed.

2 On-policy control with approximation of episodic tasks

2.1 General gradient-descent update for action-value prediction is:

w t + 1 = w t + α [ U t − q ^ ( S t , A t , w t ) ] ∇ q ^ ( S t , A t , w t ) (10.1) \mathbf{w}_{t+1}=\mathbf{w}_{t}+\alpha[U_t-\hat{q}(S_t,A_t,\mathbf{w}_{t})]\nabla \hat{q}(S_t,A_t,\mathbf{w}_{t}) \tag{10.1} wt+1=wt+α[Utq^(St,At,wt)]q^(St,At,wt)(10.1)

2.2 Semi-gradient n-step Sarsa

By replacing the update target of (10.1) with n-step return:
G t : t + n = R t + 1 + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n q ^ ( S t + n , A t + n , w t + n − 1 ) , ( t + n < T ) (10.4) G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+\dots+\gamma^{n-1}R_{t+n}+\gamma^{n}\hat{q}(S_{t+n},A_{t+n},\mathbf{w}_{t+n-1}) ,(t+n<T)\tag{10.4} Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnq^(St+n,At+n,wt+n1),(t+n<T)(10.4) We can get the update equation for semi-gradient n-step Sarsa:
w t + n = w t + n − 1 + α [ G t : t + n − q ^ ( S t , A t , w t + n − 1 ) ] ∇ q ^ ( S t , A t , w t + n − 1 ) , ( 0 ≤ t < T ) (10.5) \mathbf{w}_{t+n}=\mathbf{w}_{t+n-1}+\alpha[G_{t:t+n}-\hat{q}(S_t,A_t,\mathbf{w}_{t+n-1})]\nabla \hat{q}(S_t,A_t,\mathbf{w}_{t+n-1}),(0\leq t<T) \tag{10.5} wt+n=wt+n1+α[Gt:t+nq^(St,At,wt+n1)]q^(St,At,wt+n1),(0t<T)(10.5) Episodic semi-gradient n-step Sarsa for estimating q ^ ≈ q ∗ \hat{q}\approx q_* q^q or q π q_{\pi} qπ:

在这里插入图片描述

3 On-policy control with approximation of continuing tasks

Average reward setting, alongside the episodic and discounted settings—for formulating the goal in Markov decision problems (MDPs). This setting applies to continuing problems with no start or end state, but also no discounting.

3.1 Average reward

Discounted value is problematic with function approximation. The root cause of the difficulties with the discounted control setting is that with function approximation we have lost the policy improvement theorem (Section 4.2). It is no longer true that if we change the policy to improve the discounted value of one state then we are guaranteed to have improved the overall policy in any useful sense (e.g. generalisation could ruin the policy elsewhere).

Average reward:

This quantity is essentially the average reward under π \pi π, as suggested by (10.7). In particular, we consider all policies that attain the maximal value of r ( π ) r(\pi) r(π) to be optimal.

在这里插入图片描述

Ergodicity assumption

μ π = l i m t → ∞ P r { S t = s ∣ A 0 : t − 1 ∼ π } \mu_{\pi}=\mathop{lim}\limits_{t\to\infin}Pr\{S_t=s|A_{0:t-1}\sim\pi\} μπ=tlimPr{St=sA0:t1π} This assumption about the MDP is known as ergodicity. It means that where the MDP starts or any early decision made by the agent can have only a temporary effect; in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities. Ergodicity is sufficient to guarantee the existence of the limits in the equations above.

Steady state distribution

∑ s μ s ( s ) ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) = μ π ( s ′ ) (10.8) \sum_s \mu_s(s)\sum_a \pi(a|s)p(s'|s,a)=\mu_{\pi}(s') \tag{10.8} sμs(s)aπ(as)p(ss,a)=μπ(s)(10.8) It is the special distribution under which, if you select actions according to π \pi π, you remain in the same distribution.

Differential return:

G t = R t + 1 − r ( π ) + R t + 2 − r ( π ) + R t + 3 − r ( π ) + … (10.9) G_t=R_{t+1}-r(\pi)+R_{t+2}-r(\pi)+R_{t+3}-r(\pi)+\dots \tag{10.9} Gt=Rt+1r(π)+Rt+2r(π)+Rt+3r(π)+(10.9)

Bellman equations:

在这里插入图片描述

Differential TD errors:

在这里插入图片描述

Gradient update with differential return/ differential TD errors:

w t + 1 = w t + α δ t ∇ q ^ ( S t , A t , w t ) (10.12) \mathbf{w}_{t+1}=\mathbf{w}_{t}+\alpha \delta_t \nabla \hat{q}(S_t,A_t,\mathbf{w}_t) \tag{10.12} wt+1=wt+αδtq^(St,At,wt)(10.12)
Many of the previous algorithms and theoretical results carry over to this new setting without change.

Convergence

Methods that learn action values we seem to be currently without a local improvement guarantee.

3.2 Differential Semi-gradient n-step Sarsa

  • Differential n-step return:
    G t : t + n = R t + 1 − R ˉ t + n − 1 + ⋯ + R t + n − R ˉ t + n − 1 + q ^ ( S t + n , A t + n , w t + n − 1 ) (10.14) G_{t:t+n}=R_{t+1}-\bar{R}_{t+n-1}+\dots+R_{t+n}-\bar{R}_{t+n-1}+\hat{q}(S_{t+n},A_{t+n},\mathbf{w}_{t+n-1}) \tag{10.14} Gt:t+n=Rt+1Rˉt+n1++Rt+nRˉt+n1+q^(St+n,At+n,wt+n1)(10.14)
  • n-Step TD error
    δ t = G t : t + n − q ^ ( S t , A t , w t + n − 1 ) (10.15) \delta_t=G_{t:t+n}-\hat{q}(S_{t},A_t,\mathbf{w}_{t+n-1}) \tag{10.15} δt=Gt:t+nq^(St,At,wt+n1)(10.15)
  • Differential semi-gradient n-step Sarsa for estimating q ^ ≈ q π \hat{q}\approx q_{\pi} q^qπ or q ∗ q_* q
    TODO upload figure at page 277
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
东南亚位于我国倡导推进的“一带一路”海陆交汇地带,作为当今全球发展最为迅速的地区之一,近年来区域内生产总值实现了显著且稳定的增长。根据东盟主要经济体公布的最新数据,印度尼西亚2023年国内生产总值(GDP)增长5.05%;越南2023年经济增长5.05%;马来西亚2023年经济增速为3.7%;泰国2023年经济增长1.9%;新加坡2023年经济增长1.1%;柬埔寨2023年经济增速预计为5.6%。 东盟国家在“一带一路”沿线国家中的总体GDP经济规模、贸易总额与国外直接投资均为最大,因此有着举足轻重的地位和作用。当前,东盟与中国已互相成为双方最大的交易伙伴。中国-东盟贸易总额已从2013年的443亿元增长至 2023年合计超逾6.4万亿元,占中国外贸总值的15.4%。在过去20余年中,东盟国家不断在全球多变的格局里面临挑战并寻求机遇。2023东盟国家主要经济体受到国内消费、国外投资、货币政策、旅游业复苏、和大宗商品出口价企稳等方面的提振,经济显现出稳步增长态势和强韧性的潜能。 本调研报告旨在深度挖掘东南亚市场的增长潜力与发展机会,分析东南亚市场竞争态势、销售模式、客户偏好、整体市场营商环境,为国内企业出海开展业务提供客观参考意见。 本文核心内容: 市场空间:全球行业市场空间、东南亚市场发展空间。 竞争态势:全球份额,东南亚市场企业份额。 销售模式:东南亚市场销售模式、本地代理商 客户情况:东南亚本地客户及偏好分析 营商环境:东南亚营商环境分析 本文纳入的企业包括国外及印尼本土企业,以及相关上下游企业等,部分名单 QYResearch是全球知名的大型咨询公司,行业涵盖各高科技行业产业链细分市场,横跨如半导体产业链(半导体设备及零部件、半导体材料、集成电路、制造、封测、分立器件、传感器、光电器件)、光伏产业链(设备、硅料/硅片、电池片、组件、辅料支架、逆变器、电站终端)、新能源汽车产业链(动力电池及材料、电驱电控、汽车半导体/电子、整车、充电桩)、通信产业链(通信系统设备、终端设备、电子元器件、射频前端、光模块、4G/5G/6G、宽带、IoT、数字经济、AI)、先进材料产业链(金属材料、高分子材料、陶瓷材料、纳米材料等)、机械制造产业链(数控机床、工程机械、电气机械、3C自动化、工业机器人、激光、工控、无人机)、食品药品、医疗器械、农业等。邮箱:market@qyresearch.com

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值