《Introduction to Stochastic Dynamic Programming》第二章自学笔记

折现动态规划

1. 介绍

  • 无限期+折现因子 0 < α < 1 0<\alpha<1 0<α<1
  • 可数状态空间: S S S
  • 有限行动空间: A A A
  • 有界奖励reward): ∣ R ( i , a ) ∣ < B , ∀ i , a |R(i,a)|<B, \forall i, a R(i,a)<B,i,a
  • 平稳策略stationary policy):1.确定性策略(非随机);2.在 t t t时刻的选择只与状态有关(与时间无关)
  • 因为只与当前状态有关而与历史无关(转移概率和平稳策略),所以被称为马尔可夫决策过程MDP
  • 总期望折现回报total expected discounted return
    V π ( i ) = E [ ∑ n = 0 ∞ α n R ( x n , a n ) ∣ X 0 = i ] V_{\pi}(i)=\mathbb{E}\left[\sum_{n=0}^{\infty} \alpha^{n} R\left(x_{n}, a_{n}\right) \mid X_{0}=i\right] Vπ(i)=E[n=0αnR(xn,an)X0=i]
    由reward有界得, ∣ V π ( i ) ∣ < B / ( 1 − α ) |V_\pi(i)|<B/(1-\alpha) Vπ(i)<B/(1α)
    在这里插入图片描述

2. 最优方程与最优策略


V ( i ) = sup ⁡ π V π ( i ) . V(i)=\sup_\pi V_\pi(i). V(i)=πsupVπ(i).
一个策略 π ∗ \pi^* π被称为 α \alpha α最优 α \alpha α-optimal),如果它满足
V π ∗ ( i ) = V ( i ) for all i ≥ 0. V_{\pi^*}(i)=V(i)\quad\text{for all}\quad i\ge0. Vπ(i)=V(i)for alli0.

定理(最优方程)

V ( i ) = max ⁡ a ∈ A [ R ( i , a ) + α ∑ j P i j ( a ) V ( j ) ] , i ≥ 0 V(i)=\max _{a \in A}\left[R(i, a)+\alpha \sum_{j} P_{i j}(a) V(j)\right], i\ge0 V(i)=aAmax[R(i,a)+αjPij(a)V(j)],i0

  • V g ( i ) = R ( i , g ( i ) ) + α ∑ j P i j ( g ( i ) ) V g ( j ) V_g(i)=R(i,g(i))+\alpha\sum_jP_{ij}(g(i))V_g(j) Vg(i)=R(i,g(i))+αjPij(g(i))Vg(j)是自然的,无需证明

定理(最优方程与最优策略)

f f f为平稳策略,当处于状态 i i i时,选择使最优方程的右侧最大化的行动,也即
R ( i , f ( i ) ) + α ∑ j P i j ( f ( i ) ) V ( j ) = max ⁡ a [ R ( i , a ) + α ∑ j P i j ( a ) V ( j ) ] , ∀ i ≥ 0 , R(i, f(i))+\alpha \sum_{j} P_{i j}(f(i)) V(j)=\max _{a}\left[R(i, a)+\alpha \sum_{j} P_{i j}(a) V(j)\right], \forall i\ge 0, R(i,f(i))+αjPij(f(i))V(j)=amax[R(i,a)+αjPij(a)V(j)],i0,
f f f α \alpha α最优的策略,也即
V f ( i ) = V ( i ) for all i ≥ 0. V_{f}(i)=V(i)\quad\text{for all}\quad i\ge0. Vf(i)=V(i)for alli0.

  • 这个定理告诉我们,若有最优方程,则可解出最优策略。

定理(唯一性)

V V V是最优方程的唯一有界解。

  • 命题:对于任意平稳策略 g g g V g V_g Vg
    V g ( i ) = R ( i , g ( i ) ) + α ∑ j P i j ( g ( i ) ) V g ( j ) V_{g}(i)=R(i, g(i))+\alpha \sum_{j} P_{i j}(g(i)) V_{g}(j) Vg(i)=R(i,g(i))+αjPij(g(i))Vg(j)
    的唯一解。

3. 最优策略的计算

3.1 值函数迭代(Value Iteration)

  • MDP中最常用的迭代算法
  • 思想:用有限期值函数近似无限期值函数
  • 步骤:
    • V 0 ( i ) V_0(i) V0(i)为任意有界函数(对于有限状态空间,此条自然满足)
    • 计算 V n ( i ) = max ⁡ { R ( i , a ) + α ∑ j P i j ( a ) V n − 1 ( j ) } V_{n}(i)=\max \left\{R(i, a)+\alpha \sum_{j} P_{i j}(a) V_{n-1}(j)\right\} Vn(i)=max{R(i,a)+αjPij(a)Vn1(j)},其中, R ( i , a ) R(i,a) R(i,a)一致有界, 0 < α < 1 0<\alpha<1 0<α<1
    • 停止准则: ∥ V n ( i ) − V n − 1 ( i ) ∥ < δ \|V_n(i)-V_{n-1}(i)\|<\delta Vn(i)Vn1(i)<δ
  • 命题:
    • V 0 ≡ 0 V_0\equiv0 V00,则 ∥ V ( i ) − V n ( i ) ∥ ≤ α n + 1 B 1 − α \|V(i)-V_n(i)\|\le\frac{\alpha^{n+1}B}{1-\alpha} V(i)Vn(i)1ααn+1B
    • 对于任意有界 V 0 V_0 V0,当 n n n趋于无穷时, V n ( i ) V_n(i) Vn(i)关于 i i i一致收敛于 V ( i ) V(i) V(i)
  • 证明上述命题时需要注意有限期问题和无限期问题的区别:边界条件的影响
  • 例子:机器替换模型
    • 机器的状态为 i i i,行动为选择是否更换机器。如果更换,则需要花费 R R R,下一时刻状态为0(新机器);否则,下一时刻以 P i j P_{ij} Pij的概率转移至状态 j j j。对于状态 i i i,运营成本为 c ( i ) c(i) c(i) c ( i ) c(i) c(i)关于 i i i递增。目标是最小化无限期的总期望折现成本。
      V ( i ) = min ⁡ { R + α V ( 0 ) , α ∑ j P i j V ( j ) } + c ( i ) V(i)=\min \left\{R+\alpha V(0), \alpha \sum_{j} P_{i j} V(j)\right\}+c(i) V(i)=min{R+αV(0),αjPijV(j)}+c(i)
    • 为了研究值函数关于 i i i的单调性,我们需要增加关于 P i j P_{ij} Pij的条件:对于每一 k k k ∑ j = k ∞ P i j \sum_{j=k}^\infty P_{ij} j=kPij关于 i i i递增。设 T i T_i Ti为表示状态 i i i的下一状态的随机变量,则有 T i + 1 ≥ s t T i T_{i+1}\ge_{st}T_i Ti+1stTi
    • 随机顺序关系stochastic order relations
      • 定义:我们称一随机变量 X X X随机地大于一随机变量 Y Y Y,如果对于任意 a a a都有 P ( X ≥ a ) ≥ P ( Y ≥ a ) P(X\ge a)\ge P(Y\ge a) P(Xa)P(Ya)
      • 引理:(a) 如果 X ≥ s t Y X\ge_{st} Y XstY,则 E [ X ] ≥ E [ Y ] E[X]\ge E[Y] E[X]E[Y]; (b) X ≥ s t Y X\ge_{st} Y XstY当且仅当对于所有递增函数 f f f都有 E [ f ( X ) ] ≥ E [ f ( Y ) ] E[f(X)]\ge E[f(Y)] E[f(X)]E[f(Y)]
    • 考虑 n n n阶段问题
      V n ( i ) = min ⁡ { R + α V n − 1 ( 0 ) , α ∑ j P i j V n − 1 ( j ) } + c ( i ) V 0 ( i ) = c ( i ) V_n(i)=\min\{R+\alpha V_{n-1}(0),\alpha \sum_j P_{ij}V_{n-1}(j)\}+c(i)\\ V_0(i)=c(i) Vn(i)=min{R+αVn1(0),αjPijVn1(j)}+c(i)V0(i)=c(i)
      用数学归纳法证明 V n ( i ) V_n(i) Vn(i)关于 i i i递增。显然 V 0 ( i ) V_0(i) V0(i)关于 i i i递增。设 V n − 1 ( i ) V_{n-1}(i) Vn1(i)关于 i i i递增,则由 T i + 1 ≥ s t T i T_{i+1}\ge_{st}T_i Ti+1stTi ∑ j P i j V n − 1 ( j ) = E [ V n − 1 ( T i ) ] \sum_j P_{ij}V_{n-1}(j)=E[V_{n-1}(T_i)] jPijVn1(j)=E[Vn1(Ti)]关于 i i i递增。
    • 最优策略:因为 V ( i ) V(i) V(i)的结构为 min ⁡ { A , B } \min\{A,B\} min{A,B},其中 A A A为常数, B B B关于 i i i单调递增,所以最优策略为当 i < i ˉ i<\bar{i} i<iˉ时不换,当 i ≥ i ˉ i\ge \bar{i} iiˉ时换。

3.2 策略迭代(Policy Iteration)

  • 命题:设 g g g为一平稳策略,期望回报为 V g V_g Vg。令 h h h为一策略满足
    R ( i , h ( i ) ) + α ∑ j P i j ( h ( i ) ) V g ( j ) = max ⁡ a { R ( i , a ) + α ∑ j P i j ( a ) V g ( j ) } , R(i, h(i))+\alpha \sum_{j} P_{i j}(h(i)) V_{g}(j)=\max _{a}\left\{R(i, a)+\alpha \sum_{j} P_{i j}(a) V_{g}(j)\right\}, R(i,h(i))+αjPij(h(i))Vg(j)=amax{R(i,a)+αjPij(a)Vg(j)},

    V h ( i ) ≥ V g ( i ) , ∀ i . V_h(i)\ge V_g(i),\forall i. Vh(i)Vg(i),i.
    V h ( i ) = V g ( i ) , ∀ i V_h(i)= V_g(i),\forall i Vh(i)=Vg(i),i,则 V g = V h = V V_g=V_h=V Vg=Vh=V
  • 步骤:(状态空间有限)
    • 选择任一平稳策略 g g g
    • 由下列方程计算 V g V_g Vg
      V g ( i ) = R ( i , g ( i ) ) + α ∑ j P i j ( g ( i ) ) V g ( j ) , i = 1 , … , n V_{g}(i)=R(i, g(i))+\alpha \sum_{j} P_{i j}(g(i)) V_{g}(j),\quad i=1,\dots,n Vg(i)=R(i,g(i))+αjPij(g(i))Vg(j),i=1,,n
      (解唯一)
    • 由上述定义得到策略 h h h
    • 重复步骤二、三
    • 若状态空间有限,则策略空间有限,从而我们可以在有限步内得到最优策略
  • 策略迭代的思想和强化学习较为接近

3.3 线性规划(Linear Programming)

  • 命题:如果 u u u是一有界函数满足
    u ( i ) ≥ max ⁡ a { R ( i , a ) + α ∑ j P i j ( a ) u ( j ) } , i ≥ 0 , u(i) \geq \max _{a}\left\{R(i, a)+\alpha \sum_{j} P_{i j}(a) u(j)\right\},i\ge0, u(i)amax{R(i,a)+αjPij(a)u(j)},i0,

    u ( i ) ≥ V ( i ) , ∀ i . u(i)\ge V(i), \quad\forall i. u(i)V(i),i.
    该命题表明 V V V是满足上述不等式的最小的函数。
  • 0 < β < 1 0<\beta<1 0<β<1,则 V V V是下列优化问题的唯一解
    min ⁡ u ∑ i = 0 ∞ β n u ( i )  s.t.  u ( i ) ≥ R ( i , a ) + α ∑ j P i j ( a ) u ( j ) ∀ i , a \begin{array}{l} \min _{u} \sum_{i=0}^{\infty} \beta^{n} u(i) \\ \text { s.t. } \quad u(i) \geq R(i, a)+\alpha \sum_{j} P_{i j}(a) u(j) \quad \forall i, a \end{array} minui=0βnu(i) s.t. u(i)R(i,a)+αjPij(a)u(j)i,a
    注:1. β \beta β是为了让问题well-defined;2. 可将 u u u的考虑范围缩小为一个函数类,例如二次函数,从而变成对几个参数的优化问题;3. 约束可能过多(维度灾难),我们可以先找到一个不错的策略跑一些仿真得到样本路径(sample path),从而缩小 i i i的考虑范围。

4. 拓展:无界奖励(Unbounded Rewards)

  • 不再要求 R ( i , a ) R(i,a) R(i,a)是一致有界的,只需对于任意策略 π \pi π,都有
    ∣ E π [ R ( X n − 1 , a n − 1 ) ∣ X 0 = i ] ∣ ≤ B i n k , \left|E_{\pi}\left[R\left(X_{n-1}, a_{n-1}\right) \mid X_{0}=i\right]\right| \leq B_{i} n^{k}, Eπ[R(Xn1,an1)X0=i]Bink,
    其中 B i B_i Bi k k k是常数,从而有
    ∣ E π [ ∑ n = 0 ∞ α n R ( X n , a n ) ∣ X 0 = i ] ∣ ≤ B i ∑ n = 0 ∞ α n ( n + 1 ) k < ∞ \left|E_{\pi}\left[\sum_{n=0}^{\infty} \alpha^{n} R\left(X_{n}, a_{n}\right) \mid X_{0}=i\right]\right| \leq B_{i} \sum_{n=0}^{\infty} \alpha^{n}(n+1)^{k}<\infty Eπ[n=0αnR(Xn,an)X0=i]Bin=0αn(n+1)k<
  • 最优方程的结论还成立,但最优策略和唯一性的结论可能不再成立
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Author: Francois Louveaux, John R. Birge Publisher: Springer (2000) Binding: Hardcover, 448 pages pricer: $119.00 ISBN-10: 0387982175 editorialreviews The aim of stochastic programming is to find optimal decisions in problems which involve uncertain data. This field is currently developing rapidly with contributions from many disciplines including operations research, mathematics, and probability. Conversely, it is being applied in a wide variety of subjects ranging from agriculture to financial planning and from industrial engineering to computer networks. This textbook provides a first course in stochastic programming suitable for students with a basic knowledge of linear programming, elementary analysis, and probability. The authors aim to present a broad overview of the main themes and methods of the subject. Its prime goal is to help students develop an intuition on how to model uncertainty into mathematical problems, what uncertainty changes bring to the decision process, and what techniques help to manage uncertainty in solving the problems. The first chapters introduce some worked examples of stochastic programming and demonstrate how a stochastic model is formally built. Subsequent chapters develop the properties of stochastic programs and the basic solution techniques used to solve them. Three chapters cover approximation and sampling techniques and the final chapter presents a case study in depth. A wide range of students from operations research, industrial engineering, and related disciplines will find this a well-paced and wide-ranging introduction to this subject.
"Stochastic Differential Equations (SDEs) are mathematical models that describe the evolution of a system in the presence of random noise. They are widely used in various fields such as physics, finance, biology, and engineering to study complex systems that exhibit random behavior. The mentioned version '微盘' is a Chinese cloud storage platform where one can find resources related to SDEs. This version provides an introduction to SDEs, which can be highly beneficial for anyone interested in learning about this topic. The course teaches the fundamental concepts and techniques used in analyzing SDEs, starting with the basics of probability theory and stochastic processes. It then progresses to cover more advanced topics such as Ito calculus, numerical methods for solving SDEs, and applications of SDEs in different fields. By studying this version of the course, one can gain a comprehensive understanding of SDEs and their applications. This knowledge can be applied in various research areas, such as modeling the stock market, predicting population dynamics, understanding the behavior of biological systems, and designing control strategies for complex engineering systems. Moreover, the '微盘' version offers additional resources such as lecture notes, exercises, and supplementary materials to enhance the learning experience. These resources provide practical examples and real-world applications, helping learners grasp the concepts more effectively. In conclusion, the '微盘' version of the introduction to SDEs offers a valuable learning opportunity for individuals interested in understanding and applying stochastic differential equations. The course covers the essential concepts, provides additional resources, and equips learners with the necessary knowledge and skills to tackle problems involving randomness in various fields."

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值