【强化学习的数学原理】课程笔记（三）——贝尔曼最优公式

csu一言

已于 2023-03-13 18:35:40 修改

阅读量746

点赞数

文章标签：人工智能机器学习

于 2023-03-13 18:26:31 首次发布

本文链接：https://blog.csdn.net/baidu_40880350/article/details/129501557

版权

1. 最优策略（optimal policy）的定义

通过state value 评估一个 policy的好坏，如果两个策略 $\pi_1, \pi_2$ 满足以下式子，则认为策略 $\pi_1$ 比 $\pi_2$ 更好
$v_{\pi_1}(s) \ge v_{\pi_2}(s) \quad for\,all\, s \in S$
最优策略的定义：

A policy $\pi^*$ is optimal if $v_{\pi^{*}} \ge v_{\pi}(s)$ for any other policy $\pi$ .

四个问题：策略是否存在；是否唯一；随机还是确定；如何得到；

2. Bellman optimal policy(BOE)

elementwise form
$\begin{aligned} v (s) & = \max\limits_{\pi}\sum\limits_{a} \pi(a|s) \left( \sum\limits_{r} p(r|s, a)r + \gamma \sum\limits_{s'} p(s'|s,a) v_\pi (s') \right), \quad \forall s \in S. \\ & = \max\limits_{\pi}\sum\limits_{a} \pi(a|s)q(s,a) \end{aligned}$
- 已知 $p (r ∣ s, a)$ ， $p (s^{'} ∣ s, a)$ ，未知需求 $v_\pi (s)$ ， $v_\pi (s')$
- bellman equation 是依赖于给定的 $\pi$ ，bellman optimal eqation 未给定
matrix-vector form
$\begin{aligned} v_ = \max\limits_{\pi}(r_{\pi}+\gamma P_{\pi}v) \end{aligned}$
- 如何求解，方程是否有解，解是否唯一，和最优策略的关系

3. Rewrite Equation

$\max\limits_{\pi}\sum\limits_{a} \pi(a|s)q(s,a)= \max\limits_{a\in \mathcal{A}(s)}q(s,a)\\ a^* = \arg\max\limits_{a}q(s,a) \\ \pi(a|s)= \left\{ \begin{array}{ll} 1 & \textrm{$a=a^*$} \\ 0 &\textrm{$a \neq a^* $} \end{array} \right.$

Let
$\max\limits_{\pi}(r_\pi + \gamma P_\pi v)$
则 bellman optimal equation 成为(固定v，求解)
$v = f (v)$
其中，
$[f(v)]_s = \max \limits_{a} \pi(a|s)q(s,a)\quad s \in S$

4. Contraction Mapping Theorem

For any equation that has the form of $x = f (x)$ , if $f$ is a contarction mapping, then

Existence: there exists a fixed point $x^*$ satisfying $f(x^*)=x^*$
Uniqueeness: The fixed point $x^*$ is unique
Algorithm: Consider a sequence ${x_k\}$ where $x_{k+1}=f(x_k)$ then $x_k \rightarrow x^*$ as $k\rightarrow\infty$ . Moreover, the convergence rate is exponentially.

5. Solution

$=\max\limits_{\pi}(r_\pi + \gamma P_\pi v)$

$f (v)$ is contraction mapping.

根据 Contraction Mapping Theorem可以得出：

bellman 最优方程存在最优解且解唯一。
给定任意的 $v_0$ ，最终 $v_k$ 会以指数速度收敛于 $v^*$ ，收敛速度取决于 $\gamma$

6. Analyzing optimal policies

$\max\limits_{\pi}\sum\limits_{a} \pi(a|s) \left( \sum\limits_{r} p(r|s, a)r + \gamma \sum\limits_{s'} p(s'|s,a) v_\pi (s') \right), \quad \forall s \in S. \\$

最优策略的影响因素
- Reward design: r（r $\rightarrow$ ar+b 不会改变最优策略，Absolute reward values is not matter, it’s their relative values!）
- System model: $p (r ∣ s, a)$ , $p (s^{'} ∣ s, a)$
- Discount rate: $\gamma$ (越小越近视，关注当前的收益)
$\gamma$ 决定了Agent选择最短路径。