Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

本文探讨了在确定性和随机环境中,如何通过后验概率进行最优策略的选择。它介绍了轨迹概率、随机变量O的概念,并阐述了在最优策略下轨迹出现的概率。后向消息和价值函数被用来迭代计算这些概率。在确定性环境中,最大熵与后验概率决策问题等价,而在随机环境中,优化目标考虑了动态的影响。最后,通过变分推断的方法,证明了最大化熵强化学习等价于最大化最优决策概率的下界。
摘要由CSDN通过智能技术生成
  • 基本概念

    • 轨迹概率 (轨迹 τ \tau τ出现的概率)

    p ( τ ) = p ( s 1 ) Π t p ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p(\tau) = p(s_1)\Pi_tp(a_t|s_t)p(s_{t+1}|s_t,a_t) p(τ)=p(s1)Πtp(atst)p(st+1st,at)

    • 随机变量 O O O (是因为最优决策 而 非 不 小 心 _{而非不小心} 选到动作a的概率)

    p ( O t ∣ s t , a t ) = exp ⁡ ( r ( s t , a t ) ) p(O_t|s_t,a_t) = \exp(r(s_t,a_t)) p(Otst,at)=exp(r(st,at))

    问题一: r是正的概率会大于一。可以减去最大的r

    问题二: r的绝对值大小是否有影响。只有r的相对关系对结果有影响,因为我们考虑的是 P ( ⋅ ∣ O ) P(\cdot|O) P(O)

    • backward message (这步走完以后后续轨迹都会遵守最优决策的概率)

    β ( s t , a t ) = p ( O t : T ∣ s t , a t ) \beta(s_t,a_t) = p(O_{t:T}|s_t,a_t) β(st,at)=p(Ot:Tst,at)

\\

  • 最优策略与随机变量 O O O

    • 在最优策略下任意轨迹的出现概率

    p ( τ ∣ O 1 : T ) ∝ p ( τ , O 1 : T ) = p ( s 1 ) Π t p ( O t ∣ s t , a t ) p ( s t + 1 ∣ s t , a t ) = p ( s 1 ) Π t p ( s t + 1 ∣ s t , a t ) exp ⁡ ( ∑ t r ( s t , a t ) ) p(\tau|O_{1:T}) \propto p(\tau,O_{1:T}) = p(s_1)\Pi_tp(O_t|s_t, a_t)p(s_{t+1}|s_t,a_t) \\ = p(s_1)\Pi_tp(s_{t+1}|s_t,a_t)\exp(\sum_tr(s_t,a_t)) p(τO1:T)p(τ,O1:T)=p(s1)Πtp(Otst,at)p(st+1st,at)=p(s1)Πtp(st+1st,at)exp(tr(st,at))

    在此文中我们认为只有 P ( ⋅ ∣ O ) P(\cdot|O) P(O)是可控的,因此除此之外的变量都能视为常数,如上面的正比就是因为分母上的 P ( O 1 : T ) P(O_{1:T}) P(O1:T)不可控是常数

    • 物理意义: 把最优策略定义为 : 在确定性环境中,相同累计回报的轨迹有相同的概率被选取,累计回报较低的轨迹有较低的概率被选取,且以指数方式递减。

\\

  • Backward Message

    • 后向信息迭代关系式

    β ( s t , a t ) = p ( O t : T ∣ s t , a t ) = ∫ s t + 1 p ( O t : T , s t + 1 ∣ s t , a t ) = ∫ s t + 1 p ( O t + 1 : T ∣ s t + 1 ) p ( s t + 1 ∣ s t , a t ) p ( O t ∣ s t , a t ) \beta(s_t,a_t) = p(O_{t:T}|s_t,a_t) = \int_{s_{t+1}}p(O_{t:T},s_{t+1}|s_t,a_t) \\ = \int_{s_{t+1}}p(O_{t+1:T}|s_{t+1})p(s_{t+1}|s_t,a_t)p(O_t|s_t,a_t) β(st,at)=p(Ot:Tst,at)=st+1p(Ot:T,st+1st,at)=st+1p(Ot+1:Tst+1)p(st+1st,at)p(Otst,at)

    第一项是 β ( s t + 1 ) \beta(s_{t+1}) β(st+1)定义如下行,第二项是环境给出,第三项正比于 exp ⁡ ( r ) \exp(r) exp(r)

    • 后向信息价值函数

    β ( s t ) = ∫ a t p ( a t ∣ s t ) p ( O t : T ∣ s t , a t ) = p ( O t : T ∣ s t ) \beta(s_t) = \int_{a_t} p(a_t|s_t)p(O_{t:T}|s_t,a_t) = p(O_{t:T}|s_t) β(st)=atp(atst)p(Ot:Tst,at)=p(Ot:Tst)

    第一项 p ( a t ∣ s t ) p(a_t|s_t) p(atst)与最佳策略无关,因此我们不关心可以当常数(均匀分布),第二项是 β ( s t , a t ) \beta(s_t,a_t) β(st,at)

    • 求解方法(迭代) : 从T开始由后往前,计算 β ( s t , a t ) \beta(s_t,a_t) β(st,at) β ( s t ) \beta(s_t) β(st)

\\

  • 后验概率与决策

    • 决策

    p ( a t ∣ s t , O t : T ) = p ( a t , s t ∣ O t : T ) p ( s t ∣ O t : T ) = p ( a t , s t , O t : T ) p ( s t , O t : T ) = p ( O t : T ∣ a t , s t ) p ( a t ∣ s t ) p ( s t ) p ( O t : T ∣ s t ) p ( s t ) ∝ p ( O t : T ∣ a t , s t ) p ( O t : T ∣ s t ) = β ( a t , s t ) β ( s t ) p(a_t|s_t,O_{t:T}) = \frac{p(a_t,s_t|O_{t:T})}{p(s_t|O_{t:T})} = \frac{p(a_t,s_t,O_{t:T})}{p(s_t,O_{t:T})} \\ = \frac{p(O_{t:T}|a_t,s_t)p(a_t|s_t)p(s_t)}{p(O_{t:T}|s_t)p(s_t)} \propto \frac{p(O_{t:T}|a_t,s_t)}{p(O_{t:T}|s_t)} = \frac{\beta(a_t,s_t)}{\beta(s_t)} p(atst,Ot:T)=p(stOt:T)p(at,stOt:T)=p(st,Ot:T)p(at,st,Ot:T)=p(Ot:Tst)p(st)p(Ot:Tat,st)p(atst)p(st)p(Ot:Tst)p(Ot:Tat,st)=β(st)β(at,st)

    表示求取 β \beta β等价于求取策略、而 β \beta β求解法已经由上面给出

    • 价值函数

    Q ( s t , a t ) = log ⁡ β ( s t , a t ) = r ( s t , a t ) + log ⁡ E s t + 1 [ exp ⁡ ( V ( s t + 1 ) ) ] V ( s t ) = log ⁡ β ( s t ) = log ⁡ E A [ exp ⁡ ( Q ( s t , a t ) ) ] Q(s_t,a_t) = \log\beta(s_t,a_t) = r(s_t,a_t) + \log E_{s_{t+1}}[\exp(V(s_{t+1}))] \\ V(s_t) = \log\beta(s_t) = \log E_A[\exp(Q(s_t,a_t))] Q(st,at)=logβ(st,at)=r(st,at)+logEst+1[exp(V(st+1))]V(st)=logβ(st)=logEA[exp(Q(st,at))]

    看到上面决策的方法后,很容易发现与SAC相同,因此用SAC的方式定义价值函数的关系

    当Q很大时 V ≈ max ⁡ Q V\approx \max Q VmaxQ,因此称为softmax

\\

  • 优化目标 (与最大熵的联系)

    • 确定性环境

    p ( τ ∣ O 1 : T ) ∝ p ( s 1 ) Π t p ( s t + 1 ∣ s t , a t ) exp ⁡ ( ∑ t r ( s t , a t ) ) = exp ⁡ ( ∑ t r ( s t , a t ) ) p(\tau|O_{1:T}) \propto p(s_1)\Pi_tp(s_{t+1}|s_t,a_t)\exp(\sum_tr(s_t,a_t)) = \exp(\sum_tr(s_t,a_t)) p(τO1:T)p(s1)Πtp(st+1st,at)exp(tr(st,at))=exp(tr(st,at))

    • 函数逼近 (用 p ^ ( τ ) \hat p(\tau) p^(τ)逼近 p ( τ ∣ O 1 : T ) p(\tau|O_{1:T}) p(τO1:T))

    min ⁡ p ^ ( τ ) D K L ( p ^ ( τ ) ∣ ∣ p ( τ ∣ O 1 : T ) ) = max ⁡ p ^ ( τ ) E s , a ∼ p ^ ( τ ) [ log ⁡ p ( τ ∣ O 1 : T ) − log ⁡ p ^ ( τ ) ] = max ⁡ p ^ ( τ ) E s , a ∼ p ^ ( τ ) [ log ⁡ p ( s 1 ) + ∑ t = 1 T log ⁡ p ( s t + 1 ∣ s t , a t ) + ∑ t = 1 T r ( s t , a t ) − log ⁡ p ( s 1 ) − ∑ t = 1 T log ⁡ p ( s t + 1 ∣ s t , a t ) − ∑ t = 1 T log ⁡ π ^ ( a t ∣ s t ) ) ] = max ⁡ p ^ ( τ ) E s , a ∼ p ^ ( τ ) [ ∑ t = 1 T r ( s t , a t ) − ∑ t = 1 T log ⁡ π ^ ( a t ∣ s t ) ) ] = max ⁡ p ^ ( τ ) ∑ t = 1 T E s , a ∼ p ^ ( τ ) [ r ( s t , a t ) ] − ∑ t = 1 T E s ∼ p ^ ( τ ) [ H ( π ^ ( a t ∣ s t ) ) ] \min_{\hat p(\tau)} D_{KL}(\hat p(\tau)||p(\tau|O_{1:T})) \\ = \max_{\hat p(\tau)} E_{s,a\sim\hat p(\tau)}[\log p(\tau|O_{1:T})-\log\hat p(\tau)] \\ = \max_{\hat p(\tau)} E_{s,a\sim\hat p(\tau)}[\log p(s_1) + \sum_{t=1}^T \log p(s_{t+1}|s_t,a_t) + \sum_{t=1}^T r(s_t,a_t) \\ - \log p(s_1) - \sum_{t=1}^T \log p(s_{t+1}|s_t,a_t) - \sum_{t=1}^T \log\hat\pi(a_t|s_t))] \\ = \max_{\hat p(\tau)} E_{s,a\sim\hat p(\tau)}[\sum_{t=1}^T r(s_t,a_t) - \sum_{t=1}^T \log\hat\pi(a_t|s_t))] \\ = \max_{\hat p(\tau)} \sum_{t=1}^T E_{s,a\sim\hat p(\tau)}[r(s_t,a_t)] - \sum_{t=1}^T E_{s\sim\hat p(\tau)}[H(\hat\pi(a_t|s_t))] \\ p^(τ)minDKL(p^(τ)p(τO1:T))=p^(τ)maxEs,ap^(τ)[logp(τO1:T)logp^(τ)]=p^(τ)maxEs,ap^(τ)[logp(s1)+t=1Tlogp(st+1st,at)+t=1Tr(st,at)logp(s1)t=1Tlogp(st+1st,at)t=1Tlogπ^(atst))]=p^(τ)maxEs,ap^(τ)[t=1Tr(st,at)t=1Tlogπ^(atst))]=p^(τ)maxt=1TEs,ap^(τ)[r(st,at)]t=1TEsp^(τ)[H(π^(atst))]

    以上证明,确定性环境下最大熵等于后验概率决策问题

\\

  • 随机环境下的优化

    • 随机环境

    min ⁡ p ^ ( τ ) D K L ( p ^ ( τ ) ∣ ∣ p ( τ ∣ O 1 : T ) ) = max ⁡ p ^ ( τ ) E s , a ∼ p ^ ( τ ) [ log ⁡ p ( s 1 ) ] + ∑ t = 1 T E s , a ∼ p ^ ( τ ) [ r ( s t , a t ) + log ⁡ p ( s t + 1 ∣ s t , a t ) ] − ∑ t = 1 T E s ∼ p ^ ( τ ) [ H ( π ^ ( a t ∣ s t ) ) ] \min_{\hat p(\tau)} D_{KL}(\hat p(\tau)||p(\tau|O_{1:T})) \\ = \max_{\hat p(\tau)} E_{s,a\sim\hat p(\tau)}[\log p(s_1)] + \sum_{t=1}^T E_{s,a\sim\hat p(\tau)}[r(s_t,a_t)+\log p(s_{t+1}|s_t, a_t)] - \sum_{t=1}^T E_{s\sim\hat p(\tau)}[H(\hat\pi(a_t|s_t))] \\ p^(τ)minDKL(p^(τ)p(τO1:T))=p^(τ)maxEs,ap^(τ)[logp(s1)]+t=1TEs,ap^(τ)[r(st,at)+logp(st+1st,at)]t=1TEsp^(τ)[H(π^(atst))]

    目标函数内含有dynamic,隐式的代表了是可以通过改变dynamic来优化目标函数 (但偏离真实dynamic时会有惩罚),这样的假设不合理,会使得学到的策略更激进

    并且此目标函数在model-free的场景下是难以直接优化的

    • 函数逼近 (限制dynamic的情况下优化)

    p ^ ( τ ) = p ( s 1 ) Π t p ( s t + 1 ∣ s t , a t ) exp ⁡ ( ∑ t r ( s t , a t ) ) ≈ p ( τ ∣ O 1 : T ) \hat p(\tau) = p(s_1)\Pi_tp(s_{t+1}|s_t,a_t)\exp(\sum_tr(s_t,a_t)) \approx p(\tau|O_{1:T}) p^(τ)=p(s1)Πtp(st+1st,at)exp(tr(st,at))p(τO1:T)

    上述显示的规定 p ^ \hat p p^无法改变dynamic,可以用类似确定性环境的公式,如下

    min ⁡ p ^ ( τ ) D K L ( p ^ ( τ ) ∣ ∣ p ( τ ∣ O 1 : T ) ) = max ⁡ p ^ ( τ ) ∑ t = 1 T E s , a ∼ p ^ ( τ ) [ r ( s t , a t ) ] − ∑ t = 1 T E s ∼ p ^ ( τ ) [ H ( π ^ ( a t ∣ s t ) ) ] = max ⁡ p ^ ( τ ) ∑ t = 1 T − E s ∼ p ^ ( τ ) [ D K L ( π ^ ∣ ∣ exp ⁡ ( r t ) e x p ( V t ) ) ] + ∑ t = 1 T E s ∼ p ^ ( τ ) [ V t ] \min_{\hat p(\tau)} D_{KL}(\hat p(\tau)||p(\tau|O_{1:T})) \\ = \max_{\hat p(\tau)} \sum_{t=1}^T E_{s,a\sim\hat p(\tau)}[r(s_t,a_t)] - \sum_{t=1}^T E_{s\sim\hat p(\tau)}[H(\hat\pi(a_t|s_t))] \\ = \max_{\hat p(\tau)} \sum_{t=1}^T -E_{s\sim\hat p(\tau)}[D_{KL}(\hat\pi||\frac{\exp(r_t)}{exp(V_t)})] + \sum_{t=1}^T E_{s\sim\hat p(\tau)}[V_t] p^(τ)minDKL(p^(τ)p(τO1:T))=p^(τ)maxt=1TEs,ap^(τ)[r(st,at)]t=1TEsp^(τ)[H(π^(atst))]=p^(τ)maxt=1TEsp^(τ)[DKL(π^exp(Vt)exp(rt))]+t=1TEsp^(τ)[Vt]

    可以看出来当 π ^ = exp ⁡ ( r t ) e x p ( V t ) \hat\pi=\frac{\exp(r_t)}{exp(V_t)} π^=exp(Vt)exp(rt)的时候为最优策略,且收益是 ∑ t E [ V t ] \sum_tE[V_t] tE[Vt],其中 V t = log ⁡ ∫ A exp ⁡ ( r ( s t , a t ) ) V_t=\log\int_A\exp(r(s_t,a_t)) Vt=logAexp(r(st,at))

    跟确定性环境一样,按照SAC的方法定义 Q , V Q,V Q,V (没看懂公式14)

    • 变分推断 (概述: 用简单函数 q q q估计真实分布 p p p,可以先找到 p ( x ) p(x) p(x)的下界 E L B O ELBO ELBO并优化他)

    log ⁡ p ( x ) ≥ E z ∼ q [ log ⁡ p ( x , z ) − log ⁡ q ( z ) ] = E L B O \log p(x) \geq E_{z\sim q}[\log p(x,z)-\log q(z)] = ELBO \\ logp(x)Ezq[logp(x,z)logq(z)]=ELBO

    p ^ ( x ) = p ( s 1 ) Π t p ( s t + 1 ∣ s t , a t ) ] π ^ ( a t ∣ s t ) \hat p(x)=p(s_1)\Pi_tp(s_{t+1}|s_t,a_t)]\hat\pi(a_t|s_t) p^(x)=p(s1)Πtp(st+1st,at)]π^(atst)估计 p ( τ ) p(\tau) p(τ),其中dynamic限制只能使用真实值,带入上式得到如下形式

    log ⁡ p ( O 1 : T ) ≥ E z ∼ p ^ [ log ⁡ p ( O 1 : T , τ ) − log ⁡ p ^ ( τ ) ] = E z ∼ p ^ [ ∑ t = 1 T r ( s t , a t ) − log ⁡ π ^ ( a t ∣ s t ) ] = E L B O \log p(O_{1:T}) \geq E_{z\sim \hat p}[\log p(O_{1:T},\tau)-\log\hat p(\tau)] \\ = E_{z\sim \hat p}[\sum_{t=1}^Tr(s_t,a_t)-\log\hat\pi(a_t|s_t)] = ELBO \\ logp(O1:T)Ezp^[logp(O1:T,τ)logp^(τ)]=Ezp^[t=1Tr(st,at)logπ^(atst)]=ELBO

    上面证明了最大化熵RL等价于最大化最优决策概率的下界

    其中第二行等式成立是因为dynamic相减对消了

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值