【强化学习】SPG DPG DDPG(DPG3)

数学公式警告

Policy Gradient

J ( π θ ) = ∫ S ρ π ( s ) ∫ A π θ ( s , a ) r ( s , a ) d a d s = E s ∼ ρ π , a ∼ π θ [ r ( s , a ) ] \begin{aligned}J(\pi_\theta)=&\int_S \rho^\pi(s)\int_A \pi_\theta (s,a)r(s,a)dads\\=&E_{s\sim \rho^\pi ,a\sim \pi_\theta}[r(s,a)]\end{aligned} J(πθ)==Sρπ(s)Aπθ(s,a)r(s,a)dadsEsρπ,aπθ[r(s,a)]

ρ π ( s ′ ) = ∫ S ∑ t = 1 ∞ γ t − 1 p 1 ( s ) p ( s → s ′ , t , π ) d s \rho^\pi(s') = \int_S \sum_{t=1}^ {\infty} \gamma^{t-1}p_1(s)p(s\to s',t,\pi)ds ρπ(s)=St=1γt1p1(s)p(ss,t,π)ds

其中 p 1 ( s ) p_1(s) p1(s)表示初始状态为s的概率

p ( s − > s ′ , t , π ) p(s->s',t,\pi) p(s>s,t,π)表示在策略 π \pi π下状态s经过t时间到达s’

SPG

stochastic policy gradient

随机指随即策略 π θ ( a ∣ s ) = P [ a ∣ s , ; θ ] \pi_\theta(a|s)=P[a|s,;\theta] πθ(as)=P[as,;θ]
∇ θ J ( π θ ) = ∫ S ρ π ( s ) ∫ A ∇ θ π θ ( s , a ) Q π ( s , a ) d a d s = E s ∼ ρ π , a ∼ π θ [ ∇ θ l o g π θ ( s , a ) Q π ( s , a ) ] \begin{aligned} \nabla_\theta J(\pi_\theta)=&\int_S \rho^\pi(s)\int_A \nabla_\theta \pi_\theta (s,a)Q^\pi(s,a)dads\\=&E_{s\sim \rho^\pi ,a\sim \pi_\theta}[\nabla_\theta log \pi_\theta(s,a)Q^\pi(s,a)]\end{aligned} θJ(πθ)==Sρπ(s)Aθπθ(s,a)Qπ(s,a)dadsEsρπ,aπθ[θlogπθ(s,a)Qπ(s,a)]

DPG

deterministic policy gradient

得出的Policy对于一个state的action是确定的
J ( μ θ ) = ∫ S ρ μ ( s ) r ( s , μ θ ( s ) ) d s = E s ∼ ρ μ [ r ( s , μ θ ( s ) ) ] \begin{aligned}J(\mu_\theta)=&\int_S \rho^\mu(s) r(s,\mu_\theta(s))ds\\=&E_{s\sim \rho^\mu}[r(s,\mu_\theta(s))]\end{aligned} J(μθ)==Sρμ(s)r(s,μθ(s))dsEsρμ[r(s,μθ(s))]

∇ θ J ( μ θ ) = ∫ S ρ μ ( s ) ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) d s = E s ∼ ρ μ [ ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ] \begin{aligned}\nabla_\theta J(\mu_\theta)=&\int_S \rho^\mu(s) \nabla_\theta \mu_\theta (s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}ds\\=&E_{s\sim \rho^\mu}[\nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}]\end{aligned} θJ(μθ)==Sρμ(s)θμθ(s)aQμ(s,a)a=μθ(s)dsEsρμ[θμθ(s)aQμ(s,a)a=μθ(s)]

DDPG

img

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值