强化学习学习笔记-李宏毅

Policy Gradient

  • actor+env+reward function,env和reward是不能控制的,唯一可以变的是actor,Policy π \pi π是一个网络,参数为 θ \theta θ,输入是当前的观察,输出是采取的行为,例如游戏中输入的是游戏画面 s 1 s_1 s1,输出的是采取的操作 a 1 a_1 a1,有了决定的action a 1 a_1 a1之后会获取对应的reward r 1 r_1 r1,并且画面也会有对应的改变得到 s 2 s_2 s2,这个过程不断进行得到一个trajectory τ = { s 1 , a 1 , s 2 , a 2 , ⋯   , s T , a T } \tau = \{s_1,a_1,s_2,a_2,\cdots,s_T,a_T\} τ={s1,a1,s2,a2,,sT,aT},假设网络参数固定,那么某条trajectory的几率是 p θ ( τ ) = p ( s 1 ) p θ ( a 1 ∣ s 1 ) p ( s 2 ∣ s 1 , a 1 ) p θ ( a 2 ∣ s 2 ) ⋯ = p ( s 1 ) ∏ t = 1 T p θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p_\theta(\tau) = p(s_1)p_\theta(a_1|s_1)p(s_2|s_1,a_1)p_\theta(a_2|s_2)\cdots = p(s_1)\prod_{t = 1}^Tp_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t) pθ(τ)=p(s1)pθ(a1s1)p(s2s1,a1)pθ(a2s2)=p(s1)t=1Tpθ(atst)p(st+1st,at),某一条trajectory得到的reward R ( τ ) = ∑ t = 1 T r t R(\tau)=\sum_{t = 1}^Tr_t R(τ)=t=1Trt,目标就是调整网络参数,使得reward的期望值大 R ‾ θ = ∑ τ R ( τ ) p θ ( τ ) \overline{R}_\theta = \sum_\tau R(\tau)p_\theta(\tau) Rθ=τR(τ)pθ(τ),如何优化 θ \theta θ呢,梯度下降 ∇ ( ‾ R ) θ = ∑ τ R ( τ ) ∇ p θ ( τ ) = ∑ τ R ( τ ) p θ ( τ ) ∇ p θ ( τ ) p θ ( τ ) = ∑ τ R ( τ ) p θ ( τ ) ∇ log ⁡ p θ ( τ ) = E τ ∼ p θ ( τ ) [ R ( τ ) ∇ log ⁡ p θ ( τ ) ] = 1 N ∑ n = 1 N R ( τ n ) ∇ log ⁡ p θ ( τ n ) = 1 N ∑ n = 1 N ∑ t = 1 T n R ( τ n ) ∇ log ⁡ p θ ( a t n ∣ s t n ) \nabla \overline(R)_\theta=\sum_\tau R(\tau)\nabla p_\theta(\tau) = \sum_\tau R(\tau)p_\theta(\tau)\frac{\nabla p_\theta(\tau)}{p_\theta(\tau)}=\sum_\tau R(\tau)p_\theta(\tau)\nabla \log p_\theta(\tau) = E_{\tau\sim p_\theta(\tau)}[R(\tau)\nabla\log p_\theta(\tau)] = \frac{1}{N}\sum_{n = 1}^NR(\tau^n)\nabla\log p_\theta(\tau^n) = \frac{1}{N}\sum_{n = 1}^N\sum_{t = 1}^{T_n}R(\tau^n)\nabla\log p_\theta(a^n_t|s_t^n) (R)θ=τR(τ)pθ(τ)=τR(τ)pθ(τ)pθ(τ)pθ(τ)=τR(τ)pθ(τ)logpθ(τ)=Eτpθ(τ)[R(τ)logpθ(τ)]=N1n=1NR(τn)logpθ(τn)=N1n=1Nt=1TnR(τn)logpθ(atnstn),更新参数 θ ← θ + η ∇ R ‾ θ \theta\leftarrow \theta + \eta\nabla \overline{R}_\theta θθ+ηRθ,训练数据的获得,根据当前的网络,去玩游戏获取不同的trajectory,记录 s i t , a i t , r i s^t_i,a^t_i,r_i sit,ait,ri的数据对,计算梯度,更新参数,之后再次sample trajectory;
  • 本质上可以看做一个分类问题,网络希望输入 s i t s_i^t sit输出 a i t a_i^t ait使得reward r i r_i ri最大,其中 r i r_i ri是针对整场游戏而言的,所以可以看做以reward为权重的log likelihood,希望加权的likelihood越大越好,以此提升输入 s i t s_i^t sit得到reward大的时候对应的 a i t a_i^t ait的几率,对应的就是分类的时候提升正确类别对应的几率;
  • 由于训练的时候是sample,所以假设所有的reward都为正的时候可能会存在问题,所以reward整体都减去一个常量,可以取作reward的期望;
  • 现在reward是trajectory粒度的,但是一条trajectory里面可能并不是所有的action都是好的,所以需要为不同的步骤分配不同的credit,此时变为 R ( τ n ) → ∑ t ′ = t T n r t ′ n → ∑ t ′ = t T n γ t ′ − t r t ′ n (随时间指数) R(\tau^n)\rightarrow \sum_{t'=t}^{T_n}r_{t'}^n\rightarrow\sum_{t'=t}^{T_n}\gamma^{t'-t}r_{t'}^n(随时间指数) R(τn)t=tTnrtnt=tTnγttrtn(随时间指数),减去bias之后记作 A θ ( s t , a t ) A^\theta(s_t,a_t) Aθ(st,at)

Proximal Policy Optimization

  • On-policy:参与学习的agent和与环境互动的agent是同一个,上面的就是on-policy的做法,存在的问题就是更新了参数之后,之前sample出来的数据就不能再次使用了;
  • Off-policy:参与学习的agent和与环境互动的agent不是同一个,希望使用sample出来的数据多次,使用从 π θ ′ \pi_{\theta'} πθ中sample出来的数据来训练 π θ \pi_\theta πθ,其中 θ ′ \theta' θ是固定的;
  • importance sampling: E x ∼ p [ f ( x ) ] = 1 N ∑ i = 1 N f ( x i ) E_{x\sim p}[f(x)] = \frac{1}{N}\sum_{i = 1}^Nf(x^i) Exp[f(x)]=N1i=1Nf(xi),但是我们现在不能从 p p p sample数据,只能从 q ( x ) q(x) q(x) sample数据,所以换成 E x ∼ p [ f ( x ) ] = ∫ f ( x ) p ( x ) d x = ∫ f ( x ) p ( x ) q ( x ) q ( x ) d x = E x ∼ q [ f ( x ) p ( x ) q ( x ) ] E_{x\sim p}[f(x)] = \int f(x)p(x)dx = \int f(x)\frac{p(x)}{q(x)}q(x)dx = E_{x\sim q}[f(x)\frac{p(x)}{q(x)}] Exp[f(x)]=f(x)p(x)dx=f(x)q(x)p(x)q(x)dx=Exq[f(x)q(x)p(x)],也就是做了一个修正,乘上了 p ( x ) q ( x ) \frac{p(x)}{q(x)} q(x)p(x),也就是importance weight,但是importance sampling有一个问题就是 p p p q q q不能差太多;
  • 对应的梯度 ∇ R ‾ θ = E τ ∼ p θ ′ ( τ ) [ p θ ( τ ) p θ ′ ( τ ) R ( τ ) ∇ log ⁡ p θ ( τ ) ] = E ( s t , a t ) ∼ π θ ′ [ P θ ( s t , a t ) P θ ′ ( s t , a t ) A θ ′ ( s t , a t ) ∇ log ⁡ p θ ( a t n ∣ s t n ) ] = E ( s t , a t ) ∼ π θ ′ [ P θ ( a t ∣ s t ) P θ ′ ( a t ∣ s t ) p θ ( s t ) p θ ′ ( s t ) A θ ′ ( s t , a t ) ∇ log ⁡ p θ ( a t n ∣ s t n ) ] = E ( s t , a t ) ∼ π θ ′ [ P θ ( a t ∣ s t ) P θ ′ ( a t ∣ s t ) A θ ′ ( s t , a t ) ∇ log ⁡ p θ ( a t n ∣ s t n ) ] \nabla \overline R_\theta = E_{\tau\sim p_{\theta'}(\tau)}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\nabla\log p_\theta(\tau)]=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(s_t,a_t)}{P_{\theta'}(s_t,a_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t^n|s_t^n)]=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}\frac{p_\theta(s_t)}{p_{\theta'}(s_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t^n|s_t^n)]=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t^n|s_t^n)] Rθ=Eτpθ(τ)[pθ(τ)pθ(τ)R(τ)logpθ(τ)]=E(st,at)πθ[Pθ(st,at)Pθ(st,at)Aθ(st,at)logpθ(atnstn)]=E(st,at)πθ[Pθ(atst)Pθ(atst)pθ(st)pθ(st)Aθ(st,at)logpθ(atnstn)]=E(st,at)πθ[Pθ(atst)Pθ(atst)Aθ(st,at)logpθ(atnstn)],根据 ∇ f ( x ) = f ( x ) ∇ log ⁡ f ( x ) \nabla f(x) = f(x)\nabla\log f(x) f(x)=f(x)logf(x)反推出原优化目标为 J θ ′ ( θ ) = E ( s t , a t ) ∼ π θ ′ [ P θ ( a t ∣ s t ) P θ ′ ( a t ∣ s t ) A θ ′ ( s t , a t ) ] J^{\theta'}(\theta) =E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)] Jθ(θ)=E(st,at)πθ[Pθ(atst)Pθ(atst)Aθ(st,at)]
  • PPO就是加了一项使得 p θ p_\theta pθ p θ ′ p_{\theta'} pθ之间不能差太多, J P P O θ ′ ( θ ) = J θ ′ ( θ ) − β K L ( θ , θ ′ ) J^{\theta'}_{PPO}(\theta) = J^{\theta'}(\theta)-\beta KL(\theta,\theta') JPPOθ(θ)=Jθ(θ)βKL(θ,θ),其中 β \beta β动态调整,如果 K L ( θ , θ ′ ) > K L m a x KL(\theta,\theta')>KL_{max} KL(θ,θ)>KLmax增大 b e t a beta beta,如果 K L ( θ , θ ′ ) < K L m i n KL(\theta,\theta')<KL_{min} KL(θ,θ)<KLmin减小 b e t a beta beta

ref
https://www.youtube.com/watch?v=OAKAZhFmYoI&ab_channel=Hung-yiLee

  • 3
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值