reinforcement learning

1. Model Free

1.1 Monte Carlo

1.1.1 Value Iteration

SARSA 1. current Q -> e-greedy policy
2. sample trajectorys (s1,a1,r1,s2,a2,r2 …), first visit MC
3. update Q ( s , a ) = 1 N ( s , a ) ∑ i G i t ( s , a ) Q(s,a) = \frac{1}{N(s,a)}\sum_{i} G_i^t(s,a) Q(s,a)=N(s,a)1iGit(s,a)
4. Inprove the policy based uppdated Q value

1.1.2 Policy interation

1.1.2 Policy Gradient

Essential formula in episodic setting
the objective is to maximize, expectation over the trajectory:
J = E τ ∼ π [ r ( τ ) ∑ t = 0 t = T log ⁡ ( π ( a t ∣ s t ) ] J = E_{\tau\sim\pi}[r(\tau)\sum_{t=0}^{t=T}\log(\pi(a_t|s_t)] J=Eτπ[r(τ)t=0t=Tlog(π(atst)]

Essential formula in non-episodic setting
∇ θ J ( θ ) = ∇ θ E s ∼ d π [ V ( s ) ] = ∇ θ ∑ s ∈ S d π ( s ) ∑ a ∈ A Q π ( s , a ) π θ ( a ∣ s ) ∝ ∑ s ∈ S d π ( s ) ∑ a ∈ A Q π ( s , a ) ∇ θ π θ ( a ∣ s ) = ∑ s ∈ S d π ( s ) ∑ a ∈ A π θ ( a ∣ s ) Q π ( s , a ) ∇ θ π θ ( a ∣ s ) π θ ( a ∣ s ) = E π [ Q π ( s , a ) ∇ θ ln ⁡ π θ ( a ∣ s ) ] ; Because  ( ln ⁡ x ) ′ = 1 / x \begin{aligned} \nabla_\theta J(\theta) &=\nabla_\theta E_{s\sim d^{\pi}} [V(s)]\\ &= \nabla_\theta \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} Q^\pi(s, a) \pi_\theta(a \vert s) \\ &\propto \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} Q^\pi(s, a) \nabla_\theta \pi_\theta(a \vert s) &\\ &= \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} \pi_\theta(a \vert s) Q^\pi(s, a) \frac{\nabla_\theta \pi_\theta(a \vert s)}{\pi_\theta(a \vert s)} &\\ &= \mathbb{E}_\pi [Q^\pi(s, a) \nabla_\theta \ln \pi_\theta(a \vert s)] & \scriptstyle{\text{; Because } (\ln x)' = 1/x} \end{aligned} θJ(θ)=θEsdπ[V(s)]=θsSdπ(s)aAQπ(s,a)πθ(as)sSdπ(s)aAQπ(s,a)θπθ(as)=sSdπ(s)aAπθ(as)Qπ(s,a)πθ(as)θπθ(as)=Eπ[Qπ(s,a)θlnπθ(as)]; Because (lnx)=1/x
Where E π \mathbb{E}_\pi Eπ refers to E s ∼ d π , a ∼ π θ \mathbb{E}_{s \sim d_\pi, a \sim \pi_\theta} Esdπ,aπθ when both state and action distributions follow the policy πθ (on policy)

REINFORCEMENT

1.2 TD

1.2.1 Value Iteration

can be done in non-episode environment)

SARSA
  1. None episode setting, need tuple ( s t , a t , r t , a t + 1 ) (s_t,a_t,r_t,a_{t+1}) (st,at,rt,at+1)
    2. update Q ( s t , a t ) = Q ( s , a ) + α ( r t + r Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) ) Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+rQ(s_{t+1},a_{t+1})-Q(s_t,a_t)) Q(st,at)=Q(s,a)+α(rt+rQ(st+1,at+1)Q(st,at))
    3. Inprove the policy based updated Q value
    在这里插入图片描述

Q learning off policy learning
1. None episode setting, need tuple ( s t , a t , r t ) (s_t,a_t,r_t) (st,at,rt)
2. update Q ( s t , a t ) = Q ( s , a ) + α ( r t + r max ⁡ a ′ Q ( s t + 1 , a ′ ) − Q ( s t , a t ) ) Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+r \max_{a'}Q(s_{t+1},a')-Q(s_t,a_t)) Q(st,at)=Q(s,a)+α(rt+ramaxQ(st+1,a)Q(st,at))
在这里插入图片描述

DQN
在这里插入图片描述

1.2.2 Policy Gradient

link

Actor-Critic (using Q)
Advantage Actor Critic (using V)

2. Model Based

2.1 Monte Carlo Tree Search

four step: 1. selection, expansion, simulation, backpropagation
in selection, must select most promissing actions. we can use Q value to select the good try steps.
in expansion, random expand the leaf selected node into valid action.
in simulation, roll out this action quickly into terminal by using some imimated learned policy
in backprogagation, evaluate the action with the most expected rewords from all simulated path ( max-expection tree)

2.1 Forward Search Tree

at state, try different actions, and go to different states and then try the actions at that states. then using expecting-max tree
在这里插入图片描述

The stanford class notes:
https://github.com/Zhenye-Na/reinforcement-learning-stanford

Bellman Optimality Equation

V π ∗ ( s ) = max ⁡ Q ( s , a ) V_{\pi ^{*}} (s) = \max Q(s,a) Vπ(s)=maxQ(s,a)
π ∗ ( s ) = arg max ⁡ Q ( s , a ) \pi^{*}(s) = \argmax Q(s,a) π(s)=argmaxQ(s,a)
V π ( s ) = ∑ a ∼ π ( a ∣ s ) Q ( s , a ) V_{\pi}(s) = \sum_{a\sim \pi(a|s)} Q(s,a) Vπ(s)=aπ(as)Q(s,a)
Q π ( s , a ) = r ( s , a ) + ∑ s ′ ∼ p ( s ′ ∣ a ) V π ( s ′ ) = r ( s , a ) + ∑ s ′ ∼ p ( s ′ ∣ a ) ∑ a ′ ∼ π ( a ′ ∣ s ′ ) Q ( s ′ , a ′ ) Q_{\pi}(s,a) = r(s,a) + \sum_{s'\sim p(s'|a)} V_{\pi}(s') = r(s,a) + \sum_{s'\sim p(s'|a)} \sum_{a'\sim \pi(a'|s')} Q(s',a') Qπ(s,a)=r(s,a)+sp(sa)Vπ(s)=r(s,a)+sp(sa)aπ(as)Q(s,a)
Q π ∗ ( s , a ) = r ( s , a ) + ∑ s ′ ∼ p ( s ′ ∣ a ) V π ∗ ( s ′ ) = r ( s , a ) + ∑ s ′ ∼ p ( s ′ ∣ a ) max ⁡ a ′ Q ( s ′ , a ′ ) Q_{\pi*}(s,a) = r(s,a) + \sum_{s'\sim p(s'|a)} V_{\pi*}(s') = r(s,a) + \sum_{s'\sim p(s'|a)} \max_{a'} Q(s',a') Qπ(s,a)=r(s,a)+sp(sa)Vπ(s)=r(s,a)+sp(sa)amaxQ(s,a)

Alpha go

The detail of MCTS,

1。selection phase, we search from the current node and select the action with the max( Q + CUB ),
CUB = P(s,a)/(1+N(s,a))

注意,我们只走到 边 (edge(s,a),并不是node (state),
其实第二步不是expansion, 而是simulation!!!
2. in simulation, we take two perspective, 1)fast roll out value and 2)current value function V( s) ,
3. backup
once any of them is done, back up all the edges in the current policy tree,
V_total += V_i, Z_total_sim(s,a) += Z_i, also the counting N_z(s,a), N_q(s,a).
The new Q = 0.5*V/N_v + 0.5Z/N_z
4. expansion,
if one edge is visited for more than N_ts times, this edge is expanded to be a node (initialized by N=0) and add it to the current policy tree, which means, so when in the next search phase, the action under this node will be searched by max( Q + CUB ), note, all the initial Q(s,a) and N(s,a) under new initiated nodes is zero, sp the first select under new node will be follow argmax( p(s,a) ) from policy network, because,
max( Q + CUB ) = max(CUB) = max( P(s,a) ),
CUB = P(s,a)/(1+N(s,a))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值