reinforcement learning

最新推荐文章于 2024-01-18 15:16:48 发布

huoxingshu12345

最新推荐文章于 2024-01-18 15:16:48 发布

阅读量169

点赞数

本文链接：https://blog.csdn.net/huoxingshu12345/article/details/111528597

版权

1. Model Free

1.1 Monte Carlo

1.1.1 Value Iteration

SARSA 1. current Q -> e-greedy policy
2. sample trajectorys (s1,a1,r1,s2,a2,r2 …), first visit MC
3. update $\frac{1}{N(s,a)}\sum_{i} G_i^t(s,a)$
4. Inprove the policy based uppdated Q value

1.1.2 Policy interation

1.1.2 Policy Gradient

Essential formula in episodic setting
the objective is to maximize, expectation over the trajectory:
$E_{\tau\sim\pi}[r(\tau)\sum_{t=0}^{t=T}\log(\pi(a_t|s_t)]$

Essential formula in non-episodic setting
$\begin{aligned} \nabla_\theta J(\theta) &=\nabla_\theta E_{s\sim d^{\pi}} [V(s)]\\ &= \nabla_\theta \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} Q^\pi(s, a) \pi_\theta(a \vert s) \\ &\propto \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} Q^\pi(s, a) \nabla_\theta \pi_\theta(a \vert s) &\\ &= \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} \pi_\theta(a \vert s) Q^\pi(s, a) \frac{\nabla_\theta \pi_\theta(a \vert s)}{\pi_\theta(a \vert s)} &\\ &= \mathbb{E}_\pi [Q^\pi(s, a) \nabla_\theta \ln \pi_\theta(a \vert s)] & \scriptstyle{\text{; Because } (\ln x)' = 1/x} \end{aligned}$
Where $\mathbb{E}_\pi$ refers to $\mathbb{E}_{s \sim d_\pi, a \sim \pi_\theta}$ when both state and action distributions follow the policy πθ (on policy)

REINFORCEMENT

1.2 TD

1.2.1 Value Iteration

can be done in non-episode environment)

SARSA

None episode setting, need tuple $s_t,a_t,r_t,a_{t+1})$
2. update $Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+rQ(s_{t+1},a_{t+1})-Q(s_t,a_t))$
3. Inprove the policy based updated Q value

Q learning off policy learning
1. None episode setting, need tuple $s_t,a_t,r_t)$
2. update $Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+r \max_{a'}Q(s_{t+1},a')-Q(s_t,a_t))$
在这里插入图片描述

DQN
在这里插入图片描述

1.2.2 Policy Gradient

link

Actor-Critic (using Q)

Advantage Actor Critic (using V)

2. Model Based

2.1 Monte Carlo Tree Search

four step: 1. selection, expansion, simulation, backpropagation
in selection, must select most promissing actions. we can use Q value to select the good try steps.
in expansion, random expand the leaf selected node into valid action.
in simulation, roll out this action quickly into terminal by using some imimated learned policy
in backprogagation, evaluate the action with the most expected rewords from all simulated path ( max-expection tree)

2.1 Forward Search Tree

at state, try different actions, and go to different states and then try the actions at that states. then using expecting-max tree
在这里插入图片描述

The stanford class notes:
https://github.com/Zhenye-Na/reinforcement-learning-stanford

Bellman Optimality Equation

$V_{\pi ^{*}} (s) = \max Q(s,a)$
$\pi^{*}(s) = \argmax Q(s,a)$
$V_{\pi}(s) = \sum_{a\sim \pi(a|s)} Q(s,a)$
$Q_{\pi}(s,a) = r(s,a) + \sum_{s'\sim p(s'|a)} V_{\pi}(s') = r(s,a) + \sum_{s'\sim p(s'|a)} \sum_{a'\sim \pi(a'|s')} Q(s',a')$
$Q_{\pi*}(s,a) = r(s,a) + \sum_{s'\sim p(s'|a)} V_{\pi*}(s') = r(s,a) + \sum_{s'\sim p(s'|a)} \max_{a'} Q(s',a')$

Alpha go

The detail of MCTS,

1。selection phase, we search from the current node and select the action with the max( Q + CUB ),
CUB = P(s,a)/(1+N(s,a))

注意，我们只走到边（edge（s，a），并不是node （state），
其实第二步不是expansion，而是simulation！！！
2. in simulation, we take two perspective, 1）fast roll out value and 2）current value function V( s) ,
3. backup
once any of them is done, back up all the edges in the current policy tree,
V_total += V_i, Z_total_sim(s,a) += Z_i, also the counting N_z(s,a), N_q(s,a).
The new Q = 0.5*V/N_v + 0.5Z/N_z
4. expansion,
if one edge is visited for more than N_ts times, this edge is expanded to be a node (initialized by N=0) and add it to the current policy tree, which means, so when in the next search phase, the action under this node will be searched by max( Q + CUB ), note, all the initial Q(s,a) and N(s,a) under new initiated nodes is zero, sp the first select under new node will be follow argmax( p(s,a) ) from policy network, because,
max( Q + CUB ) = max(CUB) = max( P(s,a) ),
CUB = P(s,a)/(1+N(s,a))

huoxingshu12345

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
reinforcement learning

1. Model Free1.1 Monte Carlo1.1.1 Value IterationSARSA 1. current Q -> e-greedy policy2. sample trajectorys (s1,a1,r1,s2,a2,r2 …), first visit MC3. update Q(s,a)=1N(s,a)∑iGit(s,a) Q(s,a) = \frac{1}{N(s,a)}\sum_{i} G_i^t(s,a) Q(s,a)=N(s,a)1i∑Git
复制链接

扫一扫