1. Model Free
1.1 Monte Carlo
1.1.1 Value Iteration
SARSA 1. current Q -> e-greedy policy
2. sample trajectorys (s1,a1,r1,s2,a2,r2 …), first visit MC
3. update
Q
(
s
,
a
)
=
1
N
(
s
,
a
)
∑
i
G
i
t
(
s
,
a
)
Q(s,a) = \frac{1}{N(s,a)}\sum_{i} G_i^t(s,a)
Q(s,a)=N(s,a)1i∑Git(s,a)
4. Inprove the policy based uppdated Q value
1.1.2 Policy interation
1.1.2 Policy Gradient
Essential formula in episodic setting
the objective is to maximize, expectation over the trajectory:
J
=
E
τ
∼
π
[
r
(
τ
)
∑
t
=
0
t
=
T
log
(
π
(
a
t
∣
s
t
)
]
J = E_{\tau\sim\pi}[r(\tau)\sum_{t=0}^{t=T}\log(\pi(a_t|s_t)]
J=Eτ∼π[r(τ)t=0∑t=Tlog(π(at∣st)]
Essential formula in non-episodic setting
∇
θ
J
(
θ
)
=
∇
θ
E
s
∼
d
π
[
V
(
s
)
]
=
∇
θ
∑
s
∈
S
d
π
(
s
)
∑
a
∈
A
Q
π
(
s
,
a
)
π
θ
(
a
∣
s
)
∝
∑
s
∈
S
d
π
(
s
)
∑
a
∈
A
Q
π
(
s
,
a
)
∇
θ
π
θ
(
a
∣
s
)
=
∑
s
∈
S
d
π
(
s
)
∑
a
∈
A
π
θ
(
a
∣
s
)
Q
π
(
s
,
a
)
∇
θ
π
θ
(
a
∣
s
)
π
θ
(
a
∣
s
)
=
E
π
[
Q
π
(
s
,
a
)
∇
θ
ln
π
θ
(
a
∣
s
)
]
; Because
(
ln
x
)
′
=
1
/
x
\begin{aligned} \nabla_\theta J(\theta) &=\nabla_\theta E_{s\sim d^{\pi}} [V(s)]\\ &= \nabla_\theta \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} Q^\pi(s, a) \pi_\theta(a \vert s) \\ &\propto \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} Q^\pi(s, a) \nabla_\theta \pi_\theta(a \vert s) &\\ &= \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} \pi_\theta(a \vert s) Q^\pi(s, a) \frac{\nabla_\theta \pi_\theta(a \vert s)}{\pi_\theta(a \vert s)} &\\ &= \mathbb{E}_\pi [Q^\pi(s, a) \nabla_\theta \ln \pi_\theta(a \vert s)] & \scriptstyle{\text{; Because } (\ln x)' = 1/x} \end{aligned}
∇θJ(θ)=∇θEs∼dπ[V(s)]=∇θs∈S∑dπ(s)a∈A∑Qπ(s,a)πθ(a∣s)∝s∈S∑dπ(s)a∈A∑Qπ(s,a)∇θπθ(a∣s)=s∈S∑dπ(s)a∈A∑πθ(a∣s)Qπ(s,a)πθ(a∣s)∇θπθ(a∣s)=Eπ[Qπ(s,a)∇θlnπθ(a∣s)]; Because (lnx)′=1/x
Where
E
π
\mathbb{E}_\pi
Eπ refers to
E
s
∼
d
π
,
a
∼
π
θ
\mathbb{E}_{s \sim d_\pi, a \sim \pi_\theta}
Es∼dπ,a∼πθ when both state and action distributions follow the policy πθ (on policy)
REINFORCEMENT
1.2 TD
1.2.1 Value Iteration
can be done in non-episode environment)
SARSA
- None episode setting, need tuple
(
s
t
,
a
t
,
r
t
,
a
t
+
1
)
(s_t,a_t,r_t,a_{t+1})
(st,at,rt,at+1)
2. update Q ( s t , a t ) = Q ( s , a ) + α ( r t + r Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) ) Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+rQ(s_{t+1},a_{t+1})-Q(s_t,a_t)) Q(st,at)=Q(s,a)+α(rt+rQ(st+1,at+1)−Q(st,at))
3. Inprove the policy based updated Q value
Q learning off policy learning
1. None episode setting, need tuple
(
s
t
,
a
t
,
r
t
)
(s_t,a_t,r_t)
(st,at,rt)
2. update
Q
(
s
t
,
a
t
)
=
Q
(
s
,
a
)
+
α
(
r
t
+
r
max
a
′
Q
(
s
t
+
1
,
a
′
)
−
Q
(
s
t
,
a
t
)
)
Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+r \max_{a'}Q(s_{t+1},a')-Q(s_t,a_t))
Q(st,at)=Q(s,a)+α(rt+ra′maxQ(st+1,a′)−Q(st,at))
DQN
1.2.2 Policy Gradient
Actor-Critic (using Q)
Advantage Actor Critic (using V)
2. Model Based
2.1 Monte Carlo Tree Search
four step: 1. selection, expansion, simulation, backpropagation
in selection, must select most promissing actions. we can use Q value to select the good try steps.
in expansion, random expand the leaf selected node into valid action.
in simulation, roll out this action quickly into terminal by using some imimated learned policy
in backprogagation, evaluate the action with the most expected rewords from all simulated path ( max-expection tree)
2.1 Forward Search Tree
at state, try different actions, and go to different states and then try the actions at that states. then using expecting-max tree
The stanford class notes:
https://github.com/Zhenye-Na/reinforcement-learning-stanford
Bellman Optimality Equation
V
π
∗
(
s
)
=
max
Q
(
s
,
a
)
V_{\pi ^{*}} (s) = \max Q(s,a)
Vπ∗(s)=maxQ(s,a)
π
∗
(
s
)
=
arg max
Q
(
s
,
a
)
\pi^{*}(s) = \argmax Q(s,a)
π∗(s)=argmaxQ(s,a)
V
π
(
s
)
=
∑
a
∼
π
(
a
∣
s
)
Q
(
s
,
a
)
V_{\pi}(s) = \sum_{a\sim \pi(a|s)} Q(s,a)
Vπ(s)=a∼π(a∣s)∑Q(s,a)
Q
π
(
s
,
a
)
=
r
(
s
,
a
)
+
∑
s
′
∼
p
(
s
′
∣
a
)
V
π
(
s
′
)
=
r
(
s
,
a
)
+
∑
s
′
∼
p
(
s
′
∣
a
)
∑
a
′
∼
π
(
a
′
∣
s
′
)
Q
(
s
′
,
a
′
)
Q_{\pi}(s,a) = r(s,a) + \sum_{s'\sim p(s'|a)} V_{\pi}(s') = r(s,a) + \sum_{s'\sim p(s'|a)} \sum_{a'\sim \pi(a'|s')} Q(s',a')
Qπ(s,a)=r(s,a)+s′∼p(s′∣a)∑Vπ(s′)=r(s,a)+s′∼p(s′∣a)∑a′∼π(a′∣s′)∑Q(s′,a′)
Q
π
∗
(
s
,
a
)
=
r
(
s
,
a
)
+
∑
s
′
∼
p
(
s
′
∣
a
)
V
π
∗
(
s
′
)
=
r
(
s
,
a
)
+
∑
s
′
∼
p
(
s
′
∣
a
)
max
a
′
Q
(
s
′
,
a
′
)
Q_{\pi*}(s,a) = r(s,a) + \sum_{s'\sim p(s'|a)} V_{\pi*}(s') = r(s,a) + \sum_{s'\sim p(s'|a)} \max_{a'} Q(s',a')
Qπ∗(s,a)=r(s,a)+s′∼p(s′∣a)∑Vπ∗(s′)=r(s,a)+s′∼p(s′∣a)∑a′maxQ(s′,a′)
Alpha go
The detail of MCTS,
1。selection phase, we search from the current node and select the action with the max( Q + CUB ),
CUB = P(s,a)/(1+N(s,a))
注意,我们只走到 边 (edge(s,a),并不是node (state),
其实第二步不是expansion, 而是simulation!!!
2. in simulation, we take two perspective, 1)fast roll out value and 2)current value function V( s) ,
3. backup
once any of them is done, back up all the edges in the current policy tree,
V_total += V_i, Z_total_sim(s,a) += Z_i, also the counting N_z(s,a), N_q(s,a).
The new Q = 0.5*V/N_v + 0.5Z/N_z
4. expansion,
if one edge is visited for more than N_ts times, this edge is expanded to be a node (initialized by N=0) and add it to the current policy tree, which means, so when in the next search phase, the action under this node will be searched by max( Q + CUB ), note, all the initial Q(s,a) and N(s,a) under new initiated nodes is zero, sp the first select under new node will be follow argmax( p(s,a) ) from policy network, because,
max( Q + CUB ) = max(CUB) = max( P(s,a) ),
CUB = P(s,a)/(1+N(s,a))