Rollout Algorithm
Monte Carlo Tree Search
Construct a search tree by node based on the output of the simulation.
Construction of a search tree
(1) Selection: Starting from root node
R
R
R,the optimal child node is selected recursively until the leaf node
L
L
L is reached.
(2) Expansion: If
L
L
L is not a terminating node then create one or more word child nodes and select one of them,
C
C
C.
(3) Simulation: Run the output of a simulation starting from
C
C
C until the end of the gaming game → Rollout policy:usually a uniform random distribution function.
(4) Backpropagation: Update the current action sequence with the resultant output of the simulation. Motivation: update the total simulation gain
Q
(
v
)
Q(v)
Q(v) and total number of visits
N
(
s
)
N(s)
N(s).
(5) UTC: Upper Confidence Bound applied to Trees. Function to select the next node for traversal from the visited nodes.
U
C
T
(
v
i
,
v
)
=
Q
(
v
i
)
N
(
v
i
)
+
c
l
o
g
(
N
(
v
)
)
N
(
v
i
)
\mathbb{UCT} (v_i,v)= \frac{Q(v_i)}{N(v_i)}+c\sqrt{\frac{log(N(v))}{N(v_i)}}
UCT(vi,v)=N(vi)Q(vi)+cN(vi)log(N(v))
where,
Q
(
v
i
)
N
(
v
i
)
\frac{Q(v_i)}{N(v_i)}
N(vi)Q(vi) is exploitation component, can be viewed as the win rate estimate of the sub-node
v
i
v_i
vi,
l
o
g
(
N
(
v
)
)
N
(
v
i
)
\sqrt{\frac{log(N(v))}{N(v_i)}}
N(vi)log(N(v)) is exploration component,
c
c
c is the discount factor between exploitation and exploration. (Greedy algorithm when
c
=
0
c=0
c=0)
Essence of Rollout
Rollout Algorithm is a decision planning algorithm based on MC control. Different from MC, which estimates all the value function in order to find the optimal strategy
π
∗
\pi^*
π∗, Rollout algorithm estimate only the value of each current state (Planning at Decision Time).
For each state, Rollout policy chooses the action correspongding to the maximum estimate (new policy
π
′
\pi'
π′), which satisfies:
q
π
(
s
,
π
′
(
s
)
)
≥
v
π
(
s
)
q_{\pi}(s,\pi'(s)) \ge v_{\pi}(s)
qπ(s,π′(s))≥vπ(s)
Therefore, the essence of Rollout algorithm is to improve the current strategy, but not find the optimal strategy
π
∗
\pi^*
π∗.
The Efficiency of Rollout
Constrained by one-step decision time.
(1) Number of possible actions
∣
A
(
s
)
∣
|\mathcal{A}(s)|
∣A(s)∣.
(2) The length of the simulated trajectory.
(3) Execution time of the policy.
(4) The number of simulated trajectories for better value estimation.
Geometric interpretation of Rollout
According to Bellman’s equation, each policy
μ
\mu
μ defines the linear function
T
μ
J
T_{\mu}J
TμJ, which has value at
x
x
x given by:
(
T
μ
J
)
(
x
)
=
E
{
g
(
x
,
u
,
w
)
+
α
J
(
f
(
x
,
u
,
w
)
)
}
,
f
o
r
a
l
l
x
(T_{\mu}J)(x)=E\left \{ g(x,u,w) + \alpha J(f(x,u,w)) \right \} ,for \: all \: x
(TμJ)(x)=E{g(x,u,w)+αJ(f(x,u,w))},forallx
And value at state
x
x
x is given by:
(
T
J
)
(
x
)
=
m
i
n
u
∈
U
(
x
)
E
{
g
(
x
,
u
,
w
)
+
α
J
(
f
(
x
,
u
,
w
)
)
}
,
f
o
r
a
l
l
x
(TJ)(x)=min_{u \in U(x)}E\left \{ g(x,u,w) + \alpha J(f(x,u,w)) \right \} ,for \: all \: x
(TJ)(x)=minu∈U(x)E{g(x,u,w)+αJ(f(x,u,w))},forallx
can also be written as
T
J
=
m
i
n
μ
T
μ
J
TJ=min_{\mu}T_{\mu}J
TJ=minμTμJ1.
Truncated Rollout
Truncated Rollout =
m
m
m steps value iterations + base policy
μ
\mu
μ + terminal cost function approximation
J
~
→
J
μ
\tilde{J} \to J_{\mu}
J~→Jμ
(1) Truncated Rollout with one-step lookahead
T
μ
~
(
T
μ
m
J
~
)
=
T
(
T
μ
m
J
~
)
T_{\tilde{\mu}}(T_{\mu}^{m}\tilde{J}) = T(T_{\mu}^{m}\tilde{J})
Tμ~(TμmJ~)=T(TμmJ~)
that is,
μ
~
\tilde{\mu}
μ~ attains the minimum of
J
~
\tilde{J}
J~.
(2) Truncated Rollout with l-step lookahead
T
μ
~
(
T
l
−
1
T
μ
m
J
~
)
=
T
(
T
l
−
1
T
μ
m
J
~
)
T_{\tilde{\mu}}(T^{l-1}T_{\mu}^{m}\tilde{J}) = T(T^{l-1}T_{\mu}^{m}\tilde{J})
Tμ~(Tl−1TμmJ~)=T(Tl−1TμmJ~)
value l l l: starting point T l − 1 J → J ∗ T^{l-1}J \to J^* Tl−1J→J∗
value m m m: starting point → J μ \to J_{\mu} →Jμ
Receding horizon in MPC
Problem formulation
Consider a finite horizon
l
l
l-stage optimal control problem involving the same cost function and the requirement that the state after
l
l
l steps is driven to 0. This is the problem:
m
i
n
u
t
,
t
=
k
,
…
,
k
+
l
−
1
∑
t
=
k
k
+
l
−
1
g
(
x
t
,
u
t
)
\underset{u_t,t=k,\dots,k+l-1}{min}\sum^{k+l-1}_{t=k}g(x_t,u_t)
ut,t=k,…,k+l−1mint=k∑k+l−1g(xt,ut)
System equation constraines:
x
t
+
1
=
f
(
x
t
,
u
t
)
,
t
=
k
,
…
,
k
+
l
−
1
x_{t+1}=f(x_t,u_t), \quad t=k,\dots,k+l-1
xt+1=f(xt,ut),t=k,…,k+l−1
The control constraints:
u
t
∈
U
(
x
t
)
,
t
=
k
,
…
,
k
+
l
−
1
u_t \in U(x_t), \quad t=k,\dots,k+l-1
ut∈U(xt),t=k,…,k+l−1
And the terminal state constraints:
x
k
+
l
=
0
x_{k+l} = 0
xk+l=0
If {
u
~
k
,
…
,
u
~
k
+
l
−
1
\tilde{u}_k,\dots,\tilde{u}_{k+l-1}
u~k,…,u~k+l−1} is the optimal control sequence of this problem, we apply
u
~
k
\tilde{u}_k
u~k and we apply
u
~
k
\tilde{u}_k
u~k and we discard the other controls
u
~
k
+
1
,
…
,
u
~
k
+
l
−
1
\tilde{u}_{k+1},\dots,\tilde{u}_{k+l-1}
u~k+1,…,u~k+l−1.
At the next stage, repeat the process, once the next state
x
k
+
1
x_{k+1}
xk+1 is revealed.
In summary, the receding horizon in MPC is equivalent to the l l l-step lookahead rollout.
Reference:
8.10 rollout算法
蒙特卡洛树搜索 Monte Carlo Tree Search
Bertsekas, Dimitri. “Newton’s method for reinforcement learning and model predictive control.” Results in Control and Optimization 7 (2022): 100121.
T , T μ T,T_{\mu} T,Tμ are the Bellman operators defined to give a compact expression. ↩︎