Rollout Algorithm

小兔吧唧

已于 2022-09-05 12:24:15 修改

阅读量423

点赞数

分类专栏： RL & MPC 文章标签：算法

于 2022-08-31 16:53:17 首次发布

本文链接：https://blog.csdn.net/weixin_43496147/article/details/126619312

版权

RL & MPC 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Rollout Algorithm

Monte Carlo Tree Search
- - Construction of a search tree
Essence of Rollout
The Efficiency of Rollout
Geometric interpretation of Rollout
Truncated Rollout
Receding horizon in MPC
- - Problem formulation

Monte Carlo Tree Search

Construct a search tree by node based on the output of the simulation.

Construction of a search tree

(1) Selection: Starting from root node $R$ ，the optimal child node is selected recursively until the leaf node $L$ is reached.
(2) Expansion: If $L$ is not a terminating node then create one or more word child nodes and select one of them, $C$ .
(3) Simulation: Run the output of a simulation starting from $C$ until the end of the gaming game → Rollout policy：usually a uniform random distribution function.
(4) Backpropagation: Update the current action sequence with the resultant output of the simulation. Motivation: update the total simulation gain $Q (v)$ and total number of visits $N (s)$ .
(5) UTC: Upper Confidence Bound applied to Trees. Function to select the next node for traversal from the visited nodes.
$\mathbb{UCT} (v_i,v)= \frac{Q(v_i)}{N(v_i)}+c\sqrt{\frac{log(N(v))}{N(v_i)}}$
where, $\frac{Q(v_i)}{N(v_i)}$ is exploitation component, can be viewed as the win rate estimate of the sub-node $v_i$ , $\sqrt{\frac{log(N(v))}{N(v_i)}}$ is exploration component, $c$ is the discount factor between exploitation and exploration. (Greedy algorithm when $c = 0$ )

MCTS

Essence of Rollout

Rollout Algorithm is a decision planning algorithm based on MC control. Different from MC, which estimates all the value function in order to find the optimal strategy $\pi^*$ , Rollout algorithm estimate only the value of each current state (Planning at Decision Time).
For each state, Rollout policy chooses the action correspongding to the maximum estimate (new policy $\pi'$ ), which satisfies:
$q_{\pi}(s,\pi'(s)) \ge v_{\pi}(s)$
Therefore, the essence of Rollout algorithm is to improve the current strategy, but not find the optimal strategy $\pi^*$ .
Rollout

The Efficiency of Rollout

Constrained by one-step decision time.
(1) Number of possible actions $|\mathcal{A}(s)|$ .
(2) The length of the simulated trajectory.
(3) Execution time of the policy.
(4) The number of simulated trajectories for better value estimation.

Geometric interpretation of Rollout

According to Bellman’s equation, each policy $\mu$ defines the linear function $T_{\mu}J$ , which has value at $x$ given by:
$(T_{\mu}J)(x)=E\left \{ g(x,u,w) + \alpha J(f(x,u,w)) \right \} ,for \: all \: x$
And value at state $x$ is given by:
$(TJ)(x)=min_{u \in U(x)}E\left \{ g(x,u,w) + \alpha J(f(x,u,w)) \right \} ,for \: all \: x$
can also be written as $TJ=min_{\mu}T_{\mu}J$ ¹.

Geometric interpretation of Rollout
Policy iteration with rollout

Truncated Rollout

Truncated Rollout = $m$ steps value iterations + base policy $\mu$ + terminal cost function approximation $\tilde{J} \to J_{\mu}$
(1) Truncated Rollout with one-step lookahead
$T_{\tilde{\mu}}(T_{\mu}^{m}\tilde{J}) = T(T_{\mu}^{m}\tilde{J})$
that is, $\tilde{\mu}$ attains the minimum of $\tilde{J}$ .

(2) Truncated Rollout with l-step lookahead
$T_{\tilde{\mu}}(T^{l-1}T_{\mu}^{m}\tilde{J}) = T(T^{l-1}T_{\mu}^{m}\tilde{J})$

value $l$ : starting point $T^{l-1}J \to J^*$
value $m$ : starting point $\to J_{\mu}$

Receding horizon in MPC

Problem formulation

Consider a finite horizon $l$ -stage optimal control problem involving the same cost function and the requirement that the state after $l$ steps is driven to 0. This is the problem:
$\underset{u_t,t=k,\dots,k+l-1}{min}\sum^{k+l-1}_{t=k}g(x_t,u_t)$
System equation constraines:
$x_{t+1}=f(x_t,u_t), \quad t=k,\dots,k+l-1$
The control constraints:
$u_t \in U(x_t), \quad t=k,\dots,k+l-1$
And the terminal state constraints:
$x_{k+l} = 0$
If { $\tilde{u}_k,\dots,\tilde{u}_{k+l-1}$ } is the optimal control sequence of this problem, we apply $\tilde{u}_k$ and we apply $\tilde{u}_k$ and we discard the other controls $\tilde{u}_{k+1},\dots,\tilde{u}_{k+l-1}$ .
At the next stage, repeat the process, once the next state $x_{k+1}$ is revealed.