Rollout Algorithm

Monte Carlo Tree Search

Construct a search tree by node based on the output of the simulation.

Construction of a search tree

(1) Selection: Starting from root node R R R,the optimal child node is selected recursively until the leaf node L L L is reached.
(2) Expansion: If L L L is not a terminating node then create one or more word child nodes and select one of them, C C C.
(3) Simulation: Run the output of a simulation starting from C C C until the end of the gaming game → Rollout policy:usually a uniform random distribution function.
(4) Backpropagation: Update the current action sequence with the resultant output of the simulation. Motivation: update the total simulation gain Q ( v ) Q(v) Q(v) and total number of visits N ( s ) N(s) N(s).
(5) UTC: Upper Confidence Bound applied to Trees. Function to select the next node for traversal from the visited nodes.
U C T ( v i , v ) = Q ( v i ) N ( v i ) + c l o g ( N ( v ) ) N ( v i ) \mathbb{UCT} (v_i,v)= \frac{Q(v_i)}{N(v_i)}+c\sqrt{\frac{log(N(v))}{N(v_i)}} UCT(vi,v)=N(vi)Q(vi)+cN(vi)log(N(v))
where, Q ( v i ) N ( v i ) \frac{Q(v_i)}{N(v_i)} N(vi)Q(vi) is exploitation component, can be viewed as the win rate estimate of the sub-node v i v_i vi, l o g ( N ( v ) ) N ( v i ) \sqrt{\frac{log(N(v))}{N(v_i)}} N(vi)log(N(v)) is exploration component, c c c is the discount factor between exploitation and exploration. (Greedy algorithm when c = 0 c=0 c=0)

MCTS

Essence of Rollout

Rollout Algorithm is a decision planning algorithm based on MC control. Different from MC, which estimates all the value function in order to find the optimal strategy π ∗ \pi^* π, Rollout algorithm estimate only the value of each current state (Planning at Decision Time).
For each state, Rollout policy chooses the action correspongding to the maximum estimate (new policy π ′ \pi' π), which satisfies:
q π ( s , π ′ ( s ) ) ≥ v π ( s ) q_{\pi}(s,\pi'(s)) \ge v_{\pi}(s) qπ(s,π(s))vπ(s)
Therefore, the essence of Rollout algorithm is to improve the current strategy, but not find the optimal strategy π ∗ \pi^* π.
Rollout

The Efficiency of Rollout

Constrained by one-step decision time.
(1) Number of possible actions ∣ A ( s ) ∣ |\mathcal{A}(s)| A(s).
(2) The length of the simulated trajectory.
(3) Execution time of the policy.
(4) The number of simulated trajectories for better value estimation.

Geometric interpretation of Rollout

According to Bellman’s equation, each policy μ \mu μ defines the linear function T μ J T_{\mu}J TμJ, which has value at x x x given by:
( T μ J ) ( x ) = E { g ( x , u , w ) + α J ( f ( x , u , w ) ) } , f o r   a l l   x (T_{\mu}J)(x)=E\left \{ g(x,u,w) + \alpha J(f(x,u,w)) \right \} ,for \: all \: x (TμJ)(x)=E{g(x,u,w)+αJ(f(x,u,w))},forallx
And value at state x x x is given by:
( T J ) ( x ) = m i n u ∈ U ( x ) E { g ( x , u , w ) + α J ( f ( x , u , w ) ) } , f o r   a l l   x (TJ)(x)=min_{u \in U(x)}E\left \{ g(x,u,w) + \alpha J(f(x,u,w)) \right \} ,for \: all \: x (TJ)(x)=minuU(x)E{g(x,u,w)+αJ(f(x,u,w))},forallx
can also be written as T J = m i n μ T μ J TJ=min_{\mu}T_{\mu}J TJ=minμTμJ1.

Geometric interpretation of Rollout
Policy iteration with rollout

Truncated Rollout

Truncated Rollout = m m m steps value iterations + base policy μ \mu μ + terminal cost function approximation J ~ → J μ \tilde{J} \to J_{\mu} J~Jμ
(1) Truncated Rollout with one-step lookahead
T μ ~ ( T μ m J ~ ) = T ( T μ m J ~ ) T_{\tilde{\mu}}(T_{\mu}^{m}\tilde{J}) = T(T_{\mu}^{m}\tilde{J}) Tμ~(TμmJ~)=T(TμmJ~)
that is, μ ~ \tilde{\mu} μ~ attains the minimum of J ~ \tilde{J} J~.
Truncated Rollout with one-step lookahead

(2) Truncated Rollout with l-step lookahead
T μ ~ ( T l − 1 T μ m J ~ ) = T ( T l − 1 T μ m J ~ ) T_{\tilde{\mu}}(T^{l-1}T_{\mu}^{m}\tilde{J}) = T(T^{l-1}T_{\mu}^{m}\tilde{J}) Tμ~(Tl1TμmJ~)=T(Tl1TμmJ~)

value l l l: starting point T l − 1 J → J ∗ T^{l-1}J \to J^* Tl1JJ
value m m m: starting point → J μ \to J_{\mu} Jμ

Receding horizon in MPC

Problem formulation

Consider a finite horizon l l l-stage optimal control problem involving the same cost function and the requirement that the state after l l l steps is driven to 0. This is the problem:
m i n u t , t = k , … , k + l − 1 ∑ t = k k + l − 1 g ( x t , u t ) \underset{u_t,t=k,\dots,k+l-1}{min}\sum^{k+l-1}_{t=k}g(x_t,u_t) ut,t=k,,k+l1mint=kk+l1g(xt,ut)
System equation constraines:
x t + 1 = f ( x t , u t ) , t = k , … , k + l − 1 x_{t+1}=f(x_t,u_t), \quad t=k,\dots,k+l-1 xt+1=f(xt,ut),t=k,,k+l1
The control constraints:
u t ∈ U ( x t ) , t = k , … , k + l − 1 u_t \in U(x_t), \quad t=k,\dots,k+l-1 utU(xt),t=k,,k+l1
And the terminal state constraints:
x k + l = 0 x_{k+l} = 0 xk+l=0
If { u ~ k , … , u ~ k + l − 1 \tilde{u}_k,\dots,\tilde{u}_{k+l-1} u~k,,u~k+l1} is the optimal control sequence of this problem, we apply u ~ k \tilde{u}_k u~k and we apply u ~ k \tilde{u}_k u~k and we discard the other controls u ~ k + 1 , … , u ~ k + l − 1 \tilde{u}_{k+1},\dots,\tilde{u}_{k+l-1} u~k+1,,u~k+l1.
At the next stage, repeat the process, once the next state x k + 1 x_{k+1} xk+1 is revealed.

In summary, the receding horizon in MPC is equivalent to the l l l-step lookahead rollout.

Reference:
8.10 rollout算法
蒙特卡洛树搜索 Monte Carlo Tree Search
Bertsekas, Dimitri. “Newton’s method for reinforcement learning and model predictive control.” Results in Control and Optimization 7 (2022): 100121.


  1. T , T μ T,T_{\mu} T,Tμ are the Bellman operators defined to give a compact expression. ↩︎

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
### 回答1: "Attention rollout"是指一个系统或者模型在任务执行过程中从不同的时间步骤或层次逐渐分配和调整注意力的能力。这种方法旨在提高系统对输入和输出之间关键区域的关注度,并且有助于提高模型的性能和表现。 在注意力机制中,系统通过将不同的权重分配给输入序列的不同部分来决定关注的程度。在一个典型的注意力模型中,输入序列会被映射为一组注意力权重,这些权重决定了处理器在执行任务时如何关注和加权不同的输入特征。 在attention rollout中,注意力权重的分配是可调的,系统可以经过多个时间步骤或层次的迭代来逐渐调整注意力的分配。通过迭代过程中的反馈机制和优化算法,模型可以逐步学习到更好的特征表示和注意力分配策略,从而提高模型在任务上的表现。 Attention rollout在自然语言处理、计算机视觉和语音识别等多个领域得到了广泛的应用。例如,在机器翻译任务中,模型可以通过attention rollout逐渐调整注意力的分配,使得翻译结果更加准确。在图像分类任务中,模型可以通过attention rollout逐渐关注图片中更重要的局部区域,提高分类的准确性。 总而言之,attention rollout是一种机制,它允许系统逐步调整和分配注意力,在处理输入和输出之间的关系时更加灵活地关注和加权不同的特征。这种方法有助于提高模型的性能和提升任务的表现。 ### 回答2: Attention Rollout是一种基于注意力机制的强化学习方法,主要用于解决强化学习中探索与利用的平衡问题。在传统的强化学习中,智能体通过与环境的交互来学习最优策略,但往往存在探索不足或探索过度的问题。Attention Rollout通过引入注意力机制,充分利用先前的经验来指导智能体的探索。 Attention Rollout的核心思想是利用已有的先验知识来加速智能体的探索过程。通过将已有策略与当前的动作序列进行比对,智能体可以选择性地采纳或忽略这些先前的经验。具体而言,Attention Rollout使用一个注意力模型来确定哪些先前的经验可以提供有价值的信息。通过集中注意力的方式,智能体可以有针对性地利用这些有益的先验知识,从而加快学习过程。 Attention Rollout的优势在于能够有效平衡探索与利用。智能体可以在学习初期更加注重探索,以获取更多的经验,同时利用注意力模型来指导探索的方向。随着学习的进行,智能体逐渐从探索向利用的转变,更多地依赖学到的知识进行决策。这样一来,Attention Rollout不仅能够提高智能体的学习效率,同时也降低了学习过程中的不确定性。 总之,Attention Rollout是一种利用注意力机制平衡探索与利用的强化学习方法。通过选择性地利用先前的经验,智能体可以更加高效地学习最优策略,并在学习过程中平衡探索与利用的关系。 ### 回答3: 注意力展开(attention rollout)是一种在强化学习中使用的策略,旨在改善模型在决策过程中的性能。在强化学习中,模型需要根据当前状态选择一个行动,以最大化其长期累积奖励。通常情况下,模型只能根据当前状态选择一个行动,而不能考虑在选择当前行动之后的未来发展。 然而,通过引入注意力展开技术,模型可以在决策时同时考虑多个行动序列,从而更好地评估选择每个行动的潜在结果。具体来说,注意力展开通过执行每个可能的行动序列,并在每个时刻对每个行动计算奖励的期望,以评估每个行动的质量。 通过使用注意力展开,模型可以在选择行动时更准确地估计其后果,尤其是在长期决策中。这种技术可以帮助模型在各种复杂任务中表现更好,例如在围棋、扑克等游戏中。 总的来说,注意力展开是一种强化学习中的策略,通过在决策过程中综合考虑多个行动序列,帮助模型更好地评估每个行动的潜在结果,从而提高决策的准确性和性能。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值