Model Predictive Control

模型预测控制(model predictive contol, MPC)是在有环境模型的情况下为了找到能够实现最大奖励的动作轨迹而进行的控制优化问题。

模型描述

  1. 我们已经知道了环境的模型: s t + 1 = f ( s t , a t ) s_{t+1}=f(s_t,a_t) st+1=f(st,at)
  2. 我们已知初始状态 s 0 s_0 s0
  3. 我们知道不同状态与动作下的奖励 r ( s t , a t , s t + 1 ) r(s_t,a_t,s_{t+1}) r(st,at,st+1)
    goal : 想要求得能够到达目标状态 s f s_f sf的动作轨迹 A = a 0 , … , a N A={a_0,\dots,a_N} A=a0,,aN

问题分析

我们想要找到动作轨迹,也就是要找到策略 π ( a t ∣ s t ) \pi(a_t|s_t) π(atst),从而实现:
a r g m a x π   R = E π [ ∑ i = 0 N r ( s i , a i ) ] = r ( s 0 , π ( s 0 ) ) + r ( f ( s 0 , a 0 ) , a 1 ) + . . . argmax_\pi \ R=E_\pi[\sum_{i=0}^{N}r(s_i,a_i)]=r(s_0,\pi(s_0))+r(f(s_0,a_0),a_1)+... argmaxπ R=Eπ[i=0Nr(si,ai)]=r(s0,π(s0))+r(f(s0,a0),a1)+...
上式实际上可以对 π \pi π进行求导,然后求得最大值。但是实际上因为我们求得的 f ( s t , a t , a t + 1 ) f(s_t,a_t,a_{t+1}) f(st,at,at+1)存在偏差,因此上式实际上和实际值存在较大的偏差(误差积累)。
为此,模型预测控制提供了一种求解思路。

MPC主要思想

  1. 对于无限步的状态预测很难(因为误差积累,一定步数后会脱离实际),改为求解 H H H(Horizon)步内的最大奖励。
  2. 不采用梯度方式求解最大值,而是采用采样探索的方式寻找可行解。(因为是采用探索的方式,而我们的问题都是NPC问题,那么如何进行有效的探索成为了一个重要的问题)

MPC方法

random shooting

  1. 我们独立随机选择 N N N个动作序列 A 0 , … , A N {A_0,\dots,A_N} A0,,AN,每个动作序列包含 A i = { a 0 i , … , a H − 1 i } A_i=\{a_0^{i},\dots,a_{H-1}^i\} Ai={a0i,,aH1i}。每个动作序列 A i A_i Ai中的动作也都是随机选择的。
  2. 获取每一个动作序列的 R R R: 进行 a k i a_k^i aki动作之后,我们可以根据环境模型得到下一个状态 s k + 1 i = f ( s k i , a k i ) s_{k+1}^i=f(s_k^i,a_k^i) sk+1i=f(ski,aki),此时知道奖励 r k i = r ( s k i , a k i , s k + 1 i ) r^i_k=r(s_k^i,a_k^i,s_{k+1}^i) rki=r(ski,aki,sk+1i);然后进行下一个动作 a k + 1 i a_{k+1}^i ak+1i,知道动作序列都完成。得到每一个动作序列的 R i R^i Ri
  3. 选出 A ∗ = a r g   m a x A   R A^*=arg\ max_A \ R A=arg maxA R
  4. 然后执行第一步 a 0 ∗ a^*_0 a0(此时的动作是在真实执行的)
  5. 得到下一个状态之后,我们再重复以上的过程。

选择的过程比较的简单,但是对于高维度的动作空间来说,随机选择很难得到较好的解。

CEM( Iterative Random-shooting with refinement)

  1. 在第一个迭代步内随机选择[采用 μ t = 0 , Σ t = 固 定 值 \mu_t=0,\Sigma_t= 固定值 μt=0,Σt= a t ∼ N ( μ t , Σ t ) a_t \sim N (\mu_t,\Sigma_t) atN(μt,Σt)],得到N个动作序列。
  2. 从中选出前 J J J个奖励最多的动作序列 A e l i t e s A_{elites} Aelites
  3. 然后更新 μ t = α ∗ m e a n ( A e l i t e s ) + ( 1 − α ) μ t \mu_t = \alpha*mean(A_{elites})+(1-\alpha)\mu_t μt=αmean(Aelites)+(1α)μt Σ t + 1 = α ∗ v a r ( A e l i t e s ) + ( 1 − α ) ∗ Σ t \Sigma_{t+1}=\alpha*var(A_{elites})+(1-\alpha)*\Sigma_t Σt+1=αvar(Aelites)+(1α)Σt [ H H H步内的 μ . Σ \mu.\Sigma μ.Σ都不同]
  4. 然后重复上述迭代 M M M
  5. 将最后一次的 m e a n ( A e l i t e s ) mean(A_{elites}) mean(Aelites)的第一步作为当前的动作。
  6. 得到下一步状态之后,重复上述的过程。
    CEM因为会在奖励较多的动作空间内进行探索,所以效果会比random shooting 要好。

Filtering and Reward-weighted Reinfinement

  1. 均值更新的方式: μ t + 1 = ∑ k = 0 N ( e γ R k a t k ) ∑ j = 0 N e γ R j \mu_{t+1}=\frac{\sum_{k=0}^N(e^{\gamma R_k}a_t^k)}{\sum_{j=0}^N e^{\gamma R_j}} μt+1=j=0NeγRjk=0N(eγRkatk)
  2. H H H步内的每一步的方差更新:
    u t i ∼ N ( 0 , Σ ) , t ∈ { 0 , … , H − 1 } , i ∈ { 0 , … , N − 1 } u_t^i \sim N(0,\Sigma), t \in \{ 0,\dots,H-1\}, i \in \{0,\dots,N-1 \} utiN(0,Σ),t{0,,H1},i{0,,N1}
    n t i = β u t i + ( 1 − β ) n t − 1 i n_t^i = \beta u_t^i +(1-\beta)n_{t-1}^i nti=βuti+(1β)nt1i
    a t + 1 i = μ t i + n t i a_{t+1}^i=\mu_t^i + n_t^i at+1i=μti+nti

MPC存在的问题

  1. 因为我们只考虑 H H H步以内的最大奖励,而不考虑 H H H步以后的,因此我们的解不是针对全局最优的解,而是针对一定时间内最有的解。
  2. 采用采样探索的方式,因此所求的解很可能不是最优解(即使是只考虑 H H H步,也不是最优解)
  3. 采样探索每次求得结果之后,不会将求得的结果转化为经验,用于其他步的探索;因此不存在学习的过程,对于数据的利用较少,几乎只利用了环境模型和奖励函数。
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Model Predictive Control:Theory, Computation, and Design,2nd Edition. James B. Rawlings, David Q. Mayne, Moritz M. Diehl. Chapter 1 is introductory. It is intended for graduate students in engineering who have not yet had a systems course. But it serves a second purpose for those who have already taken the first graduate systems course. It derives all the results of the linear quadratic regulator and optimal Kalman filter using only those arguments that extend to the nonlinear and constrained cases to be covered in the later chapters. Instructors may find that this tailored treatment of the introductory systems material serves both as a review and a preview of arguments to come in the later chapters. Chapters 2-4 are foundational and should probably be covered in any graduate level MPC course. Chapter 2 covers regulation to the origin for nonlinear and constrained systems. This material presents in a unified fashion many of the major research advances in MPC that took place during the last 20 years. It also includes more recent topics such as regulation to an unreachable setpoint that are only now appearing in the research literature. Chapter 3 addresses MPC design for robustness, with a focus on MPC using tubes or bundles of trajectories in place of the single nominal trajectory. This chapter again unifies a large body of research literature concerned with robust MPC. Chapter 4 covers state estimation with an emphasis on moving horizon estimation, but also covers extended and unscented Kalman filtering, and particle filtering. Chapters 5-7 present more specialized topics. Chapter 5 addressesthe special requirements of MPC based on output measurement instead of state measurement. Chapter 6 discusses how to design distributed MPC controllers for large-scale systems that are decomposed into many smaller, interacting subsystems. Chapter 7 covers the explicit optimal control of constrained linear systems. The choice of coverage of these three chapters may vary depending on the instructor's or student's own research interests. Three appendices are included, again, so that the reader is not sent off to search a large research literature for the fundamental arguments used in the text. Appendix A covers the required mathematical background. Appendix B summarizes the results used for stability analysis including the various types of stability and Lyapunov function theory. Since MPC is an optimization-based controller, Appendix C covers the relevant results from optimization theory.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值