控制系统鲁棒性_复杂系统中的鲁棒性:学术文章摘要

控制系统鲁棒性

by Shubheksha

通过Shubheksha

复杂系统中的鲁棒性:学术文章摘要 (Robustness in Complex Systems: an academic article summary)

Today, we’re going to look at the paper titled “Robustness in Complex Systems” published in 2001 by Steven D. Gribble. All pull quotes and figures are from the paper.

今天,我们来看看由Steven D. Gribble于2001年发表的题为“ 复杂系统中的鲁棒性 ”的论文。 所有拉引语和数字均来自本文。

This paper argues that a common design paradigm for systems is fundamentally flawed, resulting in unstable, unpredictable behavior as the complexity of the system grows.
本文认为,通用的系统设计范式从根本上来说是有缺陷的,随着系统复杂性的增加,会导致不稳定,不可预测的行为。

The “common design paradigm” refers to the practice of predicting the environment the system will operate in and its failure modes. The paper states that a system will deal with conditions that weren’t predicted as it becomes more complex, so it should be designed to cope with failure gracefully. The paper explores these ideas with the help of “distributed data structures (DDD), a scalable, cluster-based storage server.”

“通用设计范例”是指预测系统将在其下运行的环境及其故障模式的实践。 该论文指出,随着系统变得越来越复杂,系统将处理未预料到的状况,因此应设计该系统以优雅地应对故障。 本文借助“分布式数据结构(DDD),一种可扩展的,基于集群的存储服务器”探索了这些想法。

By their very nature, large systems operate through the complex interaction of many components. This interaction leads to a pervasive coupling of the elements of the system; this coupling may be strong (e.g., packets sent between adjacent routers in a network) or subtle (e.g., synchronization of routing advertisements across a wide area network).
就其本质而言,大型系统通过许多组件的复杂交互来运行。 这种相互作用导致系统元素的普遍耦合。 这种耦合可能很强(例如,网络中相邻路由器之间发送的数据包)或微妙的(例如,跨广域网的路由广告的同步)。

A common characteristic that such large systems exhibit is something known as the Butterfly Effect. This refers to a small unexpected disturbance in the system resulting from the intricate interaction of various components that causes a widespread change.

这种大型系统表现出的一个共同特征是蝴蝶效应 。 这是指由于引起广泛变化的各种组件的复杂相互作用而导致的系统中的少量意外干扰。

A common goal for system design is robustness: the ability of a system to operate correctly in various conditions and fail gracefully in an unexpected situation. The paper argues against the common pattern of trying to predict a certain set of operation conditions for the system and architecting it to work well in only those conditions.

系统设计的一个共同目标是鲁棒性:系统在各种条件下正确运行并在意外情况下正常运行的能力。 本文反对试图预测系统的一组特定运行条件并设计其这些条件下能正常工作的通用模式。

It is also effectively impossible to predict all of the perturbations that a system will experience as a result of changes in environmental conditions, such as hardware failures, load bursts, or the introduction of misbehaving software. Given this, we believe that any system that attempts to gain robustness solely through precognition is prone to fragility.
实际上,不可能预测由于环境条件的变化(例如硬件故障,负载突发或行为不当的软件)而导致系统将遭受的所有干扰。 考虑到这一点,我们认为任何试图仅通过预认知获得鲁棒性的系统都容易出现脆弱性。

DDS:案例研究 (DDS: A Case Study)

The hypothesis stated above is explored using a scalable, cluster-based storage system, Distributed Data Structures (DDD) — “a high-capacity, high-throughput virtual hash table that is partitioned and replicated across many individual storage nodes called bricks.”

使用可扩展的,基于集群的存储系统分布式数据结构(DDD)来探索上述假设-“一种高容量,高吞吐量的虚拟哈希表,可以在称为砖的许多单个存储节点之间进行分区和复制。”

This system was built using a predictive design philosophy as the one described above.

该系统是使用上述预测性设计原理构建的。

Based on extensive experience with such systems, we attempted to reason about the behavior of the software components, algorithms, protocols, and hardware elements of the system, as well as the workloads it would receive.
基于此类系统的丰富经验,我们尝试对系统的软件组件,算法,协议和硬件元素的行为以及将要承受的工作量进行推理。

When the system operated within the scope of the assumptions made by the designers, it worked fine. They were able to scale it and improve performance. However, in the case when one or more of the assumptions about the operating conditions were violated, the system behaved in unexpected ways resulting in data loss or inconsistencies.

当系统在设计者的假设范围内运行时,它可以正常工作。 他们能够扩展它并提高性能。 但是,如果违反了有关运行条件的一个或多个假设,则系统将以意外的方式运行,从而导致数据丢失或不一致。

Next, we talk about several such anomalies.

接下来,我们讨论几个这样的异常。

垃圾回收处理和有限同步 (Garbage Collection Thrashing and Bounded Synchrony)

The system designers used timeouts to detect failure of components in the system. If a particular component didn’t respond within the specified time, it was considered dead. They assumed bounded synchrony in the system.

系统设计人员使用超时来检测系统中组件的故障。 如果特定组件在指定时间内没有响应,则认为该组件已失效。 他们假设系统中有界同步。

The DDS was implemented in Java, and therefore made use of garbage collection. The garbage collector in our JVM was a mark-and-sweep collector; as a result, as more active objects were resident in the JVM heap, the duration that the garbage collector would run in order to reclaim a fixed amount of memory would increase.
DDS是用Java实现的,因此利用了垃圾回收。 我们的JVM中的垃圾收集器是标记清除收集器。 结果,随着更多的活动对象驻留在JVM堆中,垃圾回收器运行以回收固定数量的内存的持续时间将增加。

When the system was at saturation, even slight variations in load on the bricks would increase the time taken by the garbage collector in turn dropping the throughput of the brick. This is called GC thrashing. The affected bricks would lag behind their counterparts leading to a further degradation in performance of the system.

当系统达到饱和时,即使砖块上的负载发生微小变化,也将增加垃圾收集器花费的时间,进而降低砖块的吞吐量。 这称为GC抖动 。 受影响的积木将落后于其对应的积木,从而导致系统性能进一步下降。

Hence, garbage collection violated the assumption of bounded synchrony when it was nearing or beyond the saturation point.

因此,当垃圾收集接近或超过饱和点时,它违反了有界同步的假设。

Another assumption made while designing the system was that the failures are independent. DDS used replication to make the system fault-tolerant. The probability of multiple replicas failing simultaneously was very small.

设计系统时做出的另一个假设是,故障是独立的。 DDS使用复制来使系统具有容错能力。 多个副本同时失败的可能性非常小。

However, this assumption was violated when they encountered a race condition in their code that caused a memory leak without affecting correctness.

但是,当他们在代码中遇到竞争条件而导致内存泄漏而不影响正确性时,就违反了此假设。

Whenever we launched our system, we would tend to launch all bricks at the same time. Given roughly balanced load across the system, all bricks therefore would run out of heap space at nearly the same time, several days after they were launched. We also speculated that our automatic failover mechanisms exacerbated this situation by increasing the load on a replica after a peer had failed, increase the rate at which the replica leaked memory.
每当我们启动系统时,我们倾向于同时启动所有积木。 在整个系统的负载大致平衡的情况下,所有砖块在启动几天后几乎都会同时耗尽堆空间。 我们还推测,我们的自动故障转移机制通过在对等方出现故障后增加副本上的负载,增加副本泄漏内存的速率来加剧这种情况。

Since all the replicas were subjected to a uniform load without taking performance degradation and other issues into consideration, this created a coupling between the replicas and…

由于所有副本都承受统一的负载,而没有考虑性能下降和其他问题,因此在副本与...之间造成了耦合。

…when combined with a slow memory leak, lead to the violation of our assumption of independent failures, which in turn caused our system to experience unavailability and partial data loss
…结合缓慢的内存泄漏,会导致违反我们对独立故障的假设,从而导致我们的系统出现不可用和部分数据丢失的情况
未检查的依赖关系和失败停止 (Unchecked Dependencies and Fail-stop)

Based on the assumption that if a component timed out, it has failed, the designers also assumed “fail-stop” failures, that is a component that has failed will not resume functioning after a while. The bricks in the system performed all long-latency work (disk I/O) in an asynchronous way.

基于这样的假设:如果某个组件超时,则该组件发生了故障,设计人员还假定了“故障停止”故障,即发生故障的组件将在一段时间后不再恢复运行。 系统中的模块以异步方式执行了所有长等待时间的工作(磁盘I / O)。

However, they failed to notice that some parts of their code made use of blocking function calls. This caused the main event-handling thread to be randomly borrowed leading to bricks seizing inexplicably for a couple of minutes and resuming post.

但是,他们没有注意到他们的代码的某些部分使用了阻塞函数调用。 这导致主事件处理线程被随机借用,导致砖块在几分钟内莫名其妙地被占用并恢复了工作。

While this error was due to our own failure to verify the behavior of code we were using, it serves to demonstrate that the low-level interaction between independently built components can have profound implications on the overall behavior of the system. A very subtle change in behavior resulted in the violation of our fail-stop assumption across the entire cluster, which eventually lead to the corruption of data in our system.
尽管此错误是由于我们自己无法验证所使用的代码的行为而引起的,但它证明了独立构建的组件之间的低级交互可能会对系统的整体行为产生深远的影响。 行为的非常细微的变化导致违反了我们在整个集群中的故障停止假设,最终导致了系统中数据的损坏。

迈向强大的系统 (Towards Robust Systems)

..small changes to a complex, coupled system can result in large, unexpected changes in behavior, possibly taking the system outside of its designers’ expected operating regime.
..对复杂的耦合系统进行小的更改可能会导致行为方面的大变化,这可能会使系统超出其设计人员的预期工作范围。

A few solutions which can help us make more robust systems:

可以帮助我们制造更强大系统的一些解决方案:

系统的超额配置 (Systematic Over-provisioning)

When approaching the saturation point, systems tend to become fragile when trying to accommodate unexpected behavior. One way to combat this is to deliberately over-provision the system.

当接近饱和点时,系统在尝试适应意外行为时会变得脆弱。 解决此问题的一种方法是故意过度配置系统。

However, this has its own set of issues: it leads to the under-utilization of resources. It also requires predicting the expected operating environment and hence the saturation point of the system. This can’t be done in an accurate manner in most cases.

但是,这有其自身的一系列问题:导致资源利用不足。 它还需要预测预期的运行环境,从而预测系统的饱和点。 在大多数情况下,这无法以准确的方式完成。

使用入场控制 (Use Admission Control)

Another technique is to start rejecting load once the system starts approaching the saturation point. However, this requires predicting the saturation point — something that’s not always possible, especially with large systems which have a lot of contributing variables.

另一种技术是一旦系统开始接近饱和点就开始拒绝负载。 但是,这需要预测饱和点-这并非总是可能的,尤其是对于具有很多影响变量的大型系统而言。

Rejecting requests also consumes some resources from the system. Services designed with admission control in mind usually have two operating modes: normal where the requests are processed and an extremely lightweight mode where they’re rejected.

拒绝请求还会消耗系统的一些资源。 考虑到准入控制设计的服务通常具有两种操作模式:处理请求的正常模式和拒绝请求的轻量级模式。

将自省构建到系统中 (Build Introspection into the system)
an introspective system is one in which the ability to monitor the system is designed in from the beginning.
自省系统是从一开始就设计了监视系统的功能。

When a system can be monitored, and designers and operators can derive meaningful measurements about its operation, it’s much more robust than a black-box system. It’s easier to adapt such a system to change in its environment, as well as manage and maintain it.

当可以监视系统,并且设计人员和操作员可以得出有意义的测量结果时,它比黑盒子系统要强大得多。 使这样的系统适应环境变化以及管理和维护更加容易。

通过闭环控制引入适应性 (Introduce adaptivity by closing the control loop)

An example of a control loop is human designers and operators adapting the design in response to a change in its operating environment indicated through various measurements. However, the timeline for such a control loop isn’t very predictable. The authors argue that systems should be built with internal control loops.

控制回路的一个例子是人类的设计人员和操作员根据各种测量结果指示的其工作环境的变化对设计进行调整。 但是,这种控制循环的时间表并不是很可预测的。 作者认为,应该使用内部控制回路来构建系统。

These systems incorporate the results of introspection, and attempt to adapt control variables dynamically to keep the system operating in a stable or well-performing regime.
这些系统结合了自省的结果,并尝试动态调整控制变量以保持系统在稳定或性能良好的状态下运行。
All such systems have the property that the component performing the adaptation is able to hypothesize somewhat precisely about the effects of the adaptation; without this ability, the system would be “operating in the dark”, and likely would become unpredictable. A new, interesting approach to hypothesizing about the effects of adaptation is to use statistical machine learning; given this, a system can experiment with changes in order to build up a model of their effects.
所有这些系统都具有以下特性:执行调整的组件可以对调整的效果进行精确的假设。 没有这种能力,系统将“在黑暗中运行”,并且可能变得不可预测。 推测适应效果的一种有趣的新方法是使用统计机器学习。 鉴于此,系统可以对变化进行实验,以建立其影响模型。
计划失败 (Plan for failure)
Complex systems must expect failure and plan for it accordingly.
复杂的系统必须预见到故障,并做出相应的计划。

A couple of techniques to do this:

几种方法可以做到这一点:

  1. decoupling of components to contain failures locally

    解耦组件以局部包含故障
  2. minimize damage by using robust abstractions such as transactions

    通过使用可靠的抽象(例如事务处理)将损害最小化
  3. minimize amount of time in failure state (using checkpointing to recover rapidly)

    最大限度地减少故障状态下的时间(使用检查点快速恢复)

In this paper, the authors argue that designing systems by assuming the constraints and nature of its operation, failures, and behavior often leads to fragile and unpredictable systems. We need a radically different approach to build systems that are more robust in the face of failure.

在本文中,作者认为,通过假设系统的操作,故障和行为的约束和性质来设计系统,通常会导致系统脆弱且不可预测。 我们需要一种截然不同的方法来构建在出现故障时更强大的系统。

This different design paradigm is one in which systems are given the best possible chance of stable behavior (through techniques such as over-provisioning, admission control, and introspection), as well as the ability to adapt to unexpected situations (by treating introspection as feedback to a closed control loop). Ultimately, systems must be designed to handle failures gracefully, as complexity seems to lead to an inevitable unpredictability.
这种不同的设计范式是这样的:系统被赋予最佳的稳定行为机会(通过诸如过度配置,接纳控制和自省之类的技术),以及适应意外情况的能力(通过将自省作为反馈)到一个封闭的控制环)。 最终,系统的设计必须能够优雅地处理故障,因为复杂性似乎会导致不可避免的不可预测性。

翻译自: https://www.freecodecamp.org/news/robustness-in-complex-systems-a-summary-95d6f4067116/

控制系统鲁棒性

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 《最优控制系统》是D.Subbaram Naidu教授撰写的一本关于最优控制系统的重要参考书籍。本书的文版是为了方便文读者的理解而进行翻译的。 这本书通过简洁明晰的方式介绍了最优控制系统的基本概念、方法和应用。它首先介绍了最优控制的基本原理,包括最优控制问题的数学表述和最优化理论的基础知识。 在随后的章节,书籍详细讨论了最优控制系统的各个方面,包括最优性条件、变分法、最优控制的数值方法、动态规划等。此外,书还介绍了最优控制系统的应用领域,如机械工程、电力系统、航天航空等,并提供了相关的例子和案例分析。 本书的特点之一是对最优控制理论的深入剖析和实际应用的结合。它不仅提供了最优控制系统的数学模型和推导过程,还重点关注了实际问题解决的工程实践。 值得一提的是,这本书的文版对文读者来说是一本宝贵的资料,因为它突破了语言障碍,使得更多的人能够深入学习和研究最优控制系统。 总而言之,《最优控制系统》是一本权威、系统性很强的最优控制理论著作的文版。它在理论和应用方面都提供了丰富的内容,为对最优控制系统感兴趣的读者提供了重要的参考和指导。 ### 回答2: 《最优控制系统》是D. Subbaram Naidu所著的一本经典著作。这本书详细介绍了最优控制系统的理论、方法和应用。最优控制系统是一种通过优化目标函数来最大化系统性能的控制系统。它通过对系统的数学建模,利用最优化方法来求解最优控制律,以实现系统的最佳性能。 这本书的内容非常丰富,涵盖了最优控制系统的各个方面。首先,它介绍了最优控制的基本概念和原理,包括最优控制的目标和约束条件,以及最优控制问题的分类和形式化表示。然后,它介绍了最优控制的基本方法,包括经典控制和现代控制的最优化方法,如最优控制理论、动态规划、极大极小原理等。此外,它还介绍了最优控制系统的设计和实现,包括系统动态建模、控制器设计和系统性能评估等。 这本书不仅仅是理论性的介绍,它还包含了大量的实际应用案例和示例。这些案例涉及到不同领域的最优控制系统,如航空航天、机器人、制造业等。通过这些实例,读者可以更好地理解和应用最优控制系统的理论和方法。 总的来说,《最优控制系统》是一本权威的参考书籍,适用于控制工程师、研究人员和学生。它提供了深入的理论知识和实际应用案例,可以帮助读者全面了解和掌握最优控制系统的原理和方法,从而提高系统的性能和效率。无论是在学术研究还是实际应用,这本书都是一本不可或缺的参考书。 ### 回答3: 《最优控制系统》是由D.Subbaram Naidu编写的一本关于最优控制理论的文版教材。该教材介绍了最优控制领域的基本原理和方法,旨在帮助读者理解和应用最优控制理论。 教材首先介绍了系统的最优控制概念和基本数学工具,如微分方程、变分法和拉格朗日函数等。随后,它详细讲解了最优控制问题的数学表述和求解方法,包括动态规划、极值原理和最优化算法等。教材还涵盖了最优控制系统的不同类型,如线性和非线性系统、离散和连续系统以及时变系统等。 此外,《最优控制系统》还包括了实际应用方面的内容,例如最优飞行控制、自适应控制和鲁棒控制等。对于读者来说,这本教材不仅提供了理论知识,还提供了实践的案例和应用。 通过学习《最优控制系统》,读者可以深入了解最优控制理论的基本原理和应用方法。该教材以文写作,使得文读者更容易理解和应用其的知识。无论是对于学生来说作为教材,还是对于工程师来说作为参考书,都是一本很有价值的资源。 总结而言,《最优控制系统》是一本全面介绍最优控制理论的文版教材,适用于对最优控制理论感兴趣的读者。无论是学术研究还是工程实践,都能从获得所需的知识和方法。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值