控制系统鲁棒性_复杂系统中的鲁棒性：学术文章摘要

最新推荐文章于 2024-07-22 19:38:46 发布

cumifi2519

最新推荐文章于 2024-07-22 19:38:46 发布

阅读量1k

点赞数

文章标签：分布式 java 大数据人工智能机器学习

原文链接：https://www.freecodecamp.org/news/robustness-in-complex-systems-a-summary-95d6f4067116/

版权

本文摘要了Steven D. Gribble的论文《复杂系统中的鲁棒性》，强调了预测设计范式的局限性。文中通过分布式数据结构（DDD）案例，揭示了预测系统运行条件可能导致的问题，如垃圾回收处理的有限同步问题、缓慢泄漏和相关故障等。为实现更强鲁棒性，提出了超额配置、准入控制、系统自省和闭环控制等解决方案，以应对复杂系统的不确定性。

摘要由CSDN通过智能技术生成

控制系统鲁棒性

by Shubheksha

通过Shubheksha

复杂系统中的鲁棒性：学术文章摘要 (Robustness in Complex Systems: an academic article summary)

Today, we’re going to look at the paper titled “Robustness in Complex Systems” published in 2001 by Steven D. Gribble. All pull quotes and figures are from the paper.

今天，我们来看看由Steven D. Gribble于2001年发表的题为“ 复杂系统中的鲁棒性 ”的论文。所有拉引语和数字均来自本文。

This paper argues that a common design paradigm for systems is fundamentally flawed, resulting in unstable, unpredictable behavior as the complexity of the system grows.

本文认为，通用的系统设计范式从根本上来说是有缺陷的，随着系统复杂性的增加，会导致不稳定，不可预测的行为。

The “common design paradigm” refers to the practice of predicting the environment the system will operate in and its failure modes. The paper states that a system will deal with conditions that weren’t predicted as it becomes more complex, so it should be designed to cope with failure gracefully. The paper explores these ideas with the help of “distributed data structures (DDD), a scalable, cluster-based storage server.”

“通用设计范例”是指预测系统将在其下运行的环境及其故障模式的实践。该论文指出，随着系统变得越来越复杂，系统将处理未预料到的状况，因此应设计该系统以优雅地应对故障。本文借助“分布式数据结构(DDD)，一种可扩展的，基于集群的存储服务器”探索了这些想法。

By their very nature, large systems operate through the complex interaction of many components. This interaction leads to a pervasive coupling of the elements of the system; this coupling may be strong (e.g., packets sent between adjacent routers in a network) or subtle (e.g., synchronization of routing advertisements across a wide area network).

就其本质而言，大型系统通过许多组件的复杂交互来运行。这种相互作用导致系统元素的普遍耦合。这种耦合可能很强(例如，网络中相邻路由器之间发送的数据包)或微妙的(例如，跨广域网的路由广告的同步)。

A common characteristic that such large systems exhibit is something known as the Butterfly Effect. This refers to a small unexpected disturbance in the system resulting from the intricate interaction of various components that causes a widespread change.

这种大型系统表现出的一个共同特征是蝴蝶效应。这是指由于引起广泛变化的各种组件的复杂相互作用而导致的系统中的少量意外干扰。

A common goal for system design is robustness: the ability of a system to operate correctly in various conditions and fail gracefully in an unexpected situation. The paper argues against the common pattern of trying to predict a certain set of operation conditions for the system and architecting it to work well in only those conditions.

系统设计的一个共同目标是鲁棒性：系统在各种条件下正确运行并在意外情况下正常运行的能力。本文反对试图预测系统的一组特定运行条件并设计其仅在这些条件下能正常工作的通用模式。

It is also effectively impossible to predict all of the perturbations that a system will experience as a result of changes in environmental conditions, such as hardware failures, load bursts, or the introduction of misbehaving software. Given this, we believe that any system that attempts to gain robustness solely through precognition is prone to fragility.

实际上，不可能预测由于环境条件的变化(例如硬件故障，负载突发或行为不当的软件)而导致系统将遭受的所有干扰。考虑到这一点，我们认为任何试图仅通过预认知获得鲁棒性的系统都容易出现脆弱性。

DDS：案例研究 (DDS: A Case Study)

The hypothesis stated above is explored using a scalable, cluster-based storage system, Distributed Data Structures (DDD) — “a high-capacity, high-throughput virtual hash table that is partitioned and replicated across many individual storage nodes called bricks.”

使用可扩展的，基于集群的存储系统分布式数据结构(DDD)来探索上述假设-“一种高容量，高吞吐量的虚拟哈希表，可以在称为砖的许多单个存储节点之间进行分区和复制。”

This system was built using a predictive design philosophy as the one described above.

该系统是使用上述预测性设计原理构建的。

Based on extensive experience with such systems, we attempted to reason about the behavior of the software components, algorithms, protocols, and hardware elements of the system, as well as the workloads it would receive.

基于此类系统的丰富经验，我们尝试对系统的软件组件，算法，协议和硬件元素的行为以及将要承受的工作量进行推理。

When the system operated within the scope of the assumptions made by the designers, it worked fine. They were able to scale it and improve performance. However, in the case when one or more of the assumptions about the operating conditions were violated, the system behaved in unexpected ways resulting in data loss or inconsistencies.

当系统在设计者的假设范围内运行时，它可以正常工作。他们能够扩展它并提高性能。但是，如果违反了有关运行条件的一个或多个假设，则系统将以意外的方式运行，从而导致数据丢失或不一致。

Next, we talk about several such anomalies.

接下来，我们讨论几个这样的异常。

垃圾回收处理和有限同步 (Garbage Collection Thrashing and Bounded Synchrony)

The system designers used timeouts to detect failure of components in the system. If a particular component didn’t respond within the specified time, it was considered dead. They assumed bounded synchrony in the system.

系统设计人员使用超时来检测系统中组件的故障。如果特定组件在指定时间内没有响应，则认为该组件已失效。他们假设系统中有界同步。

The DDS was implemented in Java, and therefore made use of garbage collection. The garbage collector in our JVM was a mark-and-sweep collector; as a result, as more active objects were resident in the JVM heap, the duration that the garbage collector would run in order to reclaim a fixed amount of memory would increase.

DDS是用Java实现的，因此利用了垃圾回收。我们的JVM中的垃圾收集器是标记清除收集器。结果，随着更多的活动对象驻留在JVM堆中，垃圾回收器运行以回收固定数量的内存的持续时间将增加。

When the system was at saturation, even slight variations in load on the bricks would increase the time taken by the garbage collector in turn dropping the throughput of the brick. This is called GC thrashing. The affected bricks would lag behind their counterparts leading to a further degradation in performance of the system.

当系统达到饱和时，即使砖块上的负载发生微小变化，也将增加垃圾收集器花费的时间，进而降低砖块的吞吐量。这称为GC抖动 。受影响的积木将落后于其对应的积木，从而导致系统性能进一步下降。

Hence, garbage collection violated the assumption of bounded synchrony when it was nearing or beyond the saturation point.

因此，当垃圾收集接近或超过饱和点时，它违反了有界同步的假设。

Another assumption made while designing the system was that the failures are independent. DDS used replication to make the system fault-tolerant. The probability of multiple replicas failing simultaneously was very small.

设计系统时做出的另一个假设是，故障是独立的。 DDS使用复制来使系统具有容错能力。多个副本同时失败的可能性非常小。

However, this assumption was violated when they encountered a race condition in their code that caused a memory leak without affecting correctness.

但是，当他们在代码中遇到竞争条件而导致内存泄漏而不影响正确性时，就违反了此假设。

Whenever we launched our system, we would tend to launch all bricks at the same time. Given roughly balanced load across the system, all bricks therefore would run out of heap space at nearly the same time, several days after they were launched. We also speculated that our automatic failover mechanisms exacerbated this situation by increasing the load on a replica after a peer had failed, increase the rate at which the replica leaked memory.

每当我们启动系统时，我们倾向于同时启动所有积木。在整个系统的负载大致平衡的情况下，所有砖块在启动几天后几乎都会同时耗尽堆空间。我们还推测，我们的自动故障转移机制通过在对等方出现故障后增加副本上的负载，增加副本泄漏内存的速率来加剧这种情况。

Since all the replicas were subjected to a uniform load without taking performance degradation and other issues into consideration, this created a coupling between the replicas and…

由于所有副本都承受统一的负载，而没有考虑性能下降和其他问题，因此在副本与...之间造成了耦合。

…when combined with a slow memory leak, lead to the violation of our assumption of independent failures, which in turn caused our system to experience unavailability and partial data loss

…结合缓慢的内存泄漏，会导致违反我们对独立故障的假设，从而导致我们的系统出现不可用和部分数据丢失的情况

未检查的依赖关系和失败停止 (Unchecked Dependencies and Fail-stop)

Based on the assumption that if a component timed out, it has failed, the designers also assumed “fail-stop” failures, that is a component that has failed will not resume functioning after a while. The bricks in the system performed all long-latency work (disk I/O) in an asynchronous way.

基于这样的假设：如果某个组件超时，则该组件发生了故障，设计人员还假定了“故障停止”故障，即发生故障的组件将在一段时间后不再恢复运行。系统中的模块以异步方式执行了所有长等待时间的工作(磁盘I / O)。

However, they failed to notice that some parts of their code made use of blocking function calls. This caused the main event-handling thread to be randomly borrowed leading to bricks seizing inexplicably for a couple of minutes and resuming post.

但是，他们没有注意到他们的代码的某些部分使用了阻塞函数调用。这导致主事件处理线程被随机借用，导致砖块在几分钟内莫名其妙地被占用并恢复了工作。

While this error was due to our own failure to verify the behavior of code we were using, it serves to demonstrate that the low-level interaction between independently built components can have profound implications on the overall behavior of the system. A very subtle change in behavior resulted in the violation of our fail-stop assumption across the entire cluster, which eventually lead to the corruption of data in our system.

尽管此错误是由于我们自己无法验证所使用的代码的行为而引起的，但它证明了独立构建的组件之间的低级交互可能会对系统的整体行为产生深远的影响。行为的非常细微的变化导致违反了我们在整个集群中的故障停止假设，最终导致了系统中数据的损坏。

迈向强大的系统 (Towards Robust Systems)

..small changes to a complex, coupled system can result in large, unexpected changes in behavior, possibly taking the system outside of its designers’ expected operating regime.

..对复杂的耦合系统进行小的更改可能会导致行为方面的大变化，这可能会使系统超出其设计人员的预期工作范围。

A few solutions which can help us make more robust systems:

可以帮助我们制造更强大系统的一些解决方案：

系统的超额配置 (Systematic Over-provisioning)

When approaching the saturation point, systems tend to become fragile when trying to accommodate unexpected behavior. One way to combat this is to deliberately over-provision the system.

当接近饱和点时，系统在尝试适应意外行为时会变得脆弱。解决此问题的一种方法是故意过度配置系统。

However, this has its own set of issues: it leads to the under-utilization of resources. It also requires predicting the expected operating environment and hence the saturation point of the system. This can’t be done in an accurate manner in most cases.

但是，这有其自身的一系列问题：导致资源利用不足。它还需要预测预期的运行环境，从而预测系统的饱和点。在大多数情况下，这无法以准确的方式完成。

使用入场控制 (Use Admission Control)

Another technique is to start rejecting load once the system starts approaching the saturation point. However, this requires predicting the saturation point — something that’s not always possible, especially with large systems which have a lot of contributing variables.

另一种技术是一旦系统开始接近饱和点就开始拒绝负载。但是，这需要预测饱和点-这并非总是可能的，尤其是对于具有很多影响变量的大型系统而言。

Rejecting requests also consumes some resources from the system. Services designed with admission control in mind usually have two operating modes: normal where the requests are processed and an extremely lightweight mode where they’re rejected.

拒绝请求还会消耗系统的一些资源。考虑到准入控制设计的服务通常具有两种操作模式：处理请求的正常模式和拒绝请求的轻量级模式。

将自省构建到系统中 (Build Introspection into the system)

an introspective system is one in which the ability to monitor the system is designed in from the beginning.

自省系统是从一开始就设计了监视系统的功能。

When a system can be monitored, and designers and operators can derive meaningful measurements about its operation, it’s much more robust than a black-box system. It’s easier to adapt such a system to change in its environment, as well as manage and maintain it.

当可以监视系统，并且设计人员和操作员可以得出有意义的测量结果时，它比黑盒子系统要强大得多。使这样的系统适应环境变化以及管理和维护更加容易。

通过闭环控制引入适应性 (Introduce adaptivity by closing the control loop)

An example of a control loop is human designers and operators adapting the design in response to a change in its operating environment indicated through various measurements. However, the timeline for such a control loop isn’t very predictable. The authors argue that systems should be built with internal control loops.

控制回路的一个例子是人类的设计人员和操作员根据各种测量结果指示的其工作环境的变化对设计进行调整。但是，这种控制循环的时间表并不是很可预测的。作者认为，应该使用内部控制回路来构建系统。

These systems incorporate the results of introspection, and attempt to adapt control variables dynamically to keep the system operating in a stable or well-performing regime.

这些系统结合了自省的结果，并尝试动态调整控制变量以保持系统在稳定或性能良好的状态下运行。

All such systems have the property that the component performing the adaptation is able to hypothesize somewhat precisely about the effects of the adaptation; without this ability, the system would be “operating in the dark”, and likely would become unpredictable. A new, interesting approach to hypothesizing about the effects of adaptation is to use statistical machine learning; given this, a system can experiment with changes in order to build up a model of their effects.

所有这些系统都具有以下特性：执行调整的组件可以对调整的效果进行精确的假设。没有这种能力，系统将“在黑暗中运行”，并且可能变得不可预测。推测适应效果的一种有趣的新方法是使用统计机器学习。鉴于此，系统可以对变化进行实验，以建立其影响模型。

计划失败 (Plan for failure)

Complex systems must expect failure and plan for it accordingly.

复杂的系统必须预见到故障，并做出相应的计划。

A couple of techniques to do this:

几种方法可以做到这一点：

decoupling of components to contain failures locally
解耦组件以局部包含故障
minimize damage by using robust abstractions such as transactions
通过使用可靠的抽象(例如事务处理)将损害最小化
minimize amount of time in failure state (using checkpointing to recover rapidly)
最大限度地减少故障状态下的时间(使用检查点快速恢复)

In this paper, the authors argue that designing systems by assuming the constraints and nature of its operation, failures, and behavior often leads to fragile and unpredictable systems. We need a radically different approach to build systems that are more robust in the face of failure.

在本文中，作者认为，通过假设系统的操作，故障和行为的约束和性质来设计系统，通常会导致系统脆弱且不可预测。我们需要一种截然不同的方法来构建在出现故障时更强大的系统。

This different design paradigm is one in which systems are given the best possible chance of stable behavior (through techniques such as over-provisioning, admission control, and introspection), as well as the ability to adapt to unexpected situations (by treating introspection as feedback to a closed control loop). Ultimately, systems must be designed to handle failures gracefully, as complexity seems to lead to an inevitable unpredictability.

这种不同的设计范式是这样的：系统被赋予最佳的稳定行为机会(通过诸如过度配置，接纳控制和自省之类的技术)，以及适应意外情况的能力(通过将自省作为反馈)到一个封闭的控制环)。最终，系统的设计必须能够优雅地处理故障，因为复杂性似乎会导致不可避免的不可预测性。

翻译自: https://www.freecodecamp.org/news/robustness-in-complex-systems-a-summary-95d6f4067116/

控制系统鲁棒性

cumifi2519

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
控制系统鲁棒性_复杂系统中的鲁棒性：学术文章摘要

控制系统鲁棒性by Shubheksha 通过Shubheksha 复杂系统中的鲁棒性：学术文章摘要 (Robustness in Complex Systems: an academic article summary)Today, we’re going to look at the paper titled “Robustness in Complex Systems” publish...
复制链接

扫一扫