混沌工程是什么_平静中的混沌：什么是混沌工程？

最新推荐文章于 2024-07-18 22:06:15 发布

weixin_26742939

最新推荐文章于 2024-07-18 22:06:15 发布

阅读量623

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/chaos-in-the-calm-what-is-chaos-engineering-5311c27b2c7e

版权

混沌工程是什么

Chaos Engineering isn’t just something that Dr. Robotnik does; it’s a serious and increasingly common part of the development life cycle.

混沌工程不仅仅是Robotnik博士所做的。 这是开发生命周期中重要且日益普遍的部分。

In an ever more dynamic global environment, there is growing pressure to introduce continuous testing to the DevOps toolchain to ensure security and resilience. This is more so the case as the cloud becomes the dominant playground for an increasing number of firms, for where there is emerging tech, there is an emerging threat.

在瞬息万变的全球环境中，向DevOps工具链引入持续测试以确保安全性和弹性的压力越来越大。当云成为越来越多的公司的主要场所时，情况就更是如此，对于有新兴技术的地方 ， 新兴威胁就在这里 。

In practice, a system that has not been routinely and effectively tested is more prone to downtime, which can lead to disappointment or even lost customers. In ITIC’s 11th annual Hourly Cost of Downtime Survey, it was found that 87% of organisations are now required to be available 99.99% of the time (which — they note — is up 81% in the past 2.5 years). Incredibly, 40% of enterprise organisations indicated that a single hour of downtime can cost them from $1 million to over $5 million.

在实践中，未经例行有效测试的系统更容易发生停机，这可能导致失望甚至失去客户。在ITIC的第11次年度每小时停机成本调查中 ，发现现在要求99.99％的时间有87％的组织处于可用状态(他们指出，在过去2.5年中，这一比例增加了81％)。令人难以置信的是，有40％的企业组织表示，一个小时的停机可能使他们蒙受的损失从100万美元到500万美元以上。

So how do you preempt and prepare for the kind of disaster that could cost your company millions? Well, one increasingly popular approach is to productively break your own things before someone else does.

那么，您如何抢占先机并为可能造成公司数百万美元损失的灾难做准备？ 好吧，一种越来越流行的方法是在别人做之前有效地破坏自己的东西。

曝光，改善 (Expose, Improve)

To those of us who aren’t wizards of the DevOps world, just the idea of deliberately introducing chaos to your hard work is chilling. Getting to grips with the concept isn’t so hard though, and once you understand the work required, I’m sure you will agree that the outcomes are well worth the resource investment.

对于那些不是DevOps领域向导的人来说，故意将混乱引入您的辛苦工作的想法是令人毛骨悚然的。不过，掌握这个概念并不难，一旦您了解了所需的工作，我相信您会同意成果值得投入资源。

A fairly traditional example of this would be penetration testing or ethical hacking whereby a company pays [typically] a third-party to try and find the security weaknesses in their infrastructure. Chaos engineering takes this idea of testing security, and dials it up a notch.

一个相当传统的例子是渗透测试或道德黑客 ，公司通常会向第三方支付费用，以试图发现其基础架构中的安全漏洞。混沌工程学采用了测试安全性的想法，并将其提高了一个等级。

To explain at a high-level, you introduce disruption (such as a major server crash, a hack attempt, a blackout) in to the normal production state of a system in order to proactively (rather than reactively) test its resilience.

为了进行高层次的解释，您会在系统的正常生产状态中引入中断 (例如主要的服务器崩溃，黑客尝试，停电)，以便主动 (而不是被动地 )测试其弹性。

Image for post — Simple view of the development life-cycle (graphic my own)

It is designed to be a replication of a real-life ‘emergency’ situation and differs from traditional and mainstream testing by taking place in the production environment, rather than earlier in the development life-cycle.

它旨在复制现实生活中的“紧急情况”，并且不同于传统测试和主流测试，而是在生产环境中进行，而不是在开发生命周期中进行。

But why would you go to the lengths of trying to tank your own network or fry your own hardware if you don’t need to?

但是，如果您不需要的话，为什么还要花很多钱尝试自己的网络或油炸自己的硬件呢？

Well, it’s sort of like the old saying “failure to prepare is preparation to fail”; if you don’t do it, then someone else probably will. Therefore, to avoid being left in a catastrophic situation with unexpected downtime, it’s best to put any mitigating procedures in place well in advance.

好吧，这有点像古话“准备失败就是准备失败”。如果您不这样做，那么其他人可能会这样做。因此，为避免因意外停机而陷入灾难性状况，最好提前采取任何缓解措施。

By exposing any possible weaknesses, you can ultimately improve confidence in any systems for the future.

通过暴露任何可能的弱点，您最终可以提高对未来任何系统的信心。

像科学学科一样对待它 (Treat it like a scientific discipline)

Masters of chaos Gremlin state clearly in their whitepaper that:

混乱的格雷姆林大师在白皮书中明确指出：

“Critical to chaos engineering is that it is treated as a scientific discipline.”

“对混沌工程至关重要的是，它被视为一门科学学科。”

Rather than just hiring someone to try and break your system, chaos engineering done right allows you to revert back to an original state without any impact to customer or employees. From a compliance perspective, it allows you to validate and justify hypothetical defense mechanisms. Gremlin set out six steps that should be followed for successful engineering:

完成混乱的工程设计不仅可以雇用他人尝试破坏系统，还可以使您恢复到原始状态，而对客户或员工没有任何影响。从合规性角度来看，它使您可以验证和证明假设的防御机制。 Gremlin列出了成功进行工程设计应遵循的六个步骤：

Form a hypothesis
形成假设
Plan your experiment
计划实验
Minimize the blast radius
最小化爆炸半径
Run the experiment
运行实验
Celebrate the outcome
庆祝结果
Complete the mission
完成任务

There are many critics and skeptics of chaos engineering because, of course, there are risks involved. But following these steps and utilizing best practice — including starting with a staging environment, always having a “kill switch”, and planning in advance — drastically reduces the chance of things going awry. Remember: if you don’t test things breaking in a controlled environment, you’re ensuring that you will struggle to fix them when they inevitably go wrong in the future!

混乱工程学受到许多批评家和怀疑论者的关注，因为当然其中涉及风险。但是，遵循这些步骤并利用最佳实践(包括从临时环境开始，始终具有“致命的转变”以及预先进行计划)，可以大大减少发生问题的机会。切记：如果您不测试在受控环境中发生的故障，那么您将确保在将来不可避免地出问题时要努力修复它们！

Ultimately by simulating events that could be completely overwhelming, you make them non-events that — should they occur in the future — would barely leave a dent.

最终，通过模拟可能完全压倒性的事件，可以使它们成为非事件-如果它们将来会发生-几乎不会留下任何痕迹。

解决假设 (Tackle assumptions)

Another draw of chaos engineering is that it challenges perhaps the biggest blight of the software engineering world: incorrect assumptions. Whether it’s believing that you hold all the power, or that’ll you never run out of space, letting these fallacies take root allows issues to creep in.

混乱工程的另一种吸引之处是，它可能挑战了软件工程领域的最大困境：错误的假设。无论是相信您拥有所有权力，还是永远不会用完空间，让这些谬论生根发芽会让问题蔓延。

If you aren’t continuously testing, and pushing your infrastructure to the limit, then you are relying on discovering your system weaknesses on the fly as they appear in the real world, potentially to the detriment of your customer base.

如果您不进行持续的测试并将基础架构推向极限，那么您将依赖于在现实世界中即时发现系统漏洞，这可能会损害客户群。

Be it major or minor downtime, every minute matters. It’s therefore true, as Andy Grove so excellently puts it, that “Success breeds complacency. Complacency breeds failure”.

无论是重大停机还是次要停机，每分钟都很重要。 因此，正如安迪·格鲁夫(Andy Grove)所说的那样，“成功培养了自满。 自满滋生失败”。

猿面军 (The Simian Army)

Chaos engineering isn’t a new concept. Back in 2011, Netflix started work on a suite of open-source tools referred to as “The Simian Army”. These tools have been designed to test the security, reliability, and resilience of their Amazon Web Services infrastructure and include big names like Chaos Monkey, Chaos Kong, and Janitor Monkey.

混沌工程并不是一个新概念。早在2011年，Netflix就着手开发一套称为“猿猴军”的开源工具。这些工具旨在测试其Amazon Web Services基础架构的安全性，可靠性和弹性，其中包括Chaos Monkey，Chaos Kong和Janitor Monkey等知名品牌。

Chaos Monkey was Netflix’s first tool, and it worked by rebooting servers at random. Whilst incredibly innovative, it seems to have developed some infamy online in part because it does deviate from the modern fundamentals of chaos engineering. Generally speaking, it’s not advisable to break things at random in the middle of busy periods, but instead to use controls and measures to understand weaknesses (unless, like Netflix, you have a massive team of engineers lined up to bag said monkey!)

Chaos Monkey是Netflix的第一个工具，它通过随机重启服务器来工作。尽管具有令人难以置信的创新性，但它似乎在网上发展了一些臭名昭著的部分原因是它确实偏离了混沌工程学的现代基础。一般而言，不建议在繁忙的时段中随意破坏事物，而应使用控制和措施来了解弱点(除非像Netflix一样，您有庞大的工程师团队排着长队说猴子！)

Netflix subsequently introduced a whole cast of monkeys to work with, including Chaos Kong (which wiped out a whole AWS region *shudders*) and Latency Monkey (which introduces delays in communication to simulate real-world degradation and outages). More recently, the monkeys have been developed separately and many are built into Spinnaker, Netflix’s “open source multi-cloud Continuous Delivery platform”

Netflix随后引入了一系列可与之合作的猴子，包括Chaos Kong (消灭了整个AWS地区* shudders *)和Latency Monkey (引入了通信延迟以模拟现实世界的退化和中断)。最近，猴子已经单独开发，许多猴子内置在Spinnaker中，这是Netflix的“开源多云连续交付平台”

You can read in detail about the Simian Army and the incredible chaos engineering projects at Netflix on the Netflix Tech Blog.

您可以在Netflix Tech Blog上详细了解有关Simian Army和不可思议的混乱工程项目的信息。

平静中的混乱 (Chaos in the Calm)

Other companies including Salesforce, Amazon, and Uber have been using resilience testing for years to improve reliability, and because so many projects are open source it’s not difficult to embrace the chaos.

包括Salesforce，Amazon和Uber在内的其他公司多年来一直在使用弹性测试来提高可靠性，并且由于有如此多的项目是开源的，因此不难摆脱混乱。

There are great forums (including the Chaos Community) as well as a wealth of offerings, including VMware’s ‘Mangle’ that is described as a tool to enable you:

有很棒的论坛(包括Chaos社区 )以及丰富的产品，包括VMware的“ Mangle ”，它被描述为使您能够使用的工具：

“…to run chaos engineering experiments seamlessly against applications and infrastructure components to assess resiliency and fault tolerance. It is designed to introduce faults with very little pre-configuration…”

“……针对应用程序和基础架构组件无缝运行混乱的工程实验，以评估弹性和容错能力。它旨在以很少的预配置引入故障……”

There are also legitimate full-scale services like the super cool ‘Gremlin’, which offers a platform that allows you introduce ‘attacks’ across your stack with multiple failure modes.

还有一些合法的全面服务，例如超级酷的“ Gremlin ”，它提供了一个平台，可让您通过多种故障模式在堆栈中引入“攻击”。

As a practice, it’s not quite the standard yet. Even if it doesn’t need to be the most difficult process ever, firms want the assurance of experienced DevOps who can engineer chaos safely, and that can be challenging to 1) find, and 2) invest in.

实际上，这还不是很标准。即使不需要这是有史以来最困难的过程，公司也希望得到经验丰富的DevOps的保证，他们可以安全地制造混乱，这对于1)寻找和2)投资可能是具有挑战性的。

Ultimately though, the need for continuous testing will only become more necessary as emerging threats become more sophisticated; as customer reliance on functional systems grows, the room for error shrinks.

但最终，随着新出现的威胁变得越来越复杂，对连续测试的需求将变得更加必要。随着客户对功能系统的依赖增加，错误的余地也随之缩小。

As Bill Gates famously warned us 5 years ago:

正如比尔·盖茨5年前著名地警告我们的那样：