Recently, Amazon Web Services experienced one of the largest disruptions the cloud computing service has seen. The failure caused a ripple effect across many websites such as Quora and HootSuite, which were completely down during the outage. Other websites, such as Reddit, were at least partially affected.
最近, Amazon Web Services经历了云计算服务所经历的最大破坏之一。 该故障在Quora和HootSuite等许多网站上造成了连锁React,这些网站在停电期间已完全关闭。 其他网站,例如Reddit ,至少受到了部分影响。
Failures happen, and as young as the cloud computing industry is I am positive this will not be the last, or the biggest, outage.
故障会发生,而且我相信云计算行业还很年轻,这不会是最后一次,也不会是最大的一次停机。
But there were several issues with this incident (on both sides) that can teach us all something about planning for and handling crises.
但是这个事件(双方)都有几个问题,可以教给我们所有有关计划和处理危机的知识。
危机中的沟通方式至关重要 (How You Communicate in a Crisis is Critically Important)
Amazon is facing its share of criticism for the AWS outage, as it should be. But the harshest criticism I’ve seen isn’t about the length of the outage, or the fact that there was one in the first place. The worst criticism I’ve read was about Amazon’s lack of any response for more than 40 minutes after the outage began—an eternity when your website is down.
亚马逊理应受到AWS停机的批评。 但是,我所见过的最严厉的批评不是关于中断时间的长短,也不是关于首先发生故障的事实。 我读到的最糟糕的批评是关于断电后40多个分钟以来亚马逊没有任何回应的消息,这是您网站关闭后的永恒。
As the outage continued, many customers were upset that the updates appeared as if they had been written by the legal department (and judging by the delay in getting updates, they may have been) instead of being written by real human beings who were working to resolve the problem.
随着中断的继续,许多客户感到不安的是,这些更新似乎是由法律部门编写的(并且从获取更新的延迟来判断,可能是这样),而不是由正在努力工作的真正的人编写。解决问题。
The lesson here is clear—when you have any kind of crisis, communication with those affected is extremely important. In emergency mode, it may not be possible to pick up the phone to talk to a client or customer, but updating your website or changing the voicemail message can have a major impact.
这里的教训很明确-当您遇到任何危机时,与受影响者的沟通非常重要。 在紧急模式下,可能无法拿起电话与客户或客户交谈,但是更新您的网站或更改语音邮件可能会产生重大影响。
When you communicate with your customers about a problem, be honest and sincere. It’s amazing how much a little sincerity can do to appease an upset customer. 37 Signals is a great example. When their Basecamp service has an outage (which is extremely rare), they respond with a detailed explanation and a compassionate apology, which we’ve yet to see from Amazon.
当您与客户就问题进行沟通时,请诚实和真诚。 令人惊奇的是,有多少诚意可以安抚心烦的客户。 37信号就是一个很好的例子。 当他们的Basecamp服务发生故障(这种情况非常罕见)时, 他们会以详细的解释和富有同情心的道歉来回应 ,而我们还没有从亚马逊那里看到过。
制定应变计划 (Have a Contingency Plan)
As I was hearing of some extremely large websites being completely down due to the AWS outage, I couldn’t help wondering why they built their systems without any redundancy or backup plan. Cloud computing is a relatively young industry, and although Amazon Web Services has been very reliable, failures happen. You wouldn’t stop backing up your computer just because you’d never had a hard drive crash—it’s just bound to happen sooner or later.
当我听到某些大型网站由于AWS中断而完全瘫痪时,我不禁要问为什么他们在没有任何冗余或备份计划的情况下构建系统。 云计算是一个相对较年轻的行业,尽管Amazon Web Services非常可靠,但还是会发生故障。 您不会仅仅因为从未发生过硬盘驱动器崩溃而停止备份计算机-它注定迟早会发生。
One of the biggest advantages of cloud computing is its rapid scalability. It is entirely possible to setup two completely separate cloud environments, one at AWS and one at Rackspace for instance, and simply have one be a backup ready to be scaled up to production when a failure occurs (either manually or automatically).
云计算的最大优势之一是其快速的可扩展性。 完全有可能设置两个完全独立的云环境,例如,一个在AWS上,一个在Rackspace上,只需要一个备份就可以在发生故障时(手动或自动)扩展到生产环境。
Now, what does this mean for you and me?
现在,这对您和我意味着什么?
In my e-commerce business, we rely heavily on search engines to drive traffic (and revenues). In 2003, a Google update resulted in our website dropping from the top of page one to around page 50 for almost every major keyword phrase. Our traffic (and revenues) disappeared overnight. We scrambled to drive traffic through search marketing and other avenues, but it was too late by that point. It was our busy season and there were no customers.
在我的电子商务业务中,我们严重依赖搜索引擎来推动流量(和收入)。 2003年,由于Google进行了一次更新,几乎所有主要关键字词组的网站都从第一页的顶部降到了50页左右。 我们的流量(和收入)在一夜之间消失了。 我们争先恐后地通过搜索营销和其他途径来吸引流量,但是到那时为止为时已晚。 那是我们繁忙的季节,没有顾客。
Anytime you have the possibility for a single point of failure to cause a project or service to fail completely, you are just asking for trouble. I’ve seen it happen time and time again when companies hire a single contractor to program an application on a tight deadline, or rely on a single client for almost all of their income.
只要您有可能单点故障导致项目或服务完全失败,您就在自找麻烦。 我已经看到一次又一次地发生这种情况,即公司在紧迫的期限内聘请了一个承包商来编写应用程序,或者几乎所有收入都依靠一个客户。
The solution in our e-commerce business was to diversify our marketing strategy. We still get a significant amount of traffic from search engines, but also drive revenues through search marketing, email marketing, social media and other forms of advertising. If another updates causes a drop, we will definitely be affected but it won’t be devastating.
我们电子商务业务的解决方案是使我们的营销策略多样化。 我们仍然从搜索引擎获得大量流量,但也通过搜索营销,电子邮件营销,社交媒体和其他形式的广告来增加收入。 如果另一个更新导致下降,我们肯定会受到影响,但这不会造成破坏性的影响。
Whether you have several contractors that can step in to help on projects in an emergency, or work to diversify your client roster so your biggest client doesn’t bring in the majority of your income, you should investigate ways you can limit the effect of any one incident on your business.
无论您有几个可以在紧急情况下为项目提供帮助的承包商,还是致力于使客户名册多样化,以使最大的客户都不会赚到大部分收入,您应该研究可以限制任何影响的方法与您的业务有关的一件事。
Image via Anyk / Shutterstock
图片通过Anyk / Shutterstock
翻译自: https://www.sitepoint.com/two-important-lessons-from-the-aws-failure-2/