IT项目启示录——来自泰坦尼克号的教训（第十篇）（转）

最新推荐文章于 2024-10-16 10:13:45 发布

clijovtbq401783153

最新推荐文章于 2024-10-16 10:13:45 发布

阅读量175

点赞数

文章标签： java

回顾当时泰坦尼克号的情形：与冰山相撞后（见第8部分）船体仍显无恙，没人受伤。在船桥指挥部看来，船的完整性保持如初。白星公司主管布鲁斯-埃斯梅死守自己面子和公司的声誉（见第9部分）。当晚11点45分，在撞击发生的10分钟后，埃斯梅催促启航，泰坦尼克号踉踉跄跄驶离冰架。对危险一无所知的乘客们在开船之中松了口气，对撞击及其潜在的损害、后果都少有担忧。

如今的IT项目中至关重要的一点是，确保IT解决方案的平均故障恢复（MTTR）规程（见第9部分）已经在项目本身（见第4部分）之中被建立，准备，计划和测试过了，并被配以专人（运行团队/技术支持）“制度化”了。在故障的第2区间（4个区间分别为故障的探测，确定，解决，和从中恢复）内，数据的采集应经过严格的检验。

在修复有问题的产品前，团队需要对修复本身的总体风险进行评估。对待上级的干预，应同对待其他方面来的意见一样，经过仔细的检验，以免造成问题的恶化。重要的是，这些意见一旦可疑，就应立即予以挑战。

史密斯船长是否也是重起航程的决定者之一，已不重要了。因为埃斯梅已按照自己的意愿左右了大局。史密斯到无线电报室向波士顿公司总部汇报情况时仍显乐观，毕竟这艘有73个水密舱的大船在设计上具有很大的信心。他发出的无线讯息中称，船撞冰了但受损很小，大家都很安全，为预防起见正驶向加拿大海尔法克斯。这条讯息应该给了白星公司足够的时间去安排火车和马车，把乘客们转往纽约。该无线讯息没有加密所以为各地媒体所悉。这也是欧洲新闻中对撞击的早期报道都普遍乐观的原因。

如今的IT项目中，平均故障恢复（MTTR）规程应完全取决于对it方案服务负责的团队。与故障有关的沟通、消息发布都需先经他们的密切配合，只有在与方案的服务对象做了外部沟通后，才能作后续支援的决定。不准确的信息将迅速瓦解服务提供者的信誉。

第2组调查人员，包括结构师托马斯-安德鲁斯和木匠约翰-哈金森,带回了更准确地事故评估和更好的数据。而第1组调查人员则尚未检视完足够的地方来获悉更大范围的损伤。实际上撞击后数秒内，煤料燃烧房和第5锅炉房已经渗水。一名消防员事后证实，在煤料燃烧房地板上见到2英尺深的裂口。抽水机立刻开始工作，似乎能应付渗水、维持船体的上浮。托马斯-安德鲁斯深知一旦邮件室淹水，船也就完蛋了。

如今的IT项目可从中吸取的教训是，为了查明事故，支援团队必须对集成的可行方案知之甚祥，必须能将之逻辑分层，分解成一系列产品和部件。要诀在于，项目各个阶段工作文档化的重要性，和把文档作为知识下传后续运行阶段的支援团队。

重新起航后，第6锅炉房也开始渗水。仅仅20分钟后，当初的决策有多不准确就已经很显见了。补救措施已无济于事，邮件室终被水淹。史密斯与托马斯-安德鲁斯及指挥员们开会决定让8节航速的船慢慢停下来。续航的行动终尝恶果，灾难性上涨的海水让船吃水更多，其他本未受撞击影响的部分也在水压下开始漏水了。

而今IT方案在不稳定时，在一个MTTR状态下，重要的是不断评估、再评估运行环境的数据（证据），并监视环境的变化。第1个修补通常是临时性的（见第9部分）、只为让方案重新开始服务。替代的永久性修补，可能需要数小时、数天才能到位，方案本身可能需要在后台打补丁。如，代码可能需要重作，或者一个新的部件需要集成进方案的整体中。这样的话，在按照规程使之产品化之前，必须经过一个严格的计划、测试（见第4，5部分）。因此要求一个强有力的变更管理流程和测试/演示环境。

安德鲁斯向史密斯准确预测了船距离沉没还有2小时，这是死刑判决。而史密斯终于也认识到情况已经无可挽救，不像撞击刚发生时那样尚有所可为了。

如今的IT项目可从中吸取的教训在于，MTTR规程是可循环的，顾及了在有限时间内的多次尝试。但是，埃斯梅迫使情况发展到超出了MTTR规程或者说是可恢复的限度。

结论

如今许多IT项目在紧急情况下大打折扣，因为不按照预定的运行和方案恢复规程行事。制度化的MTTR规程，本来应有助于弱化如泰坦尼克号执行的那种亡命决策，并防止紧急状况恶化成大灾祸。因此，支援团队人员都应对方案的细节知之甚祥。下一部分将着眼于IT项目的灾难性恢复阶段。

原文：

In recapping Titanic’s situation, following the collision (Part 8) the ship appeared to be in remarkably good shape. No one had been injured and from the bridge the integrity of the ship appeared to be sound. White Star Director Bruce Ismay was hell bent on saving face--and his company’s reputation (Part 9). At 11:50 p.m., 10 minutes after the collision, Ismay pushed to restart the ship and limp Titanic off the ice shelf. Passengers, unaware of any dangers, later testified their initial relief that the ship was restarting the journey again, with little concern about the collision, the potential damage and consequences.

In today’s IT projects it is vital that Mean Time To Recovery (MTTR) procedures for the IT solution (see Part 9) are set up, prepared, planned and tested--in the project itself (Part 4) and "institutionalized" with the staff (operations groups/technical support). Data collected in the second "problem" quadrant (the four quadrants are: detection, determination, resolution and recovery) has to stand up to rigorous review.

Before a resolution or fix is applied into production, the team needs to assess the overall risk of proceeding with it. Executive intervention is handled like any other input and needs to stand up to careful examination so as not to further deteriorate the situation. Importantly, it needs to be challenged if it does not make sense, without any repercussions.

Whether Captain Smith was part of the decision to restart Titanic was not really relevant as Ismay was in control of the situation driving forward his own agenda. Smith proceeded to the wireless room to inform the White Star Line in Boston of the situation. Smith was still optimistic; after all, there was a great confidence in the design of the ship with the 73 water tight compartments. Smith sent a wireless message outlining that Titanic had struck ice but with little damage. Everyone was safe aboard, and as a precaution the ship was proceeding to Halifax. The message would give White Star time to organize trains and carriages to transport the passengers to New York. Wireless messages were not encrypted and this one was intercepted by the world media. It was the reason why early reports of the collision that appeared in the European press were overwhelmingly optimistic.

In today’s IT projects, MTTR procedures need to be completely controlled by the groups responsible for the IT solution and the services it provides. Communications or announcements related to an outage situation need to be made in close conjunction with these groups and support decisions made when communicating externally to the service recipients of a solution. Inaccurate information can quickly erode confidence in the service provider.

The second search party, with the architect Thomas Andrews and the carpenter John Hutchinson, returned with a more accurate assessment of the situation and better data. The first search party had not descended enough decks to see the full extent of the damage. Within seconds of the collision, flooding had occurred in the coal bunkers and Boiler Room 5. One of the firemen later testified seeing a gaping hole 2 feet into the floor of the coal bunker. Suction lines were set up right away and the pumping seemed to be coping with the rate of flooding to keep the ship afloat. Andrews knew that if the mail room was lost to flooding, the ship was doomed.

The lesson from this for IT projects today is in order to pinpoint faults the support team needs a detailed knowledge of the integrated working solution, and the ability to break it down into logical layers and decompose it into a sequence of products and components. The importance of creating documentation at each stage of the project, and then transferring it as knowledge to support staff for later use in the operation, is key.

After restarting the ship, Boiler Room 6 had started to flood. Around 20 minutes later it was apparent that the initial determination was grossly inaccurate, and the fix was not resolving the situation. The mail room was lost to flooding. Smith conferred with Andrews and the officers, determining that the ship--sailing now at 8 knots--should come to a gradual stop. The forward motion had taken its toll. The ship had taken on more water resulting in increased flooding that was becoming catastrophic. Other parts of the ship, which were initially unaffected, had started to spring leaks under the strain of the water.

In today’s world in a MTTR situation where an IT solution falters, it is important to keep assessing and reassessing the environmental data (evidence) and monitoring the environment for any changes. The first fix applied is usually temporary (Part 9) as so to get the solution online and back into service. It may take hours or days to get a permanent fix in place. The solution may have to be patched up in the background. For example code may have to be reworked or a new component integrated into the solution. This then needs to go through rigorous planning and testing (Part 4 and Part 5) before implementing into production using the procedures from the project, hence the requirement for a robust change management process and a test/staging environment.

Andrews rightly predicted to Smith that the ship had approximately two hours before foundering. This was a death sentence, and Smith finally recognized the situation was hopeless and not recoverable like it had been right after the collision.

The lesson from this for IT projects today is that MTTR procedures are cyclical and allow for several attempts at recovery, in a limited time frame. However, Ismay forced a situation where the ship went beyond MTTR or recovery.

Conclusions
Today, many IT projects severely compromise a critical situation by not following an established process in operation and recovery of a solution. Institutionalized MTTR procedures should help minimize disparate decision making as carried out on Titanic and prevent a critical situation from becoming catastrophic. So should the support staff’s detailed knowledge of the solution. The next installment will look at the disaster recovery stage of the IT project.
[@more@]

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/7839396/viewspace-955624/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/7839396/viewspace-955624/