什么是高可用性_什么是高可用性| 第2部分

什么是高可用性

高可用性系统的设计 (Design of a high availability system)

Ironically, adding more components to the total system can undermine efforts to achieve high availability. This is because complex systems are inherently more potential failure points and are more difficult to implement correctly. Most highly available systems draw a simple design pattern: a simple multi-physical system of high quality with comprehensive internal redundancy interdependences running all features paired with a second system on a separate physical location.

具有讽刺意味的是,在整个系统中添加更多组件可能会破坏实现高可用性的努力。 这是因为复杂的系统固有地具有更多的潜在故障点,并且更难于正确实施。 高度可用的系统绘制了一个简单的设计模式:一个简单的高质量多物理系统,具有全面的内部冗余相互依存关系,将所有功能与另一个系统配对运行在一个单独的物理位置上。

This classic design pattern is common among financial institutions for example. The computer industry and Communications has established the Service Availability Forum will host the creation of network infrastructure products, services and high availability systems. The same basic design principle applies beyond the information in such fields as nuclear power, aerospace and medical care.

例如,这种经典的设计模式在金融机构中很常见。 计算机行业和通信部门已经建立了“服务可用性论坛”,该论坛将主持网络基础设施产品,服务和高可用性系统的创建。 除核电,航空航天和医疗保健等领域的信息外,相同的基本设计原则也适用。

High availability requires the most suitable accommodation: power supply, air conditioning on the floor, with particulate filter, maintenance service, security service and security against malicious acts and theft. Attention also to the risk of fire and water damage. Power cables and communication must be multifaceted and buried. They should not be prominent in the underground garage of the building, which is too often seen in buildings in Paris. These criteria are the first to come into account when choosing a hosting provider (if renting a local high availability).

高可用性需要最合适的住宿条件:电源,地板上的空调,微粒过滤器,维护服务,安全服务以及防止恶意行为和盗窃的安全性。 还应注意火灾和水灾的危险。 电力电缆和通讯必须是多面的并且是埋藏的。 它们不应在建筑物的地下车库中突出显示,在巴黎的建筑物中经常见到。 选择托管服务提供商(如果租用本地高可用性)时,首先要考虑这些标准。

For each level of the architecture for each component, each connection between components must be established:

对于每个组件的体系结构的每个级别,必须在组件之间建立每个连接:

  • How to detect a failure?

    如何检测故障?
  • How secure is component, redundant, rescued, etc. Examples can be: backup server, cluster system, Websphere clustering, RAID storage, backup, SAN double attachment, degraded mode, unused material free (spare) ready to be reinstalled. .

    组件,冗余,救援等组件的安全性如何。示例可以是:备份服务器,群集系统,Websphere群集,RAID存储,备份,SAN双重连接,降级模式,未使用的无材料(备用),可以重新安装。 。
  • How do we want the trigger switches to backup / gradient. Whether it should be Manually after analysis? Or Automatically?

    我们如何让触发器切换到备份/渐变。 分析后是否应该手动? 还是自动?
  • How to ensure that the emergency system start over on a stable and known. Examples: one starts with a copy of the base and reapplies the archive logs, restart batches from a known state, 2-phase commit for transactions updating multiple data repositories, etc.

    如何确保应急系统重新稳定并广为人知。 示例:一个从基础副本开始,然后重新应用归档日志,从已知状态重新启动批处理,为更新多个数据存储库的事务进行两阶段提交等。
  • How the application restarts on the backup mechanism. Examples: application restart, restart of interrupted batches, activation of a degraded mode, return the IP address of the failed server by the backup server, etc.

    应用程序如何在备份机制上重新启动。 示例:应用程序重新启动,中断的批次的重新启动,降级模式的激活,备用服务器返回故障服务器的IP地址等。
  • How to take any transactions or sessions. Examples: Session persistence on the application server, a mechanism to ensure a response to a client for a transaction that was successfully made before failure but for which the customer does not have an answer, etc.

    如何进行任何交易或会话。 示例:应用程序服务器上的会话持久性,一种机制,用于确保在失败之前成功完成但客户没有答案的事务对客户端的响应等。
  • How to return to the nominal situation. Examples:

    如何回到名义状态。 例子:

~~ if a degraded mode allows for failure of a database to store transactions waiting in a file, how transactions are they re-applied when the database becomes active again. ~~ If a failed component has been deactivated, how is its reintroduction in active service (e.g., need to resynchronize data, retest the component, etc.)

~~如果降级模式导致数据库无法将等待的事务存储在文件中,则当数据库再次变为活动状态时,如何重新应用事务。 ~~如果失效的组件已被停用,如何将其重新引入活动服务中(例如,需要重新同步数据,重新测试该组件等)

负载平衡和灵敏度 (Load balancing and sensitivity)

Sensitivity is often managed by redundant elements with a load balancing mechanism. For this system offers a real gain in terms of reliability, check if one element fails, the remaining elements have sufficient power to service.

灵敏度通常由具有负载平衡机制的冗余元素来管理。 对于该系统,在可靠性方面具有真正的优势,请检查一个元件是否发生故障,其余元件是否具有足够的服务能力。

In other words, in the case of two active servers with load balancing, the power of a single server must ensure the entire load. With three servers, the power of a single server must ensure 50% of the load (assuming that the probability of an incident on two servers at the same time is negligible). To ensure reliability, it is useless to many servers back each other up. For example, a reliable 99% redundant once gives a reliability of 99.99% (the probability that the two elements is failing at the same time 1/100×1/100 = = 1/10.000)

换句话说,在两个具有负载平衡的活动服务器的情况下,单个服务器的功能必须确保整个负载。 对于三台服务器,一台服务器的功能必须确保50%的负载(假设同时发生在两台服务器上的可能性很小)。 为了确保可靠性,许多服务器相互备份是没有用的。 例如,可靠的99%冗余一次可提供99.99%的可靠性(两个元素同时失败的概率1/100×1/100 = = 1 / 10.000)

差分冗余 (Differential redundancy)

The redundancy of an element is usually done by choosing redundant with several identical components. This assumes, to be effective, a failure of a component is random and independent of the failure of the other ingredients. This is for example the case of hardware failures.

通常,通过选择具有多个相同组件的冗余来完成元素的冗余。 有效地假定,某组分的失效是随机的,并且与其他成分的失效无关。 例如,这是硬件故障的情况。

This is not the case for all failures: for example, a flaw in the operating system or software malfunction of a component can occur when conditions are favorable on all components at once. For this reason, when the application is extremely sensitive, we consider redundant elements with components of different natures but the same functions. This can lead to:

并非所有故障都如此:例如,当一次所有组件的状况都良好时,可能会发生操作系统缺陷或组件软件故障。 因此,当应用程序非常敏感时,我们考虑具有不同性质但功能相同的组件的冗余元素。 这可能导致:

  • Choose different kind of servers with different OS, software products for different infrastructure,

    为不同的基础架构选择具有不同操作系统,软件产品的不同类型的服务器,
  • Develop the same component twice respecting each time the contracts that apply to the component interface.

    每次开发适用于组件接口的合同时,都要两次开发相同的组件。

改善可用性的过程 (Processes That Improve The Availability)

There are two distinct roles in these processes:

在这些过程中有两个不同的角色:

减少失败次数的过程 (Processes that reduce the number of failures)

Based on the fact that prevention is better than cure, implement control processes that will reduce the number of incidents on the system improves availability. Both processes can play this role:

基于预防胜于治疗这一事实,实施可减少系统事件数量的控制流程可提高可用性。 这两个过程都可以扮演这个角色:

  • The process of change management: 60% of errors are related to a recent change. By implementing a formalized process, accompanied by adequate tests (and implemented in a proper pre-production), many incidents can be eliminated.

    变更管理过程:60%的错误与最近的变更有关。 通过实施正式的流程,并进行适当的测试(并在适当的预生产中实施),可以消除许多事故。
  • A process of pro-active management of errors: incidents can often be detected before they occur: response times increase, etc. A process dedicated to this task and provided with adequate tools (measuring system, reporting, etc.) may intervene even before the incident happens.

    主动管理错误的过程:通常可以在事件发生之前就将其检测出来:响应时间增加等。专用于此任务并提供了适当工具(测量系统,报告等)的过程甚至可以在事件发生之前进行干预。事件发生了。

By implementing these processes, many incidents can be avoided.

通过实施这些过程,可以避免许多事故。

该过程减少了停机时间 (The process reduces the duration of outages)

Breakdowns always happen eventually. At this point, the recovery process in case of error is essential if the service is restored as quickly as possible. This process must have a goal: enabling the user to use a service as quickly as possible. Permanent repair should be avoided because it takes much longer. This process will have to implement a workaround for the problem.

故障总是最终发生。 此时,如果尽快恢复服务,则在发生错误时进行恢复过程至关重要。 这个过程必须有一个目标:使用户能够尽快使用服务。 应避免进行永久性维修,因为这需要更长的时间。 此过程将必须解决此问题。

高可用性集群 (High availability cluster)

A high availability cluster (as opposed to a computing cluster) is a cluster of computers whose goal is to provide a service whilst avoiding downtime.

高可用性群集(与计算群集相对)是计算机群集,其目标是在提供服务的同时避免停机。

投票系统的冗余 (Redundancy with voting system)

In this mode, various components process the same inputs and produce therefore (in principle) the same output.

在这种模式下,各种组件处理相同的输入并因此产生(原则上)相同的输出。

The results produced by all the components are collected, then an algorithm is implemented to produce the final result. The algorithm can be simple (majority vote) or complex (mean, weighted mean, median, etc.), the aim being to eliminate erroneous results due to a malfunction on one of the components and / or a reliable result by combining several slightly different results.

收集所有组件产生的结果,然后执行算法以产生最终结果。 该算法可以是简单的(多数表决),也可以是复杂的(均值,加权均值,中位数等),目的是消除由于组件之一发生故障而导致的错误结果和/或通过组合几个略有不同的结果来获得可靠的结果结果。

This process:

这个流程:

  • Do not allow load balancing

    不允许负载平衡
  • Introduces the problem of reliability of the component managing the voting algorithm

    介绍管理投票算法的组件的可靠性问题

This method is commonly used in the following cases

在以下情况下通常使用此方法

  • Systems based on sensors (e.g., temperature sensors) for which the sensors are redundant

    基于传感器的系统(例如温度传感器),传感器是冗余的
  • Systems or several different components that perform the same function are used and for which a better outcome can be achieved by combining the results produced by the components (e.g., pattern recognition system using multiple algorithms for better recognition rate.

    使用执行相同功能的系统或几个不同组件,可以通过组合组件产生的结果来获得更好的结果(例如,使用多个算法的模式识别系统以获得更好的识别率)。

Continued…

继续…

翻译自: https://www.eukhost.com/blog/webhosting/what-is-high-availability-part-2-2/

什么是高可用性

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值