RAS, reliability, availablitity, serviceability

最新推荐文章于 2024-03-17 17:09:06 发布

wlzsddx

最新推荐文章于 2024-03-17 17:09:06 发布

阅读量1k

点赞数

http://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability_(computer_hardware)

http://en.wikipedia.org/wiki/Fault-tolerant_system

Reliability & Availability & Serviceablity are combined together to get a good RAS system.

Reliablity requires to keep the system function correctly. If malfunctions, try to isolate it & report to maintenance party

Availablity need to quick recovery procedure and need to consider fault alarm, recovery, isolate, report to maintenance team and automatically process it in order to reduce downtime

Fault tolerant design is hte mindset for RAS system

0. Fault detection, reporting

1. Fault isolation, reduced functionality

2. Recovery from error: roll forward, roll back, checkpointing

Self-stablizing

3. Some sort of duplication

The basic characteristics of fault tolerance require:No single point of failure – If a system experiences a failure, it must continue to operate without interruption during the repair process.Fault isolation to the failing component – When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorizes faults based on Locality, Cause, Duration and Effect.Fault containment to prevent propagation of the failure – Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the "Rogue transmitter" which can swamp legitimate communication in a system and cause overall system failure. Mechanisms that isolate a rogue transmitter or failing component to protect the system are required.Availability of reversion modesIn addition, fault-tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability.Fault-tolerant systems are typically based on the concept of redundancy.

Lockstep system to detect fault (DMR:dual module redundancy) & correct fault(TMR: triple module redundancy)

4. Geography Redundancy, Multi-datacenter awareness, switchover

(flexible horizontal scalability)

1. Reliablity

Bathtub curve:
The reciprocal value of the MTBF that is a measure for the reliability of a component is the failure rate λ. Plotting of the statistical failure rate λ over time tgives the bathtub function shown below (bathtub curve)

single module's MTBF = total device hours/failure times

Reliabitlity:

The probabiilty that a system will operate without failure to a specifiied time t

The probabitlity of failure

The system failure rate = 1/MTBF, failure times/ total device hours

The MTBF is determined by adding together the FIT values of the individual
components.

R(t) = exp(-t/MTBF)

The likelihood that a system will operate successfully to its MTBF = exp(-1）=0.368

Service life

The service life is the time for which the device or component is designed to
function. This, therefore, is the time up to the beginning of the wear-and-tear phase
through a physical law or aging due to chemical reactions. In the case of devices
with electromechanical parts (relays), the service life is mainly defined by the
number of operations and the load connected.

The MTBF figure for a product can be derived in various ways: lab test data, actual field failure data, or prediction models (such as Telcordia SR-332 or MIL-HDBK-217). The RelCalc for Windows software can help you do your MTBF prediction.

Reliability is theoretically defined as the probability of failure, the frequency of failures, or in terms of availability, a probability derived from reliability and maintainability. Maintainability and maintenance may be defined as a part of reliability engineering. Reliability plays a key role in cost-effectiveness of systems.

Reliability engineering for complex systems requires a different, more elaborate systems approach than for non-complex systems. Reliability engineering may involve the creation of proper use studies and requirements specification, hardware & software design, functional (failure) analysis, testing and analyzing manufacturing, maintenance, transport, storage, spare parts stocking, operations research, human factors and technical documentation. Also data and information acquisition / organisation may be of importance. Effective reliability engineering requires understanding of the basics of failure mechanisms for which experience, broad engineering skills and good knowledge from many different special fields of engineering, like: tribology-, stress / fracture mechanics -, fatigue-, thermal-, shock-, electrical- and chemical "engineering".

2. Availability

Availability is the probability a system is operational at a given time, i.e. the amount of time a device is actually operating as the percentage of total time it should be operating. In high availability applications, availability may be reported as minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur.

Five nines: 0.00001, 1万小时，10年，一年5.26分钟downtime

The downtime is less than 5.26 minutes per year