RAS, reliability, availablitity, serviceability

http://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability_(computer_hardware)

http://en.wikipedia.org/wiki/Fault-tolerant_system


Reliability & Availability & Serviceablity are combined together to get a  good RAS system.

Reliablity  requires to keep the system function correctly. If malfunctions,  try to isolate it & report  to maintenance party

Availablity need to quick recovery procedure and need to consider fault alarm, recovery, isolate, report to maintenance team and automatically process it in order to reduce downtime

Fault tolerant design is hte mindset for RAS system

0. Fault detection, reporting

1. Fault isolation, reduced functionality

2. Recovery from error: roll forward, roll back, checkpointing

 Self-stablizing

3. Some sort of duplication

The basic characteristics of fault tolerance require:No single point of failure – If a system experiences a failure, it must continue to operate without interruption during the repair process.Fault isolation to the failing component – When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorizes faults based on Locality, Cause, Duration and Effect.Fault containment to prevent propagation of the failure – Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the "Rogue transmitter" which can swamp legitimate communication in a system and cause overall system failure. Mechanisms that isolate a rogue transmitter or failing component to protect the system are required.Availability of reversion modesIn addition, fault-tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability.Fault-tolerant systems are typically based on the concept of redundancy.


Lockstep system to detect fault (DMR:dual module redundancy) & correct fault(TMR: triple module redundancy)


4. Geography Redundancy, Multi-datacenter  awareness, switchover

(flexible horizontal scalability)

1. Reliablity

Bathtub curve:
The reciprocal value of the MTBF that is a measure for the reliability of a component is the failure rate λ. Plotting of the statistical failure rate λ over time tgives the bathtub function shown below (bathtub curve)

single module's MTBF = total device hours/failure times



Reliabitlity:

The probabiilty that a system will operate without failure to a specifiied time t

The probabitlity of failure 

The system failure rate =  1/MTBF, failure times/ total device hours



The MTBF is determined by adding together the FIT values of the individual 
components. 


R(t) = exp(-t/MTBF)


The likelihood that a system will operate successfully to its MTBF = exp(-1)=0.368


Service life

The service life is the time for which the device or component is designed to 
function. This, therefore, is the time up to the beginning of the wear-and-tear phase 
through a physical law or aging due to chemical reactions. In the case of devices 
with electromechanical parts (relays), the service life is mainly defined by the 
number of operations and the load connected.



The MTBF figure for a product can be derived in various ways: lab test data, actual field failure data, or prediction models (such as Telcordia SR-332 or MIL-HDBK-217). The RelCalc for Windows software can help you do your MTBF prediction.



Reliability is theoretically defined as the probability of failure, the frequency of failures, or in terms of availability, a probability derived from reliability and maintainability. Maintainability and maintenance may be defined as a part of reliability engineering. Reliability plays a key role in cost-effectiveness of systems.

Reliability engineering for complex systems requires a different, more elaborate systems approach than for non-complex systems. Reliability engineering may involve the creation of proper use studies and requirements specification, hardware & software design, functional (failure) analysis, testing and analyzing manufacturing, maintenance, transport, storage, spare parts stocking, operations research, human factors and technical documentation. Also data and information acquisition / organisation may be of importance. Effective reliability engineering requires understanding of the basics of failure mechanisms for which experience, broad engineering skills and good knowledge from many different special fields of engineering, like: tribology-, stress / fracture mechanics -, fatigue-, thermal-, shock-, electrical- and chemical "engineering".

2. Availability

Availability is the probability a system is operational at a given time, i.e. the amount of time a device is actually operating as the percentage of total time it should be operating. In high availability applications, availability may be reported as minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur.

Five nines: 0.00001, 1万小时,10年,一年5.26分钟downtime

The downtime is less than 5.26 minutes per year


3. Serviceablity, maintenablity


REF:

http://download.csdn.net/detail/wlzsddx/7369421

http://download.csdn.net/detail/wlzsddx/7369399

Systems Engineering: Building Successful Systems


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值