ddia(1)----chapter1.Reliable,Scalable, and Maintainable Applications

文章讨论了软件系统中的三个关键点:可靠性涉及系统在面对硬件故障、软件错误甚至人为错误时仍能正常运行,通过冗余备份和故障恢复策略来增强。可扩展性关注系统如何随着数据量、流量或复杂性的增长进行有效处理,如Twitter的推文发布和时间线显示的实现方式。可维护性强调系统应易于理解和修改,以便不同人员能高效地协作维护和适应新需求。
摘要由CSDN通过智能技术生成

Overall

in this chater, we focus on three concerns that are important in most software systems:

  • Reliability
  • Scalability
  • Maintainability

Reliability

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity(hardware or software faults, even human error).

Hardware faults
  • hard disk crash, RAM becomes faulty, the power grid has a blackout, some unplugs the wrong network table.
  • MTTF(Mean Time To Failures):10-50y.
  • Response
    1. add redundancy: disk set up in a RAID configuration, dual power supplies, hot-swappable CPUs, datacenters have batteries and diesel generators for backup power.
    2. as long as you can restore a backup onto a new machine fairly quickly, the downtime in case of is not catastrophic in most applications.
Software Errors
  • some bugs…
  • some small thing can help
    • carefully thinking about assumptions and interactions
    • thorough testing
    • process isolation
    • allowing processes to crash and restart
    • measuring, monitoring, and analyzing system behavior
Human Errors

one study found that configuration errors by operators were the leading cause of outages, whereas hardware faults(servers or network) played a role in only 10-25% of outages.

  • some approaches:
    • provide fully featured non-production sandbox environments
    • design systems in a way that minimizes opportunities for error.
    • CI(continuous integration) / CD(continuous delivery) + automate testing
    • Monitoring: RPC / memory/disk
importance

every application are expected to work reliably.

Scalability

As the system grows(in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

Describing Load

Consider Twitter as an example, two of Twitter’s main operations are:

  • Post tweet: (4,600 requests per sec on average, 12,000 requests per sec at peak)
  • Home timeline: (300,000 requests per sec)

there are two ways of implementing these two operations:

Insert new tweets into a global collection of tweets

find all the tweets for each of the people they follow, and merge them(sort by time).
在这里插入图片描述

SELECT tweets.*, users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
Maintain a cache for each user’s home timeline–like a mailbox

when a user posts a tweet, look up all the people who follow that user, and insert the new tweet into their home timeline caches. The request to read is cheap.
在这里插入图片描述

Which approach that Twitter used

At first, Twitter used approach 1 and switched to approach 2. This works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of reads, so it prefers to do more work at write time.
The approach 2 also has a downside. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches. But some users have over 30 million followers, so this tweet may result in over 30 million writes to home timelines in 5 seconds.
Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they post, like in approach 2, but a small number of users with a very large number of followers are excepted from this fan-out, like in approach 1.

Describing Performance
  • throughput

  • response time

  • latency
    在这里插入图片描述
    in order to figure out how bad your outliers are, you can look at higher percentiles: the 95th, 99th, and 99.9th percentiles are common. For example, if the 95th percentile response time is 1 second, that means 95 out of 100 requests take less than 1 second, and 5 out of 100 requests take 1 second or more.
    High percentiles of response times also known as tail latencies, are important because:

  • they directly affect users’ experience of the service

  • the users with the slower request are often those who have more data on their accounts----that is, they are the most valuable users.

  • SLO(service level objectives)

  • SLA(service level agreements)

  • head-of-line blocking

Approaches for coping with load
  • scaling up(vertical scaling), moving to a more powerful machine
  • scaling out(horizontal scaling), distributing the load across multiple smaller machines
  • Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase. An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer faults.

Maintainability

over time, many different people should work on the system(engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

Operability
Simplicity: Managing Complexity

Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity(not inherent in the problem that the software solves but arises only from the implementations).
One of the best tools we have for removing accidental complexity is abstraction.

Evolvability: Making Change Easy

Summary

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值