ddia(1)----chapter1.Reliable,Scalable, and Maintainable Applications

back2childhood

已于 2023-08-14 15:33:53 修改

阅读量113

点赞数 1

分类专栏： ddia notes 文章标签： sql

于 2023-06-29 17:26:18 首次发布

本文链接：https://blog.csdn.net/weixin_44609676/article/details/131457505

版权

ddia notes 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

文章讨论了软件系统中的三个关键点：可靠性涉及系统在面对硬件故障、软件错误甚至人为错误时仍能正常运行，通过冗余备份和故障恢复策略来增强。可扩展性关注系统如何随着数据量、流量或复杂性的增长进行有效处理，如Twitter的推文发布和时间线显示的实现方式。可维护性强调系统应易于理解和修改，以便不同人员能高效地协作维护和适应新需求。

摘要由CSDN通过智能技术生成

Overall

in this chater, we focus on three concerns that are important in most software systems:

Reliability
Scalability
Maintainability

Reliability

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity(hardware or software faults, even human error).

Hardware faults

hard disk crash, RAM becomes faulty, the power grid has a blackout, some unplugs the wrong network table.
MTTF(Mean Time To Failures):10-50y.
Response
1. add redundancy: disk set up in a RAID configuration, dual power supplies, hot-swappable CPUs, datacenters have batteries and diesel generators for backup power.
2. as long as you can restore a backup onto a new machine fairly quickly, the downtime in case of is not catastrophic in most applications.

Software Errors

some bugs…
some small thing can help
- carefully thinking about assumptions and interactions
- thorough testing
- process isolation
- allowing processes to crash and restart
- measuring, monitoring, and analyzing system behavior

Human Errors

one study found that configuration errors by operators were the leading cause of outages, whereas hardware faults(servers or network) played a role in only 10-25% of outages.

some approaches:
- provide fully featured non-production sandbox environments
- design systems in a way that minimizes opportunities for error.
- CI(continuous integration) / CD(continuous delivery) + automate testing
- Monitoring: RPC / memory/disk

importance

every application are expected to work reliably.

Scalability

As the system grows(in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

Describing Load

Consider Twitter as an example, two of Twitter’s main operations are:

Post tweet: (4,600 requests per sec on average, 12,000 requests per sec at peak)
Home timeline: (300,000 requests per sec)

there are two ways of implementing these two operations:

Insert new tweets into a global collection of tweets

find all the tweets for each of the people they follow, and merge them(sort by time).
在这里插入图片描述

SELECT tweets.*, users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user

Maintain a cache for each user’s home timeline–like a mailbox

when a user posts a tweet, look up all the people who follow that user, and insert the new tweet into their home timeline caches. The request to read is cheap.
在这里插入图片描述

Which approach that Twitter used

At first, Twitter used approach 1 and switched to approach 2. This works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of reads, so it prefers to do more work at write time.
The approach 2 also has a downside. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches. But some users have over 30 million followers, so this tweet may result in over 30 million writes to home timelines in 5 seconds.
Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they post, like in approach 2, but a small number of users with a very large number of followers are excepted from this fan-out, like in approach 1.

Describing Performance

throughput
response time
latency

in order to figure out how bad your outliers are, you can look at higher percentiles: the 95th, 99th, and 99.9th percentiles are common. For example, if the 95th percentile response time is 1 second, that means 95 out of 100 requests take less than 1 second, and 5 out of 100 requests take 1 second or more.
High percentiles of response times also known as tail latencies, are important because:
they directly affect users’ experience of the service
the users with the slower request are often those who have more data on their accounts----that is, they are the most valuable users.
SLO(service level objectives)
SLA(service level agreements)
head-of-line blocking

Approaches for coping with load

scaling up(vertical scaling), moving to a more powerful machine
scaling out(horizontal scaling), distributing the load across multiple smaller machines
Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase. An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer faults.

Maintainability

over time, many different people should work on the system(engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

Operability

Simplicity: Managing Complexity

Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity(not inherent in the problem that the software solves but arises only from the implementations).
One of the best tools we have for removing accidental complexity is abstraction.