Overall
in this chater, we focus on three concerns that are important in most software systems:
- Reliability
- Scalability
- Maintainability
Reliability
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity(hardware or software faults, even human error).
Hardware faults
- hard disk crash, RAM becomes faulty, the power grid has a blackout, some unplugs the wrong network table.
- MTTF(Mean Time To Failures):10-50y.
- Response
- add redundancy: disk set up in a RAID configuration, dual power supplies, hot-swappable CPUs, datacenters have batteries and diesel generators for backup power.
- as long as you can restore a backup onto a new machine fairly quickly, the downtime in case of is not catastrophic in most applications.
Software Errors
- some bugs…
- some small thing can help
- carefully thinking about assumptions and interactions
- thorough testing
- process isolation
- allowing processes to crash and restart
- measuring, monitoring, and analyzing system behavior
Human Errors
one study found that configuration errors by operators were the leading cause of outages, whereas hardware faults(servers or network) played a role in only 10-25% of outages.
- some approaches:
- provide fully featured non-production sandbox environments
- design systems in a way that minimizes opportunities for error.
- CI(continuous integration) / CD(continuous delivery) + automate testing
- Monitoring: RPC / memory/disk
importance
every application are expected to work reliably.
Scalability
As the system grows(in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
Describing Load
Consider Twitter as an example, two of Twitter’s main operations are:
- Post tweet: (4,600 requests per sec on average, 12,000 requests per sec at peak)
- Home timeline: (300,000 requests per sec)
there are two ways of implementing these two operations:
Insert new tweets into a global collection of tweets
find all the tweets for each of the people they follow, and merge them(sort by time).
SELECT tweets.*, users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
Maintain a cache for each user’s home timeline–like a mailbox
when a user posts a tweet, look up all the people who follow that user, and insert the new tweet into their home timeline caches. The request to read is cheap.
Which approach that Twitter used
At first, Twitter used approach 1 and switched to approach 2. This works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of reads, so it prefers to do more work at write time.
The approach 2 also has a downside. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches. But some users have over 30 million followers, so this tweet may result in over 30 million writes to home timelines in 5 seconds.
Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they post, like in approach 2, but a small number of users with a very large number of followers are excepted from this fan-out, like in approach 1.
Describing Performance
-
throughput
-
response time
-
latency
in order to figure out how bad your outliers are, you can look at higher percentiles: the 95th, 99th, and 99.9th percentiles are common. For example, if the 95th percentile response time is 1 second, that means 95 out of 100 requests take less than 1 second, and 5 out of 100 requests take 1 second or more.
High percentiles of response times also known as tail latencies, are important because: -
they directly affect users’ experience of the service
-
the users with the slower request are often those who have more data on their accounts----that is, they are the most valuable users.
-
SLO(service level objectives)
-
SLA(service level agreements)
-
head-of-line blocking
Approaches for coping with load
- scaling up(vertical scaling), moving to a more powerful machine
- scaling out(horizontal scaling), distributing the load across multiple smaller machines
- Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase. An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer faults.
Maintainability
over time, many different people should work on the system(engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.
Operability
Simplicity: Managing Complexity
Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity(not inherent in the problem that the software solves but arises only from the implementations).
One of the best tools we have for removing accidental complexity is abstraction.