Part I Introduction
Chapt 1. Introduction
The Sysadmin Approach to Service Management
Devs v.s. Ops
The confilct between Devs and Ops
-
Devs:“We want to launch anything, any time, without hindrance” —— “flag flips,” “incremental updates,” or “cherry‐picks.”
-
Ops:“We won’t want to ever change anything in the system once it works”
Site Reliability Engineering
Composition of SRE team
- 50–60% are SWE
- 40–50% are very close to the SWE qualifications and had a set of technical skills that is useful to SRE but is rare for most SWEs
- UNIX system internals
- networking (Layer 1 to Layer 3) expertise
SRE model
- SRE features
- quickly become bored by performing tasks by hand
- can write software to replace manual work - Time management
- 50% cap on the aggregate “ops” work for all SREs
- Ensures the time to make service stable and operable.
Tenets of SRE
The responsibility
- availability 可用性
- latency 延迟优化
- performance 性能优化
- effciency 效率优化
- change management 变更管理
- monitoring 监控
- emergency response 紧急事务管理
- capacity planning of their service(s) 容量规划与管理
Ensuring a Durable Focus on Engineering
- SREs should receive a maximum of two events per 8–12-hour on-call shift.
- End the Vicious circle - Postmortems should be written for all significant incidents
- And fix it,not cover it
Pursuing Maximum Change Velocity Without Violating a Service’s SLO
-
Error budget.
Actually, we don’t need 100% reliability
-
But how much do we need?
• What level of availability will the users be happy with, given how they use the product?
• What alternatives are available to users who are dissatisfied with the product’s availability?
• What happens to users’ usage of the product at different availability levels?
we would spend all of our error budget taking risks with things we launch in order to launch them quickly.
Monitoring
- Alerts
- Take action immediately in response to something
- that is either happening or about to happen, in order to improve the situation.
- Tickets
- Take action, but not immediately.
- The system can not automatically handle the situation, but if a human takes action in a few days, no damage will result.
- Logging
- it is recorded for diagnostic or forensic purposes.
- The expectation is that no one reads logs unless something else prompts them to do so.
Emergency Response
- mean time to failure (MTTF) 平均失败时间
- mean time to repair (MTTR) 平均恢复时间
- the importance of Playbook
Change Management
- automation
• Implementing progressive(渐进式) rollouts
• Quickly and accurately detecting problems
• Rolling back changes safely when problems arise
Demand Forecasting and Capacity Planning
-
Ensuring that there is sufficient capacity and redundancy to serve projected future demand with the required availability
-
Capacity planning:
- 准确的自然增长预测模型
- 准确的需求来源统计
- 周期性压力测试
Provisioning (资源部署)
Effciency and Performance
- demand (load) 负载
slower - capacity 可用容量
- software efficiency 软件使用效率
Chapt 2. The Production Environment at Google, from the Viewpoint of an SRE
Hardware
-
Machine:A piece of hardware (or perhaps a VM)
-
Server:A piece of software that implements a service
-
the topology of a Google datacenter:
• Tens of machines are placed in a rack. 机柜
• Racks stand in a row. 机柜排
• One or more rows form a cluster. 集群
• Usually a datacenter building houses multiple clusters. 数据中心
• Multiple datacenter buildings that are located close together form a campus. 园区 -
SDN software-defined networking architecture
System Software That “Organizes” the Hardware
Managing Machines
Kubernetes—an open source Container Cluster orchestration framework started by Google in 2014;
Storage
- D
(for disk, although D uses both spinning disks and flash storage).
D is a fileserver running on almost all machines in a cluster. user doesn’t need to know which one - Colossus
creates a cluster-wide filesystem that offers usual filesystem semantics, as well as replication and encryption. - several database-like services
- Bigtable
NoSQL
a sparse, distributed, persistent multidimensional sorted map that is indexed by row key, column key, and timestamp;
each value in the map is an uninterpreted array of bytes.
Bigtable supports eventually consistent, cross-datacenter replication. - Spanner
SQL-like interface & real consistency across the world. - Several other database systems,
- Bigtable
Networking
- Global Sofware Load Balancer (GSLB)
- Geographic load balancing for DNS requests (for example, to www.google.com) 地理位置信息
- Load balancing at a user service level (for example, YouTube or Google Maps) 用户服务层面
- Load balancing at the Remote Procedure Call (RPC) level 远程调用层面
load balancing
Other System Software
Lock Service
Monitoring and Alerting
Software Infrastructure 基础设施
RPC
Development Environment
Shakespeare: A Sample Service
Two parts:
- Batch
- Creates an index, and writes the index into a Bigtable.
- Only run once
- MapReduce
- The mapping phase reads Shakespeare’s texts and splits them into individual words.
This is faster if performed in parallel by multiple workers. - The shuffle phase sorts the tuples by word.
- In the reduce phase, a tuple of (word, list of locations) is created.
- The mapping phase reads Shakespeare’s texts and splits them into individual words.
- Frontend
- handles end-user requests.
- This job is always up, as users in all time zones
Life of a Request
- Get the IP address
- the user’s device resolves the address with its DNS server
- This request ultimately ends up at Google’s DNS server, which talks to GSLB. As GSLB keeps track of traffic load among frontend servers across regions, it picks which server IP address to send to this user.
- Connect
- The browser connects to the HTTP server on this IP.
- This server (named the Google Frontend, or GFE) is a reverse proxy that terminates the TCP connection (2).
- RPC request
- The GFE looks up which service is required
- Again using GSLB, the server finds an available Shakespeare frontend server, and sends that server an RPC containing the HTML request (3).
- Look up
- The Shakespeare server analyzes the HTML request and constructs a protobuf containing the word to look up.
- The Shakespeare frontend server now needs to contact the Shakespeare backend server: the frontend server contacts GSLB to obtain the BNS address of a suitable and unloaded backend server (4).
- Get data
- That Shakespeare backend server now contacts a Bigtable server to obtain the requested data (5).
- Return
- The answer is written to the reply protobuf and returned to the Shakespeare backend server.
- The backend hands a protobuf containing the results to the Shakespeare frontend server, which assembles the HTML and returns the answer to the user.
Job and Data Organization
Why N+ 2?
- During updates, one task at a time will be unavailable.
- A machine failure might occur during a task update.