【读书摘抄&笔记】SRE : 谷歌运维解密——I

最新推荐文章于 2024-08-19 11:08:23 发布

LukaMadrid

最新推荐文章于 2024-08-19 11:08:23 发布

阅读量741

点赞数

分类专栏：读书笔记

本文链接：https://blog.csdn.net/u014090659/article/details/109668009

版权

读书笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Chapt 1. Introduction

The Sysadmin Approach to Service Management

Devs v.s. Ops

The confilct between Devs and Ops

Devs：“We want to launch anything, any time, without hindrance” —— “flag flips,” “incremental updates,” or “cherry‐picks.”
Ops：“We won’t want to ever change anything in the system once it works”

Site Reliability Engineering

Composition of SRE team

50–60% are SWE
40–50% are very close to the SWE qualifications and had a set of technical skills that is useful to SRE but is rare for most SWEs
- UNIX system internals
- networking (Layer 1 to Layer 3) expertise

SRE model

SRE features
- quickly become bored by performing tasks by hand
- can write software to replace manual work
Time management
- 50% cap on the aggregate “ops” work for all SREs
- Ensures the time to make service stable and operable.

Tenets of SRE

The responsibility

availability 可用性
latency 延迟优化
performance 性能优化
effciency 效率优化
change management 变更管理
monitoring 监控
emergency response 紧急事务管理
capacity planning of their service(s) 容量规划与管理

Ensuring a Durable Focus on Engineering

SREs should receive a maximum of two events per 8–12-hour on-call shift.
- End the Vicious circle
Postmortems should be written for all significant incidents
- And fix it，not cover it

Pursuing Maximum Change Velocity Without Violating a Service’s SLO

Error budget.

Actually, we don’t need 100% reliability
But how much do we need?

• What level of availability will the users be happy with, given how they use the product?

• What alternatives are available to users who are dissatisfied with the product’s availability?

• What happens to users’ usage of the product at different availability levels?
we would spend all of our error budget taking risks with things we launch in order to launch them quickly.

Monitoring

Alerts
- Take action immediately in response to something
- that is either happening or about to happen, in order to improve the situation.
Tickets
- Take action, but not immediately.
- The system can not automatically handle the situation, but if a human takes action in a few days, no damage will result.
Logging
- it is recorded for diagnostic or forensic purposes.
- The expectation is that no one reads logs unless something else prompts them to do so.

Emergency Response

mean time to failure (MTTF) 平均失败时间
mean time to repair (MTTR) 平均恢复时间
the importance of Playbook

Change Management

automation
• Implementing progressive（渐进式） rollouts
• Quickly and accurately detecting problems
• Rolling back changes safely when problems arise

Demand Forecasting and Capacity Planning

Ensuring that there is sufficient capacity and redundancy to serve projected future demand with the required availability
Capacity planning:
- 准确的自然增长预测模型
- 准确的需求来源统计
- 周期性压力测试

Provisioning （资源部署）

Effciency and Performance

demand (load) 负载
slower
capacity 可用容量
software efficiency 软件使用效率

思维导图

Chapt 2. The Production Environment at Google, from the Viewpoint of an SRE

Hardware

Machine：A piece of hardware (or perhaps a VM)
Server：A piece of software that implements a service
the topology of a Google datacenter:

• Tens of machines are placed in a rack. 机柜
• Racks stand in a row. 机柜排
• One or more rows form a cluster. 集群
• Usually a datacenter building houses multiple clusters. 数据中心
• Multiple datacenter buildings that are located close together form a campus. 园区
SDN software-defined networking architecture

System Software That “Organizes” the Hardware

Managing Machines

Kubernetes—an open source Container Cluster orchestration framework started by Google in 2014;

Storage

D
(for disk, although D uses both spinning disks and flash storage).
D is a fileserver running on almost all machines in a cluster. user doesn’t need to know which one
Colossus
creates a cluster-wide filesystem that offers usual filesystem semantics, as well as replication and encryption.
several database-like services
1. Bigtable
  NoSQL
  a sparse, distributed, persistent multidimensional sorted map that is indexed by row key, column key, and timestamp;
  each value in the map is an uninterpreted array of bytes.
  Bigtable supports eventually consistent, cross-datacenter replication.
2. Spanner
  SQL-like interface & real consistency across the world.
3. Several other database systems,

Networking

Global Sofware Load Balancer (GSLB)
- Geographic load balancing for DNS requests (for example, to www.google.com) 地理位置信息
- Load balancing at a user service level (for example, YouTube or Google Maps) 用户服务层面
- Load balancing at the Remote Procedure Call (RPC) level 远程调用层面
  load balancing

Other System Software

Lock Service
Monitoring and Alerting

Software Infrastructure 基础设施

RPC

Development Environment

Shakespeare: A Sample Service

Two parts:

Batch

Creates an index, and writes the index into a Bigtable.
Only run once
MapReduce
- The mapping phase reads Shakespeare’s texts and splits them into individual words.
  This is faster if performed in parallel by multiple workers.
- The shuffle phase sorts the tuples by word.
- In the reduce phase, a tuple of (word, list of locations) is created.

Frontend

handles end-user requests.
This job is always up, as users in all time zones

Life of a Request

Get the IP address

the user’s device resolves the address with its DNS server
This request ultimately ends up at Google’s DNS server, which talks to GSLB. As GSLB keeps track of traffic load among frontend servers across regions, it picks which server IP address to send to this user.

Connect

The browser connects to the HTTP server on this IP.
This server (named the Google Frontend, or GFE) is a reverse proxy that terminates the TCP connection (2).

RPC request

The GFE looks up which service is required
Again using GSLB, the server finds an available Shakespeare frontend server, and sends that server an RPC containing the HTML request (3).

Look up

The Shakespeare server analyzes the HTML request and constructs a protobuf containing the word to look up.
The Shakespeare frontend server now needs to contact the Shakespeare backend server: the frontend server contacts GSLB to obtain the BNS address of a suitable and unloaded backend server (4).

Get data

That Shakespeare backend server now contacts a Bigtable server to obtain the requested data (5).

Return

The answer is written to the reply protobuf and returned to the Shakespeare backend server.
The backend hands a protobuf containing the results to the Shakespeare frontend server, which assembles the HTML and returns the answer to the user.