【读书摘抄&笔记】SRE : 谷歌运维解密——I

Chapt 1. Introduction

The Sysadmin Approach to Service Management

Devs v.s. Ops

The confilct between Devs and Ops

  • Devs:“We want to launch anything, any time, without hindrance” —— “flag flips,” “incremental updates,” or “cherry‐picks.”

  • Ops:“We won’t want to ever change anything in the system once it works”

Site Reliability Engineering

Composition of SRE team

  • 50–60% are SWE
  • 40–50% are very close to the SWE qualifications and had a set of technical skills that is useful to SRE but is rare for most SWEs
    - UNIX system internals
    - networking (Layer 1 to Layer 3) expertise

SRE model

  • SRE features
    - quickly become bored by performing tasks by hand
    - can write software to replace manual work
  • Time management
    - 50% cap on the aggregate “ops” work for all SREs
    - Ensures the time to make service stable and operable.

Tenets of SRE

The responsibility
  • availability 可用性
  • latency 延迟优化
  • performance 性能优化
  • effciency 效率优化
  • change management 变更管理
  • monitoring 监控
  • emergency response 紧急事务管理
  • capacity planning of their service(s) 容量规划与管理
Ensuring a Durable Focus on Engineering
  • SREs should receive a maximum of two events per 8–12-hour on-call shift.
    - End the Vicious circle
  • Postmortems should be written for all significant incidents
    - And fix it,not cover it
Pursuing Maximum Change Velocity Without Violating a Service’s SLO
  • Error budget.

    Actually, we don’t need 100% reliability

  • But how much do we need?

    • What level of availability will the users be happy with, given how they use the product?

    • What alternatives are available to users who are dissatisfied with the product’s availability?

    • What happens to users’ usage of the product at different availability levels?
    we would spend all of our error budget taking risks with things we launch in order to launch them quickly.

Monitoring
  • Alerts
    • Take action immediately in response to something
    • that is either happening or about to happen, in order to improve the situation.
  • Tickets
    • Take action, but not immediately.
    • The system can not automatically handle the situation, but if a human takes action in a few days, no damage will result.
  • Logging
    • it is recorded for diagnostic or forensic purposes.
    • The expectation is that no one reads logs unless something else prompts them to do so.
Emergency Response
  • mean time to failure (MTTF) 平均失败时间
  • mean time to repair (MTTR) 平均恢复时间
  • the importance of Playbook
Change Management
  • automation
    • Implementing progressive(渐进式) rollouts
    • Quickly and accurately detecting problems
    • Rolling back changes safely when problems arise
Demand Forecasting and Capacity Planning
  • Ensuring that there is sufficient capacity and redundancy to serve projected future demand with the required availability

  • Capacity planning:

    • 准确的自然增长预测模型
    • 准确的需求来源统计
    • 周期性压力测试
Provisioning (资源部署)
Effciency and Performance
  • demand (load) 负载
    slower
  • capacity 可用容量
  • software efficiency 软件使用效率

思维导图

Chapt 2. The Production Environment at Google, from the Viewpoint of an SRE

Hardware

  • Machine:A piece of hardware (or perhaps a VM)

  • Server:A piece of software that implements a service

  • the topology of a Google datacenter:

    • Tens of machines are placed in a rack. 机柜
    • Racks stand in a row. 机柜排
    • One or more rows form a cluster. 集群
    • Usually a datacenter building houses multiple clusters. 数据中心
    • Multiple datacenter buildings that are located close together form a campus. 园区

  • SDN software-defined networking architecture

System Software That “Organizes” the Hardware

Managing Machines

Kubernetes—an open source Container Cluster orchestration framework started by Google in 2014;

Storage

  1. D
    (for disk, although D uses both spinning disks and flash storage).
    D is a fileserver running on almost all machines in a cluster. user doesn’t need to know which one
  2. Colossus
    creates a cluster-wide filesystem that offers usual filesystem semantics, as well as replication and encryption.
  3. several database-like services
    1. Bigtable
      NoSQL
      a sparse, distributed, persistent multidimensional sorted map that is indexed by row key, column key, and timestamp;
      each value in the map is an uninterpreted array of bytes.
      Bigtable supports eventually consistent, cross-datacenter replication.
    2. Spanner
      SQL-like interface & real consistency across the world.
    3. Several other database systems,

Networking

  • Global Sofware Load Balancer (GSLB)
    • Geographic load balancing for DNS requests (for example, to www.google.com) 地理位置信息
    • Load balancing at a user service level (for example, YouTube or Google Maps) 用户服务层面
    • Load balancing at the Remote Procedure Call (RPC) level 远程调用层面
      load balancing

Other System Software

Lock Service
Monitoring and Alerting

Software Infrastructure 基础设施

RPC

Development Environment

Shakespeare: A Sample Service

Two parts:

  1. Batch
  • Creates an index, and writes the index into a Bigtable.
  • Only run once
  • MapReduce
    • The mapping phase reads Shakespeare’s texts and splits them into individual words.
      This is faster if performed in parallel by multiple workers.
    • The shuffle phase sorts the tuples by word.
    • In the reduce phase, a tuple of (word, list of locations) is created.
  1. Frontend
  • handles end-user requests.
  • This job is always up, as users in all time zones

Life of a Request

  1. Get the IP address
  • the user’s device resolves the address with its DNS server
  • This request ultimately ends up at Google’s DNS server, which talks to GSLB. As GSLB keeps track of traffic load among frontend servers across regions, it picks which server IP address to send to this user.
  1. Connect
  • The browser connects to the HTTP server on this IP.
  • This server (named the Google Frontend, or GFE) is a reverse proxy that terminates the TCP connection (2).
  1. RPC request
  • The GFE looks up which service is required
  • Again using GSLB, the server finds an available Shakespeare frontend server, and sends that server an RPC containing the HTML request (3).
  1. Look up
  • The Shakespeare server analyzes the HTML request and constructs a protobuf containing the word to look up.
  • The Shakespeare frontend server now needs to contact the Shakespeare backend server: the frontend server contacts GSLB to obtain the BNS address of a suitable and unloaded backend server (4).
  1. Get data
  • That Shakespeare backend server now contacts a Bigtable server to obtain the requested data (5).
  1. Return
  • The answer is written to the reply protobuf and returned to the Shakespeare backend server.
  • The backend hands a protobuf containing the results to the Shakespeare frontend server, which assembles the HTML and returns the answer to the user.

Job and Data Organization

Why N+ 2?

  1. During updates, one task at a time will be unavailable.
  2. A machine failure might occur during a task update.
    思维导图
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值