SRE岗位理解(上篇)—读SRE实战手册有感

本文介绍了SRE(站点可靠性工程师)的角色和职责,强调了其在提升系统稳定性和可用性方面的重要性。通过Google的SRE实践,详细解释了SLI(服务级别指标)、SLO(服务级别目标)和错误预算的概念,并阐述了如何通过SLI和SLO来设定和衡量系统稳定性。此外,文章还探讨了在落地SLO时需要考虑的因素,包括核心链路的识别、服务依赖关系分析和验证策略,以确保在保证核心业务稳定的同时,实现整体系统的高效运维。
摘要由CSDN通过智能技术生成

1. 引言

什么是SRE呢?SRE全称为:Site Reliability Engineering,意为:站点可靠性工程师。

SRE这个概念来自Google,Systems Engineer, Site Reliability Engineering是Google招聘给出的职位描述,我们具体看看这个岗位的要求:

image-20210128011732635

职位简介:

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google’s services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to users’ needs and a fast rate of improvement. Additionally SRE’s will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation.

On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Google, while using your expertise in coding, algorithms, complexity analysis and large-scale system design.
SRE’s culture of diversity, intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

To learn more: check out our books on Site Reliability Engineering, watch a recorded Hangout on Air to meet some of our SREs, or read a career profile about why a Software Engineer chose to join SRE.

Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running. From developing and maintaining our data centers to building the next generation of Google platforms, we make Google’s product portfolio possible. We’re proud to be our engineers’ engineers and love voiding warranties by taking things apart so we can rebuild them. We keep our networks up and running, ensuring our users have the best and fastest experience possible.

image-20210128012932468

image-20210128013316953

Google SRE是目前最稳定性领域的最佳实践,在引入了微服务、容器,以及其他的分布式技术和产品之后,复杂架构的系统稳定性很难得到保证。这时候就需要 SRE 。

2. 什么是SRE

一位从业者的回答:

DevOps核心是做全栈交付,SRE的核心是做稳定性保障,关注业务的所有活动,两者的共性是:都是用软件工程解决问题;DevOps的诞生是由于互联网商业市场竞争加剧,企业为了减少试错成本,往往推出最小可行产品,产品需要不断且高频迭代来满足市场需求,抢占市场(产品迭代是关乎一整条交付链的事),高频的迭代则会促使研发团队使用敏捷模式,敏捷模式下对运维的全栈交付能力要求更严格,则运维必须开启DevOps来实现全栈交付;因为不断的迭代交付(也就是俗称的变更)是触发故障,非稳定性根源,而互联网产品、服务稳定性缺失会造成用户流失,甚至流到竞争对手那里,因此关注业务稳定性也变

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值