SRE岗位理解(上篇)—读SRE实战手册有感

最新推荐文章于 2024-07-11 22:30:14 发布

星尘博客

最新推荐文章于 2024-07-11 22:30:14 发布

阅读量4.2k

点赞数 2

分类专栏： SRE 面试常考题 Linux 文章标签：运维运维开发 devops linux

本文链接：https://blog.csdn.net/u011130655/article/details/113410295

版权

本文介绍了SRE（站点可靠性工程师）的角色和职责，强调了其在提升系统稳定性和可用性方面的重要性。通过Google的SRE实践，详细解释了SLI（服务级别指标）、SLO（服务级别目标）和错误预算的概念，并阐述了如何通过SLI和SLO来设定和衡量系统稳定性。此外，文章还探讨了在落地SLO时需要考虑的因素，包括核心链路的识别、服务依赖关系分析和验证策略，以确保在保证核心业务稳定的同时，实现整体系统的高效运维。

摘要由CSDN通过智能技术生成

1. 引言

什么是SRE呢？SRE全称为：Site Reliability Engineering，意为：站点可靠性工程师。

SRE这个概念来自Google，Systems Engineer, Site Reliability Engineering是Google招聘给出的职位描述，我们具体看看这个岗位的要求:

职位简介：

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google’s services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to users’ needs and a fast rate of improvement. Additionally SRE’s will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation.

On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Google, while using your expertise in coding, algorithms, complexity analysis and large-scale system design.
SRE’s culture of diversity, intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

To learn more: check out our books on Site Reliability Engineering, watch a recorded Hangout on Air to meet some of our SREs, or read a career profile about why a Software Engineer chose to join SRE.

Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running. From developing and maintaining our data centers to building the next generation of Google platforms, we make Google’s product portfolio possible. We’re proud to be our engineers’ engineers and love voiding warranties by taking things apart so we can rebuild them. We keep our networks up and running, ensuring our users have the best and fastest experience possible.

Google SRE是目前最稳定性领域的最佳实践，在引入了微服务、容器，以及其他的分布式技术和产品之后，复杂架构的系统稳定性很难得到保证。这时候就需要 SRE 。

2. 什么是SRE

一位从业者的回答：

DevOps核心是做全栈交付，SRE的核心是做稳定性保障，关注业务的所有活动，两者的共性是：都是用软件工程解决问题；DevOps的诞生是由于互联网商业市场竞争加剧，企业为了减少试错成本，往往推出最小可行产品，产品需要不断且高频迭代来满足市场需求，抢占市场(产品迭代是关乎一整条交付链的事)，高频的迭代则会促使研发团队使用敏捷模式，敏捷模式下对运维的全栈交付能力要求更严格，则运维必须开启DevOps来实现全栈交付；因为不断的迭代交付(也就是俗称的变更)是触发故障，非稳定性根源，而互联网产品、服务稳定性缺失会造成用户流失，甚至流到竞争对手那里，因此关注业务稳定性也变