Netflix的全周期开发者—运行您构建的内容（中英双语）-CSDN博客

原文

The year was 2012 and operating a critical service at Netflix was laborious. Deployments felt like walking through wet sand. Canarying was devolving into verifying endurance (“nothing broke after one week of canarying, let’s push it”) rather than correct functionality. Researching issues felt like bouncing a rubber ball between teams, hard to catch the root cause and harder yet to stop from bouncing between one another. All of these were signs that changes were needed.

2012年的时候在Netflix运营一项关键服务很费力的。部署就像在潮湿的沙地上行走。金丝雀部署变成了对耐心的测试（“一周的金丝雀测试都没问题发生所以我们继续推进吧”）而不是正确的功能。研究问题就好像皮球一样在团队之间被踢来踢去，很难抓住根源。所有这些迹象表明需要做一些改变了。

Fast forward to 2018. Netflix has grown to 125M global members enjoying 140M+ hours of viewing per day. We’ve invested significantly in improving the development and operations story for our engineering teams. Along the way we’ve experimented with many approaches to building and operating our services. We’d like to share one approach, including its pros and cons, that is relatively common within Netflix. We hope that sharing our experiences inspires others to debate the alternatives and learn from our journey.

时间快进到2018年。Netflix的全球用户已增至1.25亿，每天观看时长超过1.4亿小时。我们已经在改善我们的工程团队的开发和运营方面投入了大量的资金。在此过程中，我们尝试了许多方法来构建和运行我们的服务。在此，我们愿意把其中一种我们内部用得相对普遍的方案，包括它的优缺点拿出来跟大家分享。我们希望我们的经验分享能给大家一点启发并且讨论出可能的替代方案。

一个团队的旅程

Edge Engineering is responsible for the first layer of AWS services that must be up for Netflix streaming to work. In the past, Edge Engineering had ops-focused teams and SRE specialists who owned the deploy+operate+support parts of the software life cycle. Releasing a new feature meant devs coordinating with the ops team on things like metrics, alerts, and capacity considerations, and then handing off code for the ops team to deploy and operate. To be effective at running the code and supporting partners, the ops teams needed ongoing training on new features and bug fixes. The primary upside of having a separate ops team was less developer interrupts when things were going well.

Edge Engineering（边缘工程）负责AWS服务的第一层，Netflix流媒体必须依靠这些服务才能正常工作。在过去，Edge Engineering有专注运维的团队以及SRE（网站可靠性工程师）专家，他们负责软件生命周期的部署+运营+支撑这部分。发布一个新特性意味着开发人员要在度量、警报和容量考虑事项等方面与运维团队协调，然后将代码交给运维团队进行部署和操作。为了有效地运行代码并支持合作伙伴，运维团队需要持续接受新特性和bug修复方面的培训。拥有一个单独的运维团队的主要好处是，当事情进展顺利时，开发人员的干扰更少。

When things didn’t go well, the costs added up. Communication and knowledge transfers between devs and ops/SREs were lossy, requiring additional round trips to debug problems or answer partner questions. Deployment problems had a higher time-to-detect and time-to-resolve due to the ops teams having less direct knowledge of the changes being deployed. The gap between code complete and deployed was much longer than today, with releases happening on the order of weeks rather than days. Feedback went from ops, who directly experienced pains such as lack of alerting/monitoring or performance issues and increased latencies, to devs, who were hearing about those problems second-hand.

当事情进展不顺时，成本就会增加。开发人员和ops/SREs之间的交流和信息交换是有损耗的，需要额外的往返调试问题或回答合作伙伴的问题。部署问题因为运维团队对所部署内容的更改了解较少。所以检测和解决问题需要很长的时间。代码完成与部署之间的鸿沟变得更大，发布往往是以周为量级而不是日。反馈从运维发起，这帮人直接经历了缺少告警/监控或者性能问题及时延增加这样的痛苦，然后再传递到开发人员这里时问题已经是二手了。

To improve on this, Edge Engineering experimented with a hybrid model where devs could push code themselves when needed, and also were responsible for off-hours production issues and support requests. This improved the feedback and learning cycles for developers. But, having only partial responsibility left gaps. For example, even though devs could do their own deployments and debug pipeline breakages, they would often defer to the ops release specialist. For the ops-focused people, they were motivated to do the day to day work but found it hard to prioritize automation so that others didn’t need to rely on them.

为了改进这一点，Edge Engineering尝试了一种混合模型，在这种模型中，开发人员可以在需要的时候自己推送代码，同时还负责非工作时间的生产问题和支持请求。这改进了开发人员的反馈和学习周期。但这会出现部分的责任不到位的问题。例如，即使开发人员可以执行他们自己的部署和调试管道中断，他们往往也会交给运维处理。对于那些专注于运维的人来说，他们有动力去做每天的工作，但是很难会把无需别人依赖自己的自动化放在优先考虑的位置

In search of a better way, we took a step back and decided to start from first principles. What were we trying to accomplish and why weren’t we being successful?

为了寻找更好的方法，我们退了一步，决定从第一性原理开始。我们想要完成什么，为什么我们没有成功?

软件生命周期

The purpose of the software life cycle is to optimize “time to value”; to effectively convert ideas into working products and services for customers. Developing and running a software service involves a full set of responsibilities:

软件生命周期的目的是优化“价值时间”;为客户有效地将想法转化为工作产品和服务。开发和运行软件服务涉及一系列职责:

We had been segmenting these responsibilities. At an extreme, this means each functional area is owned by a different person/role:

我们一直在划分这些责任。在极端情况下，这意味着每个功能区由不同的人/角色拥有:

These specialized roles create efficiencies within each segment while potentially creating inefficiencies across the entire life cycle. Specialists develop expertise in a focused area and optimize what’s needed for that area. They get more effective at solving their piece of the puzzle. But software requires the entire life cycle to deliver value to customers. Having teams of specialists who each own a slice of the life cycle can create silos that slow down end-to-end progress. Grouping differing specialists together into one team can reduce silos, but having different people do each role adds communication overhead, introduces bottlenecks, and inhibits the effectiveness of feedback loops.

这些专门的角色在每一个细分领域内创造出了效能，但是却有可能造成整个生命周期的低效。专家在其聚焦的领域发展专业知识并针对该领域的需要进行优化。他们在解决特定领域的难题上变得越来越高效。但是软件需要整个生命周期来为客户提供价值。各自精通生命周期的一小块的专家团队反而可能会制造出信息孤岛，从而减慢端到端的进度。将不同的专家分组到一个团队中可以减少信息孤岛，但是让不同的人担任每个角色会增加沟通开销，引入瓶颈，并抑制反馈回环的有效性。

运行构建的内容

To rethink our approach, we drew inspiration from the principles of the devops movement. We could optimize for learning and feedback by breaking down silos and encouraging shared ownership of the full software life cycle:

为了重新思考我们的方法，我们从devops运动的原则中获得了灵感。我们可以通过打破信息孤岛和鼓励共享整个软件生命周期的所有权来优化学习和反馈:

“Operate what you build” puts the devops principles in action by having the team that develops a system also be responsible for operating and supporting that system. Distributing this responsibility to each development team, rather than externalizing it, creates direct feedback loops and aligns incentives. Teams that feel operational pain are empowered to remediate the pain by changing their system design or code; they are responsible and accountable for both functions. Each development team owns deployment issues, performance bugs, capacity planning, alerting gaps, partner support, and so on.

“运营你开发的东西”通过让开发系统的团队也负责系统的运营和支持来践行devops原则。把这个责任分摊给每一支开发团队，而不是外化它，这样就建立直接反馈回环并且把激励给统一起来。感受到运维痛苦的团队被赋权通过改变系统设计或代码来治疗这种痛苦；他们要负责这两种职能。每一支开发团队都要负责部署问题、性能bug、能力规划、告警差异、伙伴支持等等。

通过开发工具进行扩展

Ownership of the full development life cycle adds significantly to what software developers are expected to do. Tooling that simplifies and automates common development needs helps to balance this out. For example, if software developers are expected to manage rollbacks of their services, rich tooling is needed that can both detect and alert them of the problems as well as to aid in the rollback.

对整个开发生命周期的所有权给软件开发者显著增加了负担。这就需要有简化和自动化共同开发需求的工具来减轻负担。比方说，如果软件开发者预期要管理服务的回滚的话，就要有丰富的工具既能检测到问题并予以告警，又能辅助进行回滚才行。

Netflix created centralized teams (e.g., Cloud Platform, Performance & Reliability Engineering, Engineering Tools) with the mission of developing common tooling and infrastructure to solve problems that every development team has. Those centralized teams act as force multipliers by turning their specialized knowledge into reusable building blocks. For example:

Netflix创建了集中的团队(例如，云平台、性能和可靠性工程、工程工具)，其任务是开发通用的工具和基础设施来解决每个开发团队都有的问题。这些集中的团队将他们的专业知识转化为可重用的构建块，从而起到了力量倍增器的作用。例如:

Empowered with these tools in hand, development teams can focus on solving problems within their specific product domain. As additional tooling needs arise, centralized teams assess whether the needs are common across multiple dev teams. When they are, collaborations ensue. Sometimes these local needs are too specific to warrant centralized investment. In that case the development team decides if their need is important enough for them to solve on their own.

有了这些工具在手，开发团队可以专注于解决他们特定产品领域中的问题。随着其他工具需求的出现，集中化团队会评估多个开发团队是否也有这些需求。如果有，接着就要协作。有时，这些地方需求过于具体，无法保证集中投资。在这种情况下，开发团队决定他们的需求是否重要到足以让他们自己解决问题。

Balancing local versus central investment in similar problems is one of the toughest aspects of our approach. In our experience the benefits of finding novel solutions to developer needs are worth the risk of multiple groups creating parallel solutions that will need to converge down the road. Communication and alignment are the keys to success. By starting well-aligned on the needs and how common they are likely to be, we can better match the investment to the benefits to dev teams across Netflix.

对类似问题在局部与集中投资间进行平衡是我们的方案当中最棘手的地方。按照我们的经验寻找开发需求的新颖解决方案的好处，是值得冒险让多支团队同时开发在将来殊途同归的解决方案的。沟通与协调是成功的关键。通过协调好需求及其共性，我们就能更好地将投资与跨开发团队的好处进行匹配。

全周期开发者

By combining all of these ideas together, we arrived at a model where a development team, equipped with amazing developer productivity tools, is responsible for the full software life cycle: design, development, test, deploy, operate, and support.

把所有这些想法凑到一起，我们就得出了这么一个模式，在配备了出色的开发者生产力工具之后，开发团队将负责整个软件生命周期：包括设计、开发、测试、部署、运营以及支持。

Full cycle developers are expected to be knowledgeable and effective in all areas of the software life cycle. For many new-to-Netflix developers, this means ramping up on areas they haven’t focused on before. We run dev bootcamps and other forms of ongoing training to impart this knowledge and build up these skills. Knowledge is necessary but not sufficient; easy-to-use tools for deployment pipelines (e.g., Spinnaker) and monitoring (e.g., Atlas) are also needed for effective full cycle ownership.

全周期开发者需要熟悉软件生命周期各个领域并且高效。对于很多不熟悉Netflix的开发者来说，这意味着要在自己之前不怎么关注的领域加把劲。我们开设有dev新兵训练营及其他持续培训形式来灌输这种知识并培养技能。知识是必要非充分条件；部署管道和监控还需要有易用的工具才能支撑高效的全周期开发运营。

Full cycle developers apply engineering discipline to all areas of the life cycle. They evaluate problems from a developer perspective and ask questions like “how can I automate what is needed to operate this system?” and “what self-service tool will enable my partners to answer their questions without needing me to be involved?” This helps our teams scale by favoring systems-focused rather than humans-focused thinking and automation over manual approaches.

全周期开发者把工程规范应用到生命周期的各个领域。他们从开发者的角度去评估问题，会提出类似“我如何才能自动化该系统运营所需的东西？”以及“什么样的自服务工具能让我的伙伴回答他们的问题而不需要我的参与？”优先考虑聚焦系统的办法而不是聚焦于人的办法，优先考虑自动化而不是手工，这帮助了我们团队实现伸缩性。

Moving to a full cycle developer model requires a mindset shift. Some developers view design+development, and sometimes testing, as the primary way that they create value. This leads to the anti-pattern of viewing operations as a distraction, favoring short term fixes to operational and support issues so that they can get back to their “real job”. But the “real job” of full cycle developers is to use their software development expertise to solve problems across the full life cycle. A full cycle developer thinks and acts like an SWE, SDET, and SRE. At times they create software that solves business problems, at other times they write test cases for that, and still other times they automate operational aspects of that system.

转向全周期开发者模式需要理念的转变。一些开发者认为设计+开发，或者有时候测试才是创造价值的主要手段。这会导致一种反模式，认为运营是分心的事情，更喜欢对运营和支持问题进行短期性质的修补以便能够回到自己“真正的工作”上去。但是全周期开发者这项“真正的工作”是利用他们的软件开发知识去解决全生命周期的问题。全周期开发者要像SWE、SDET以及SRE一样思考和行动。有时候他们要创建软件去解决商业问题，有时候他们写相应的测试用例，还有些时候他们会对系统的运营方面进行自动化。

For this model to succeed, teams must be committed to the value it brings and be cognizant of the costs. Teams need to be staffed appropriately with enough headroom to manage builds and deployments, handle production issues, and respond to partner support requests. Time needs to be devoted to training. Tools need to be leveraged and invested in. Partnerships need to be fostered with centralized teams to create reusable components and solutions. All areas of the life cycle need to be considered during planning and retrospectives. Investments like automating alert responses and building self-service partner support tools need to be prioritized alongside business projects. With appropriate staffing, prioritization, and partnerships, teams can be successful at operating what they build. Without these, teams risk overload and burnout.

这一模式要想取得成功，团队必须为它所带来的价值做奉献并且要认识到所需的成本。团队需要预留合理的人手去管理开发和部署，处理生产问题，并且对伙伴的支持请求作出响应。需要投入时间到培训上。要利用好工具并且投资于工具。需要跟集中化团队培养合作关系来创建出可重用的组件和解决方案。规划和回顾阶段要考虑到生命周期的各个领域。除了商业项目以外，像自动化告警响应和开发自服务伙伴支持工具这样的投资需要优先考虑。有了合适的人力、恰当的优先次序，再加上合作关系，团队就能成功地运营自己开发的东西。没有这些，团队就会有负担过重精疲力竭的风险。

To apply this model outside of Netflix, adaptations are necessary. The common problems across your dev teams are likely similar — from the need for continuous delivery pipelines, monitoring/observability, and so on. But many companies won’t have the staffing to invest in centralized teams like at Netflix, nor will they need the complexity that Netflix’s scale requires. Netflix’s tools are often open source, and it may be compelling to try them as a first pass. However, other open source and SaaS solutions to these problems can meet most companies needs. Start with analysis of the potential value and count the costs, followed by the mindset-shift. Evaluate what you need and be mindful of bringing in the least complexity necessary.

在Netflix之外的地方应用这一模式需要进行必要的调整。开发团队之间的共同问题可能是类似的——比如持续交付管道的需求，比如监控/可观察性等等。但很多公司并没有像Netflix这样有足够的人力投资到集中化团队上，或者也不需要Netflix这种规模导致的复杂性。Netflix的工具往往是开源的，所以一开始你想尝试一下也正常。不过这些问题其他的开源和SaaS解决方案也能满足大多数公司的需求。先从分析潜在价值和计算成本开始没然后再进行观念转变。评估你需要什么，小心不要引入不必要的复杂性。

权衡

The tech industry has a wide range of ways to solve development and operations needs (see devops topologies for an extensive list). The full cycle model described here is common at Netflix, but has its downsides. Knowing the trade-offs before choosing a model can increase the chance of success.

技术圈有很丰富的手段来解决开放和运营需求（延伸阅读：devops拓扑）。这里描述的全周期模型在Netflix很普遍，但这种模式也有缺点。在选择一种模式前先了解其中的利弊可以提高成功的几率。

With the full cycle model, priority is given to a larger area of ownership and effectiveness in those broader domains through tools. Breadth requires both interest and aptitude in a diverse range of technologies. Some developers prefer focusing on becoming world class experts in a narrow field and our industry needs those types of specialists for some areas. For those experts, the need to be broad, with reasonable depth in each area, may be uncomfortable and sometimes unfulfilling. Some at Netflix prefer to be in an area that needs deep expertise without requiring ongoing breadth and we support them in finding those roles; others enjoy and welcome the broader responsibilities.

在全周期模式下，一个人要管的事情变宽了变多了。而一些开发者偏向于专注成为比较狭窄的领域的世界级专家，在一些领域我们也是需要那种类型的专家的。对于那些专家来说，需要一专多能，对每个领域都懂一些的要求可能会感觉不太舒服而且有时候勉为其难。有些人宁愿呆在需要深厚知识不需要持续扩展广度的领域，我们也支持他们去找到这样的角色；有的则享受并且欢迎承担更广的责任。

In our experience with building and operating cloud-based systems, we’ve seen effectiveness with developers who value the breadth that owning the full cycle requires. But that breadth increases each developer’s cognitive load and means a team will balance more priorities every week than if they just focused on one area. We mitigate this by having an on-call rotation where developers take turns handling the deployment + operations + support responsibilities. When done well, that creates space for the others to do the focused, flow-state type work. When not done well, teams devolve into everyone jumping in on high-interrupt work like production issues, which can lead to burnout.

根据我们开发和运营基于云的系统的经验，我们见识过哪些重视拥有全周期所需的广度的开发者的效能。但是这种广度增加了每一位开发者的认知负荷，这意味着团队每周将比仅关注一个领域要平衡更多的优先事项。为此我们采取了随时待命的轮转来缓解这一点：即让开发者轮流分担部署+运营+支持责任。做得不好的情况下，就会出现人人都在当救火队员去处理生产问题等高中断的情况，导致所有人精疲力竭。

Tooling and automation help to scale expertise, but no tool will solve every problem in the developer productivity and operations space. Netflix has a “paved road” set of tools and practices that are formally supported by centralized teams. We don’t mandate adoption of those paved roads but encourage adoption by ensuring that development and operations using those technologies is a far better experience than not using them. The downside of our approach is that the ideal of “every team using every feature in every tool for their most important needs” is near impossible to achieve. Realizing the returns on investment for our centralized teams’ solutions requires effort, alignment, and ongoing adaptations.

工具和自动化有助于扩展专业知识，但没有一项工具能解决开发者生产力和运营领域的每一个问题。Netflix有集中化团队支撑的现成的一套工具和实践。我们不强求其他团队一定要用这些，但是通过确保开发和运营采用这些技术的体验要比不用好得多来鼓励他们采用。我们的办法不好之处在于“每一支团队将每一项工具的每一个功能用到其最重要的需求”上这个理想几乎是不可能实现的。需要意识到我们集中化团队解决方案的投资回报需要努力、协调以及持续适配。

结论

The path from 2012 to today has been full of experiments, learning, and adaptations. Edge Engineering, whose earlier experiences motivated finding a better model, is actively applying the full cycle developer model today. Deployments are routine and frequent, canaries take hours instead of days, and developers can quickly research issues and make changes rather than bouncing the responsibilities across teams. Other groups are seeing similar benefits. However, we’re cognizant that we got here by applying and learning from alternate approaches. We expect tomorrow’s needs to motivate further evolution.

从2012年走到今天经历了种种实验、学习和适配的过程。Edge Engineering的早期经历刺激了寻找更好模式的需求，从此全周期开发者模式就被我们积极地应用到今天。部署是日常，进行得很频繁，金丝雀行动只需要数小时而不是数日了，开发者可以迅速调研问题作出变更而不是在团队之间踢皮球。其他的团队也看到了类似的好处。然而，我们认识到我们是通过应用替代方案并从中学习才走到今天的。我们预期将来的需求还会推动进一步的演进。

Interested in seeing this model in action? Want to be a part of exploring how we evolve our approaches for the future? Consider joining us.