持续集成持续部署持续交付
Like most startups, our original monolithic codebase, Shipotle, grew rapidly to keep up with the exponential growth of the company. It soon became a tangled blob of complex business logic and data storage. There are various efforts under way to break our monolith into microservices, but the transition will take some time. In the meantime, working in Shipotle remains on the critical path of shipping many of our new features and products. To provide immediate relief for some of the pain points of working in a large, legacy repo, we launched a team at the beginning of 2020 focused exclusively on Developer Experience in Shipotle.
像大多数初创公司一样,我们最初的整体式代码库ShipotleSwift增长,以跟上公司的指数级增长。 很快,它变成了复杂的业务逻辑和数据存储的纠结对象。 为了将我们的整体服务细分为微服务,正在采取各种措施,但是过渡将需要一些时间。 同时,在Shipotle工作仍然是运送许多新功能和新产品的关键之路。 为了立即缓解在大型旧版回购中工作的某些痛点,我们在2020年初成立了一个专门针对Shipotle开发人员经验的团队。
One of the biggest customer complaints was that the Continuous Integration (CI) process was too slow and prevented developers from iterating quickly on their features. At Convoy, we require all pull requests to pass suites of unit and integration tests as well as style checks before we allow merges into the mainline. Because there are so many tests in the monolith, it is impractical for developers to run all the different tests locally. Most developers’ workflows usually involve only running a few tests locally before relying on the CI environment to validate their changes as well as catch any regressions on the entire test suite.
客户最大的抱怨之一是持续集成(CI)流程太慢,妨碍了开发人员快速迭代其功能。 在Convoy,我们要求所有拉取请求通过单元测试和集成测试套件以及样式检查,然后才允许合并到主线中。 由于整体中有如此多的测试,因此开发人员无法在本地运行所有不同的测试。 大多数开发人员的工作流程通常只涉及在本地运行少数测试,然后再依赖CI环境来验证其更改以及捕获整个测试套件的所有回归。
When our team first set out to tackle this problem, the CI workflow took 20 minutes. All 100+ developers in Shipotle had to pay this tax for every commit they made, and it added up to limit the productivity of the entire engineering team. In addition to costing developers a lot of time, a long CI process also requires paying for more cloud compute and directly affects the company’s bottom line. Through our work, we managed to halve the CI time to 10 minutes even as the number of tests in repo increased by over 20%.
当我们的团队首次着手解决此问题时,CI工作流程耗时20分钟。 希波托尔(Shipotle)的所有100多名开发人员必须为每次提交支付此税,这加起来限制了整个工程团队的生产力。 除了花费大量的开发人员时间外,长时间的CI流程还需要支付更多的云计算费用,并直接影响公司的盈利。 通过我们的工作,即使回购中的测试数量增加了20%以上,我们也将CI时间减少了一半,降至10分钟。
建筑可观察性 (Building Observability)
Going into the project, our team knew that a lot of tasks would be exploratory in nature. To increase our confidence in the proposed changes, we needed to collect and analyze metrics from the CI pipeline. These metrics would also serve as an easy way to track progress over time and highlight bottlenecks that needed additional work. Luckily, the data we were looking for was readily available via webhooks from the CI pipeline. All we had to do was listen to the incoming webhooks and emit new Datadog metrics. Once the CI data was in Datadog, we were able to quickly build dashboards for easy visualization.
进入该项目,我们的团队知道很多任务本质上都是探索性的。 为了增强我们对拟议变更的信心,我们需要从CI管道中收集和分析指标。 这些指标还可以用作跟踪一段时间内进度的简单方法,并突出显示需要额外工作的瓶颈。 幸运的是,我们正在寻找的数据可以通过CI管道中的webhooks获得。 我们所要做的就是监听传入的Webhooks并发出新的Datadog指标。 将CI数据存储在Datadog中之后,我们便能够快速构建仪表板以轻松进行可视化。
We ended up with two dashboards, build-test time and tests time. The build-test time dashboard gave us a top level understanding of the CI process and helped us understand trends over time. The tests time dashboard dives into individual tests and helps us determine which tests are slow and flaky.
我们最终得到了两个仪表板, 建立测试时间和测试时间。 构建测试时间仪表板使我们对CI流程有了最高级的了解,并帮助我们了解了一段时间内的趋势。 测试时间仪表板将深入各个测试,并帮助我们确定哪些测试缓慢且不稳定。
With the two dashboards, we quickly realized we needed to focus our efforts on driving down the (Docker) build times and the integration test times.
通过这两个仪表板,我们很快意识到我们需要集中精力减少(Docker)的构建时间和集成测试时间。
缩短Docker构建时间(Improving Docker Build Times)
At Convoy, we believe in running our tests inside of production-ready artifacts to minimize differences between environments. As such, building the production-ready Docker image is a prerequisite to running our test suites.
在Convoy,我们相信要在可用于生产的工件中运行测试,以最大程度地减少环境之间的差异。 因此,构建可用于生产环境的Docker映像是运行我们的测试套件的先决条件。
Our docker build has 4 main phases.
我们的Docker构建包括4个主要阶段。
- Code checkout 代码签出
- Installing the necessary node modules.安装必要的节点模块。
- Compiling TypeScript to Javascript 将TypeScript编译为Javascript
- Pushing the built image to our container repository 将构建的映像推送到我们的容器存储库
One simple trick for shaving build time off of any large and old monorepo is to leverage shallow git clones to skip downloading all of the repo’s history. We found that this simple change changed our clone times from 30 seconds to 5 seconds.
减少任何大型和老式Monorepo的构建时间的一个简单技巧是利用浅git克隆跳过下载所有repo的历史记录。 我们发现此简单更改将克隆时间从30秒更改为5秒。
处理大型Docker层 (Dealing with Large Docker Layers)
We quickly found out that steps 2 and 4 are intimately related: the fewer node modules you install, the faster the push (and pull). Docker is able to push different layers in parallel, but when a single layer contains all the node modules, the slowest part of the process ends up being the time it takes to compress thousands of files before the push.
我们很快发现步骤2和步骤4是紧密相关的:您安装的节点模块越少,推送(和拉出)的速度就越快。 Docker能够并行推送不同的层,但是当单个层包含所有节点模块时,该过程中最慢的部分就是在推送之前压缩数千个文件所花费的时间。
When we first looked into our node modules directory, we realized that there were a couple packages that were duplicated over and over. Yarn why helped us realize we were pinning the version of certain packages in Shipotle’s package.json, preventing hoisting from happening within many of our required internal client libraries and resulting in many nested versions instead. Just by bumping lodash 4 patch versions upwards, we were able to remove 70 duplicate lodash copies in node modules! Systematically working through our package.json list with yarn why and yarn deduplicate, we halved the size of node modules from 2 GB to < 1 GB.
当我们第一次查看节点模块目录时,我们意识到有几个软件包是一遍又一遍地重复的。 Yarn为什么帮助我们意识到我们将某些软件包的版本固定在Shipotle的package.json中,从而防止在我们所需的许多内部客户端库中发生提升,而导致许多嵌套版本。 仅仅通过向上提升lodash 4补丁版本,我们就能够删除节点模块中的70个重复lodash副本! 系统地处理package.json列表中的yarn why和yarn deduplicate ,我们将节点模块的大小从2 GB减少了一半至<1 GB。
We also discovered another simple trick for shrinking the size of node modules in Docker images. Yarn install caches the package tarball in a local cache directory to prevent fetching a package over and over again. This makes a lot of sense for local development but is unnecessary for a production image. By modifying our install command from yarn install to yarn install && yarn cache clean, we further trimmed down the size of our Docker image.
我们还发现了另一个简单的技巧,可以缩小Docker映像中节点模块的大小。 Yarn install将软件包tarball缓存在本地缓存目录中,以防止一遍又一遍地获取软件包。 这对于本地开发很有意义,但对于生产图像而言则不必要。 通过将install命令从yarn install修改为yarn install && yarn cache clean ,我们进一步缩小了Docker映像的大小。
In addition to reducing the Docker image size, we also looked into making the Docker build more efficient. We wanted a system that could more efficiently leverage Docker’s built-in layer reuse. In particular, installing node modules over and over is extremely wasteful and slow. We rolled out a cache system that determines if the checksum of the package.json and yarn.lock files have been encountered before. If the cache exists, we pull the corresponding Docker image that will share the same node modules layer. If not, we skip the image pull, build the image from scratch, and update the cache with the new image. It does take a bit longer to pull the cached image before kicking off the build, but that is easily offset by not needing to install or push the large node modules layer.
除了减少Docker映像大小之外,我们还研究了使Docker构建更高效的方法。 我们想要一个可以更有效地利用Docker内置层重用的系统。 特别是,一遍又一遍地安装节点模块非常浪费且缓慢。 我们推出了一个缓存系统,该系统确定以前是否遇到过package.json和yarn.lock文件的校验和。 如果缓存存在,我们提取将共享相同节点模块层的相应Docker映像。 如果没有,我们将跳过图像拉取,从头开始构建图像,并使用新图像更新缓存。 在开始构建之前,拉取缓存的映像确实需要花费更长的时间,但是由于不需要安装或推送大型节点模块层而很容易抵消。
缩短TypeScript编译时间 (Improving TypeScript Compile Times)
The other main step in our Docker build is compiling our TypeScript code into Javascript. When we first started, the compile time was taking roughly 280 seconds. We tried a variety of different experiments like increasing the machine size, breaking apart the compile into smaller chunks, and upgrading TypeScript versions. Nothing worked. In the end, it came down to a single TypeScript config flag. Our configuration had the incremental flag set to true. With incremental compiles, TypeScript is able to determine which files changed since the last compile and only type check and transpile those impacted files. Developers pay an expensive one time boot up cost for faster local iteration. However, because our production artifact does not need to be recompiled again and again, keeping this flag enabled in the Docker build is useless. In fact, we actually found that keeping the flag on greatly slows down the compile time because the compiler has to do more work to output the information necessary to make incremental compiles possible. Switching the flag off immediately caused our compile times to drop down to 130 seconds.
Docker构建的另一个主要步骤是将TypeScript代码编译为Javascript。 刚开始时,编译时间大约需要280秒。 我们尝试了各种不同的实验,例如增加计算机大小,将编译器分成较小的块以及升级TypeScript版本。 没事。 最后,它归结为一个TypeScript配置标志。 我们的配置将增量标志设置为true。 使用增量编译,TypeScript能够确定自上次编译以来哪些文件已更改,并且仅进行类型检查并转换那些受影响的文件。 开发人员为一次更快的本地迭代支付了昂贵的一次性启动成本。 但是,由于不需要一次又一次地重新编译我们的生产工件,因此在Docker构建中保持启用此标志是没有用的。 实际上,我们实际上发现,将标志保持打开状态会极大地降低编译时间,因为编译器必须做更多的工作来输出使增量编译成为可能的信息。 立即关闭标志会使我们的编译时间减少到130秒。
加快测试 (Speeding up Testing)
Generally, the simplest way to speed up tests is to increase the number of containers running them. While the overall wall clock time remains the same regardless of the number of processes, there is a cost overhead for each additional container/machine we want to launch. This is because it takes time to pull, extract, and start each Docker container. While the compute cost of running more machines scales linearly, shorter test times have diminishing returns on developer productivity. Given the limited capital we can spend in this area, it is easier to view this problem as an efficiency problem instead of just a speed problem.
通常,加快测试速度的最简单方法是增加运行它们的容器数量。 尽管总的挂钟时间保持不变,而不管进程数如何,但我们要启动的每个其他容器/机器都存在开销。 这是因为拉,提取和启动每个Docker容器需要花费时间。 尽管运行更多计算机的计算成本呈线性增长,但更短的测试时间却降低了开发人员生产力的回报。 鉴于我们可以在这方面花费的资金有限,因此将这个问题视为效率问题而不是速度问题更为容易。
解决最慢的测试 (Tackling the Slowest Tests)
Once we built out our test dashboard, we could easily identify the problematic slow test suites that were blocking the build. While we did discover a testing race condition that would cause some tests to get locked out and idle for 3 minutes, we found most of the slowness was a result of the gradual build up of code over time. Oftentimes there was inefficient or unnecessary setup and teardown logic that was copy and pasted between test files, and the only way to fix them was to work with the individual teams. Although the work was seemingly unglamorous (writing tests is hard enough, but reading tests is even less enjoyable), our team was able to document some common anti-patterns and implement some guardrails to help prevent future mistakes.
一旦构建了测试仪表板,我们就可以轻松地识别出阻碍构建的有问题的慢速测试套件。 尽管我们确实发现了一个测试竞态条件,该条件会导致某些测试被锁定并空闲3分钟,但我们发现,大多数缓慢的原因是随着时间的推移逐步建立了代码。 通常,在测试文件之间复制和粘贴效率低下或不必要的设置和拆卸逻辑,解决这些问题的唯一方法是与各个团队合作。 尽管这项工作看似乏善可陈(编写测试足够困难,但阅读测试却不够愉快),但我们的团队能够记录一些常见的反模式并实现了一些防护措施,以防止将来出现错误。
改善测试容器的使用 (Improving Testing Container Usage)
Despite our best efforts to tackle the slowest tests, we were not keeping up with the influx of new testing code, especially for the integration tests. We eventually realized that we had never bothered to question the original test running strategy. The integration tests were originally set up to run as a single test process along with the required Postgres and Redis servers and this setup had never been revisited. A quick ssh to one of the test containers and we saw that the container was being underutilized!
尽管我们已尽最大努力应对最慢的测试,但我们仍未跟上新测试代码的涌入,特别是对于集成测试。 最终,我们意识到我们从未怀疑过原始的测试运行策略。 集成测试最初是与必需的Postgres和Redis服务器一起设置为作为单个测试过程运行的,并且从未重新考虑过这种设置。 快速查看一个测试容器,我们发现该容器未得到充分利用!
After that discovery, we experimented with running multiple isolated test processes via backgrounding, passing each test process its own unique Postgres database and Redis server to maintain isolation. As we tweaked the number of background test processes to run inside each test container, we closely monitored our dashboards to understand if we were causing the CPUs to thrash or if we could push the machine harder. We found our sweet spot to be 5 background test processes (and their corresponding databases) running on a 3 vCPU machine. Before backgrounding, our integration tests were consistently taking 9–10 minutes. With our current setup, the tests take about half as long and sometimes even finish in less than 4 minutes.
发现之后,我们尝试了通过后台运行多个隔离的测试过程,并将每个测试过程传递给自己的唯一Postgres数据库和Redis服务器以保持隔离。 当我们调整要在每个测试容器中运行的后台测试进程的数量时,我们密切监视了仪表板,以了解是导致CPU崩溃还是更难推动计算机运行。 我们发现,最理想的是在3台vCPU机器上运行的5个后台测试进程(及其对应的数据库)。 在进行背景测试之前,我们的集成测试持续耗时9-10分钟。 使用我们当前的设置,测试大约需要一半的时间,有时甚至不到4分钟即可完成。
Working in, supporting, and optimizing a large monolithic code base can be challenging in the best of times and it can begin to feel like the legacy systems are actually slowing down progress. Although it took time for our team to get familiarized with each corner of the monolith and begin to establish a broader domain expertise, by digging in so deeply, we were able to uncover simple, high impact fixes that greatly improved the CI pipeline.
在最佳情况下,工作,支持和优化大型整体代码库可能会充满挑战,并且可能会开始感觉到旧系统实际上正在减慢进度。 尽管我们的团队花了一些时间熟悉整体的每个角落并开始建立更广泛的领域专业知识,但通过深入研究,我们仍然能够发现简单而有效的修复程序,从而极大地改善了CI流程。
Through this work, we discovered three key takeaways:
通过这项工作,我们发现了三个关键点:
- Observability and transparency are critical when pushing forward a difficult project 推进困难的项目时,可观察性和透明度至关重要
- Sometimes it’s the smallest changes that make the biggest impact but only by knowing the code base intimately could we root them out有时,最小的更改会产生最大的影响,但只有深入了解代码库,我们才能根除它们
- Perseverance and a little out of the box thinking can be key to uncovering new solutions毅力和开箱即用的思维可能是发现新解决方案的关键
Hopefully hearing more about our process has been helpful and you can apply some of these tricks to your CI pipeline as well!
希望更多地了解我们的流程对您有所帮助,您也可以将其中一些技巧应用于CI管道!
翻译自: https://medium.com/convoy-tech/improving-continuous-integration-for-our-monolith-27a5eaf82ae
持续集成持续部署持续交付