监控能力 devops_管理数百台服务器以进行负载测试:自动扩展,自定义监控,DevOps文化

监控能力 devops

In the previous article, I talked about our load testing infrastructure. On average, we use about 100 servers to create a load, about 150 servers to run our service. All these servers need to be created, configured, started, deleted. To do this, we use the same tools as in the production environment to reduce the amount of manual work: 在上一篇文章中 ,我讨论了我们的负载测试基础架构。 平均而言,我们使用大约100台服务器来创建负载,大约150台服务器来运行我们的服务。 所有这些服务器都需要创建,配置,启动,删除。 为此,我们使用与生产环境中相同的工具来减少手工工作量:
  • Terraform脚本 (Terraform scripts)

    for creating and deleting a test environment;

    用于创建和删除测试环境;

  • Ansible脚本 (Ansible scripts)

    for configuring, updating, starting servers;

    用于配置,更新,启动服务器;

  • 内部Python脚本 (In-house Python scripts)

    for dynamic scaling, depending on the load.

    动态缩放,具体取决于负载。

Thanks to the Terraform and Ansible scripts, all operations ranging from creating instances to starting servers are performed with only six commands:

多亏了Terraform和Ansible脚本,从创建实例到启动服务器的所有操作都仅用以下六个命令执行:

#launch the required instances in the AWS console
ansible-playbook deploy-config.yml #update servers versions
ansible-playbook start-application.yml #start our app on these servers
ansible-playbook update-test-scenario.yml --ask-vault-pass #update the JMeter test scenario if it was changed
infrastructure-aws-cluster/jmeter_clients:~# terraform apply #create JMeter servers for creating the load
playbook start-jmeter-server-cluster.yml #start the JMeter cluster
ansible-playbook start-stress-test.yml #start the test

动态服务器扩展 (Dynamic server scaling)

We have more than a hundred thousand simultaneously active online users during peak hours. There is no point in keeping the full amount of servers running all the time, so we set up autoscaling for the board servers, which handle requests that are made when a user opens a whiteboard, and for the API servers, which handle all other API requests. Servers are now created and deleted when needed.

在高峰时段,我们有十万多个同时在线用户。 始终保持服务器的全部运行是没有意义的,因此我们为处理服务器打开用户白板时发出的请求的板级服务器以及处理所有其他API的API服务器设置了自动缩放要求。 现在可以创建服务器,并在需要时将其删除。

This mechanism is handy for load testing: We can have just the minimum required number of servers by default, and when we run the test, this number will be automatically increased accordingly. We may have four board servers at the start and up to forty at the peak. Additionally, new servers are not created immediately, but only after the current servers are fully loaded. An example rule for creating new instances could be reaching 50 percent of the CPU usage. This allows us not to slow down the growth of virtual users in a test scenario and not to create unnecessary servers.

此机制在负载测试中很方便:默认情况下,我们只能拥有最少数量的服务器,并且在运行测试时,该数量将相应地自动增加。 我们一开始可能有四台板服务器,高峰时可能有四十台。 此外,不会立即创建新服务器,而只会在当前服务器完全加载后才创建。 创建新实例的示例规则可能达到CPU使用率的50%。 这使我们在测试场景中不会减慢虚拟用户的增长,也不会创建不必要的服务器。

An additional advantage of this approach is that thanks to dynamic scaling, we learn how much capacity we need for different numbers of users that we haven’t yet seen in the production environment.

这种方法的另一个优点是,由于有了动态扩展,我们可以了解生产环境中尚未见到的不同数量用户所需的容量。

收集类似生产的指标 (Collecting production-like metrics)

There are many tools and approaches for monitoring load tests, but we went our own way.

有许多工具和方法可以监视负载测试,但是我们还是走了自己的路。

We monitor the production environment using a standard technology stack:

我们使用标准技术堆栈监视生产环境:

Logstash,Elasticsearch,Kibana,Prometheus和Grafana (Logstash, Elasticsearch, Kibana, Prometheus, and Grafana)

. Our testing cluster is similar to the production cluster, so we decided to make the monitoring the same as in the production environment, with the same metrics. There are two reasons for that:

。 我们的测试集群与生产集群类似,因此我们决定使监视与生产环境相同,并采用相同的指标。 有两个原因:

  • There’s no need to build a monitoring system from scratch; we already have a complete system;

    无需从头开始构建监视系统。 我们已经有一个完整的系统;
  • We’re additionally testing the monitoring in the production environment: if during the monitoring of the test environment, we conclude that we do not have enough data to analyze the problem, it means that we will not have enough data when that problem occurs in the production environment.

    我们还将在生产环境中测试监视:如果在测试环境的监视过程中得出的结论是我们没有足够的数据来分析问题,则意味着当该问题发生在生产环境中时,我们将没有足够的数据。生产环境。
image

我们在报告中包含的内容 (What we include in the reports)

  • Technical specification of the test booth;

    试验台技术规范;
  • Test scenario in a human-readable format;

    以人类可读的格式测试场景;
  • A result that is understandable by both developers and managers;

    开发人员和管理人员都可以理解的结果;
  • General condition charts;

    一般状况图;
  • Charts that show a bottleneck or something that was affected by the optimization tested in the test.

    图表显示瓶颈或受测试中测试的优化影响的内容。

It is crucial to store all results in one place. This way, they can be easily compared with each other from test run to test run.

将所有结果存储在一个地方至关重要。 这样,可以轻松地将它们在测试运行之间相互比较。

We create reports in our product automatically by using our public API.

我们使用公共API在产品中自动创建报告。

基础架构即代码(Iac) (Infrastructure as code (Iac))

In our case, product quality is not the responsibility of QA Engineers but the entire team. Load tests are just one of the quality assurance tools. It’s great if the team understands that it is important to test new changes under load. To start thinking about it, the team has to take responsibility for the production environment’s well-being. Here we are helped by the principles of DevOps culture, which we started to implement in our work.

就我们而言,产品质量不是质量保证工程师的责任,而是整个团队的责任。 负载测试只是质量保证工具之一。 如果团队理解在负载下测试新更改很重要,那就太好了。 要开始考虑,团队必须对生产环境的福祉负责。 在这里,我们受到了DevOps文化原则的帮助,我们在工作中开始实施这些原则。

But to start thinking about conducting load tests is only the first step. The team will not be able to create thorough test cases without understanding the structure of the production environment. We encountered this problem when we began to set up the process of conducting load tests in teams. At that time, the teams had no means of understanding the production environment, so it was difficult for them to work on the design of the tests. There were two reasons for that: the lack of up-to-date documentation or somebody who keeps the whole schema of the production environment in their head and the dramatic increase in the development team’s size.

但是开始考虑进行负载测试只是第一步。 如果不了解生产环境的结构,团队将无法创建全面的测试用例。 当我们开始设置团队中进行负载测试的过程时,我们遇到了这个问题。 当时,团队无法理解生产环​​境,因此他们很难进行测试的设计。 造成这种情况的原因有两个:缺乏最新的文档或缺少将生产环境的整个架构牢记在心的人,以及开发团队规模的急剧增加。

The Infrastructure-as-Code approach, which we now use in the development team, can help the team to understand the production environment.

我们现在在开发团队中使用的“基础结构即代码”方法可以帮助团队了解生产环境。

What we have already achieved using that approach:

使用这种方法我们已经实现了:

  • Everything must be automated and ready to be launched at any moment. This significantly reduces the recovery time in case of an accident in the data center and allows us to keep the right amount of relevant test environments;

    一切都必须是自动化的,随时可以启动。 这大大减少了数据中心发生事故时的恢复时间,并使我们能够保留适当数量的相关测试环境;
  • Reasonable savings: when we can, we deploy environments using OpenStack to replace expensive platforms like AWS;

    合理的节省:在可能的情况下,我们使用OpenStack部署环境来替换昂贵的平台(例如AWS);
  • Teams create load tests on their own because they understand the production environment;

    团队会自己创建负载测试,因为他们了解生产环境。
  • Code replaces the documentation, so there is no need to update the documentation continually; it is always complete and up-to-date;

    代码代替了文档,因此不需要持续更新文档; 它始终是完整且最新的;
  • No need for a dedicated narrow-field expert to do ordinary tasks. Any engineer can figure out the code;

    不需要专门的窄场专家来执行普通任务。 任何工程师都可以弄清楚代码。
  • Having a clear production environment structure makes it much easier to plan investigative load tests, like chaos monkey testing or long memory leak tests.

    具有清晰的生产环境结构使计划调查性负载测试(例如混乱的猴子测试或长内存泄漏测试)变得更加容易。

We want to extend this approach beyond creating the infrastructure to support various tools. For example, we have successfully converted the database test that I talked about in the previous article to code. Thanks to this, instead of a pre-prepared site, we have a set of scripts that we can use to create, in seven minutes, a fully configured environment in an empty AWS account and start the test. For the same reason, we are now looking closely at Gatling, which is positioned by its authors as a “Load test as code” tool.

我们希望将这种方法扩展到创建支持各种工具的基础架构之外。 例如,我们已经成功地将我在上一篇文章中讨论的数据库测试转换为代码。 有了这个,我们有了一组脚本,可以用一个脚本在一个空的AWS账户中在七分钟内创建一个完全配置的环境,然后开始测试,而不是预先准备的站点。 出于同样的原因,我们现在正在密切关注Gatling ,该作者将其定位为“作为代码的负载测试”工具。

Infrastructure-as-Code entails a similar approach to the infrastructure testing and to all new scripts written by the team to create an infrastructure for new features. Tests must cover all this. There are various testing frameworks, such as Molecule. There are also tools for chaos monkey testing, paid tools for AWS, Pumba for Docker, etc. They will allow us to solve different types of problems:

基础架构即代码需要类似的方法来进行基础架构测试以及团队编写的所有新脚本,以创建用于新功能的基础架构。 测试必须涵盖所有这一切。 有各种测试框架,例如Molecule 。 也有用于混乱猴子测试的工具,用于AWS的付费工具,用于Docker的Pumba等。它们将使我们能够解决不同类型的问题:

  • How can we check, in case one of the AWS instances fails, that the load is rebalanced among the remaining instances and that the service survives this sudden redirection of requests?

    在其中一个AWS实例发生故障的情况下,我们如何检查其余实例之间的负载是否重新平衡,并且该服务在这种突然的请求重定向中仍然存在?
  • How can we simulate slow network connections, disconnects, and other technical problems that should not break the logic of the service’s infrastructure?

    我们如何模拟缓慢的网络连接,断开连接以及其他不应破坏服务基础架构逻辑的技术问题?

We plan to solve these problems soon.

我们计划尽快解决这些问题。

结论 (Conclusions)

  • Do not waste your time on manual infrastructure orchestration. It is better to automate these actions to more reliably control all environments, including the production environment;

    不要将时间浪费在手动基础架构流程上。 最好使这些动作自动化以更可靠地控制所有环境,包括生产环境。
  • Dynamic scaling significantly reduces the cost of maintaining the production and large test environments while also reducing the human factor effect on scaling;

    动态扩展显着降低了维护生产环境和大型测试环境的成本,同时还减少了人为因素对扩展的影响;
  • You don’t have to have a separate monitoring system for tests. Instead, use an existing system from the production environment;

    您不必具有用于测试的单独监视系统。 而是使用生产环境中的现有系统。
  • Load test reports must be automatically collected in one place and have a uniform look. This will allow you to compare them and analyze the changes quickly;

    负载测试报告必须自动收集在一个地方并具有统一的外观。 这样您就可以比较它们并快速分析更改;
  • Load testing will become a normal process in the company as teams start to feel responsible for the well-being of the production environment;

    随着团队开始对生产环境的幸福感负责,负载测试将成为公司的正常流程。
  • Load tests are infrastructure tests. If the load test was finished successfully, maybe it was miswritten. To validate the correctness of the test, you have to have a thorough understanding of the production environment. The teams should have the means to understand the production environment by themselves. We solve this problem using the IaC approach;

    负载测试是基础结构测试。 如果负载测试成功完成,则可能是写错了。 为了验证测试的正确性,您必须对生产环境有透彻的了解。 团队应该有能力自己了解生产环境。 我们使用IaC方法解决了这个问题;
  • Scripts that create the infrastructure also require testing like any other code.

    创建基础结构的脚本也需要像其他任何代码一样进行测试。

P.S.: This article was first published on Medium.

PS:本文最初发表于Medium

翻译自: https://habr.com/en/company/miro/blog/500652/

监控能力 devops

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值