linux网络io负载问题_客户注意事项：在Web API应用程序负载下诊断问题已迁移到Linux上的ASP.NET Core-CSDN博客

linux网络io负载问题

When the engineers on the ASP.NET/.NET Core team talk to real customers about actual production problems they have, interesting stuff comes up. I've tried to capture a real customer interaction here without giving away their name or details.

当ASP.NET/.NET Core团队的工程师与实际客户讨论他们遇到的实际生产问题时，就会出现一些有趣的东西。我试图在这里捕获真实的客户互动，而没有透露他们的姓名或详细信息。

The team recently had the opportunity to help a large customer of .NET investigate performance issues they’ve been having with a newly-ported ASP.NET Core 2.1 app when under load. The customer's developers are experienced with ASP.NET on Windows but in this case they needed help getting started with performance investigations with ASP.NET Core in Linux containers.

该团队最近有机会帮助.NET的大客户调查在负载下他们使用新移植的ASP.NET Core 2.1应用程序一直遇到的性能问题。客户的开发人员对Windows上的ASP.NET很有经验，但是在这种情况下，他们需要帮助开始使用Linux容器中的ASP.NET Core进行性能调查。

As with many performance investigations, there were a variety of issues contributing to the slowdowns, but the largest contributors were time spent garbage collecting (due to unnecessary large object allocations) and blocking calls that could be made asynchronous.

与许多性能调查一样，造成减速的原因很多，但最大的原因是浪费时间进行垃圾收集(由于不必要的大对象分配)和阻塞可以异步进行的调用。

After resolving the technical and architectural issues detailed below, the customer's Web API went from only being able to handle several hundred concurrent users during load testing to being able to easily handle 3,000 and they are now running the new ASP.NET Core version of their backend web API in production.

解决了下面详述的技术和体系结构问题后，客户的Web API从只能在负载测试期间处理数百个并发用户，到能够轻松处理3,000个，现在正在运行其后端的新ASP.NET Core版本生产中的Web API。

问题陈述 (Problem Statement)

The customer recently migrated their .NET Framework 4.x ASP.NET-based backend Web API to ASP.NET Core 2.1. The migration was broad in scope and included a variety of tech changes.

客户最近将他们的.NET Framework 4.x基于ASP.NET的后端Web API迁移到ASP.NET Core 2.1。迁移的范围很广，包括各种技术更改。

Their previous version Web API (We'll call it version 1) ran as an ASP.NET application (targeting .NET Framework 4.7.1) under IIS on Windows Server and used SQL Server databases (via Entity Framework) to persist data. The new (2.0) version of the application runs as an ASP.NET Core 2.1 app in Linux Docker containers with PostgreSQL backend databases (via Entity Framework Core). They used Nginx to load balance between multiple containers on a server and HAProxy load balancers between their two main servers. The Docker containers are managed manually or via Ansible integration for CI/CD (using Bamboo).

他们的早期版本的Web API(我们将其称为版本1)在Windows Server上的IIS下作为ASP.NET应用程序(针对.NET Framework 4.7.1)运行，并使用SQL Server数据库(通过Entity Framework)来保留数据。该应用程序的新版本(2.0)在带有PostgreSQL后端数据库Linux Docker容器中作为ASP.NET Core 2.1应用程序运行(通过Entity Framework Core)。他们使用Nginx在服务器上的多个容器之间进行负载平衡，并在两个主服务器之间使用HAProxy负载平衡器。可以手动或通过用于CI / CD的Ansible集成(使用Bamboo)来管理Docker容器。

Although the new Web API worked well functionally, load tests began failing with only a few hundred concurrent users. Based on current user load and projected growth, they wanted the web API to support at least 2,000 concurrent users. Load testing was done using Visual Studio Team Services load tests running a combination of web tests mimicking users logging in, doing the stuff of their business, activating tasks in their application, as well as pings that the Mobile App's client makes regularly to check for backend connectivity. This customer also uses New Relic for application telemetry and, until recently, New Relic agents did not work with .NET Core 2.1. Because of this, there was unfortunately no app diagnostic information to help pinpoint sources of slowdowns.

尽管新的Web API在功能上运行良好，但是只有几百个并发用户开始负载测试失败。根据当前的用户负载和预计的增长，他们希望Web API支持至少2,000个并发用户。负载测试是使用Visual Studio Team Services负载测试完成的，该负载测试结合了模拟用户登录，从事业务，在应用程序中激活任务以及Web应用程序客户端定期进行ping以检查用户后端的Web测试的组合连接性。该客户还使用New Relic进行应用遥测，直到最近，New Relic代理还无法与.NET Core 2.1一起使用。因此，很遗憾，没有应用诊断信息可帮助您确定减速的原因。

得到教训 (Lessons Learned)

跨平台调查(Cross-Platform Investigations)

One of the most interesting takeaways for me was not the specific performance issues encountered but, instead, the challenges this customer had working in a Linux environment. The team's developers are experienced with ASP.NET on Windows and comfortable debugging in Visual Studio. Despite this, the move to Linux containers has been challenging for them.

对我而言，最有趣的收获不是遇到的特定性能问题，而是该客户在Linux环境中工作所面临的挑战。该团队的开发人员对Windows上的ASP.NET以及在Visual Studio中进行舒适的调试具有丰富的经验。尽管如此，向Linux容器迁移对他们仍然是一个挑战。

Because the engineers were unfamiliar with Linux, they hired a consultant to help deploy their Docker containers on Linux servers. This model worked to get the site deployed and running, but became a problem when the main backend began exhibiting performance issues. The performance problems only manifested themselves under a fairly heavy load, such that they could not be reproduced on a dev machine. Up until this investigation, the developers had never debugged on Linux or inside of a Docker container except when launching in a local container from Visual Studio with F5. They had no idea how to even begin diagnosing issues that only reproduced in their staging or production environments. Similarly, their dev-ops consultant was knowledgeable about Linux infrastructure but not familiar with application debugging or profiling tools like Visual Studio.

因为工程师不熟悉Linux，所以他们雇用了一名顾问来帮助将Docker容器部署在Linux服务器上。该模型可以使站点得以部署和运行，但是当主要后端开始出现性能问题时，它就成为一个问题。性能问题仅在相当重的负载下表现出来，因此无法在dev机器上复制。直到进行此调查为止，开发人员从未在Linux或Docker容器内部进行调试，除非从Visual Studio中使用F5在本地容器中启动。他们甚至不知道如何开始诊断仅在登台或生产环境中再现的问题。同样，他们的dev-ops顾问对Linux基础架构也很了解，但对诸如Visual Studio之类的应用程序调试或分析工具不熟悉。

The ASP.NET team has some documentation on using PerfCollect and PerfView to gather cross-platform diagnostics, but the customer's devs did not manage to find these docs until they were pointed out. Once an ASP.NET Core team engineer spent a morning showing them how to use PerfCollect, LLDB, and other cross-platform debugging and performance profiling tools, they were able to make some serious headway debugging on their own. We want to make sure everyone can debug .NET Core on Linux with LLDB/SOS or remotely with Visual Studio as easily as possible.

ASP.NET团队有一些使用PerfCollect和PerfView收集跨平台诊断的文档，但是直到指出这些文档后，客户的开发人员才设法找到它们。一旦ASP.NET Core团队的工程师花了一个上午向他们展示如何使用PerfCollect，LLDB和其他跨平台调试和性能分析工具，他们便可以自行进行一些认真的调试工作。我们希望确保每个人都可以尽可能轻松地在具有LLDB / SOSLinux上或通过Visual Studio远程调试.NET Core 。

The ASP.NET Core team now believes they need more documentation on how to diagnose issues in non-Windows environments (including Docker) and the documentation that already exists needs to be more discoverable. Important topics to make discoverable include PerfCollect, PerfView, debugging on Linux using LLDB and SOS, and possibly remote debugging with Visual Studio over SSH.

ASP.NET Core团队现在认为，他们需要更多有关如何在非Windows环境(包括Docker)中诊断问题的文档，并且已经存在的文档需要更易于发现。使可发现的重要主题包括PerfCollect，PerfView，使用LLDB和SOS在Linux上进行调试以及可能通过SSH通过Visual Studio进行远程调试。

Web API代码中的问题 (Issues in Web API Code)

Once we gathered diagnostics, most of the perf issues ended up being common problems in the customer’s code.

一旦我们收集了诊断信息，大多数性能问题最终成为客户代码中的常见问题。

The largest contributor to the app’s slowdown was frequent Generation 2 (Gen 2) GCs (Garbage Collections) which were happening because a commonly-used code path was downloading a lot of images (product images), converting those bytes into a base64 strings, responding to the client with those strings, and then discarding the byte[] and string. The images were fairly large (>100 KB), so every time one was downloaded, a large byte[] and string had to be allocated. Because many of the images were shared between multiple clients, we solved the issue by caching the base64 strings for a short period of time (using IMemoryCache).
造成应用程序运行速度下降的最大原因是频繁发生的第二代(第二代)GC(垃圾回收)，这是因为常用的代码路径正在下载大量图像(产品图像)，将这些字节转换为base64字符串，然后响应这些字符串传递给客户端，然后丢弃byte []和字符串。图片很大(> 100 KB)，因此每次下载图片时，都必须分配一个大字节[]和字符串。由于许多客户端之间共享了许多映像，因此我们通过在短时间内缓存base64字符串(使用IMemoryCache)解决了该问题。
HttpClient Pooling with HttpClientFactory
使用HttpClientFactory进行HttpClient池化
1. When calling out to Web APIs there was a pattern of creating new HttpClient instances rather than using IHttpClientFactory to pool the clients.
  调用Web API时，有一种创建新的HttpClient实例的模式，而不是使用IHttpClientFactory池化客户端的模式。
2. Despite implementing IDisposable, it is not a best practice to dispose HttpClient instances as soon as they’re out of scope as they will leave their socket connection in a TIME_WAIT state for some time after being disposed. Instead, HttpClient instances should be re-used.
  尽管实现了IDisposable，但最好不要在超出范围时立即处理HttpClient实例，因为它们在处理后的一段时间内将使套接字连接保持TIME_WAIT状态。相反，应该重新使用HttpClient实例。
Additional investigation showed that much of the application’s time was spent querying PostgresSQL for data (as is common). There were several underlying issues here.
进一步的调查表明，应用程序的大部分时间都花在查询PostgresSQL数据上(这很常见)。这里有几个潜在的问题。
1. Database queries were being made in a blocking way instead of being asynchronous. We helped address the most common call-sites and pointed the customer at the AsyncUsageAnalyzer to identify other async cleanup that could help.
  数据库查询是以阻塞的方式进行的，而不是异步的。我们帮助解决了最常见的呼叫站点，并向客户指出了AsyncUsageAnalyzer，以确定其他可以帮助解决的异步问题。
2. Database connection pooling was not enabled. It is enabled by default for SQL Server, but not for PostgreSQL.
  未启用数据库连接池。默认情况下，它对SQL Server启用，但对PostgreSQL不启用。
  1. We re-enabled database connection pooling. It was necessary to have different pooling settings for the common database (used by all requests) and the individual shard databases which are used less frequently. While the common database needs a large pool, the shard connection pools need to be small to avoid having too many open, idle connections.
    我们重新启用了数据库连接池。对于公用数据库(供所有请求使用)和使用频率较低的单个分片数据库，必须具有不同的池设置。尽管公共数据库需要一个大的池，但是分片连接池需要很小，以避免打开过多的空闲连接。
3. The Web API had a fairly ‘chatty’ interface with the database and made a lot of small queries. We re-worked this interface to make fewer calls (by querying more data at once or by caching for short periods of time).
  Web API具有与数据库相当的“聊天”界面，并进行了许多小的查询。我们对该接口进行了重新设计，以减少调用次数(通过一次查询更多数据或在短时间内进行缓存)。
There was also some impact from having other background worker containers on the web API’s servers consuming large amounts of CPU. This led to a ‘noisy neighbor’ problem where the web API containers didn’t have enough CPU time for their work. We showed the customer how to address this with Docker resource constraints.
Web API服务器上的其他后台工作容器占用大量CPU，也会带来一些影响。这导致了“嘈杂的邻居”问题，其中Web API容器没有足够的CPU时间来进行工作。我们向客户展示了如何利用Docker资源约束解决此问题。

结语 (Wrap Up)

As shown in the graph below, at the end of our performance tuning, their backend was easily able to handle 3,000 concurrent users and they are now using their ASP.NET Core solution in production. The performance issues they saw overlapped a lot with those we’ve seen from other customers (especially the need for caching and for async calls), but proved to be extra challenging for the developers to diagnose due to the lack of familiarity with Linux and Docker environments.

如下图所示，在性能调整结束时，他们的后端可以轻松处理3,000个并发用户，并且他们现在在生产中使用ASP.NET Core解决方案。他们看到的性能问题与我们从其他客户那里看到的问题有很多重叠(特别是对缓存和异步调用的需求)，但是由于缺乏对Linux和Docker的熟悉度，事实证明对于开发人员而言，诊断存在额外的挑战环境。

Performance and Errors Charts look good, up and to the right

Throughput and Tests Charts look good, up and to the right

Some key areas of focus uncovered by this investigation were:

该调查发现的一些关键重点领域是：

Being mindful of memory allocations to minimize GC pause times

注意内存分配以最大程度地减少GC暂停时间
Keeping long-running calls non-blocking/asynchronous

使长时间运行的呼叫保持非阻塞/异步
Minimizing calls to external resources (such as other web services or the database) with caching and grouping of requests
通过对请求进行缓存和分组将对外部资源(例如其他Web服务或数据库)的调用减至最少

Hope you find this useful! Big thanks to Mike Rousos from the ASP.NET Core team for his work and analysis!

希望你觉得这个有用！非常感谢ASP.NET Core团队的Mike Rousos的工作和分析！

翻译自: https://www.hanselman.com/blog/customer-notes-diagnosing-issues-under-load-of-web-api-app-migrated-to-aspnet-core-on-linux

linux网络io负载问题