ubuntu 运行grpc_在生产中运行grpc服务的挑战

最新推荐文章于 2024-07-17 17:16:49 发布

郝ren

最新推荐文章于 2024-07-17 17:16:49 发布

阅读量317

点赞数

文章标签： ubuntu python linux docker 物联网

原文链接：https://medium.com/inlocotech/challenges-of-running-grpc-services-in-production-b3a113df2542

版权

ubuntu 运行grpc

There are several ways to make services communicate, which generally involve a transport layer. Our applications often rely on it to provide several abstractions and features, such as load balancing, retries and high availability.

有几种方法可以使服务进行通信，通常涉及传输层。我们的应用程序经常依靠它来提供多种抽象和功能，例如负载平衡，重试和高可用性。

However, when running a service in production, we get more network-related errors than we’d like. This post intends to show how we mitigated these errors while using gRPC for service-to-service communication.

但是，在生产环境中运行服务时，与网络相关的错误比我们想要的要多。这篇文章旨在说明我们如何在使用gRPC进行服务到服务的通信时减轻这些错误。

为什么选择gRPC？ (Why gRPC?)

Back in 2016, almost every service at InLoco made use of the HTTP1.1/JSON stack for communication. It worked well for a long time, but, as the company grew, some high-traffic services started requiring a more efficient way of communicating with internal clients.

早在2016年，InLoco的几乎所有服务都使用HTTP1.1 / JSON堆栈进行通信。长期以来，它一直运行良好，但是随着公司的发展，一些高流量服务开始要求与内部客户进行更有效的通信。

Documentation of JSON APIs was also cumbersome to maintain, as they were not bound to the code itself, meaning that someone could deploy code that changes the API without changing the documentation appropriately.

JSON API的文档维护也很繁琐，因为它们未绑定到代码本身，这意味着有人可以部署更改API的代码而无需适当更改文档。

In the search for a good alternative, we looked into gRPC, which solves the performance and schema definition issues described above, with the following features:

在寻找一个好的替代方案时，我们研究了gRPC，它解决了上述性能和架构定义问题，并具有以下功能：

The API surface is defined directly in the protobuf files, where each method describes its own request/response types
API表面直接在protobuf文件中定义，每种方法都描述其自己的请求/响应类型
Auto-generate both client and server code in many languages
自动生成多种语言的客户端和服务器代码
Uses HTTP/2 in combination with Protobuf, which are both binary protocols, resulting in a more compact request/response payload
将HTTP / 2与Protobuf结合使用，这两种协议都是二进制协议，因此请求/响应的负载更加紧凑
HTTP/2 also uses persistent connections, removing the need to constantly create/close connections, as HTTP/1.1 does.
HTTP / 2还使用持久性连接，从而无需像HTTP / 1.1那样不断创建/关闭连接。

But running gRPC services also provided us with some challenges, mostly due to the fact that HTTP/2 uses persistent connections.

但是运行gRPC服务也给我们带来了一些挑战，主要是由于HTTP / 2使用了持久连接这一事实。

gRPC在生产中的挑战 (Challenges of gRPC in production)

We are heavy users of Kubernetes, and as such, our gRPC services are running on Kubernetes clusters, on Amazon EKS.

我们是Kubernetes的大量用户，因此，我们的gRPC服务在Amazon EKS的Kubernetes集群上运行。

One of the challenges we faced was ensuring load balancing on our servers. As the number of servers changes dynamically due to autoscaling, the clients must be able to make use of the new servers and remove connections to the ones that are no longer available, while ensuring that the number of requests are well-balanced between them by following some load balancing policy.

我们面临的挑战之一是确保服务器上的负载平衡。由于服务器的数量由于自动缩放而动态变化，因此客户端必须能够使用新服务器并删除与不再可用的服务器的连接，同时通过遵循以下步骤确保它们之间的请求数量保持平衡一些负载平衡策略。

负载均衡 (Load balancing)

There are some solutions for this problem, as stated in the gRPC blog, including proxy load balancing and client-side load balancing. In the following sections, we explain the approaches we implemented, in chronological order.

如gRPC博客所述，有一些针对此问题的解决方案，包括代理负载平衡和客户端负载平衡。在以下各节中，我们将按时间顺序说明实现的方法。

方法1：具有Linkerd 1.x的代理负载均衡器 (Approach 1: Proxy Load Balancer with Linkerd 1.x)

The first way we implemented was using a proxy load balancer, namely Linkerd 1.x, as Figure 1 shows. This solution worked well for some time, solving the load balancing issue from the server perspective, but the client-to-proxy load was still unbalanced, meaning that some Linkerd instances handled a larger amount of requests than others.

我们实现的第一种方法是使用代理负载均衡器，即Linkerd 1.x ，如图1所示。此解决方案在一段时间内效果很好，从服务器的角度解决了负载平衡问题，但是从客户端到代理的负载仍然不平衡，这意味着某些Linkerd实例处理的请求数量比其他实例更多。

Unbalanced traffic on the client-to-proxy link later proved to be problematic. Overloaded proxy instances could add too much latency, or even run out of memory sometimes, becoming increasingly hard to manage.

后来，客户端到代理链路上的流量不平衡被证明是有问题的。过载的代理实例可能会增加过多的延迟，甚至有时会耗尽内存，从而变得越来越难以管理。

In addition, this solution was proven to add considerable overhead (as it requires an additional network hop), also consuming a considerable amount of resources in our Kubernetes cluster, as we deployed Linkerd as a daemonset, meaning that a Linkerd pod runs in every worker node in the cluster.

此外，事实证明，此解决方案会增加可观的开销(因为它需要额外的网络跃点)，并且在我们将Kubernetes集群部署为守护程序时也消耗了Kubernetes集群中的大量资源，这意味着Linkerd Pod会在每个worker中运行集群中的节点。

方法2：胖gRPC客户端 (Approach 2: Thick gRPC client)

Trying to tackle issues with the first approach, we tried to eliminate the proxy layer, handling the responsibility of load balancing in the client code, which we own.

为了解决第一种方法的问题，我们尝试消除代理层，在我们拥有的客户端代码中处理负载平衡的责任。

To handle load balancing in the clients, we used grpc-go’s naming.NewDNSResolverWithFreq(time.Duration) in combination with Kubernetes’ headless services (to handle discovery of server pods). In this solution, the clients refresh the pool of hosts they can connect by polling the target service’s DNS every few seconds.

为了处理客户端中的负载平衡，我们使用了grpc-go的naming.NewDNSResolverWithFreq(time.Duration)结合Kubernetes的无头服务(处理服务器Pod的发现)。在此解决方案中，客户端通过每隔几秒钟轮询一次目标服务的DNS来刷新其可以连接的主机池。

This caused clients to connect directly to the server’s pods, which reduced our latency when compared to the proxy load balancer approach. The following diagram shows the components involved in this approach.

这导致客户端直接连接到服务器的Pod，与代理负载平衡器方法相比，这减少了我们的延迟。下图显示了此方法涉及的组件。

Image for post — Figure 2: Thick client approach

However, dynamic service discovery using DNS is being deprecated by the Go gRPC implementation , in favor of other protocols such as xDS. Not only that — on other languages, it had never been implemented in the first place.

但是，Go gRPC实现不赞成使用DNS进行动态服务发现，而推荐使用 xDS之类的其他协议。不仅如此-在其他语言上，它从来没有实现过。

We learned that, although this approach offers us a stable and high performant communication, relying on implementations on client code can be brittle and hard to manage due the diversity of gRPC implementations. This point holds true for other features, like rate limiting and authorization.

我们了解到，尽管这种方法为我们提供了稳定且高性能的通信，但是由于gRPC实现的多样性，依赖于客户端代码的实现可能很脆弱且难以管理。这一点对于速率限制和授权等其他功能也适用。

After trying these different approaches, we identified that we needed a generic, low-overhead, language-agnostic way to enable service discovery and load balancing.

在尝试了这些不同的方法之后，我们发现我们需要一种通用的，开销低，与语言无关的方法来实现服务发现和负载平衡。

方法3：与Envoy一起使用Sidecar代理 (Approach 3: Sidecar proxy with Envoy)

After some research on the topic, we chose to use the sidecar pattern, adding another container to the client pod, which handles service discovery, load balancing, and provides some observability to our connections. We chose to use Envoy, for its high performance and deployment simplicity.

在对该主题进行了一些研究之后，我们选择使用sidecar模式，在客户端pod中添加另一个容器，该容器处理服务发现，负载平衡并提供对我们连接的可观察性。我们选择使用Envoy ，是因为它具有高性能和易于部署的特点。

In this approach, the client containers connect to the Envoy sidecar, which maintains connections to the target service.

在这种方法中，客户端容器连接到Envoy边车，该边车维持与目标服务的连接。

Using this approach, we got what we were seeking:

使用这种方法，我们得到了我们想要的：

Low latency, as Envoy’s overhead is minimal when compared to Linkerd 1.x
低延迟，因为与Linkerd 1.x相比，Envoy的开销最小
No additional code in the clients
客户端中没有其他代码
Observability, as Envoy exports metrics in Prometheus format
可观察性，因为Envoy以Prometheus格式导出指标
Ability to enrich the network layer, as Envoy supports features like authorization and rate limiting
Envoy支持授权和速率限制等功能，能够丰富网络层

服务发现和正常关闭 (Service discovery and graceful shutdown)

With proper load balancing configured, we still need a way for Envoy to discover new targets and update its pool of hosts.

配置了适当的负载平衡后，我们仍然需要Envoy找到新目标并更新其主机池的方法。

There are a couple of options for service discovery with Envoy, such as DNS and EDS (based on xDS). For the sake of simplicity and familiarity, we chose to use DNS.

Envoy提供了两种服务发现选项，例如DNS和EDS (基于xDS )。为了简单和熟悉，我们选择使用DNS。

Integrating DNS service discovery in Kubernetes is quite straightforward, as we use external-dns, being able to specify hostname and DNS TTL directly on our Kubernetes service, as follows:

在Kubernetes中集成DNS服务发现非常简单，因为我们使用external-dns ，能够直接在Kubernetes服务上指定主机名和DNS TTL，如下所示：

A hidden complexity of using DNS as our service discovery mechanism is that it takes some time to propagate. So we need to give our gRPC clients some leeway to update their host lists before a terminating backend really stops receiving connections. Using DNS, the graceful shutdown flow is a bit trickier, as DNS records have a TTL associated with them, meaning Envoy caches the hosts for this period.

使用DNS作为我们的服务发现机制的一个隐藏的复杂性是传播需要花费一些时间。因此，在终止后端真正停止接收连接之前，我们需要给gRPC客户端一些余地来更新其主机列表。使用DNS时，正常关闭流程会有些棘手，因为DNS记录具有与它们关联的TTL，这意味着Envoy会在此期间缓存主机。

The following diagram shows a basic flow which ends with a failed request:

下图显示了以失败的请求结尾的基本流程：

In this scenario, the second client request fails, as the server pod was no longer available, while the Envoy cache still had its IP.

在这种情况下，第二个客户端请求失败，因为服务器窗格不再可用，而Envoy缓存仍具有其IP。

To solve this issue, we must look at how Kubernetes handles pod termination, which is described in detail here. It consists of 2 steps running at the same time: the pod is removed from the Kubernetes service endpoints (in our case, this also makes external-dns remove the pod’s IP from the list of DNS records) and the container is sent a TERM signal, starting the graceful shutdown process.

要解决此问题，我们必须查看Kubernetes如何处理Pod终端，这在此处进行了详细描述。它包含两个同时运行的步骤：从Kubernetes服务端点中删除Pod(在我们的例子中，这也使external-dns从DNS记录列表中删除Pod的IP)，并向容器发送TERM信号，开始正常关闭程序。

To solve the terminating host issue, we used Kubernetes’ pre-stop hooks to prevent an immediate TERM signal from being sent to the pod, as follows:

为了解决主机终止问题，我们使用Kubernetes的pre-stop钩子来防止立即将TERM信号发送到Pod，如下所示：

With the preStop hook configured, our flow now looks like the following:

配置了preStop挂钩之后，我们的流程现在如下所示：

With this solution, we give enough time for Envoy’s DNS cache to expire, and perform a new DNS lookup, which no longer includes the dead pod’s IP.

使用此解决方案，我们有足够的时间使Envoy的DNS缓存过期，并执行新的DNS查找，该查找不再包含死角站的IP。

未来的改进 (Future Improvements)

While using Envoy brought us a lot of performance improvements and overall simplicity, the DNS service discovery is still not ideal. It is not as robust, as it is based on polling, where the clients are responsible for refreshing the pool of hosts when the TTL expires.

尽管使用Envoy为我们带来了很多性能改进和整体简化，但是DNS服务发现仍然不是理想的选择。它不像基于轮询那样健壮，在TTL到期时，客户端负责刷新主机池。

A more robust way is to use Envoy’s EDS, which is more flexible, adding capabilities such as canary deployments and more sophisticated load balancing strategies, but we still need some time to evaluate this approach and validate it in a production environment.

一种更可靠的方法是使用Envoy的EDS ，它更灵活，添加了诸如canary部署和更复杂的负载平衡策略之类的功能，但是我们仍然需要一些时间来评估这种方法并在生产环境中对其进行验证。