grpc负载均衡_零停机时间部署Kubernetes gRPC工作负载

最新推荐文章于 2023-05-10 16:36:37 发布

weixin_26746861

最新推荐文章于 2023-05-10 16:36:37 发布

阅读量531

点赞数

文章标签： java python nginx linux leetcode

原文链接：https://medium.com/@jwenz723/deploy-kubernetes-grpc-workloads-with-zero-down-time-3585c146f74f

版权

grpc负载均衡

So you started down the path of using Kubernetes and everything has been a dream! You were able to deploy workloads with immense speed and efficiency and you were able to integrate your various services with each other with ease. Then you realized that requests were failing every time you deployed an update to one of your workloads during the deploy window. Zero downtime deploys can be achieved in Kubernetes. There is still hope for the stability of your workloads!

因此，您开始使用Kubernetes，一切都变成了梦想！您能够以极大的速度和效率部署工作负载，并且能够轻松地将各种服务相互集成。然后您意识到，每次在部署窗口中将更新部署到一个工作负载时，请求都会失败。在Kubernetes中可以实现零停机时间部署。仍然希望您的工作负载稳定！

We will start off by defining a simple proto file containing the API which will be exposed by our sample application via gRPC:

我们将从定义一个简单的包含API的原始文件开始，该文件将由示例应用程序通过gRPC公开：

syntax = "proto3";
option go_package = ".;example";
service Example {
    rpc Work(WorkItem) returns (WorkResponse) {}
}
message WorkItem {
    string name = 1;
    int32 size = 2;
}
message WorkResponse {
    string name = 1;
}

The API is defined as having one rpc (or method) called Work. This api will be used to send in some simulated work to be performed by our server.

该API被定义为具有一个称为Work的rpc(或方法)。该API将用于发送一些模拟的工作，以供我们的服务器执行。

Now we will write some Go code to implement the gRPC server interface:

现在，我们将编写一些Go代码来实现gRPC服务器接口：

type exampleService struct {
   l               *zap.Logger
   requestDuration time.Duration
}// Work implements example.ExampleServerfunc (s *exampleService) Work(_ context.Context, req *example.WorkItem) (*example.WorkResponse, error) {
   s.l.Info("Work invoked", zap.String("name", req.Name), zap.Int("size", int(req.Size)))
   
   // sleep based upon the specified request size to simulate slow requests
   time.Sleep(time.Duration(req.Size) * time.Second)
   
   return &example.WorkResponse{Name: req.Name}, nil
}

The Work method is implemented to sleep for a period of seconds based upon the incoming request’s size parameter. This will be used later to simulate requests that take a long time to be processed.

根据传入请求的size参数，将Work方法实现为睡眠几秒钟。稍后将使用它来模拟需要很长时间才能处理的请求。

Now we will write a func main() to define the startup of the application with a gRPC server:

现在，我们将编写一个func main()来定义使用gRPC服务器的应用程序的启动：

package main
import (
   "context"
   "flag"
   "fmt"
   example "github.com/jwenz723/podlifecycle/server/proto"
   "github.com/oklog/run"
   "go.uber.org/zap"
   "net"
   "os"
   "os/signal"
   "syscall"
   "time"
)func main() {
   grpcAddr := flag.String("grpcAddr", ":8080", "address to expose grpc on")
   flag.Parse()
   logger, _ := zap.NewDevelopment()
   defer logger.Sync()
   var g run.Group
   {
      lis, err := net.Listen("tcp", *grpcAddr)
      if err != nil {
         logger.Error("failed to start grpc listener", zap.Error(err))
      }
      service := exampleService{
         l:               logger,
      }
      grpcServer := NewGRPCServerFromListener(lis)
      g.Add(func() error {
         example.RegisterExampleServer(grpcServer.Server(), &service)
         logger.Info("starting grpc server...", zap.String("addr", *grpcAddr))
         return grpcServer.Start()
      }, func(err error) {
         logger.Info("shutting down grpc server...")
         grpcServer.Stop()
         lis.Close()
      })
   }
   {
      // This function just sits and waits for ctrl-C.
      cancelInterrupt := make(chan struct{})
      g.Add(func() error {
         c := make(chan os.Signal, 1)
         signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)
         select {
         case sig := <-c:
            logger.Info("received signal", zap.String("signal", sig.String()))
            return fmt.Errorf("received signal %s", sig)
         case <-cancelInterrupt:
            logger.Info("cancel interrupt")
            return nil
         }
      }, func(error) {
         close(cancelInterrupt)
      })
   }
   logger.Info("exiting", zap.Error(g.Run()))
}

This code makes use of a run.Group to define multiple long running components that will be executed in parallel. Each component has a defined startup func and a shutdown func. The run.Group takes care of automatically calling the shutdown func for each component when one of the startup func’s returns an error. Notice that the first component being added into the run.Group with a call to g.Add is the gRPC server. The second component is a listener for OS signals SIGINT and SIGTERM. When either of these signals occurs an error will be returned which will then cause the shutdown func of the gRPC server to be invoked. This allows the application to gracefully shutdown the gRPC server and ideally cleanly end connections that are open.

此代码利用run.Group定义将并行执行的多个长时间运行的组件。每个组件都有一个已定义的启动函数和关闭函数。当启动函数之一返回错误时，run.Group会自动为每个组件调用关闭函数。注意，第一个添加到run.Group并带有g.Add调用的组件是gRPC服务器。第二个组件是OS信号SIGINT和SIGTERM的侦听器。当这些信号之一发生时，将返回错误，这将导致调用gRPC服务器的关闭功能。这使应用程序可以正常关闭gRPC服务器，并理想情况下可以干净地断开已打开的连接。

Now we will define a GRPCServer struct to encapsulate some helpful behavior:

现在，我们将定义一个GRPCServer结构来封装一些有用的行为：

package main
import (
   "google.golang.org/grpc"
   "google.golang.org/grpc/health"
   healthpb "google.golang.org/grpc/health/grpc_health_v1"
   "net"
)
type GRPCServer struct {
   // Listen address for the server specified as hostname:port
   address string
   // Listener for handling network requests
   listener net.Listener
   // GRPC server
   server *grpc.Server
   // Server for gRPC Health Check Protocol.
   healthServer *health.Server
}// NewGRPCServerFromListener creates a new implementation of a GRPCServer given
// an existing net.Listener instance using default keepalivefunc NewGRPCServerFromListener(listener net.Listener) *GRPCServer {
   grpcServer := &GRPCServer{
      address:  listener.Addr().String(),
      listener: listener,
   }
   grpcServer.server = grpc.NewServer()
   grpcServer.healthServer = health.NewServer()
   healthpb.RegisterHealthServer(grpcServer.server, grpcServer.healthServer)
   return grpcServer
}// Start starts the underlying grpc.Serverfunc (gServer *GRPCServer) Start() error {
   // if health check is enabled, set the health status for all registered services
   if gServer.healthServer != nil {
      for name := range gServer.server.GetServiceInfo() {
         gServer.healthServer.SetServingStatus(
            name,
            healthpb.HealthCheckResponse_SERVING,
         )
      }
      gServer.healthServer.SetServingStatus(
         "",
         healthpb.HealthCheckResponse_SERVING,
      )
   }
   return gServer.server.Serve(gServer.listener)
}
func (gServer *GRPCServer) Stop() {
   gServer.server.GracefulStop()
}// Server returns the grpc.Server for the GRPCServer instancefunc (gServer *GRPCServer) Server() *grpc.Server {
   return gServer.server
}

The func NewGRPCServerFromListener serves as a constructor for our newly defined struct. This constructor instantiates a new instance of the healthServer. This is a gRPC server that implements the gRPC Health Checking Protocol which will provide us with a way to integrate into the automated health checks that can be performed by Kubernetes later. Both this healthServer and our example gRPC server defined earlier are exposed on the same tcp listener, so the health checks performed will be flowing through similar networking logic to our real business logic APIs.

函数NewGRPCServerFromListener充当我们新定义的结构的构造函数。此构造函数实例化healthServer的新实例。这是一个实现gRPC运行状况检查协议的gRPC服务器，它将为我们提供一种方法，可以集成到Kubernetes以后可以执行的自动运行状况检查中。此healthServer和我们先前定义的示例gRPC服务器都暴露在同一tcp侦听器上，因此执行的运行状况检查将通过类似的网络逻辑流到我们的实际业务逻辑API。

Now that we have an application, we need to build it into a docker image to be deployed to Kubernetes. We need to make sure that our docker image has a Docker entrypoint that handles or passes OS signals. Kubernetes will send a SIGTERM immediately upon a pod termination being initiated. After the terminationGracePeriodSeconds has elapsed a SIGKILL will then be sent. There are various ways with Docker to configure an entrypoint, the recommended method is to use ENTRYPOINT (see the last line in the dockerfile defined below) because it doesn’t wrap the command in a shell that might accidentally suppress the signals that are received. For example:

现在我们有了一个应用程序，我们需要将其构建到一个Docker映像中，以部署到Kubernetes。我们需要确保我们的Docker映像具有处理或传递OS信号的Docker入口点。 Kubernetes将在启动Pod终止后立即发送SIGTERM。在终止GracePeriodSeconds之后，将发送SIGKILL。 Docker有多种方法来配置入口点，推荐的方法是使用ENTRYPOINT(请参阅下面定义的dockerfile中的最后一行)，因为它不会将命令包装在可能会意外抑制接收到的信号的外壳中。例如：

# Accept the Go version for the image to be set as a build argument.ARG GO_VERSION=1.14# Execute the build using an alpine linux environment so that it will execute properly in the final environmentFROM golang:${GO_VERSION}-alpine AS builder
ENV CGO_ENABLED=0# Create the user and group files that will be used in the running container to
# run the process as an unprivileged user.RUN mkdir /user && \
    echo 'nobody:x:65534:65534:nobody:/:' > /user/passwd && \
    echo 'nobody:x:65534:' > /user/group# Install the Certificate-Authority certificates for the app to be able to make
# calls to HTTPS endpoints.
# Git is required for fetching the dependencies.RUN apk add ca-certificates# Set a directory to contain the go app to be compiled (this directory will work for go apps making use of go modules as long as go 1.13+ is used)WORKDIR /go/src/podlifecycle# Fetch dependencies first; they are less susceptible to change on every build
# and will therefore be cached for speeding up the next buildCOPY go.mod ./go.sum ./
RUN go mod download# Import the code from the context.COPY . .# Build the executable to `/app`. Mark the build as statically linked.RUN go build \
    -installsuffix 'static' \
    -o /podlifecycle .# Final stage: the running container.FROM alpine AS final# Import the user and group files from the first stage.COPY --from=builder /user/group /user/passwd /etc/# Import the Certificate-Authority certificates for enabling HTTPS.COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/# Import the compiled executable from the first stage.COPY --from=builder /podlifecycle /podlifecycle# Perform any further action as an unprivileged user.USER nobody:nobody# Run the compiled binary in a way that can receive OS signalsENTRYPOINT ["/podlifecycle"]

We will also define a second docker image which will encapsulate the open source grpc-health-probe application. This docker image will be deployed as a sidecar in our Kubernetes pods and will be used to hit the gRPC health check API defined earlier. This docker image will be configured to run indefinitely until an OS signal is received to shutdown. The grpc-health-probe will be invoked by Kubernetes as defined later in the Kubernetes manifest. Here is the dockerfile:

我们还将定义第二个docker映像，该映像将封装开源grpc-health-probe应用程序。该docker映像将作为Sidecar部署在我们的Kubernetes吊舱中，并将用于实现之前定义的gRPC运行状况检查API。该docker映像将配置为无限期运行，直到收到OS信号以将其关闭。稍后在Kubernetes清单中定义的Kubernetes将调用grpc-health-probe。这是dockerfile ：

ARG GO_VERSION=1.14
FROM golang:${GO_VERSION}-alpine AS builderENV CGO_ENABLED=0# Create the user and group files that will be used in the running container to
# run the process as an unprivileged user.RUN mkdir /user && \
    echo 'nobody:x:65534:65534:nobody:/:' > /user/passwd && \
    echo 'nobody:x:65534:' > /user/group# Install the Certificate-Authority certificates for the app to be able to make
# calls to HTTPS endpoints.
# Git is required for fetching the dependencies.RUN apk add ca-certificates git
RUN go get github.com/grpc-ecosystem/grpc-health-probe# Final stage: the running container.FROM alpine AS final
# Import the user and group files from the first stage.COPY --from=builder /user/group /user/passwd /etc/# Import the Certificate-Authority certificates for enabling HTTPS.COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
COPY --from=builder /go/bin/grpc-health-probe /usr/bin/local/grpc-health-probe# Perform any further action as an unprivileged user.USER nobody:nobody# This will sleep indefinitely until SIGTERM or SIGINT occur https://stackoverflow.com/a/35770783CMD exec /bin/sh -c "trap : TERM INT; (while true; do sleep 1000; done) & wait"

With our Docker images defined we will now move on to configuring Kubernetes manifests to deploy our workload. We will define a Service and a Deployment:

定义了Docker映像后，我们现在将继续配置Kubernetes清单以部署我们的工作负载。我们将定义服务和部署：

apiVersion: v1
kind: Service
metadata:
  name: podlifecycle
  labels:
    app: podlifecycle
spec:
  ports:
    - port: 8080
      name: grpc-server
  selector:
    app: podlifecycle
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: podlifecycle
  name: podlifecycle
spec:
  replicas: 1
  selector:
    matchLabels:
      app: podlifecycle
  template:
    metadata:
      labels:
        app: podlifecycle
        version: "1" # This will be changed in the future to cause deploys to occur
    spec:
      containers:
        - image: jwenz723/grpchealthprobe
          livenessProbe:
            exec:
              command:
                - /usr/bin/local/grpc-health-probe
                # This is the address of the gRPC server exposed in the other container
                - -addr=localhost:8080
            initialDelaySeconds: 30
            failureThreshold: 1
            successThreshold: 1
            periodSeconds: 10 # this is 3 times the failure threshold (periodSeconds * failureThreshold) of the readiness probe
          name: grpchealthprobe
          readinessProbe:
            exec:
              command:
                - /usr/bin/local/grpc-health-probe
                - -addr=:8080
            initialDelaySeconds: 5
            failureThreshold: 3
            successThreshold: 1
            periodSeconds: 3
          lifecycle:
            # Defining a preStop hook to sleep for a few seconds allows Kubernetes the necessary amount of time to propagate the deletion of a pod to Kubernetes Services and Ingresses, this value will differ based upon the sync frequency of your Ingress
            preStop:
              exec:
                command:
                  - /bin/sleep
                  - "20"
        - image: jwenz723/podlifecycle
          name: podlifecycle
          ports:
            - containerPort: 8080
          lifecycle:
              # Defining a preStop hook to sleep for a few seconds allows Kubernetes the necessary amount of time to propagate the deletion of a pod to Kubernetes Services and Ingresses, this value will differ based upon the sync frequency of your Ingress
              preStop:
              exec:
                command:
                  - /bin/sleep
                  - "20"
      # This is the amount of time Kubernetes will wait after sending a SIGTERM for your application to shutdown before sending a SIGKILL
      terminationGracePeriodSeconds: 30

Above we have defined 2 containers to be deployed, a grpchealthprobe container and a podlifecycle container. In Kubernetes when multiple containers are deployed into the same pod they will be executed in the same logical host. This means that the grpchealthprobe container is able to send requests to podlifecycle at localhost:8080 (the address exposed by podlifecycle). The alternative method to this is to compile the grpc-health-probe binary directly into your workload docker image. I opted to keep it separate so so that I can have a sidecar container that is reusable for all my gRPC serving workloads.

上面我们定义了两个要部署的容器，一个grpchealthprobe容器和一个podlifecycle容器。在Kubernetes中，将多个容器部署到同一容器中时，它们将在同一逻辑主机中执行。这意味着grpchealthprobe容器能够将请求发送到本地主机：8080(podlifecycle公开的地址)中的podlifecycle。替代方法是将grpc-health-probe二进制文件直接编译到您的工作负载docker映像中。我选择将其分开，以便可以使用可用于所有gRPC服务工作负载的sidecar容器。

When configuring liveness and readiness probes, it is important that the thresholds not be set identically for both probes. The readiness probe is intended to be used to notify Kubernetes when a workload is ready to serve traffic. In simple terms, this means that Kubernetes will add the IP address of the pod into the endpoint set for the corresponding Kubernetes Service when the workload is ready. The Liveness probe is intended to be used to ensure that your application does not ever hang. Some times applications will get into a bad state that can only be recovered from by restarting the application, this is the type of behavior that a liveness probe is built to help resolve. When a liveness probe fails Kubernetes will delete the pod that is failing the probe and create a new pod. This can be dangerous, take caution when defining liveness probes.

配置活动性和就绪性探针时，两个探针的阈值设置不能相同很重要。就绪探针旨在用于在工作负载准备好为流量提供服务时通知Kubernetes。简单来说，这意味着当工作负载就绪时，Kubernetes会将pod的IP地址添加到相应Kubernetes服务的端点集中。 Liveness探针旨在确保您的应用程序永远不会挂起。有时应用程序将进入不良状态，只能通过重新启动应用程序才能恢复，这是构建活动性探针以帮助解决的行为类型。当活动探针失败时，Kubernetes将删除失败探针的容器，并创建一个新容器。这可能很危险，在定义活动探针时要小心。

Some basic suggestions for configuring these probes are to set the periodSeconds on the liveness probe to be 3 times the entire failure threshold (periodSeconds * failureThreshold) of the readiness probe.

配置这些探针的一些基本建议是将活动探针上的periodSeconds设置为就绪探针的整个故障阈值(periodSeconds * failureThreshold)的3倍。

When a pod enters a Terminating state the pod is removed from Kubernetes Services and Ingresses to prevent new traffic from reaching the terminating pod. Unfortunately, this is done using some asynchronous API calls, therefore, it is unknown exactly when a pod will be removed from routing. For this purpose a preStop hook has been added to sleep for 5 seconds. In my case, this was sufficient for Kubernetes to execute all updates. This period will need to be based upon how quickly your Ingress Controller performs updates.

当Pod进入终止状态时，该Pod将从Kubernetes Services和Ingress移除，以防止新流量到达终止Pod。不幸的是，这是通过使用一些异步API调用完成的，因此，确切地知道何时将Pod从路由中删除是未知的。为此，已将preStop挂钩添加到睡眠状态5秒钟。就我而言，这足以让Kubernetes执行所有更新。该时间段将取决于您的Ingress Controller执行更新的速度。

With all of this in place, we will now deploy it all using Skaffold. We will start by defining a skaffold.yaml file:

有了所有这些之后，我们现在将使用Skaffold部署所有这些。我们将从定义skaffold.yaml文件开始：

apiVersion: skaffold/v1beta13
kind: Config
build:
  artifacts:
    - image: jwenz723/podlifecycle
      context: ./
      docker:
        dockerfile: ./Dockerfile
  local:
    push: false
deploy:
  kubectl:
    manifests:
      - ./k8s/*

Now we run:

现在我们运行：

skaffold run

You can alternatively execute `skaffold dev` if you want the logs of the workload to be piped to your terminal.

如果要将工作负载的日志通过管道传输到终端，则可以选择执行“ skaffold dev”。

After the Skaffold completes the deployment you should see one pod:

Skaffold完成部署后，您应该会看到一个窗格：

➜  ~  kubectl get pods
NAME                            READY   STATUS    RESTARTS   AGE
podlifecycle-8577f67547-5k8gw   2/2     Running   0          70s

Now we will write a client application to send repeated load to the server. The client app will make use of the proto that we defined earlier to invoke the Work rpc in an infinite loop. If errors occur, they will be logged and not cause the application to shutdown. This is helpful when testing because if a zero downtime deploy does not occur then the client application will experience errors. Here is the code for our client application:

现在，我们将编写一个客户端应用程序，以将重复的负载发送到服务器。客户端应用程序将利用我们先前定义的原型在无限循环中调用Work rpc。如果发生错误，将记录这些错误，并且不会导致应用程序关闭。这在测试时很有用，因为如果不进行零停机部署，则客户端应用程序将遇到错误。这是我们的客户应用程序的代码：

package main
import (
   "context"
   example "github.com/jwenz723/podlifecycle/server/proto"
   "google.golang.org/grpc"
   "log"
   "time"
)
const (
   address = "192.168.64.24:32332"
)
func main() {
   // Set up a connection to the server.
   conn, err := grpc.Dial(address, grpc.WithInsecure())
   if err != nil {
      log.Fatalf("did not connect: %v", err)
   }
   defer conn.Close()
   c := example.NewExampleClient(conn)
   for {
      time.Sleep(100 * time.Millisecond)
      ctx, cancel := context.WithTimeout(context.Background(), 2 * time.Second)
      defer cancel()
      r, err := c.Work(ctx, &example.WorkItem{
         Name: "test",
         Size: 0,
      })
      if err != nil {
         log.Printf("could not DoStuff: %v", err)
         continue
      }
      log.Printf("Response: %s", r.GetName())
   }
}

The address const defined at the top of the file is the address where the kubernetes podlifecycle service can be reached. In my case, I am deploying my workload to a minikube k8s cluster so I can obtain the address of the podlifecycle service by running the command:

在文件顶部定义的地址const是可以访问kubernetes podlifecycle服务的地址。就我而言，我正在将工作负载部署到minikube k8s集群，以便可以通过运行以下命令获取podlifecycle服务的地址：

➜  ~  minikube service --url podlifecyclehttp://192.168.64.24:32332

You can see the client application runs an infinite loop to send Work requests with 100 milliseconds between requests so we don’t overload anything. Build and start the client application to start sending traffic to the server.

您可以看到客户端应用程序运行无限循环以在两个请求之间发送100毫秒的工作请求，因此我们不会超载任何内容。生成并启动客户端应用程序以开始向服务器发送流量。

With the client application running and traffic successfully hitting the Kubernetes podlifecycle workload, we can now test zero downtime deploys by making small changes to the Kubernetes manifests defined earlier. We will change the version label value from “1” to “2”:

随着客户端应用程序的运行和流量成功达到Kubernetes podlifecycle工作负载，我们现在可以通过对先前定义的Kubernetes清单进行少量更改来测试零停机时间部署。我们将版本标签值从“ 1”更改为“ 2”：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: podlifecycle
...
spec:
...
  template:
    metadata:
      labels:
        app: podlifecycle
        version: "2"

If you ran skaffold dev earlier then as soon as you save your label change a deploy will be initiated. If you used skaffold run then run the command again to start a new deploy. Watch the logs on the client application as the deploy progresses. You should see endless logs indicating success and no errors like:

如果您较早运行skaffold dev ，则在保存标签更改后将立即开始部署。如果您使用skaffold run再次运行命令以开始新的部署。在部署过程中观察客户端应用程序上的日志。您应该看到无尽的日志指示成功，并且没有错误，例如：

2020/08/12 00:38:59 Response: test
2020/08/12 00:38:59 Response: test
2020/08/12 00:38:59 Response: test
2020/08/12 00:38:59 Response: test
2020/08/12 00:38:59 Response: test

Congrats, you just successfully completed a zero downtime deployment of a gRPC service! The source code for all components mentioned above can be found at https://github.com/jwenz723/podlifecycle

恭喜，您已经成功完成了gRPC服务的零停机部署！上面提到的所有组件的源代码都可以在https://github.com/jwenz723/podlifecycle中找到

To learn more about the methods describe above check out the following resources:

要了解有关上述方法的更多信息，请查看以下资源：