node.js 扩展c++_扩展Node.js应用程序-CSDN博客

node.js 扩展c++

by Samer Buna

通过Samer Buna

扩展Node.js应用程序 (Scaling Node.js Applications)

您需要了解有关Node.js内置工具以实现可伸缩性的所有知识 (Everything you need to know about Node.js built-in tools for scalability)

Update: This article is now part of my book “Node.js Beyond The Basics”.

更新：这篇文章现在是我的书《超越基础的Node.js》的一部分。

Read the updated version of this content and more about Node at jscomplete.com/node-beyond-basics.

在jscomplete.com/node-beyond-basics中阅读此内容的更新版本以及有关Node的更多信息。

Scalability in Node.js is not an afterthought. It’s something that’s baked into the core of the runtime. Node is named Node to emphasize the idea that a Node application should comprise multiple small distributed nodes that communicate with each other.

Node.js中的可伸缩性不是事后的想法。这是运行时的核心。 Node被命名为Node，以强调Node应用程序应包含多个相互通信的小型分布式节点的想法。

Are you running multiple nodes for your Node applications? Are you running a Node process on every CPU core of your production machines and load balancing all the requests among them? Did you know that Node has a built-in module to help with that?

您是否正在为节点应用程序运行多个节点？您是否在生产机器的每个CPU内核上运行Node进程，并在其中平衡所有请求的负载？您知道Node有一个内置模块来帮助您吗？

Node’s cluster module not only provides an out-of-the-box solution to utilizing the full CPU power of a machine, but it also helps with increasing the availability of your Node processes and provides an option to restart the whole application with a zero downtime. This article covers all that goodness and more.

Node的群集模块不仅提供了一种开箱即用的解决方案，以利用计算机的全部CPU功能，而且还有助于提高Node进程的可用性，并提供了以零停机时间重新启动整个应用程序的选项。本文介绍了所有这些优点以及更多内容。

This article is a write-up of part of my Pluralsight course about Node.js. I cover similar content in video format there.

本文是我的Pluralsight课程有关Node.js的一部分的文章。我在那里以视频格式介绍了类似的内容。

可扩展性策略 (Strategies of Scalability)

The workload is the most popular reason we scale our applications, but it’s not the only reason. We also scale our applications to increase their availability and tolerance to failure.

工作负载是我们扩展应用程序的最常见原因，但这不是唯一的原因。我们还扩展了应用程序，以提高其可用性和对故障的容忍度。

There are mainly three different things we can do to scale an application:

我们可以通过三种不同的方式来扩展应用程序：

1-克隆 (1 — Cloning)

The easiest thing to do to scale a big application is to clone it multiple times and have each cloned instance handle part of the workload (with a load balancer, for example). This does not cost a lot in term of development time and it’s highly effective. This strategy is the minimum you should do and Node.js has the built-in module, cluster, to make it easier for you to implement the cloning strategy on a single server.

扩展大型应用程序最简单的方法是对其进行多次克隆，并让每个克隆的实例处理部分工作负载(例如，使用负载均衡器)。就开发时间而言，这并不花费很多，而且非常有效。此策略是您应该做的最低工作，Node.js具有内置模块cluster ，使您可以更轻松地在单个服务器上实施克隆策略。

2-分解 (2 — Decomposing)

We can also scale an application by decomposing it based on functionalities and services. This means having multiple, different applications with different code bases and sometimes with their own dedicated databases and User Interfaces.

我们还可以通过基于功能和服务分解应用程序来扩展应用程序。这意味着拥有多个具有不同代码库的不同应用程序，有时还具有自己的专用数据库和用户界面。

This strategy is commonly associated with the term Microservice, where micro indicates that those services should be as small as possible, but in reality, the size of the service is not what’s important but rather the enforcement of loose coupling and high cohesion between services. The implementation of this strategy is often not easy and could result in long-term unexpected problems, but when done right the advantages are great.

这种策略通常与术语微服务 ，其中微表明，这些服务应尽可能小有关，但在现实中，服务的大小是最重要的不是，而是松耦合服务之间的高凝聚力的执行。实施此策略通常并不容易，并且可能会导致长期的意外问题，但是如果正确执行，优势将很大。

3-拆分 (3 — Splitting)

We can also split the application into multiple instances where each instance is responsible for only a part of the application’s data. This strategy is often named horizontal partitioning, or sharding, in databases. Data partitioning requires a lookup step before each operation to determine which instance of the application to use. For example, maybe we want to partition our users based on their country or language. We need to do a lookup of that information first.

我们还可以将应用程序拆分为多个实例，其中每个实例仅负责应用程序数据的一部分。在数据库中，此策略通常称为水平分区或分片。数据分区需要在每个操作之前执行查找步骤，以确定要使用的应用程序实例。例如，也许我们想根据用户所在的国家或语言对他们进行划分。我们需要先查找该信息。

Successfully scaling a big application should eventually implement all three strategies. Node.js makes it easy to do so but I am going to focus on the cloning strategy in this article and explore the built-in tools available in Node.js to implement it.

成功扩展大型应用程序最终应实施所有三种策略。使用Node.js可以很容易地做到这一点，但是我将在本文中重点介绍克隆策略，并探索Node.js中可用的内置工具来实现它。

Please note that you need a good understanding of Node.js child processes before reading this article. If you haven’t already, I recommend that you read this other article first:

请注意，在阅读本文之前，您需要对Node.js 子进程有充分的了解。如果您还没有，我建议您先阅读另一篇文章：

Node.js Child Processes: Everything you need to knowHow to use spawn(), exec(), execFile(), and fork()medium.freecodecamp.org

Node.js子进程：您需要了解的所有信息 如何使用spawn()，exec()，execFile()和fork() medium.freecodecamp.org

集群模块 (The Cluster Module)

The cluster module can be used to enable load balancing over an environment’s multiple CPU cores. It’s based on the child process module fork method and it basically allows us to fork the main application process as many times as we have CPU cores. It will then take over and load balance all requests to the main process across all forked processes.

群集模块可用于在环境的多个CPU内核上实现负载平衡。它基于子进程模块fork方法，并且基本上使我们可以在拥有CPU内核的情况下多次派生主应用程序进程。然后它将接管所有分支流程中对主流程的所有请求并进行负载均衡。

The cluster module is Node’s helper for us to implement the cloning scalability strategy, but only on one machine. When you have a big machine with a lot of resources or when it’s easier and cheaper to add more resources to one machine rather than adding new machines, the cluster module is a great option for a really quick implementation of the cloning strategy.

群集模块是Node的帮助程序，它使我们可以在一台机器上实施克隆可伸缩性策略。当您拥有一台拥有大量资源的大型计算机，或者向一台计算机添加更多资源而不是添加新计算机变得更容易，更便宜时，集群模块是真正快速实施克隆策略的绝佳选择。

Even small machines usually have multiple cores and even if you’re not worried about the load on your Node server, you should enable the cluster module anyway to increase your server availability and fault-tolerance. It’s a simple step and when using a process manager like PM2, for example, it becomes as simple as just providing an argument to the launch command!

即使小型计算机通常也具有多个内核，即使您不担心节点服务器上的负载，也应该启用集群模块以提高服务器的可用性和容错能力。这是一个简单的步骤，例如，当使用像PM2这样的流程管理器时，它变得与仅向启动命令提供参数一样简单！

But let me tell you how to use the cluster module natively and explain how it works.

但是，让我告诉您如何本地使用集群模块并解释其工作原理。

The structure of what the cluster module does is simple. We create a master process and that master process forks a number of worker processes and manages them. Each worker process represents an instance of the application that we want to scale. All incoming requests are handled by the master process, which is the one that decides which worker process should handle an incoming request.

集群模块的工作结构很简单。我们创建一个主流程，该主流程会分叉多个工作进程并进行管理。每个工作进程代表我们要扩展的应用程序实例。所有传入请求均由主进程处理，该主进程决定哪个工作进程应处理传入请求。

The master process’s job is easy because it actually just uses a round-robin algorithm to pick a worker process. This is enabled by default on all platforms except Windows and it can be globally modified to let the load-balancing be handled by the operation system itself.

主流程的工作很容易，因为它实际上仅使用循环算法来选择工作流程。默认情况下，在Windows以外的所有平台上都启用此功能，并且可以对其进行全局修改，以使负载平衡可以由操作系统本身处理。

The round-robin algorithm distributes the load evenly across all available processes on a rotational basis. The first request is forwarded to the first worker process, the second to the next worker process in the list, and so on. When the end of the list is reached, the algorithm starts again from the beginning.

轮循算法将负载轮流分配到所有可用进程中。第一个请求转发到列表中的第一个工作进程，第二个转发到列表中的下一个工作进程，依此类推。当到达列表的末尾时，算法将从头开始重新开始。

This is one of the simplest and most used load balancing algorithms. But it’s not the only one. More featured algorithms allow assigning priorities and selecting the least loaded server or the one with the fastest response time.

这是最简单和最常用的负载平衡算法之一。但这不是唯一的。功能更强大的算法允许分配优先级，并选择负载最少的服务器或响应时间最快的服务器。

HTTP服务器的负载平衡 (Load-Balancing an HTTP Server)

Let’s clone and load balance a simple HTTP server using the cluster module. Here’s the simple Node’s hello-world example server slightly modified to simulate some CPU work before responding:

让我们使用集群模块克隆并负载均衡一个简单的HTTP服务器。这是简单的Node的hello-world示例服务器，在进行响应之前，对其进行了稍微修改以模拟一些CPU工作：

// server.js
const http = require('http');
const pid = process.pid;

http.createServer((req, res) => {
  for (let i=0; i<1e7; i++); // simulate CPU work
  res.end(`Handled by process ${pid}`);
}).listen(8080, () => {
  console.log(`Started process ${pid}`);
});

To verify that the balancer we’re going to create is going to work, I’ve included the process pid in the HTTP response to identify which instance of the application is actually handling a request.

为了验证我们将要创建的平衡器是否可以正常工作，我在HTTP响应中包含了进程pid ，以标识应用程序的哪个实例实际上在处理请求。

Before we create a cluster to clone this server into multiple workers, let’s do a simple benchmark of how many requests this server can handle per second. We can use the Apache benchmarking tool for that. After running the simple server.js code above, run this ab command:

在创建集群以将该服务器克隆到多个工作服务器之前，让我们简单地测试一下该服务器每秒可以处理多少个请求。我们可以为此使用Apache基准测试工具。运行上面的简单server.js代码后，运行以下ab命令：

ab -c200 -t10 http://localhost:8080/

This command will test-load the server with 200 concurrent connections for 10 seconds.

此命令将测试200个并发连接的服务器负载10秒钟。

On my machine, the single node server was able to handle about 51 requests per second. Of course, the results here will be different on different platforms and this is a very simplified test of performance that’s not a 100% accurate, but it will clearly show the difference that a cluster would make in a multi-core environment.

在我的机器上，单节点服务器每秒能够处理约51个请求。当然，这里的结果在不同的平台上会有所不同，这是对性能的非常简化的测试，精度不是100％，但是它将清楚地显示群集在多核环境中所产生的差异。

Now that we have a reference benchmark, we can scale the application with the cloning strategy using the cluster module.

现在我们有了参考基准，我们可以使用集群模块通过克隆策略扩展应用程序。

On the same level as the server.js file above, we can create a new file (cluster.js) for the master process with this content (explanation follows):

在与上面的server.js文件相同的级别上，我们可以使用以下内容为主进程创建一个新文件( cluster.js )(解释如下)：

// cluster.js
const cluster = require('cluster');
const os = require('os');

if (cluster.isMaster) {
  const cpus = os.cpus().length;

  console.log(`Forking for ${cpus} CPUs`);
  for (let i = 0; i<cpus; i++) {
    cluster.fork();
  }
} else {
  require('./server');
}

In cluster.js, we first required both the cluster module and the os module. We use the os module to read the number of CPU cores we can work with using os.cpus().

在cluster.js ，我们首先需要cluster模块和os模块。我们使用os模块来读取可以使用os.cpus()使用的CPU内核数。

The cluster module gives us the handy Boolean flag isMaster to determine if this cluster.js file is being loaded as a master process or not. The first time we execute this file, we will be executing the master process and that isMaster flag will be set to true. In this case, we can instruct the master process to fork our server as many times as we have CPU cores.

cluster模块为我们提供了方便的布尔标志isMaster用于确定是否将cluster.js文件作为主进程加载。第一次执行此文件时，我们将执行主进程，并且isMaster标志将设置为true。在这种情况下，我们可以指示主进程将拥有CPU内核的服务器分叉多次。

Now we just read the number of CPUs we have using the os module, then with a for loop over that number, we call the cluster.fork method. The for loop will simply create as many workers as the number of CPUs in the system to take advantage of all the available processing power.

现在，我们仅读取使用os模块的CPU数量，然后对该数量进行for循环，我们调用cluster.fork方法。 for循环将简单地创建与系统中CPU数量一样多的工作程序，以利用所有可用的处理能力。

When the cluster.fork line is executed from the master process, the current file, cluster.js, is run again, but this time in worker mode with the isMaster flag set to false. There is actually another flag set to true in this case if you need to use it, which is the isWorker flag.

从主进程执行cluster.fork行时，将再次运行当前文件cluster.js ，但这一次是在工作模式下 ，将isMaster标志设置为false。 实际上，在这种情况下，如果您需要使用另一个标志，则将其设置为true，即isWorker标志。

When the application runs as a worker, it can start doing the actual work. This is where we need to define our server logic, which, for this example, we can do by requiring the server.js file that we have already.

当应用程序作为工作程序运行时，它可以开始进行实际的工作。这是我们需要定义服务器逻辑的地方，在此示例中，我们可以通过要求已经拥有的server.js文件来完成。

That’s basically it. That’s how easy it is to take advantage of all the processing power in a machine. To test the cluster, run the cluster.js file:

基本上就是这样。充分利用机器中的所有处理能力是如此的容易。要测试集群，请运行cluster.js文件：

I have 8 cores on my machine so it started 8 processes. It’s important to understand that these are completely different Node.js processes. Each worker process here will have its own event loop and memory space.

我的机器上有8个核心，因此它启动了8个进程。重要的是要了解它们是完全不同的Node.js进程。这里的每个工作进程都有自己的事件循环和内存空间。

When we now hit the web server multiple times, the requests will start to get handled by different worker processes with different process ids. The workers will not be exactly rotated in sequence because the cluster module performs some optimizations when picking the next worker, but the load will be somehow distributed among the different worker processes.

现在，当我们多次访问Web服务器时，请求将开始由具有不同进程ID的不同工作进程处理。由于在选择下一个工作程序时集群模块会执行一些优化，因此工作程序不会按顺序精确轮换，但是负载将以某种方式分布在不同的工作程序进程之间。

We can use the same ab command above to load-test this cluster of processes:

我们可以使用与上面相同的ab命令来对该进程集群进行负载测试：

The cluster I created on my machine was able to handle 181 requests per second in comparison to the 51 requests per second that we got using a single Node process. The performance of this simple application tripled with just a few lines of code.

与使用单个Node进程获得的每秒51个请求相比，我在计算机上创建的集群每秒能够处理181个请求。只需几行代码，此简单应用程序的性能便增加了两倍。

向所有员工广播消息 (Broadcasting Messages to All Workers)

Communicating between the master process and the workers is simple because under the hood the cluster module is just using the child_process.fork API, which means we also have communication channels available between the master process and each worker.

主进程与工作进程之间的通信很简单，因为在child_process.fork ，集群模块仅使用child_process.fork API，这意味着我们在主进程与每个工作进程之间也具有可用的通信通道。

Based on the server.js/cluster.js example above, we can access the list of worker objects using cluster.workers, which is an object that holds a reference to all workers and can be used to read information about these workers. Since we have communication channels between the master process and all workers, to broadcast a message to all them we just need a simple loop over all the workers. For example:

基于上面的server.js / cluster.js示例，我们可以使用cluster.workers访问工作对象列表， cluster.workers是一个对所有工作程序都有引用的对象，可用于读取有关这些工作程序的信息。由于我们在主流程和所有工作人员之间都有交流渠道，因此要向所有工作人员广播消息，我们只需要在所有工作人员之间进行简单循环即可。例如：

Object.values(cluster.workers).forEach(worker => {
  worker.send(`Hello Worker ${worker.id}`);
});

We simply used Object.values to get an array of all workers from the cluster.workers object. Then, for each worker, we can use the send function to send over any value that we want.

我们仅使用Object.values从cluster.workers对象中获取所有工作者的数组。然后，对于每个工作者，我们可以使用send函数来发送所需的任何值。

In a worker file, server.js in our example, to read a message received from this master process, we can register a handler for the message event on the global process object. For example:

在我们的示例中的工作文件server.js中，要读取从该主流程接收到的消息，我们可以在全局process对象上注册message事件的process 。例如：

process.on('message', msg => {
  console.log(`Message from master: ${msg}`);
});

Here is what I see when I test these two additions to the cluster/server example:

这是我测试集群/服务器示例的这两个添加项时看到的内容：

Every worker received a message from the master process. Note how the workers did not start in order.

每个工人都从主过程中收到一条消息。 注意工人是如何不按顺序开始的。

Let’s make this communication example a little bit more practical. Let’s say we want our server to reply with the number of users we have created in our database. We’ll create a mock function that returns the number of users we have in the database and just have it square its value every time it’s called (dream growth):

让我们使这个通信示例更加实用。假设我们希望服务器使用在数据库中创建的用户数进行回复。我们将创建一个模拟函数，该函数返回我们在数据库中拥有的用户数量，并在每次调用它时使它的值平方(梦想增长)：

// **** Mock DB Call
const numberOfUsersInDB = function() {
  this.count = this.count || 5;
  this.count = this.count * this.count;
  return this.count;
}
// ****

Every time numberOfUsersInDB is called, we’ll assume that a database connection has been made. What we want to do here — to avoid multiple DB requests — is to cache this call for a certain period of time, such as 10 seconds. However, we still don’t want the 8 forked workers to do their own DB requests and end up with 8 DB requests every 10 seconds. We can have the master process do just one request and tell all of the 8 workers about the new value for the user count using the communication interface.

每次调用numberOfUsersInDB ，我们都假定已建立数据库连接。为了避免多个数据库请求，我们在此要做的是将该调用缓存一定的时间，例如10秒。但是，我们仍然不希望这8个派生的工人执行自己的数据库请求，并最终每10秒收到8个数据库请求。我们可以让主进程只执行一个请求，并使用通信接口将8个工作人员的新值告知用户计数的新值。

In the master process mode, we can, for example, use the same loop to broadcast the users count value to all workers:

在主流程模式下，例如，我们可以使用相同的循环将用户计数值广播给所有工作人员：

// Right after the fork loop within the isMaster=true block
const updateWorkers = () => {
  const usersCount = numberOfUsersInDB();
  Object.values(cluster.workers).forEach(worker => {
    worker.send({ usersCount });
  });
};

updateWorkers();
setInterval(updateWorkers, 10000);

Here we’re invoking updateWorkers for the first time and then invoking it every 10 seconds using a setInterval. This way, every 10 seconds, all workers will receive the new user count value over the process communication channel and only one database connection will be made.

在这里，我们是第一次调用updateWorkers ，然后使用setInterval每10秒调用一次。这样，每隔10秒钟，所有工作人员将通过过程通信通道接收新的用户计数值，并且仅建立一个数据库连接。

In the server code, we can use the usersCount value using the same message event handler. We can simply cache that value with a module global variable and use it anywhere we want.

在服务器代码中，我们可以使用具有相同message事件处理程序的usersCount值。我们可以简单地使用模块全局变量缓存该值，并在任何需要的地方使用它。

For example:

例如：

const http = require('http');
const pid = process.pid;

let usersCount;

http.createServer((req, res) => {
  for (let i=0; i<1e7; i++); // simulate CPU work
  res.write(`Handled by process ${pid}\n`);
  res.end(`Users: ${usersCount}`);
}).listen(8080, () => {
  console.log(`Started process ${pid}`);
});

process.on('message', msg => {
  usersCount = msg.usersCount;
});

The above code makes the worker web server respond with the cached usersCount value. If you test the cluster code now, during the first 10 seconds you’ll get “25” as the users count from all workers (and only one DB request would be made). Then after another 10 seconds, all workers would start reporting the new user count, 625 (and only one other DB request would be made).

上面的代码使工作器Web服务器以缓存的usersCount响应值。如果现在测试集群代码，则在最初的10秒钟内，当用户从所有工作线程中计数时，您将获得“ 25”(并且只会发出一个数据库请求)。然后又过了10秒钟，所有工作人员将开始报告新用户数625(并且仅会发出另一个数据库请求)。

This is all possible thanks to the communication channels between the master process and all workers.

借助于主过程与所有工作人员之间的沟通渠道，一切皆有可能。

提高服务器可用性 (Increasing Server Availability)

One of the problems in running a single instance of a Node application is that when that instance crashes, it has to be restarted. This means some downtime between these two actions, even if the process was automated as it should be.

运行Node应用程序的单个实例的问题之一是，当该实例崩溃时，必须重新启动它。这意味着这两个操作之间会有一些停机时间，即使该过程按原样是自动化的也是如此。

This also applies to the case when the server has to be restarted to deploy new code. With one instance, there will be downtime which affects the availability of the system.

这也适用于必须重新启动服务器以部署新代码的情况。在一个实例中，将存在停机时间，这会影响系统的可用性。

When we have multiple instances, the availability of the system can be easily increased with just a few extra lines of code.

当我们有多个实例时，只需增加几行代码即可轻松提高系统的可用性。

To simulate a random crash in the server process, we can simply do a process.exit call inside a timer that fires after a random amount of time:

为了模拟服务器进程中的随机崩溃，我们可以简单地在计时器中执行一个process.exit调用，该计时器在随机的时间后触发：

// In server.js
setTimeout(() => {
  process.exit(1) // death by random timeout
}, Math.random() * 10000);

When a worker process exits like this, the master process will be notified using the exit event on the cluster model object. We can register a handler for that event and just fork a new worker process when any worker process exits.

当这样的工作进程退出时，将使用cluster模型对象上的exit事件通知主进程。我们可以为该事件注册一个处理程序，并在任何工作进程退出时仅派生一个新的工作进程。

For example:

例如：

// Right after the fork loop within the isMaster=true block
cluster.on('exit', (worker, code, signal) => {
  if (code !== 0 && !worker.exitedAfterDisconnect) {
    console.log(`Worker ${worker.id} crashed. ` +
                'Starting a new worker...');
    cluster.fork();
  }
});

It’s good to add the if condition above to make sure the worker process actually crashed and was not manually disconnected or killed by the master process itself. For example, the master process might decide that we are using too many resources based on the load patterns it sees and it will need to kill a few workers in that case. To do so, we can use the disconnect methods on any worker and, in that case, the exitedAfterDisconnect flag will be set to true. The if statement above will guard to not fork a new worker for that case.

最好添加上面的if条件，以确保工作进程实际上崩溃了，并且没有被主进程本身手动断开连接或杀死。例如，主进程可能会根据看到的负载模式来决定我们正在使用太多资源，在这种情况下，它将需要杀死一些工人。为此，我们可以在任何工作exitedAfterDisconnect上使用disconnect方法，在这种情况下， exitedAfterDisconnect标志将设置为true。上面的if语句将防止在这种情况下不派生新工人。

If we run the cluster with the handler above (and the random crash in server.js), after a random number of seconds, workers will start to crash and the master process will immediately fork new workers to increase the availability of the system. You can actually measure the availability using the same ab command and see how many requests the server will not be able to handle overall (because some of the unlucky requests will have to face the crash case and that’s hard to avoid.)

如果我们使用上面的处理程序(以及server.js的随机崩溃)运行集群，则在随机的秒数之后，工作程序将开始崩溃，并且主进程将立即派生新的工作程序以提高系统的可用性。您实际上可以使用相同的ab命令来评估可用性，并查看服务器将无法整体处理多少个请求(因为某些不幸的请求将不得不面对崩溃情况，这很难避免。)

When I tested the code, only 17 requests failed out of over 1800 in the 10-second testing interval with 200 concurrent requests.

当我测试代码时，在10秒的测试间隔中，超过1800个事件中只有17个请求失败，并发了200个并发请求。

That’s over 99% availability. By just adding a few lines of code, we now don’t have to worry about process crashes anymore. The master guardian will keep an eye on those processes for us.

超过99％的可用性。只需添加几行代码，我们现在就不必担心进程崩溃了。家长监护人将为我们密切注意这些流程。

零停机时间重启 (Zero-downtime Restarts)

What about the case when we want to restart all worker processes when, for example, we need to deploy new code?

例如，当我们需要部署新代码时，如果要重新启动所有工作进程，该怎么办？

We have multiple instances running, so instead of restarting them together, we can simply restart them one at a time to allow other workers to continue to serve requests while one worker is being restarted.

我们有多个实例在运行，因此我们不必一次重新启动它们，而只需一次重新启动一个实例，以允许其他工作进程在重新启动一个工作进程时继续处理请求。

Implementing this with the cluster module is easy. Since we don’t want to restart the master process once it’s up, we need a way to send this master process a command to instruct it to start restarting its workers. This is easy on Linux systems because we can simply listen to a process signal like SIGUSR2, which we can trigger by using the kill command on the process id and passing that signal:

使用集群模块轻松实现。由于我们不想在启动主进程后立即重新启动它，因此我们需要一种方法向该主进程发送命令以指示其开始重新启动其工作程序。在Linux系统上，这很容易，因为我们可以简单地监听SIGUSR2类的进程信号，可以通过在进程id上使用kill命令并传递该信号来触发该信号：

// In Node
process.on('SIGUSR2', () => { ... });
// To trigger that
$ kill -SIGUSR2 PID

This way, the master process will not be killed and we have a way to instruct it to start doing something. SIGUSR2 is a proper signal to use here because this will be a user command. If you’re wondering why not SIGUSR1, it’s because Node uses that for its debugger and you want to avoid any conflicts.

这样，主进程将不会被杀死，我们有一种方法来指示它开始做某事。 SIGUSR2是在此处使用的适当信号，因为这将是用户命令。如果您想知道为什么不使用SIGUSR1 ，那是因为Node将它用于调试器，并且您希望避免任何冲突。

Unfortunately, on Windows, these process signal are not supported and we would have to find another way to command the master process to do something. There are some alternatives. We can, for example, use standard input or socket input. Or we can monitor the existence of a process.pid file and watch that for a remove event. But to keep this example simple, we’ll just assume this server is running on a Linux platform.

不幸的是，在Windows上，不支持这些过程信号，我们将不得不寻找另一种方法来命令主进程执行某项操作。有一些选择。例如，我们可以使用标准输入或套接字输入。或者，我们可以监视process.pid文件的存在并观察是否存在remove事件。但是为了简单起见，我们仅假设此服务器在Linux平台上运行。

Node works very well on Windows, but I think it’s a much safer option to host production Node applications on a Linux platform. This is not just because of Node itself, but many other production tools that are much more stable on Linux. This is my personal opinion and feel free to completely ignore it.

Node在Windows上可以很好地运行，但是我认为在Linux平台上托管生产Node应用程序是一种更安全的选择。这不仅是因为Node本身，还因为许多其他生产工具在Linux上更加稳定。这是我的个人观点，可以完全忽略它。

By the way, on recent versions of Windows, you can actually use a Linux subsystem and it works very well. I’ve tested it myself and it was nothing short of impressive. If you’re developing a Node applications on Windows, check out Bash on Windows and give it a try.

顺便说一句，在Windows的最新版本中，您实际上可以使用Linux子系统，并且运行良好。 我自己测试了它，令人印象深刻。 如果要在Windows上开发Node应用程序，请在Windows上检查Bash并尝试一下。

In our example, when the master process receives the SIGUSR2 signal, that means it’s time for it to restart its workers, but we want to do that one worker at a time. This simply means the master process should only restart the next worker when it’s done restarting the current one.

在我们的示例中，当主进程接收到SIGUSR2信号时，这意味着是时候重新启动其工作程序了，但是我们希望一次执行一次该工作程序。这仅表示主进程仅应在完成下一个工作进程的重启后才重启下一个工作进程。

To begin this task, we need to get a reference to all current workers using the cluster.workers object and we can simply just store the workers in an array:

要开始此任务，我们需要使用cluster.workers对象获取对所有当前工作程序的引用，我们只需将工作程序存储在数组中即可：

const workers = Object.values(cluster.workers);

Then, we can create a restartWorker function that receives the index of the worker to be restarted. This way we can do the restarting in sequence by having the function call itself when it’s ready for the next worker. Here’s an example restartWorker function that we can use (explanation follows):

然后，我们可以创建一个restartWorker函数，该函数接收要重新启动的worker的索引。这样，我们可以在下一个工作程序准备就绪时让函数自己调用，从而按顺序进行重启。这是我们可以使用的示例restartWorker函数(解释如下)：

const restartWorker = (workerIndex) => {
  const worker = workers[workerIndex];
  if (!worker) return;

  worker.on('exit', () => {
    if (!worker.exitedAfterDisconnect) return;
    console.log(`Exited process ${worker.process.pid}`);
    
    cluster.fork().on('listening', () => {
      restartWorker(workerIndex + 1);
    });
  });

  worker.disconnect();
};

restartWorker(0);

Inside the restartWorker function, we got a reference to the worker to be restarted and since we will be calling this function recursively to form a sequence, we need a stop condition. When we no longer have a worker to restart, we can just return. We then basically want to disconnect this worker (using worker.disconnect), but before restarting the next worker, we need to fork a new worker to replace this current one that we’re disconnecting.

在restartWorker函数内部，我们获得了要重新启动的worker的引用，并且由于我们将递归调用此函数以形成一个序列，因此我们需要一个停止条件。当我们不再需要工人重启时，我们可以返回。然后，我们基本上想断开该工作程序的连接(使用worker.disconnect )，但是在重新启动下一个工作程序之前，我们需要派一个新的工作程序来替换当前要断开连接的工作程序。

We can use the exit event on the worker itself to fork a new worker when the current one exists, but we have to make sure that the exit action was actually triggered after a normal disconnect call. We can use the exitedAfetrDisconnect flag. If this flag is not true, the exit was caused by something else other than our disconnect call and in that case, we should just return and do nothing. But if the flag is set to true, we can go ahead and fork a new worker to replace the one that we’re disconnecting.

当当前工作线程存在时，我们可以使用工作线程本身上的exit事件派生一个新工作线程，但是我们必须确保退出操作实际上是在正常的断开连接调用之后触发的。我们可以使用exitedAfetrDisconnect标志。如果此标志不为真，则退出是由断开连接调用以外的其他原因引起的，在这种情况下，我们应该返回而不执行任何操作。但是，如果该标志设置为true，我们可以继续并派遣一名新工作人员来替换要断开连接的工作人员。

When this new forked worker is ready, we can restart the next one. However, remember that the fork process is not synchronous, so we can’t just restart the next worker after the fork call. Instead, we can monitor the listening event on the newly forked worker, which tells us that this worker is connected and ready. When we get this event, we can safely restart the next worker in sequence.

当这个新的派生工人准备就绪时，我们可以重新启动下一个。但是，请记住，fork过程不是同步的，因此我们不能仅在fork调用之后重新启动下一个工作进程。相反，我们可以监视新派生的工作程序上的listening事件，这告诉我们该工作程序已连接并准备就绪。收到此事件后，我们可以安全地顺序重启下一个工作线程。

That’s all we need for a zero-downtime restart. To test it, you’ll need to read the master process id to be sent to the SIGUSR2 signal:

这就是零停机重启所需的一切。要对其进行测试，您需要读取要发送给SIGUSR2信号的主进程ID：

console.log(`Master PID: ${process.pid}`);

Start the cluster, copy the master process id, and then restart the cluster using the kill -SIGUSR2 PID command. You can also run the same ab command while restarting the cluster to see the effect that this restart process will have on availability. Spoiler alert, you should get ZERO failed requests:

启动集群，复制主进程ID，然后使用kill -SIGUSR2 PID命令重新启动集群。您也可以在重新启动群集时运行相同的ab命令，以查看此重新启动过程将对可用性产生的影响。剧透警报，您应该收到零失败的请求：

Process monitors like PM2, which I personally use in production, make all the tasks we went through so far extremely easy and give a lot more features to monitor the health of a Node.js application. For example, with PM2, to launch a cluster for any app, all you need to do is use the -i argument:

我亲自在生产中使用的过程监控器(例如PM2)使我们迄今为止完成的所有任务变得非常容易，并提供了更多功能来监视Node.js应用程序的运行状况。例如，对于PM2，要为任何应用启动集群，您所需要做的就是使用-i参数：

pm2 start server.js -i max

And to do a zero downtime restart you just issue this magic command:

要执行零停机时间重启，只需发出以下魔术命令：

pm2 reload all

However, I find it helpful to first understand what actually will happen under the hood when you use these commands.

但是，我发现首先了解使用这些命令时实际发生的情况会有所帮助。

共享状态和粘性负载平衡 (Shared State and Sticky Load Balancing)

Good things always come with a cost. When we load balance a Node application, we lose some features that are only suitable for a single process. This problem is somehow similar to what’s known in other languages as thread safety, which is about sharing data between threads. In our case, it’s sharing data between worker processes.

好事总是要付出代价的。当我们对Node应用程序进行负载平衡时，我们会丢失一些仅适用于单个进程的功能。这个问题在某种程度上类似于其他语言中所谓的线程安全，即线程之间共享数据。在我们的例子中，它是在工作进程之间共享数据。

For example, with a cluster setup, we can no longer cache things in memory because every worker process will have its own memory space. If we cache something in one worker’s memory, other workers will not have access to it.

例如，使用群集设置，我们将无法再将内容缓存在内存中，因为每个工作进程将拥有自己的内存空间。如果我们在某个工作人员的内存中缓存某些内容，其他工作人员将无法访问它。

If we need to cache things with a cluster setup, we have to use a separate entity and read/write to that entity’s API from all workers. This entity can be a database server or if you want to use in-memory cache you can use a server like Redis or create a dedicated Node process with a read/write API for all other workers to communicate with.

如果需要使用集群设置来缓存内容，则必须使用单独的实体，并从所有工作线程读取/写入该实体的API。该实体可以是数据库服务器，或者如果您想使用内存中的缓存，则可以使用Redis之类的服务器，或者使用具有读/写API的专用Node进程创建一个供所有其他工作人员进行通信的节点。

Don’t look at this as a disadvantage though, because using a separate entity for your application caching needs is part of decomposing your app for scalability. You should probably be doing that even if you’re running on a single core machine.

但是，不要将其视为缺点，因为使用单独的实体满足应用程序的缓存需求是分解应用程序以实现可伸缩性的一部分。即使您在单核计算机上运行，也应该这样做。

Other than caching, when we’re running on a cluster, stateful communication in general becomes a problem. Since the communication is not guaranteed to be with the same worker, creating a stateful channel on any one worker is not an option.

除了缓存之外，当我们在集群上运行时，有状态通信通常会成为问题。由于不能保证与同一工作人员进行通信，因此不能在任何一个工作人员上创建有状态通道。

The most common example for this is authenticating users.

最常见的示例是对用户进行身份验证。

With a cluster, the request for authentication comes to the master balancer process, which gets sent to a worker, assuming that to be A in this example.

对于集群，身份验证请求进入主平衡器流程，该流程被发送给工作程序，在本示例中假定为A。

Worker A now recognizes the state of this user. However, when the same user makes another request, the load balancer will eventually send them to other workers, which do not have them as authenticated. Keeping a reference to an authenticated user session in one instance memory is not going to work anymore.

现在，工作者A可以识别该用户的状态。但是，当同一用户提出另一个请求时，负载平衡器最终会将它们发送给其他未经过身份验证的工作程序。在一个实例内存中保留对经过身份验证的用户会话的引用将不再起作用。

This problem can be solved in many ways. We can simply share the state across the many workers we have by storing these sessions’ information in a shared database or a Redis node. However, applying this strategy requires some code changes, which is not always an option.

这个问题可以用很多方法解决。通过将这些会话的信息存储在共享数据库或Redis节点中，我们可以简单地在许多工作人员之间共享状态。但是，应用此策略需要对代码进行一些更改，但这并不总是一种选择。

If you can’t do the code modifications needed to make a shared storage of sessions here, there is a less invasive but not as efficient strategy. You can use what’s known as Sticky Load Balancing. This is much simpler to implement as many load balancers support this strategy out of the box. The idea is simple. When a user authenticates with a worker instance, we keep a record of that relation on the load balancer level.

如果您无法在此处进行共享的会话存储所需的代码修改，则可以采用侵入性较小但效率不高的策略。您可以使用所谓的“粘性负载平衡”。由于许多负载均衡器开箱即用地支持该策略，因此实现起来要容易得多。这个想法很简单。当用户通过工作实例进行身份验证时，我们会在负载均衡器级别保留该关系的记录。

Then, when the same user sends a new request, we do a lookup in this record to figure out which server has their session authenticated and keep sending them to that server instead of the normal distributed behavior. This way, the code on the server side does not have to be changed, but we don’t really get the benefit of load balancing for authenticated users here so only use sticky load balancing if you have no other option.

然后，当同一用户发送新请求时，我们在该记录中进行查找，以确定哪个服务器的会话已通过身份验证，并继续将其发送到该服务器，而不是正常的分布式行为。这样，不必更改服务器端的代码，但是我们在这里并没有真正获得负载均衡的好处，因此，如果没有其他选择，请仅使用粘性负载均衡。

The cluster module actually does not support sticky load balancing, but a few other load balancers can be configured to do sticky load balancing by default.

群集模块实际上不支持粘性负载平衡，但是默认情况下，可以将其他一些负载平衡器配置为执行粘性负载平衡。

Thanks for reading.

谢谢阅读。

Learning React or Node? Checkout my books:

学习React还是Node？结帐我的书：