zmq 中文文档

ØMQ - The Guide

[Table of Contents](javascript:😉

By Pieter Hintjens, CEO of iMatix

Please use the issue tracker for all comments and errata. This version covers the latest stable release of ZeroMQ (3.2). If you are using older versions of ZeroMQ then some of the examples and explanations won’t be accurate.

The Guide is originally in C, but also in PHP, Java, Python, Lua, and Haxe. We’ve also translated most of the examples into C++, C#, CL, Delphi, Erlang, F#, Felix, Haskell, Objective-C, Ruby, Ada, Basic, Clojure, Go, Haxe, Node.js, ooc, Perl, and Scala.

ZeroMQ 简介

ZeroMQ(也称为ØMQ, 0mq或zmq)看起来像一个嵌入式的网络库(an embeddable networking library),但就像一个并发性框架。它为您提供了scoket,可以跨进程内、进程间、TCP和多播等各种传输传输原子消息。您可以N-to-N的连接scokets 诸如 fan-out, pub-sub, task distribution,和request-reply等模式。它的速度足以成为集群产品的组织(fabric)。它的异步I/O模型为您提供了可伸缩的多核应用程序,构建为异步消息处理任务。它有大量的语言api,可以在大多数操作系统上运行。ZeroMQ来自iMatix,是LGPLv3级开源。

它如何开始

We took a normal TCP socket, injected it with a mix of radioactive isotopes stolen from a secret Soviet atomic research project, bombarded it with 1950-era cosmic rays, and put it into the hands of a drug-addled comic book author with a badly-disguised fetish for bulging muscles clad in spandex. Yes, ZeroMQ sockets are the world-saving superheroes of the networking world.

Figure 1 - A terrible accident…

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Pcgql7mB-1611294599074)(https://github.com/imatix/zguide/raw/master/images/fig1.png)]

Zero的含义

ZeroMQ的Ø权衡。一方面,这个奇怪的名字降低了ZeroMQ在谷歌和Twitter上的知名度。另一方面它惹恼了我们丹麦人写一些诸如“ØMG røtfl”,并且“Ø看作(looking)零不是好笑的!”和“地中海Rødgrød fløde !”,这显然是一种侮辱,意思是“愿你的邻居是格伦德尔的直系后裔!”似乎是公平交易。

最初ZeroMQ中的0表示“零代理”,并且(接近于)“零延迟”(尽可能)。从那时起,它开始包含不同的目标:零管理、零成本、零浪费。更普遍地说,“零”指的是渗透在项目中的极简主义文化。我们通过消除复杂性而不是公开新功能来增加功能。

Audience

本书是为专业程序员编写的,他们想要学习如何制作将主导未来计算的大规模分布式软件。我们假设您可以阅读C代码,因为这里的大多数示例都是用C编写的,即使ZeroMQ在许多语言中都被使用。我们假设您关心规模,因为ZeroMQ首先解决了这个问题。我们假设您需要以尽可能少的成本获得尽可能好的结果,否则您将不会欣赏ZeroMQ所做的权衡。除了基本的背景知识,我们还将介绍使用ZeroMQ所需的网络和分布式计算的所有概念。

致谢

Thanks to Andy Oram for making the O’Reilly book happen, and editing this text.

Thanks to Bill Desmarais, Brian Dorsey, Daniel Lin, Eric Desgranges, Gonzalo Diethelm, Guido Goldstein, Hunter Ford, Kamil Shakirov, Martin Sustrik, Mike Castleman, Naveen Chawla, Nicola Peduzzi, Oliver Smith, Olivier Chamoux, Peter Alexander, Pierre Rouleau, Randy Dryburgh, John Unwin, Alex Thomas, Mihail Minkov, Jeremy Avnet, Michael Compton, Kamil Kisiel, Mark Kharitonov, Guillaume Aubert, Ian Barber, Mike Sheridan, Faruk Akgul, Oleg Sidorov, Lev Givon, Allister MacLeod, Alexander D’Archangel, Andreas Hoelzlwimmer, Han Holl, Robert G. Jakabosky, Felipe Cruz, Marcus McCurdy, Mikhail Kulemin, Dr. Gergő Érdi, Pavel Zhukov, Alexander Else, Giovanni Ruggiero, Rick “Technoweenie”, Daniel Lundin, Dave Hoover, Simon Jefford, Benjamin Peterson, Justin Case, Devon Weller, Richard Smith, Alexander Morland, Wadim Grasza, Michael Jakl, Uwe Dauernheim, Sebastian Nowicki, Simone Deponti, Aaron Raddon, Dan Colish, Markus Schirp, Benoit Larroque, Jonathan Palardy, Isaiah Peng, Arkadiusz Orzechowski, Umut Aydin, Matthew Horsfall, Jeremy W. Sherman, Eric Pugh, Tyler Sellon, John E. Vincent, Pavel Mitin, Min RK, Igor Wiedler, Olof Åkesson, Patrick Lucas, Heow Goodman, Senthil Palanisami, John Gallagher, Tomas Roos, Stephen McQuay, Erik Allik, Arnaud Cogoluègnes, Rob Gagnon, Dan Williams, Edward Smith, James Tucker, Kristian Kristensen, Vadim Shalts, Martin Trojer, Tom van Leeuwen, Hiten Pandya, Harm Aarts, Marc Harter, Iskren Ivov Chernev, Jay Han, Sonia Hamilton, Nathan Stocks, Naveen Palli, and Zed Shaw for their contributions to this work.

Chapter 1 - Basics

Fixing the World

如何解释ZeroMQ?我们中的一些人从它所做的所有奇妙的事情开始说起。它的sockets在steroids上。它就像带有路由的邮箱。它很快! 其他人试图分享他们的顿悟时刻,即当一切都变得显而易见时,ap-pow-kaboom satori paradigm-shift moment。事情变得简单了。复杂性消失。它能开阔思维。*其他人试图通过比较来解释。它更小、更简单,但看起来仍然很眼熟。就我个人而言,我想要记住我们为什么要制作ZeroMQ,因为这很有可能就是你们读者今天仍然在做的事情。

编程是一门伪装成艺术的科学,因为我们大多数人都不懂软件的物理原理,而且很少有人教过编程。
软件的物理不是算法、数据结构、语言和抽象。这些只是我们制造、使用、丢弃的工具。软件真正的物理特性是人的物理特性——具体地说,是我们在复杂性方面的局限性,以及我们合作解决大问题的愿望。这是编程的科学:制作人们能够理解和使用的积木,然后人们将一起工作来解决最大的问题。

我们生活在一个互联的世界,现代软件必须在这个世界中导航。因此,未来最大的解决方案的构建模块是相互连接和大规模并行的。仅仅让代码变得“强大而安静”是不够的。代码必须与代码对话。代码必须健谈、善于交际、关系良好。代码必须像人脑一样运行,数以万亿计的单个神经元相互发送信息,这是一个大规模的并行网络,没有中央控制,没有单点故障,但能够解决极其困难的问题。代码的未来看起来像人脑,这并非偶然,因为每个网络的端点,在某种程度上,都是人脑。

如果您使用线程、协议或网络做过任何工作,您就会发现这几乎是不可能的。这是一个梦。当您开始处理实际的情况时,即使跨几个scoket连接几个程序也是非常麻烦的。数万亿吗?其代价将是难以想象的。连接计算机是如此困难,以至于软件和服务要做这是一项数十亿美元的业务。

所以我们生活在一个线路比我们使用它的能力超前数年的世界里。上世纪80年代,我们经历了一场软件危机。当时,弗雷德•布鲁克斯(Fred Brooks)等顶尖软件工程师相信,没有什么“灵丹妙药”能“保证生产率、可靠性或简单性哪怕提高一个数量级”。

布鲁克斯错过了免费和开源软件,正是这些软件解决了这场危机,使我们能够有效地共享知识。今天,我们面临着另一场软件危机,但我们很少谈论它。只有最大、最富有的公司才有能力创建连接的应用程序。有云,但它是私有的。我们的数据和知识正在从个人电脑上消失,变成我们无法访问、无法与之竞争的云。谁拥有我们的社交网络?这就像是反过来的大型机- pc革命。

我们可以把政治哲学留给另一本书。关键是,互联网提供了大量的潜在连接代码,现实情况是,对于大多数人来说,这是难以企及的,所以巨大而有趣的问题(在健康、教育、经济、交通、等等)仍然没有解决,因为没有办法连接代码,因此没有办法去连接可以一起工作的大脑来解决这些问题。

已经有很多尝试来解决连接代码的挑战。有数以千计的IETF规范,每个规范都解决了这个难题的一部分。对于应用程序开发人员来说,HTTP可能是一个简单到足以工作的解决方案,但是它鼓励开发人员和架构师从大服务器和thin,stupid的客户机的角度考虑问题,从而使问题变得更糟。

因此,今天人们仍然使用原始UDP和TCP、专有协议、HTTP和Websockets连接应用程序。它仍然痛苦、缓慢、难以扩展,而且本质上是集中的。分布式P2P架构主要是为了玩,而不是工作。有多少应用程序使用Skype或Bittorrent来交换数据?

这让我们回到编程科学。要改变世界,我们需要做两件事。第一,解决“如何在任何地方将任何代码连接到任何代码”的一般问题。第二,用最简单的模块来概括,让人们能够理解和使用。

这听起来简单得可笑。也许确实如此。这就是重点。

开始的前提

我们假设您至少使用了ZeroMQ的3.2版。我们假设您正在使用Linux机器或类似的东西。我们假设您可以或多或少地阅读C代码,因为这是示例的默认语言。我们假设,当我们编写像PUSH或SUBSCRIBE这样的常量时,您可以想象它们实际上被称为’ ZMQ_PUSH ‘或’ ZMQ_SUBSCRIBE '(如果编程语言需要的话)。

获取例子

The examples live in a public GitHub repository. The simplest way to get all the examples is to clone this repository:

git clone --depth=1 https://github.com/imatix/zguide.git

接下来,浏览examples子目录。你会通过语言找到例子。如果您使用的语言中缺少示例,建议您提交翻译。正是由于许多人的努力,这篇文章才变得如此有用。所有示例都是根据MIT/X11授权的。

有求必应

让我们从一些代码开始。当然,我们从Hello World的例子开始。我们将创建一个客户机和一个服务器。客户端向服务器发送“Hello”,服务器以“World”作为响应。这是C语言的服务器,它在端口5555上打开一个ZeroMQ scoket,读取请求,然后用“World”对每个请求进行响应:

[hwserver: Hello World server in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C| Perl | PHP | Python | Q | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc

图2 - Request-Reply

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jzXKt26r-1611294599076)(https://github.com/imatix/zguide/raw/master/images/fig2.png)]

REQ-REP套接字对是同步的。客户机在循环中发出zmq_send()然后zmq_recv(),在循环中(或者只需要执行一次)。执行任何其他序列(例如,在一行中发送两条消息)都会导致send或recv调用返回的代码为-1。类似地,服务按这个顺序发出zmq_recv()和zmq_send(),只要它需要。

ZeroMQ使用C作为参考语言,这是我们在示例中使用的主要语言。如果您正在在线阅读本文,下面的示例链接将带您到其他编程语言的翻译。让我们在c++中比较相同的服务器:

//Hello World server in C++
//Binds REP socket to tcp://\*:5555
//Expects "Hello" from client, replies with "World"
//
#include <zmq.hpp>
#include <string>
#include <iostream>
#ifndef _WIN32
#include <unistd.h>
#else
#include <windows.h>

#define sleep(n)    Sleep(n)
#endif

int main () {
`    `*//  Prepare our context and socket*
`    `zmq::context_t context (1);
`    `zmq::socket_t socket (context, ZMQ_REP);
`    `socket.bind ("tcp://*:5555");

`    `**while** (true) {
`        `zmq::message_t request;

`        `*//  Wait for next request from client*
`        `socket.recv (&request);
`        `std::cout << "Received Hello" << std::endl;

`        `*//  Do some 'work'*
`        `sleep(1);

`        `*//  Send reply back to client*
`        `zmq::message_t reply (5);
`        `memcpy (reply.data (), "World", 5);
`        `socket.send (reply);
`    }`
`    `**return** 0;
}

hwserver.cpp: Hello World server

You can see that the ZeroMQ API is similar in C and C++. In a language like PHP or Java, we can hide even more and the code becomes even easier to read:

<?php
*/**`  `Hello World server\*`  `Binds REP socket to tcp://\*:5555\*`  `Expects "Hello" from client, replies with "World"\* @author Ian Barber <ian(dot)barber(at)gmail(dot)com>\*/*

$context = **new** ZMQContext(1);

*//  Socket to talk to clients*
$responder = **new** ZMQSocket($context, ZMQ::SOCKET_REP);
$responder->bind("tcp://*:5555");

**while** (**true**) {
`    `*//  Wait for next request from client*
`    `$request = $responder->recv();
`    `printf ("Received request: [%s]**\n**", $request);

`    `*//  Do some 'work'*
`    `sleep (1);

`    `*//  Send reply back to client*
`    `$responder->send("World");
}

hwserver.php: Hello World server

package guide;

//
//  Hello World server in Java
//  Binds REP socket to tcp://*:5555
//  Expects "Hello" from client, replies with "World"
//

import org.zeromq.SocketType;
import org.zeromq.ZMQ;
import org.zeromq.ZContext;

public class hwserver
{
    public static void main(String[] args) throws Exception
    {
        try (ZContext context = new ZContext()) {
            // Socket to talk to clients
            ZMQ.Socket socket = context.createSocket(SocketType.REP);
            socket.bind("tcp://*:5555");

            while (!Thread.currentThread().isInterrupted()) {
                byte[] reply = socket.recv(0);
                System.out.println(
                    "Received " + ": [" + new String(reply, ZMQ.CHARSET) + "]"
                );

                String response = "world";
                socket.send(response.getBytes(ZMQ.CHARSET), 0);

                Thread.sleep(1000); //  Do some 'work'
            }
        }
    }
}

hwserver.java: Hello World server

The server in other languages:

[hwserver: Hello World server in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C| Perl | PHP | Python | Q | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc

Here’s the client code:

[hwclient: Hello World client in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C| Perl | PHP | Python | Q | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc

这看起来太简单了,不太现实,但是正如我们已经知道的,ZeroMQ套接字具有超能力。您可以同时将数千个客户机扔到这个服务器上,它将继续愉快而快速地工作。有趣的是,先启动客户机,然后再启动服务器,看看它是如何工作的,然后再考虑一下这意味着什么。
让我们简要地解释一下这两个程序实际上在做什么。它们创建要使用的ZeroMQ context 和socket。不要担心这些词的意思。你会知道的。服务器将其REP (reply) socket 绑定到端口5555。服务器在一个循环中等待一个请求,每次都用一个响应来响应。客户机发送请求并从服务器读取响应。

如果您关闭服务器(Ctrl-C)并重新启动它,客户机将无法正常恢复。从进程崩溃中恢复并不那么容易。
创建一个可靠的request-reply流非常复杂,直到可靠的Request-Reply模式才会涉及它。

幕后发生了很多事情,但对我们程序员来说,重要的是代码有多短、多好,以及即使在重负载下也不会崩溃的频率。这是request-reply模式,可能是使用ZeroMQ的最简单方法。它映射到RPC和经典的 client/server模型。

需要对Strings小小的注意

除了以字节为单位的大小外,ZeroMQ对您发送的数据一无所知。这意味着您要负责安全地格式化它,以便应用程序能够读取它。为对象和复杂数据类型执行此操作是专门库(如协议缓冲区)的工作。但即使是字符串,你也要小心。

在C语言和其他一些语言中,字符串以空字节结束。我们可以发送一个字符串,如“HELLO”与额外的空字节:

zmq_send (requester, "Hello", 6, 0);

但是,如果您从另一种语言发送一个字符串,它可能不会包含那个空字节。例如,当我们用Python发送相同的字符串时,我们这样做:

socket.send ("Hello")

然后连接到线路上的是长度(对于较短的字符串是一个字节)和作为单个字符的字符串内容。

图 3 - ZeroMQ的 string

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hDem9Win-1611294599078)(https://github.com/imatix/zguide/raw/master/images/fig3.png)]

如果您从C程序中读取这段代码,您将得到一个看起来像字符串的东西,并且可能意外地表现得像字符串(如果幸运的话,这5个字节后面跟着一个无辜的潜伏的null),但是它不是一个正确的字符串。当您的客户机和服务器不同意字符串格式时,您将得到奇怪的结果。

当您在C语言中从ZeroMQ接收字符串数据时,您不能简单地相信它已经安全终止。每次读取字符串时,都应该为额外的字节分配一个带空间的新缓冲区,复制字符串,并使用null正确地终止它。

因此,让我们建立一个规则,即ZeroMQ字符串是指定长度的,并且在传输时不带null。在最简单的情况下(在我们的示例中我们将这样做),ZeroMQ字符串整洁地映射到ZeroMQ消息框架,它看起来像上面的图—长度和一些字节。

在C语言中,我们需要做的是接收一个ZeroMQ字符串并将其作为一个有效的C字符串发送给应用程序:

*//`  `Receive ZeroMQ string from socket and convert into C string//`  `Chops string at 255 chars, if it's longer*
**static** char *
s_recv (void *socket) {
`    `char buffer [256];
`    `int size = zmq_recv (socket, buffer, 255, 0);
`    `**if** (size == -1)
`        `**return** NULL;
`    `**if** (size > 255)
`        `size = 255;
`    `buffer [size] = \0;
`    `*/\* use strndup(buffer, sizeof(buffer)-1) in \*nix **/
`    `**return** strdup (buffer);
}

这是一个方便的helper函数,本着使我们可以有效重用的精神,让我们编写一个类似的s_send函数,它以正确的ZeroMQ格式发送字符串,并将其打包到一个可以重用的头文件中。

结果是zhelpers.h,它是一个相当长的源代码,而且只对C开发人员有乐趣,所以请在闲暇时阅读它。

版本报告

ZeroMQ有几个版本,通常,如果遇到问题,它会在以后的版本中得到修复。所以这是一个很有用的技巧,可以准确地知道您实际链接的是哪个版本的ZeroMQ。

这里有一个小程序可以做到这一点:

[version: ZeroMQ version reporting in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Java | Lua | Node.js | Objective-C | Perl| PHP | Python | Q | Ruby | Scala | Tcl | Ada | Basic | Haxe | ooc | Racket

传达信息

第二个经典模式是单向数据分发,其中服务器将更新推送到一组客户机。让我们看一个示例,它推出由邮政编码、温度和相对湿度组成的天气更新。我们将生成随机值,就像真实的气象站所做的那样。
这是服务器。我们将为这个应用程序使用端口5556:

[wuserver: Weather update server in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C| Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc | Q

这个更新流没有起点也没有终点,就像一个永无止境的广播。

下面是客户端应用程序,它监听更新流并获取与指定zip code有关的任何内容,默认情况下,纽约是开始任何冒险的好地方:

[wuclient: Weather update client in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C| Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc | Q

图 4 - Publish-Subscribe

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cNF9ORLA-1611294599080)(https://github.com/imatix/zguide/raw/master/images/fig4.png)]

注意,当您使用 SUB socket 时,必须使用zmq_setsockopt()和SUBSCRIBE设置订阅,如下面的代码所示。如果不设置任何订阅,就不会收到任何消息。这是初学者常犯的错误。订阅者可以设置许多订阅,这些订阅被添加到一起。也就是说,如果更新匹配任何订阅,订阅方将接收更新。订阅者还可以取消特定的订阅。订阅通常是,但不一定是可打印的字符串。请参阅zmq_setsockopt()了解其工作原理。

PUB-SUB socket 对(双方的意思)是异步的。客户机在循环中执行zmq_recv()(或者它只需要一次)。试图向 SUB socket发送消息将导致错误(单向的只能收不能发)。类似地,服务在需要的时候执行zmq_send(),但是不能在PUB scoket上执行zmq_recv()(单向的只能发不能收)。

理论上,对于ZeroMQ sockets,哪一端连接和哪一端绑定并不重要。然而,在实践中有一些未记录的差异,我将在稍后讨论。现在,绑定PUB并连接SUB,除非您的网络设计不允许这样做。

关于 PUB-SUB sockets,还有一件更重要的事情需要了解:您不知道订阅者何时开始接收消息。即使启动订阅服务器,等一下,然后启动发布服务器,订阅服务器也始终会错过发布服务器发送的第一个消息。这是因为当订阅服务器连接到发布服务器时(这需要一点时间,但不是零),发布服务器可能已经在发送消息了。

这种“慢速加入者”症状经常出现在很多人身上,我们将对此进行详细解释。
记住ZeroMQ执行异步I/O,即,在后台。假设有两个节点按如下顺序执行此操作:

  • 订阅者连接到端点并接收和计数消息。
  • 发布者绑定到端点并立即发送1,000条消息。

那么订阅者很可能不会收到任何东西。您会闪烁(困扰?),检查是否设置了正确的过滤器,然后重试一次,订阅者仍然不会收到任何内容。

建立TCP连接涉及到握手和握手,握手需要几毫秒,这取决于您的网络和对等点之间的跳数。在这段时间里,ZeroMQ可以发送许多消息。为了便于讨论,假设建立一个连接需要5毫秒,并且相同的链接每秒可以处理1M条消息。在订阅者连接到发布者的5毫秒期间,发布者只需要1毫秒就可以发送那些1K消息。

在Sockets and Patterns中,我们将解释如何同步发布者和订阅者,以便在订阅者真正连接并准备好之前不会开始发布数据。有一个简单而愚蠢的方法可以延迟发布,那就是sleep。但是,不要在实际应用程序中这样做,因为它非常脆弱、不优雅且速度很慢。使用sleep向您自己证明发生了什么,然后等待Sockets and Patterns来查看如何正确地执行此操作。

同步的另一种选择是简单地假设发布的数据流是无限的,没有开始和结束。还有一种假设是订阅者不关心在启动之前发生了什么。这是我们如何构建天气客户端示例的。

因此,客户端订阅其选择的zip code,并为该zip code收集100个更新。如果zip code是随机分布的,这意味着大约有一千万次来自服务器的更新。您可以启动客户机,然后启动服务器,客户机将继续工作。您可以随时停止和重启服务器,客户机将继续工作。当客户机收集了它的100个更新后,它计算平均值,打印并退出。

关于发布-订阅(发布-订阅) publish-subscribe (pub-sub) 模式的几点:

  • 订阅服务器可以连接到多个发布服务器,每次使用一个连接调用。然后,数据将到达并交错(“公平排队”),这样就不会有一个发布者淹没其他发布者。

  • 如果发布者没有连接的订阅者,那么它将删除所有消息。

  • 如果您正在使用TCP,而订阅服务器很慢,则消息将在发布服务器上排队。稍后,我们将研究如何使用“高水位标记(high-water mark)”来保护publishers 不受此影响。

从ZeroMQ v3.x,当使用连接的协议(tcp://或ipc://)时,过滤发生在发布端。
使用epgm://协议,过滤发生在订阅方。在ZeroMQ v2.x,所有过滤都发生在订阅端。

我的笔记本电脑是2011年的英特尔i5,接收和过滤1000万条信息的时间是这样的:

$ time wuclient
Collecting updates from weather server...
Average temperature for zipcode '10001 ' was 28F

real    0m4.470s
user    0m0.000s
sys     0m0.008s

Divide and Conquer(分而治之)

图 5 - Parallel Pipeline

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WHhVyrO1-1611294599082)(https://github.com/imatix/zguide/raw/master/images/fig5.png)]

作为最后一个例子(您肯定已经厌倦了有趣的代码,并希望重新研究比较抽象规范的语言学讨论),让我们来做一些超级计算。然后咖啡。我们的超级计算应用程序是一个相当典型的并行处理模型。我们有:

  • 可同时完成多项任务的ventilator
  • 一组处理任务的workers
  • 从工作进程收集结果的sink

在现实中,workers 在超级快的机器上运行,可能使用gpu(图形处理单元)来做艰难的计算。这是ventilator 。它会生成100个任务,每个任务都有一条消息告诉worker睡眠几毫秒:

[taskvent: Parallel task ventilator in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C| Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | ooc | Q | Racket

这是worker应用程序。它接收到一条消息,休眠几秒钟,然后发出信号,表示它已经完成:
[taskwork: Parallel task worker in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C| Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | ooc | Q | Racket

下面是sink应用程序。它收集了100个任务,然后计算出整个处理过程花费了多长时间,这样我们就可以确认,如果有多个任务,那么这些工人确实是并行运行的:
[tasksink: Parallel task sink in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C| Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | ooc | Q | Racket

  • 1 worker: total elapsed time: 5034 msecs.
  • 2 workers: total elapsed time: 2421 msecs.
  • 4 workers: total elapsed time: 1018 msecs.

让我们更详细地看看这段代码的一些方面:

  • worker将上游连接到ventilator ,下游连接到sink。这意味着可以任意添加worker。如果worker绑定到他们的端点,您将需要(a)更多的端点和(b)每次添加worker时修改ventilator 和/或sink。我们说ventilator 和sink是我们建筑的“稳定”部分,worker是建筑的“动态”部分。

  • 我们必须在所有worker正在启动和运行后开始批处理(We have to synchronize the start of the batch with all workers being up and running.)。这是ZeroMQ中一个相当常见的问题,没有简单的解决方案。’ zmq_connect '方法需要一定的时间。因此,当一组worker连接到ventilator 时,第一个成功连接的worker将在短时间内获得大量信息,而其他worker也在连接。如果不以某种方式同步批处理的开始,系统就根本不会并行运行。试着把ventilator 里的等待时间去掉,看看会发生什么。

  • ventilator 的PUSH socket 将任务分配给worker(假设他们在批处理开始输出之前都连接好了)。这就是所谓的“负载平衡”,我们将再次详细讨论它。

  • sink的PULL均匀地收集worker的结果。这叫做“公平排队”。
    图 6 - Fair Queuing

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pkkp7vy1-1611294599085)(https://github.com/imatix/zguide/raw/master/images/fig6.png)]

管道模式(pipeline pattern)还表现出“慢连接者”综合征,导致指责PUSH sockets不能正确地平衡负载。如果您正在使用 PUSH 和 PULL,而您的一个worker获得的消息比其他worker多得多,这是因为这个PULL socket连接得比其他worker更快,并且在其他worker设法连接之前捕获了大量消息。如果您想要适当的负载平衡,您可能需要查看Advanced Request-Reply Patterns中的负载平衡模式。

ZeroMQ编程

看过一些例子之后,您一定很想开始在一些应用程序中使用ZeroMQ。在你开始之前,深呼吸,放松,并思考一些基本的建议,这会帮你减轻很多压力和困惑。

  • 循序渐进的学习ZeroMQ。它只是一个简单的API,但它隐藏了大量的可能性。慢慢地把握每一种可能性。

  • 写好代码。丑陋的代码隐藏了问题,让别人很难帮助你。您可能已经习惯了无意义的变量名,但是阅读您代码的人不会习惯。使用真实的单词,而不是“我太粗心了,不能告诉您这个变量的真正用途”。使用一致的缩进和干净的布局。写好代码,你的世界就会更舒适。

  • 一边做一边测试。当您的程序无法工作时,您应该知道应该归咎于哪五行。当你使用极具魅力的ZeroMQ的时候,这一点尤其正确,因为在你开始尝试的几次之后,它都不会起作用。

    • 当您发现有些东西不像预期的那样工作时,将您的代码分成几部分,测试每一部分,看看哪一部分不工作。ZeroMQ允许你编写模块化代码;利用这一点。
  • 根据需要进行抽象(类、方法等)。如果你复制/粘贴了很多代码,你也会复制/粘贴错误。

正确理解 Context

ZeroMQ应用程序总是从创建Context开始,然后使用Context创建sockets。在C语言中,它是zmq_ctx_new()调用。您应该在流程中创建并使用一个Context。从技术上讲,Context是一个进程中所有sockets的容器,它充当inproc sockets的传输,inproc sockets是在一个进程中连接线程的最快方式。如果在运行时一个流程有两个Context,那么它们就像独立的ZeroMQ实例。如果这是你明确想要的,好的,否则记住:

Call zmq_ctx_new() once at the start of a process, and zmq_ctx_destroy() once at the end.

在流程开始时调用zmq_ctx_new()一次,在流程结束时调用zmq_ctx_destroy()一次。
如果使用fork()系统调用,那么在fork之后和子进程代码的开头执行zmq_ctx_new()。通常,您希望在子进程中执行有趣的(ZeroMQ)操作,而在父进程中执行乏味的流程管理。

退出前清理

一流的程序员与一流的杀手有相同的座右铭:当你完成工作时,总是要清理干净。当您在Python之类的语言中使用ZeroMQ时,会自动释放一些内容。但是在使用C语言时,必须小心地释放对象,否则会导致内存泄漏、应用程序不稳定,通常还会产生坏的因果报应。

内存泄漏是一回事,但是ZeroMQ对如何退出应用程序非常挑剔。原因是技术性的和痛苦的,但是结果是,如果您打开任何sockets ,zmq_ctx_destroy()函数将永远挂起。即使关闭所有sockets ,默认情况下,如果有挂起连接或发送,zmq_ctx_destroy()将永远等待,除非在关闭这些sockets 之前将这些sockets 的逗留时间设置为零。

我们需要担心的ZeroMQ对象是 messages, sockets, 和 contexts。幸运的是,它非常简单,至少在简单的程序中:

  • 可以时使用zmq_send()和zmq_recv(),因为它避免了使用zmq_msg_t对象。

  • 如果您确实使用zmq_msg_recv(),那么总是在使用完接收到的消息后立即释放它,方法是调用zmq_msg_close()。

  • 如果您打开和关闭了许多sockets,这可能是您需要重新设计应用程序的标志。在某些情况下,在销毁上下文之前不会释放sockets句柄。

  • 退出程序后,关闭socket,然后调用zmq_ctx_destroy()。这会销毁context。

这至少是C开发的情况。在具有自动对象销毁的语言中,离开作用域时将销毁套接字和上下文。
如果使用异常,则必须在类似“final”块的地方进行清理,这与任何资源都是一样的。

如果你在做多线程的工作,它会变得比这更复杂。我们将在下一章中讨论多线程,但是由于有些人会不顾警告,在安全地行走前先尝试运行,下面是在多线程ZeroMQ应用程序中实现干净退出的快速而又脏的指南。

首先,不要尝试从多个线程使用同一个socket。请不要解释为什么你认为这将是非常有趣的,只是请不要这样做。接下来,您需要关闭具有正在进行的请求的每个socket。正确的方法是设置一个较低的逗留值(1秒),然后关闭socket。如果您的语言绑定在销毁context时没有自动为您完成此任务,我建议发送一个补丁。

最后,销毁context。这将导致任何阻塞接收或轮询或发送附加线程(即,共享context)返回一个错误。捕获该错误,然后设置逗留,关闭该线程中的socket,然后退出。不要两次破坏相同的Context。主线程中的zmq_ctx_destroy将阻塞,直到它所知道的所有socket都安全关闭为止。

瞧!这是非常复杂和痛苦的,任何称职的语言绑定作者都会自动地这样做,使socket关闭舞蹈变得不必要。

为什么我们需要ZeroMQ

既然您已经看到了ZeroMQ的作用,让我们回到“为什么”。

现在的许多应用程序都是由跨越某种网络(LAN或Internet)的组件组成的。因此,许多应用程序开发人员最终都会进行某种消息传递。一些开发人员使用消息队列产品,但大多数时候他们自己使用TCP或UDP来完成。这些协议并不难使用,但是从a向B发送几个字节与以任何一种可靠的方式进行消息传递之间有很大的区别。

让我们看看在开始使用原始TCP连接各个部分时所面临的典型问题。任何可重用的消息层都需要解决所有或大部分问题:

  • 我们如何处理I/O?我们的应用程序是阻塞还是在后台处理I/O ?这是一个关键的设计决策。阻塞I/O会创建伸缩性不好的体系结构。但是后台I/O很难正确地执行。
  • 我们如何处理动态组件,即,暂时消失的碎片?我们是否将组件正式划分为“客户端”和“服务器”,并要求服务器不能消失?如果我们想把服务器连接到服务器呢?我们是否每隔几秒钟就尝试重新连接?
  • 我们如何在网络上表示消息?我们如何设置数据的框架,使其易于读写,不受缓冲区溢出的影响,对小消息有效,但对于那些戴着派对帽子跳舞的猫的大型视频来说,这已经足够了吗?
  • 我们如何处理无法立即交付的消息?特别是,如果我们正在等待一个组件重新联机?我们是丢弃消息,将它们放入数据库,还是放入内存队列?
  • 我们在哪里存储消息队列?如果从队列读取的组件非常慢,导致我们的队列增加,会发生什么?那么我们的策略是什么呢?
  • 我们如何处理丢失的消息?我们是等待新数据、请求重发,还是构建某种确保消息不会丢失的可靠性层?如果这个层本身崩溃了呢?
  • 如果我们需要使用不同的网络传输怎么办?比如说,多播而不是TCP单播?还是IPv6 ?我们是否需要重写应用程序,还是在某个层中抽象传输?
  • 我们如何路由消息?我们可以向多个对等点发送相同的消息吗?我们可以将回复发送回原始请求者吗?
  • 我们如何为另一种语言编写API ?我们是重新实现一个线级协议,还是重新打包一个库?如果是前者,如何保证栈的高效稳定?如果是后者,我们如何保证互操作性?
  • 我们如何表示数据,以便在不同的体系结构之间读取数据?我们是否对数据类型强制执行特定的编码?这是消息传递系统的工作,而不是更高一层的工作。
  • 我们如何处理网络错误?我们是等待并重试,默不作声地忽略它们,还是中止?

以一个典型的开源项目为例,比如Hadoop Zookeeper,在src/ C /src/ Zookeeper . C中读取C API代码。当我在2013年1月读到这段代码时,它是4200行神秘代码,其中有一个未文档化的客户机/服务器网络通信协议。我认为这是有效的,因为它使用轮询而不是选择。但实际上,Zookeeper应该使用通用消息层和显式文档化的有线级协议。对于团队来说,一遍又一遍地构建这个特定的轮子是非常浪费的。

但是如何创建可重用的消息层呢?为什么在如此多的项目需要这种技术的时候,人们仍然在用一种很困难的方式来完成它,在他们的代码中驱动TCP套接字,并一次又一次地解决长列表中的问题?

事实证明,构建可重用的消息传递系统是非常困难的,这就是为什么很少有自由/开源软件项目尝试过,以及为什么商业消息传递产品是复杂的、昂贵的、不灵活的和脆弱的。2006年,iMatix设计了AMQP,它开始为自由/开源软件开发人员提供消息系统的第一个可重用配方。AMQP比其他许多设计都要好,但仍然相对复杂、昂贵和脆弱。学习使用它需要几周的时间,而创建当事情变得棘手时不会崩溃的稳定的体系结构需要几个月的时间。

图 7 - Messaging as it Starts

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JMWV27Re-1611294599086)(https://github.com/imatix/zguide/raw/master/images/fig7.png)]

大多数消息传递项目,如AMQP,都试图通过发明一个新的概念“broker”来解决这一长串问题,该概念负责寻址、路由和排队,从而以可重用的方式解决这些问题。这将导致客户机/服务器协议或一些未文档化协议之上的一组api,这些协议允许应用程序与此broker通信。在减少大型网络的复杂性方面,Brokers 是一件很好的事情。但是在Zookeeper这样的产品中添加基于代理的消息会让情况变得更糟,而不是更好。这将意味着添加一个额外的大框和一个新的单点故障。broker 迅速成为一个瓶颈和一个需要管理的新风险。如果软件支持它,我们可以添加第二个、第三个和第四个broker ,并制定一些故障转移方案。人们这样做。它创造了更多的活动部件,更多的复杂性,以及更多需要打破的东西。

以broker 为中心需要自己的operations team。你确实需要日日夜夜地观察这些brokers,当他们开始行为不端时,你要用棍子打他们。你需要盒子,你需要备份盒子,你需要人们来管理这些盒子。它只值得为大型应用程序做很多移动的部分,由几个团队的人在几年的时间内构建。

图 8 - Messaging as it Becomes

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nPIUc2o8-1611294599087)(https://github.com/imatix/zguide/raw/master/images/fig8.png)]

因此,中小型应用程序开发人员陷入了困境。它们要么避免网络编程,要么开发不可伸缩的单片应用程序。或者他们跳入网络编程,使脆弱、复杂的应用程序难以维护。或者他们押注于一个消息传递产品,最终开发出可伸缩的应用程序,这些应用程序依赖于昂贵且容易崩溃的技术。一直没有真正好的选择,这也许就是为什么messaging 在很大程度上停留在上个世纪,并激起强烈的情感:对用户来说是负面的,对那些销售支持和许可的人来说是欢欣鼓舞的。

我们需要的是能够完成消息传递功能的东西,但它的实现方式非常简单和廉价,可以在任何应用程序中运行,成本几乎为零。它应该是一个链接的库,没有任何其他依赖关系。没有额外的移动部件,所以没有额外的风险。它应该运行在任何操作系统上,并且可以使用任何编程语言。

这就是ZeroMQ:一个高效的、可嵌入的库,它解决了应用程序需要在不花费太多成本的情况下在网络上保持良好弹性的大部分问题。

特别地:

  • 它在后台线程中异步处理I/O。这些线程使用无锁数据结构与应用程序线程通信,因此并发ZeroMQ应用程序不需要锁、信号量或其他等待状态。

  • 组件可以动态进出,ZeroMQ将自动重新连接。这意味着您可以以任何顺序启动组件。您可以创建“面向服务的体系结构”(service-oriented architecture, soa),其中服务可以随时加入和离开网络。

  • 它在需要时自动对消息进行排队。它很聪明地做到了这一点,在对消息进行排队之前,尽可能地将消息推送到接收端。

  • 它有办法处理过满的队列(称为“高水位”)。当队列已满时,ZeroMQ会根据您正在执行的消息类型(所谓的“模式”)自动阻塞发送者或丢弃消息。

  • 它允许您的应用程序通过任意传输相互通信:TCP、多播、进程内、进程间。您不需要更改代码来使用不同的传输。

  • 它使用依赖于消息传递模式的不同策略安全地处理慢速/阻塞的readers 。

  • 它允许您使用各种模式路由消息,比如请求-应答和发布-订阅。这些模式是取决于你如何创建拓扑结构的,即网络的结构。

  • 它允许您创建代理来通过一个调用对消息进行排队、转发或捕获。代理可以降低网络的互连复杂性。

  • 它通过在网络上使用一个简单的框架,完全按照发送的方式传递整个消息。如果您写了一条10k的消息,您将收到一条10k的消息。

  • 它不将任何格式强加于消息。它们是从0到gb大小的水滴。当您想要表示数据时,您可以在顶部选择一些其他产品,例如msgpack、谷歌的协议缓冲区等。

  • 它通过在有意义的情况下自动重试来智能地处理网络错误。

  • 它可以减少你的碳足迹。用更少的CPU做更多的事情意味着您的机器使用更少的能量,并且您可以让旧的机器使用更长时间。 Al Gore会喜欢ZeroMQ的。

  • 实际上ZeroMQ做的远不止这些。
    它对如何开发支持网络的应用程序具有颠覆性的影响。从表面上看,它是一个受套接字启发的API,您可以在其上执行’ zmq_recv() ‘和’ zmq_send() '。但是消息处理很快成为中心循环,您的应用程序很快就分解为一组消息处理任务。它优雅自然。它是可伸缩的:每个任务都映射到一个节点,节点之间通过任意传输进行通信。一个进程中的两个节点(节点是一个线程)、一个框中的两个节点(节点是一个进程)或一个网络上的两个节点(节点是一个框)—都是一样的,没有应用程序代码更改。

(可伸缩性的scoket)

Socket Scalability(可伸缩性的scoket)

让我们看看ZeroMQ的可伸缩性。下面是一个shell脚本,它先启动天气服务器,然后并行地启动一堆客户机:

wuserver &
wuclient 12345 &
wuclient 23456 &
wuclient 34567 &
wuclient 45678 &
wuclient 56789 &

As the clients run, we take a look at the active processes using the top command’, and we see something like (on a 4-core box):

PID  USER  PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME+  COMMAND
7136  ph   20   0 1040m 959m 1156 R  157 12.0 16:25.47 wuserver
7966  ph   20   0 98608 1804 1372 S   33  0.0  0:03.94 wuclient
7963  ph   20   0 33116 1748 1372 S   14  0.0  0:00.76 wuclient
7965  ph   20   0 33116 1784 1372 S    6  0.0  0:00.47 wuclient
7964  ph   20   0 33116 1788 1372 S    5  0.0  0:00.25 wuclient
7967  ph   20   0 33072 1740 1372 S    5  0.0  0:00.35 wuclient

让我们想一下这里发生了什么。气象服务器只有一个套接字,但是这里我们让它并行地向五个客户机发送数据。我们可以有成千上万的并发客户端。服务器应用程序不会看到它们,也不会直接与它们对话。所以ZeroMQ套接字就像一台小服务器,默默地接受客户机请求,并以网络最快的速度将数据发送给它们。它是一个多线程服务器,可以从CPU中挤出更多的能量。

从ZeroMQ v2.2升级到ZeroMQ v3.2

Compatible Changes(兼容变更)

这些更改不会直接影响现有的应用程序代码:

  • Pub-sub filtering现在运行在在publisher而不是在subscriber,这在许多pub-sub 用例中显著提高了性能 You can mix v3.2 and v2.1/v2.2 publishers and subscribers safely.

  • ZeroMQ v3.2 has many new API methods (zmq_disconnect(), zmq_unbind(), zmq_monitor(), zmq_ctx_set(), etc.)

不兼容的变更

这些是影响应用程序和语言绑定的主要领域:(These are the main areas of impact on applications and language bindings):

  • Changed send/recv methods: zmq_send() and zmq_recv() have a different, simpler interface, and the old functionality is now provided by zmq_msg_send() and zmq_msg_recv(). Symptom: compile errors. Solution: fix up your code.

  • 这两种方法成功时返回正值,错误时返回-1。在v2。他们成功时总是零回报。症状:工作正常时明显的错误。解决方案:严格测试返回代码= -1,而不是非零.

  • zmq_poll() 现在等待毫秒,而不是微秒。症状:应用程序停止响应(实际上响应慢了1000倍)。解决方案::在所有zmq_poll调用中,使用下面定义的ZMQ_POLL_MSEC宏。

  • “ZMQ_NOBLOCK”现在称为“ZMQ_DONTWAIT”。症状:在“ZMQ NOBLOCK”宏上编译失败。

  • “ZMQ_HWM”socket 选项现在分为“ZMQ_SNDHWM”和“ZMQ_RCVHWM”。症状:在’ ZMQ_HWM '宏上编译失败。

  • 大多数但不是所有的’ zmq_getsockopt() ‘选项现在都是整数值。症状:运行时错误返回’ zmq_setsockopt ‘和’ zmq_getsockopt '。

  • ’ ZMQ_SWAP ‘选项已被删除。症状:在’ ZMQ_SWAP '上编译失败。解决方案:重新设计使用此功能的任何代码。

Suggested Shim Macros

对于希望在两个v2上运行的应用程序。x和v3.2,例如语言绑定,我们的建议是尽可能地模拟v3.2。这里有一些C宏定义,可以帮助您的C/ c++代码跨两个版本工作(取自CZMQ):

\#ifndef ZMQ_DONTWAIT
\#`   `define ZMQ_DONTWAIT`     `ZMQ_NOBLOCK
\#endif
\#if ZMQ_VERSION_MAJOR == 2
\#`   `define zmq_msg_send(msg,sock,opt) zmq_send (sock, msg, opt)
\#`   `define zmq_msg_recv(msg,sock,opt) zmq_recv (sock, msg, opt)
\#`   `define zmq_ctx_destroy(context) zmq_term(context)
\#`   `define ZMQ_POLL_MSEC`    `1000`        `*//  zmq_poll is usec*
\#`   `define ZMQ_SNDHWM ZMQ_HWM
\#`   `define ZMQ_RCVHWM ZMQ_HWM
\#elif ZMQ_VERSION_MAJOR == 3
\#`   `define ZMQ_POLL_MSEC`    `1`           `*//  zmq_poll is msec*
\#endif

Warning: 不稳定的范例!

传统的网络编程建立在一个socket 与一个connection、一个peer通信的一般假设之上。有多播协议,但这些都是外来的。当我们假设““one socket = one connection””时,我们以某种方式扩展架构。我们创建逻辑线程,其中每个线程使用一个socket和一个peer。我们在这些线程中放置intelligence 和状态。
在ZeroMQ领域,sockets是快速后台通信引擎的入口,这些引擎可以自动地为您管理一整套连接。您无法查看、处理、打开、关闭或将状态附加到这些连接。无论您使用阻塞发送、接收或轮询,您只能与socket通信,而不是它为您管理的连接。连接是私有的,不可见的,这是ZeroMQ可伸缩性的关键。
这是因为,与socket通信的代码可以处理任意数量的连接,而无需更改周围的任何网络协议。ZeroMQ中的消息传递模式比应用程序代码中的消息传递模式扩展得更便宜。

所以一般的假设不再适用。当您阅读代码示例时,您的大脑将尝试将它们映射到您所知道的内容。您将读取“socket”并认为“啊,这表示到另一个节点的连接”。这是错误的。当你读到“thread”时,你的大脑又会想,“啊,一个thread代表了与另一个节点的连接”,你的大脑又会出错。
如果你是第一次读这本指南的话,要意识到这一点,直到你在一两天内编写ZeroMQ代码(可能是三到四天),你可能会感到困惑,特别是ZeroMQ使事情多么简单,你可以试着把这个普遍的假设强加给ZeroMQ,它不会工作。然后你将经历你的启蒙和信任的时刻,当一切都变得清晰的时候,你将经历“zap-pow-kaboom satori”的时刻。

Chapter 2 - Sockets and Patterns

在第1章—基础知识中,我们将ZeroMQ作为驱动器,并提供了一些主要ZeroMQ模式的基本示例:请求-应答、发布-订阅和管道。在本章中,我们将亲自动手,开始学习如何在实际程序中使用这些工具。

我们将讨论:

  • 如何创建和使用ZeroMQ Sockets。
  • 如何在Sockets上发送和接收消息。
    如何围绕ZeroMQ的异步I/O模型构建应用程序。
    如何在一个线程中处理多个Sockets。
    如何正确处理致命和非致命错误。
    如何处理像Ctrl-C这样的中断信号。
    如何干净地关闭ZeroMQ应用程序。
    如何检查ZeroMQ应用程序的内存泄漏。
    如何发送和接收多部分消息。
    如何跨网络转发消息。
    如何构建一个简单的消息队列代理(broker)。
    如何使用ZeroMQ编写多线程应用程序。
    如何使用ZeroMQ在线程之间发送信号。
    如何使用ZeroMQ来协调节点网络。
    如何为发布-订阅创建和使用消息信封。
    使用HWM (high-water mark)来防止内存溢出。

The Socket API

说句老实话,ZeroMQ对你耍了个花招,对此我们不道歉。这是为了你好,我们比你更伤心。ZeroMQ提供了一个熟悉的基于Socket的API,要隐藏一堆消息处理引擎需要付出很大的努力。然而,结果将慢慢地修正您关于如何设计和编写分布式软件的世界观。

Socket实际上是网络编程的标准API, as well as being useful for stopping your eyes from falling onto your cheeks(怎么翻译? 大跌眼镜?)。ZeroMQ对开发人员特别有吸引力的一点是,它使用Socket和messages ,而不是其他任意一组概念。感谢Martin Sustrik的成功。它将“面向消息的中间件”变成了“额外辛辣(Extra Spicy ,升级版)的Sockets!”这让我们对披萨产生了一种奇怪的渴望,并渴望了解更多。
就像最喜欢的菜一样,ZeroMQsockets 很容易消化。sockets 的生命由四个部分组成,就像BSDsockets 一样:

  • 创造和摧毁sockets ,它们一起形成一个插座生命的业力循环(see zmq_socket(), zmq_close()).

  • 通过设置套接字上的选项并在必要时检查它们来配置套接字(see zmq_setsockopt(), zmq_getsockopt()).

  • Plugging sockets into the network topology by creating ZeroMQ connections to and from them (see zmq_bind(), zmq_connect()).

  • 通过创建与它们之间的ZeroMQ连接,将sockets插入网络拓扑(see zmq_msg_send(), zmq_msg_recv()).

注意,套接字总是空指针,消息(我们很快就会讲到)是结构。所以在C语言中,按原样传递sockets ,但是在所有处理消息的函数中传递消息的地址,比如zmq_msg_send()和zmq_msg_recv()。作为一个助记符,请认识到“在ZeroMQ中,您所有的sockets 都属于我们”,但是消息实际上是您在代码中拥有的东西。
创建、销毁和配置Sockets的工作原理与您对任何对象的期望一样。但是请记住ZeroMQ是一个异步的、有弹性的结构。这对我们如何将Sockets插入网络拓扑以及之后如何使用Sockets有一定的影响。

将Sockets插入拓扑中

要在两个节点之间创建连接,可以在一个节点中使用zmq_bind(),在另一个节点中使用zmq_connect()。一般来说,执行zmq_bind()的节点是一个“服务器”,位于一个已知的网络地址上,执行zmq_connect()的节点是一个“客户机”,具有未知或任意的网络地址。因此,我们说“将socket 绑定到端点”和“将socket 连接到端点”,端点就是那个已知的网络地址。

ZeroMQ连接与传统TCP连接有些不同。主要的显著差异是:

  • They go across an arbitrary transport (inproc, ipc, tcp, pgm, or epgm). See zmq_inproc(), zmq_ipc(), zmq_tcp(), zmq_pgm(), and zmq_epgm().

  • 一个将socket可能有许多传出和传入连接。.

  • 没有’ zmq_accept '()方法。当socket绑定到端点时,它将自动开始接受连接

  • 网络连接本身发生在后台,如果网络连接中断,ZeroMQ将自动重新连接(例如,如果peer 消失,然后返回)。

  • 您的应用程序代码不能直接使用这些连接;它们被封装在socket下面。

许多架构遵循某种客户机/服务器模型,其中服务器是最静态的组件,而客户机是最动态的组件,即,他们来了又走的最多。有时存在寻址问题:服务器对客户机可见,但是反过来不一定是这样的。因此,很明显,哪个节点应该执行zmq_bind()(服务器),而哪个节点应该执行zmq_connect()(客户机)。它还取决于您使用的socket的类型,对于不常见的网络体系结构有一些例外。稍后我们将研究socket类型。

现在,假设在启动服务器之前先启动客户机。在传统的网络中,我们会看到一个大大的红色失败标志。但是ZeroMQ让我们任意地开始和停止。只要客户机节点执行zmq_connect(),连接就存在,该节点就可以开始向socket写入消息。在某个阶段(希望是在消息排队太多而开始被丢弃或客户机阻塞之前),服务器会启动,执行zmq_bind(),然后ZeroMQ开始传递消息。
一个服务器节点可以绑定到许多端点(即协议和地址的组合),并且它可以使用一个socket来实现这一点。这意味着它将接受跨不同传输的连接:

zmq_bind (socket, "tcp://*:5555");
zmq_bind (socket, "tcp://*:9999");
zmq_bind (socket, "inproc://somename");

对于大多数传输,不能像UDP那样两次绑定到同一个端点。然而,ipc传输允许一个进程绑定到第一个进程已经使用的端点。这意味着允许进程在崩溃后恢复。
虽然ZeroMQ试图对哪边绑定和哪边连接保持中立,但还是有区别的。稍后我们将更详细地看到这些。其结果是,您通常应该将“服务器”视为拓扑的静态部分,它绑定到或多或少固定的端点,而将“客户机”视为动态部分,它们来来去去并连接到这些端点。然后,围绕这个模型设计应用程序。它“正常工作”的可能性要大得多。

Sockets 有多个类型。Socket类型定义Sockets的语义、Socket向内和向外路由消息的策略、队列等。您可以将某些类型的Socket连接在一起,例如,publisher Socket和subscriber Socket。Socket在“messaging patterns”中协同工作。稍后我们将更详细地讨论这个问题。
正是能够以这些不同的方式连接Sockets,使ZeroMQ具备了作为消息队列系统的基本功能。在此之上还有一些层,比如代理,我们稍后将讨论它。但从本质上讲,使用ZeroMQ,您可以像孩子的积木玩具一样将各个部分拼接在一起,从而定义您的网络体系结构。

发送和接收消息

要发送和接收消息,可以使用zmq_msg_send()和zmq_msg_recv()方法。这些名称都是传统的,但是ZeroMQ的I/O模型与传统的TCP模型有很大的不同,您需要时间来理解它。

图 9 - TCP sockets are 1 to 1

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uTTVUHmB-1611294599088)(https://github.com/imatix/zguide/raw/master/images/fig9.png)]

让我们来看看TCP sockets和ZeroMQ sockets在处理数据方面的主要区别:

  • ZeroMQ套接字像UDP一样携带消息,而不像TCP那样携带字节流。ZeroMQ消息是长度指定的二进制数据。我们很快就会讲到信息;它们的设计是针对性能进行优化的,因此有点棘手。
  • ZeroMQ套接字在后台线程中执行I/O。这意味着消息到达本地输入队列并从本地输出队列发送,无论您的应用程序在忙什么。
  • 根据socket类型,ZeroMQ sockets具有内置的1对n路由行为。

zmq_send()方法实际上并不将消息发送到socket connection(s)。它对消息进行排队,以便I/O线程可以异步发送消息。它不会阻塞,除非在某些异常情况下。因此,当zmq_send()返回到应用程序时,不一定要发送消息。

单播传输(Unicast Transports)

ZeroMQ提供了一组单播传输(inproc、ipc和tcp)和多播传输(epgm、pgm)。多播是一种先进的技术,我们稍后会讲到。不要开始使用它,除非你知道你的扇出比将使1到n单播不可能(Don’t even start using it unless you know that your fan-out ratios will make 1-to-N unicast impossible.)。

对于大多数常见的情况,使用tcp,这是一个断开连接式(disconnected )的tcp传输。它是弹性的,便携式的,和足够快的大多数情况下。我们将此称为断开连接式(disconnected ),因为ZeroMQ的tcp传输不需要在连接到端点之前存在端点。客户机和服务器可以随时连接和绑定,可以来回切换,并且对应用程序保持透明。

进程间ipc传输也是断开连接式(disconnected )的,就像tcp一样。它有一个限制:它还不能在Windows上运行。按照惯例,我们使用带有“.ipc”扩展名,以避免与其他文件名的潜在冲突。在UNIX系统上,如果使用ipc端点,则需要使用适当的权限创建这些端点,否则在不同用户id下运行的进程之间可能无法共享这些端点。您还必须确保所有进程都可以访问这些文件,例如,在相同的工作目录中运行。

线程间传输(inproc)是一种连接(connected )的信号传输。它比tcp或ipc快得多。与tcp和ipc相比,这种传输有一个特定的限制:服务器必须在任何客户机发出连接之前发出绑定。这是ZeroMQ的未来版本可能会解决的问题,但目前这定义了如何使用inproc套接字。我们创建并绑定一个socket,并启动子线程,子线程创建并连接其他socket。

ZeroMQ 不是中立的载体(Neutral Carrier)

ZeroMQ新手常问的一个问题(我自己也问过这个问题)是:“如何用ZeroMQ编写XYZ服务器?”
例如,“如何用ZeroMQ编写HTTP服务器?”这意味着,如果我们使用普通sockets 来承载HTTP请求和响应,我们应该能够使用ZeroMQsockets 来做同样的事情,只是更快更好。

答案曾经是“事情不是这样的”。ZeroMQ并不是一个中立的载体:它在使用的传输协议上强加了一个框架。这种帧与现有协议不兼容,现有协议倾向于使用自己的帧。例如,比较TCP/IP上的HTTP请求和ZeroMQ请求。

图 10 - HTTP on the Wire

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-We7BSbnE-1611294599089)(https://github.com/imatix/zguide/raw/master/images/fig10.png)]

HTTP请求使用CR-LF作为最简单的帧分隔符,而ZeroMQ使用指定长度的帧。因此,您可以使用ZeroMQ编写类似http的协议,例如使用request-reply socket模式。但它不是HTTP。

图 11 - ZeroMQ on the Wire

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lrdb11PF-1611294599090)(https://github.com/imatix/zguide/raw/master/images/fig11.png)]

但是,从v3.3开始,ZeroMQ就有一个名为ZMQ_ROUTER_RAW的套接字选项,允许您在不使用ZeroMQ帧的情况下读写数据。您可以使用它来读写正确的HTTP请求和响应。Hardeep Singh对此做出了贡献,这样他就可以从ZeroMQ应用程序连接到Telnet服务器。在编写本文时,这还处于试验阶段,但它显示了ZeroMQ如何不断发展以解决新问题。也许下一个补丁就是你的了。

I/O Threads

我们说过ZeroMQ在后台线程中执行I/O。一个I/O线程(适用于所有类型socket)对于除最极端的应用程序之外的所有应用程序都是足够的。当您创建一个新的context时,它从一个I/O线程开始。一般的经验法则是,每秒允许1千兆字节(gigabyte ,1GB?)的数据进出一个I/O线程。要增加I/O线程的数量,请在创建任何socket之前使用zmq_ctx_set()调用:

int io_threads = 4;
void *context = zmq_ctx_new ();
zmq_ctx_set (context, ZMQ_IO_THREADS, io_threads);
assert (zmq_ctx_get (context, ZMQ_IO_THREADS) == io_threads);

我们已经看到一个socket可以同时处理几十个、甚至数千个连接。这对如何编写应用程序具有根本性的影响。传统的网络应用程序每个远程连接有一个进程或一个线程,该进程或线程处理一个scoket。ZeroMQ允许您将整个结构折叠成一个进程,然后根据需要将其拆分以实现可伸缩性

如果您只将ZeroMQ用于线程间通信(即,一个没有外部scoket I/O的多线程应用程序)您可以将I/O线程设置为零。这不是一个重要的优化,更多的是一个好奇心。

消息传递模式(Messaging Patterns)

在ZeroMQ socket API的牛皮纸包装下,隐藏着消息传递模式的世界。如果您有企业消息传递方面的背景知识,或者熟悉UDP,那么您对这些可能会有些熟悉。但对ZeroMQ的大多数新来者来说,它们是一个惊喜。我们非常习惯TCP范例,其中socket 一对一地映射到另一个节点。

让我们简要回顾一下ZeroMQ为您做了什么。它将数据块(消息)快速有效地交付给节点。您可以将节点映射到线程、进程或节点。ZeroMQ为您的应用程序提供了一个可以使用的socket API,而不管实际的传输是什么(比如进程内、进程间、TCP或多播)。它会在同行来来去去时自动重新连接到他们。它根据需要在发送方和接收方对消息进行排队。它限制这些队列,以防止进程耗尽内存。它处理socket 错误。它在后台线程中执行所有I/O操作。它使用无锁技术在节点之间进行通信,因此从不存在锁、等待、信号量或死锁。

但是,它会根据称为模式的精确配方路由和排队消息。正是这些模式提供了ZeroMQ的智能。它们浓缩了我们来之不易的经验,即最好的数据和工作分发方式。ZeroMQ的模式是硬编码的,但是未来的版本可能允许用户定义模式。

ZeroMQ模式由具有匹配类型的 sockets 对实现。换句话说,要理解ZeroMQ模式,您需要了解sockets 类型及其协同工作的方式。大多数情况下,这只需要学习;在这个层面上,没有什么是显而易见的。
内置的核心ZeroMQ模式是:

  • Request-reply: 它将一组客户机连接到一组服务。这是一个远程过程调用和任务分发模式。

  • Pub-sub:它将一组发布者连接到一组订阅者。这是一个数据发布模式。

  • Pipeline:它以扇出/扇入模式连接节点,该模式可以有多个步骤和循环。
    这是一个并行的任务分发和收集模式

  • Exclusive pair:只连接两个sockets 。这是一个用于连接进程中的两个线程的模式,不要与“普通”sockets 对混淆。
    我们在第1章-基础知识中讨论了前三种模式,我们将在本章后面看到the exclusive pair 模式。zmq_socket()手册页对模式非常清楚——值得反复阅读几遍,直到开始理解为止。这些socket组合对连接绑定对是有效的(任何一方都可以绑定):

  • PUB and SUB

  • REQ and REP

  • REQ and ROUTER (注意,REQ插入了一个额外的空帧)

  • DEALER and REP (注意,REP插入了一个额外的空帧)

  • DEALER and ROUTER

  • DEALER and DEALER

  • ROUTER and ROUTER

  • PUSH and PULL

  • PAIR and PAIR

您还将看到对XPUB和XSUB sockets的引用,我们稍后将对此进行讨论(它们类似于PUB和SUB的原始版本)。任何其他组合都将产生未文档化和不可靠的结果,如果您尝试ZeroMQ的未来版本,可能会返回错误。当然,您可以并且将通过代码桥接其他socket类型,即,从一个socket类型读取并写入到另一个socket类型。

High-Level Messaging Patterns

这四个核心模式被煮成ZeroMQ。它们是ZeroMQ API的一部分,在核心c++库中实现,并保证可以在所有优秀的零售商店中使用。
除此之外,我们还添加了高级消息传递模式。我们在ZeroMQ的基础上构建这些高级模式,并在应用程序中使用的任何语言中实现它们。它们不是核心库的一部分,不附带ZeroMQ包,并且作为ZeroMQ社区的一部分存在于它们自己的空间中。例如,我们在可靠的请求-应答模式中探索的Majordomo模式位于ZeroMQ组织中的GitHub Majordomo项目中。
在本书中,我们的目标之一是为您提供一组这样的高级模式,包括小型模式(如何明智地处理消息)和大型模式(如何构建可靠的发布子体系结构)。

处理消息(Working with Messages)

libzmq核心库实际上有两个api来发送和接收消息。我们已经看到和使用的zmq_send()和zmq_recv()方法都是简单的一行程序。我们将经常使用这些方法,但是zmq_recv()不擅长处理任意消息大小:它将消息截断为您提供的任何缓冲区大小。所以有第二个API与zmq_msg_t结构一起工作,它有一个更丰富但更困难的API:

  • Initialise a message: zmq_msg_init(), zmq_msg_init_size(), zmq_msg_init_data().
  • Sending and receiving a message: zmq_msg_send(), zmq_msg_recv().
  • Release a message: zmq_msg_close().
  • 访问 message content: zmq_msg_data(), zmq_msg_size(), zmq_msg_more().
  • 处理消息属性 message properties: zmq_msg_get(), zmq_msg_set().
  • Message manipulation: zmq_msg_copy(), zmq_msg_move().

在网络上,ZeroMQ消息是从零开始的任何大小的块,大小都适合存储在内存中。您可以使用协议缓冲区、msgpack、JSON或应用程序需要使用的任何其他东西来进行自己的序列化。选择一个可移植的数据表示形式是明智的,但是您可以自己做出关于权衡的决定。

在内存中,ZeroMQ消息是zmq_msg_t结构(或类,取决于您的语言)。下面是在C语言中使用ZeroMQ消息的基本规则:

  • 创建并传递zmq_msg_t对象,而不是数据块。
  • 要读取消息,可以使用zmq_msg_init()创建一个空消息,然后将其传递给zmq_msg_recv()。
  • 要从新数据中编写一条消息,可以使用zmq_msg_init_size()创建一条消息,同时分配某个大小的数据块。然后使用memcpy填充数据,并将消息传递给zmq_msg_send()。
  • 要释放(而不是销毁)消息,可以调用zmq_msg_close()。这将删除引用,最终ZeroMQ将销毁消息。
  • 要访问消息内容,可以使用zmq_msg_data()。要知道消息包含多少数据,可以使用zmq_msg_size()。
  • 不要使用zmq_msg_move()、zmq_msg_copy()或zmq_msg_init_data(),除非您阅读了手册页并确切地知道为什么需要这些。
  • 您传递一个消息后zmq_msg_send(),ØMQ将clear 这个消息。i.e.,将大小设置为零。您不能两次发送相同的消息,并且不能在发送消息后访问消息数据。
  • 如果您使用zmq_send()和zmq_recv(),而不是消息结构,这些规则将不适用。

如果您希望多次发送相同的消息,并且消息大小相当,那么创建第二个消息,使用zmq_msg_init()初始化它,然后使用zmq_msg_copy()创建第一个消息的副本。这不是复制数据,而是复制引用。然后可以发送消息两次(如果创建了更多副本,则可以发送两次或多次),并且只有在发送或关闭最后一个副本时才最终销毁消息。
ZeroMQ还支持多部分消息,它允许您以单个在线消息的形式发送或接收帧列表。这在实际应用程序中得到了广泛的应用,我们将在本章后面和高级请求-应答模式中对此进行研究。帧(在ZeroMQ参考手册页面中也称为“消息部件”)是ZeroMQ消息的基本有线格式。帧是指定长度的数据块。长度可以是0以上。如果您已经做过TCP编程,您就会明白为什么帧是“我现在应该读取多少关于这个网络socket的数据”这个问题的有用答案。

有一个称为ZMTP的线级协议,它定义了ZeroMQ如何在TCP连接上读写帧。如果您对它的工作原理感兴趣,那么这个规范非常简短。
最初,ZeroMQ消息是一个帧,就像UDP一样。稍后,我们使用多部分消息对此进行了扩展,这些消息非常简单,就是一系列帧,其中“more”位设置为1,然后是一个位设置为0的帧。然后ZeroMQ API允许您编写带有“more”标志的消息,当您读取消息时,它允许您检查是否有“more”。
因此,在底层ZeroMQ API和参考手册中,消息和帧之间存在一些模糊。所以这里有一个有用的词汇:

  • 消息可以是一个或多个部分。
  • 这些部分也被称为“框架”。
  • 每个部分都是zmq_msg_t对象。
  • 您在底层API中分别发送和接收每个部分。
  • 高级api提供包装器来发送整个多部分消息。

还有一些关于信息值得了解的事情:

  • 您可以发送零长度的消息,例如,从一个线程发送信号到另一个线程。
  • ZeroMQ保证要不提供消息的所有部分(一个或多个),要不一个也不提供。
  • ZeroMQ不会立即发送消息(单个或多个部分),而是在稍后某个不确定的时间。因此,多部分消息必须适合于内存。
  • 消息(单个或多个部分)必须装入内存。如果您想发送任意大小的文件,您应该将它们分成几部分,并将每一部分作为单独的单部分消息发送。使用多部分数据不会减少内存消耗。
  • 必须在接收到消息后调用zmq_msg_close(),使用的语言在范围关闭时不会自动销毁对象。发送消息后不会调用此方法。

重复一下,不要使用zmq_msg_init_data()。这是一种零拷贝的方法,肯定会给您带来麻烦。在您开始担心减少微秒之前,还有许多更重要的事情需要了解ZeroMQ。
使用这个丰富的API可能会很累。这些方法是针对性能而不是简单性进行优化的。如果你开始使用这些,你几乎肯定会弄错,直到你仔细阅读手册页。因此,一个好的语言绑定的主要工作之一就是将这个API封装在更容易使用的类中。

处理多个Sockets(Handling Multiple Sockets)

在到目前为止的所有例子中,大多数例子的主循环是:

  • 1.等待套接字上的消息。
  • 2.过程信息。
  • 3.重复。

如果我们想同时读取多个端点呢?最简单的方法是将一个socket连接到所有端点,并让ZeroMQ为我们执行扇入。如果远程端点使用相同的模式,这是合法的,但是将PULL socket连接到PUB端点将是错误的。
要同时读取多个sockets,可以使用zmq_poll()。更好的方法可能是将zmq_poll()封装在一个框架中,该框架将其转换为一个不错的事件驱动的反应器,但是它的工作量比我们在这里要介绍的多得多。
让我们从一个脏的hack开始,部分原因是为了好玩,但主要是因为它让我向您展示如何进行非阻塞socket 读取。下面是一个使用非阻塞读取从两个sockets读取的简单示例。这个相当混乱的程序既是天气更新的订阅者,又是并行任务的工作人员:

[msreader: Multiple socket reader in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Java | Lua | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Haskell | Haxe | Node.js | ooc | Q | Racket

这种方法的代价是对第一个消息(循环末尾的休眠,当没有等待消息要处理时)增加一些延迟。在亚毫秒级延迟非常重要的应用程序中,这将是一个问题。此外,您还需要检查nanosleep()或其他函数的文档,以确保它不繁忙循环。
您可以通过先读取一个套接字,然后读取第二个套接字来公平地对待套接字,而不是像我们在本例中所做的那样对它们进行优先级排序。
现在让我们看看同样毫无意义的小应用程序,使用zmq_poll():
[mspoller: Multiple socket poller in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Java | Lua | Node.js | Objective-C | Perl| PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Haxe | ooc | Q | Racket
The items structure has these four members:

typedef struct {
    void *socket;       //  ZeroMQ socket to poll on
    int fd;             //  OR, native file handle to poll on
    short events;       //  Events to poll on
    short revents;      //  Events returned after poll
} zmq_pollitem_t;

多部分消息(Multipart Messages)

ZeroMQ让我们用几个帧组成一个消息,给我们一个“多部分消息”。现实的应用程序大量使用多部分消息,既用于包装带有地址信息的消息,也用于简单的序列化。稍后我们将查看回复信封。

我们现在要学习的只是如何盲目而安全地读写任何应用程序(例如代理)中的多部分消息,这些应用程序需要在不检查消息的情况下转发消息。

当你处理多部分消息时,每个部分都是zmq_msg项。例如,如果要发送包含五个部分的消息,必须构造、发送和销毁五个zmq_msg项。您可以预先执行此操作(并将zmq_msg项存储在数组或其他结构中),或者在发送它们时逐个执行。

这是我们如何发送帧在一个多部分的消息(我们接收每帧到一个消息对象):

zmq_msg_send (&message, socket, ZMQ_SNDMORE);
…
zmq_msg_send (&message, socket, ZMQ_SNDMORE);
…
zmq_msg_send (&message, socket, 0);

下面是我们如何接收和处理一个消息中的所有部分,无论是单个部分还是多个部分:

while (1) {
    zmq_msg_t message;
    zmq_msg_init (&message);
    zmq_msg_recv (&message, socket, 0);
    //  Process the message frame
    …
    zmq_msg_close (&message);
    if (!zmq_msg_more (&message))
        break;      //  Last message frame
}

关于多部分消息需要知道的一些事情:

  • 当您发送一个多部分消息时,第一部分(以及所有后续部分)只有在您发送最后一部分时才实际通过网络发送。
  • 如果您正在使用zmq_poll(),当您接收到消息的第一部分时,其他部分也都已经到达。
  • 您将接收到消息的所有部分,或者完全不接收。
  • 消息的每个部分都是一个单独的zmq_msg项。
  • 无论是否选中more属性,都将接收消息的所有部分。
  • 发送时,ZeroMQ将消息帧在内存中排队,直到最后一个消息帧被接收,然后将它们全部发送出去。
  • 除了关闭套接字外,无法取消部分发送的消息。

中介和代理 Intermediaries and Proxies

ZeroMQ的目标是分散智能,但这并不意味着你的网络是中间的空白空间。它充满了消息感知的基础设施,通常,我们使用ZeroMQ构建该基础设施。ZeroMQ管道可以从很小的管道到成熟的面向服务的brokers。消息传递行业将此称为中介,即中间的内容处理任何一方。在ZeroMQ中,我们根据context调用这些代理、队列、转发器、设备或brokers。

这种模式在现实世界中极为常见,这也是为什么我们的社会和经济中充斥着中介机构,它们除了降低大型网络的复杂性和规模成本外,没有其他实际功能。真实的中介通常称为批发商、分销商、经理等等。

动态发现问题 The Dynamic Discovery Problem

在设计大型分布式架构时,您将遇到的问题之一是发现。也就是说,各个部分是如何相互了解的?这是特别困难的,如果部分来了又走了,所以我们称之为“动态发现问题”。

动态发现有几种解决方案。最简单的方法是通过硬编码(或配置)网络体系结构来完全避免这种情况,以便手工完成发现。也就是说,当您添加一个新片段时,您将重新配置网络以了解它。

图 12 - 小规模的发布-订阅网络(Small-Scale Pub-Sub Network)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XdO9OZDH-1611294599091)(https://github.com/imatix/zguide/raw/master/images/fig12.png)]

在实践中,这将导致越来越脆弱和笨拙的体系结构。假设有一个发布者和100个订阅者。通过在每个订阅服务器中配置一个发布服务器端点,可以将每个订阅服务器连接到发布服务器。这很简单。用户是动态的;发布者是静态的。现在假设您添加了更多的发布者。突然间,它不再那么容易了。如果您继续将每个订阅者连接到每个发布者,那么避免动态发现的成本就会越来越高。

Figure 13 -使用代理的发布-订阅网络 Pub-Sub Network with a Proxy

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SwDSeZP1-1611294599092)(https://github.com/imatix/zguide/raw/master/images/fig13.png)]

对此有很多答案,但最简单的答案是添加中介;也就是说,网络中所有其他节点都连接到的一个静态点。在传统的消息传递中,这是消息代理的工作。ZeroMQ没有提供这样的消息代理,但是它让我们可以很容易地构建中介。

您可能想知道,如果所有网络最终都变得足够大,需要中介体,那么为什么不为所有应用程序设置一个message broker呢?对于初学者来说,这是一个公平的妥协。只要始终使用星型拓扑结构,忘记性能,事情就会正常工作。然而,消息代理是贪婪的;作为中央中介人,它们变得太复杂、太有状态,最终成为一个问题。

最好将中介看作简单的无状态消息交换机。一个很好的类比是HTTP代理;它在那里,但没有任何特殊的作用。在我们的示例中,添加一个 pub-sub代理解决了动态发现问题。我们在网络的“中间”设置代理。代理打开一个XSUB套接字、一个XPUB套接字,并将每个套接字绑定到已知的IP地址和端口。然后,所有其他进程都连接到代理,而不是彼此连接。添加更多订阅者或发布者变得很简单。

Figure 14 - Extended Pub-Sub

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W48WAV2q-1611294599092)(https://github.com/imatix/zguide/raw/master/images/fig14.png)]

我们需要XPUB和XSUB套接字,因为ZeroMQ从订阅者到发布者执行订阅转发。XSUB和XPUB与SUB和PUB完全一样,只是它们将订阅公开为特殊消息。代理必须通过从XPUB套接字读取这些订阅消息并将其写入XSUB套接字,从而将这些订阅消息从订阅方转发到发布方。这是XSUB和XPUB的主要用例。

共享队列Shared Queue (DEALER and ROUTER sockets)

在Hello World客户机/服务器应用程序中,我们有一个客户机与一个服务通信。然而,在实际情况中,我们通常需要允许多个服务和多个客户机。这让我们可以扩展服务的功能(许多线程、进程或节点,而不是一个)。唯一的限制是服务必须是无状态的,所有状态都在请求中,或者在一些共享存储(如数据库)中。

Figure 15 -请求分发 Request Distribution

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tvMhfR5u-1611294599093)(https://github.com/imatix/zguide/raw/master/images/fig15.png)]

有两种方法可以将多个客户机连接到多个服务器。蛮力方法是将每个客户端套接字连接到多个服务端点。一个客户端套接字可以连接到多个服务套接字,然后REQ套接字将在这些服务之间分发请求。假设您将一个客户端套接字连接到三个服务端点;客户机请求R1、R2、R3、R4。R1和R4是服务A的,R2是服务B的,R3是服务C的。

这种设计可以让您更便宜地添加更多的客户端。您还可以添加更多的服务。每个客户端将其请求分发给服务。但是每个客户机都必须知道服务拓扑。如果您有100个客户机,然后决定再添加3个服务,那么您需要重新配置并重新启动100个客户机,以便客户机了解这3个新服务。

这显然不是我们想在凌晨3点做的事情,因为我们的超级计算集群已经耗尽了资源,我们迫切需要添加几百个新的服务节点。太多的静态部分就像液体混凝土:知识是分散的,你拥有的静态部分越多,改变拓扑结构的努力就越大。我们想要的是位于客户机和服务之间的东西,它集中了拓扑的所有知识。理想情况下,我们应该能够在任何时候添加和删除服务或客户机,而不需要触及拓扑的任何其他部分。

因此,我们将编写一个小消息队列代理来提供这种灵活性。代理绑定到两个端点,一个用于客户机的前端,一个用于服务的后端。然后,它使用zmq_poll()监视这两个sockets 的活动,当它有一些活动时,它在它的两个sockets 之间传递消息。它实际上并不明确地管理任何队列—zeromq在每个sockets 上自动管理队列。

当您使用REQ与REP对话时,您将得到一个严格同步的请求-应答对话框。客户端发送一个请求。服务读取请求并发送响应。然后客户端读取应答。如果客户机或服务尝试执行其他操作(例如,在不等待响应的情况下连续发送两个请求),它们将得到一个错误。

但是我们的代理必须是非阻塞的。显然,我们可以使用zmq_poll()来等待两个socket上的活动,但是不能使用REP和REQ。

Figure 16 - Extended Request-Reply

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ebrq0NKw-1611294599094)(https://github.com/imatix/zguide/raw/master/images/fig16.png)]

幸运的是,有两个名为DEALER和ROUTER的socket允许您执行非阻塞的请求-响应。在高级请求-应答模式中,您将看到商人和路由器套接字如何让您构建各种异步请求-应答流。现在,我们只需要看看DEALER 和ROUTER 如何让我们扩展REQ-REP跨一个中介,也就是我们的小broker。
在这个简单的扩展请求-应答模式中,REQ与ROUTER 对话,而DEALER 与REP对话。在DEALER 与ROUTER 之间,我们必须有代码(就像我们的broker一样)将消息从一个socket 中提取出来,并将它们推送到另一个socket 中。
request-reply broker绑定到两个端点,一个用于clients 连接(前端socket),另一个用于workers 连接(后端)。要测试此broker,您需要更改workers ,以便他们连接到后端socket。这是一个client ,我的意思是:

[rrclient: Request-reply client in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

Here is the worker:
[rrworker: Request-reply worker in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

这是代理,它可以正确地处理多部分消息:
[rrbroker: Request-reply broker in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

图 17 - Request-Reply Broker

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8zcM57S9-1611294599095)(https://github.com/imatix/zguide/raw/master/images/fig17.png)]

使用请求-应答代理可以使客户机/服务器体系结构更容易伸缩,因为客户机看不到worker,而worker也看不到客户机。唯一的静态节点是中间的代理。

ZeroMQ的内置代理函数 ZeroMQ’s Built-In Proxy Function

原来,上一节的rrbroker中的核心循环非常有用,并且可以重用。它让我们可以毫不费力地构建pub-sub转发器和共享队列以及其他小型中介。ZeroMQ将其封装在一个方法中,zmq_proxy():

zmq_proxy (frontend, backend, capture);

zmq_proxy (frontend, backend, capture);

必须正确地连接、绑定和配置这两个(或者三个sockets,如果我们想捕获数据的话)。当我们调用zmq_proxy方法时,就像启动rrbroker的主循环一样。让我们重写 request-reply broker来调用zmq_proxy,并将其重新标记为一个听起来很昂贵的“消息队列”(人们已经为执行更少的代码向house收费):

[msgqueue: Message queue broker in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Q | Ruby | Tcl | Ada | Basic | Felix | Objective-C | ooc | Racket | Scala

如果您和大多数ZeroMQ用户一样,在这个阶段,您的思想开始思考,“如果我将随机的套接字类型插入代理,我能做什么坏事?”简单的回答是:试一试,看看发生了什么。实际上,您通常会坚持使用 ROUTER/DEALER、XSUB/XPUB或PULL/PUSH。

传输桥接 Transport Bridging

ZeroMQ用户经常会问,“我如何将我的ZeroMQ网络与技术X连接起来?”其中X是其他网络或消息传递技术。

图 18 - Pub-Sub Forwarder Proxy

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nG6wPOxN-1611294599096)(https://github.com/imatix/zguide/raw/master/images/fig18.png)]

答案很简单,就是建一座桥。桥接是一个小应用程序,它在一个socket上讲一个协议,并在另一个套接字上转换成 to/from第二个协议。协议解释器,如果你喜欢的话。ZeroMQ中常见的桥接问题是桥接两个传输或网络。
例如,我们将编写一个小代理,它位于发布者和一组订阅者之间,连接两个网络。前端socket (SUB)面向气象服务器所在的内部网络,后端(PUB)面向外部网络上的订阅者。它订阅前端socket 上的天气服务,并在后端socket 上重新发布数据。

[wuproxy: Weather update proxy in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

它看起来与前面的代理示例非常相似,但关键部分是前端和后端sockets 位于两个不同的网络上。例如,我们可以使用这个模型将组播网络(pgm传输)连接到tcp publisher。

处理错误和黑屏? Handling Errors and ETERM

ZeroMQ的错误处理哲学是快速故障和恢复能力的结合。我们认为,流程应该尽可能容易受到内部错误的攻击,并且尽可能健壮地抵御外部攻击和错误。打个比方,如果一个活细胞检测到一个内部错误,它就会自我毁灭,但它也会尽一切可能抵抗来自外部的攻击。

断言充斥着ZeroMQ代码,对于健壮的代码是绝对重要的;它们只需要在细胞壁的右边。应该有这样一堵墙。如果不清楚故障是内部的还是外部的,那就是需要修复的设计缺陷。在C/ c++中,断言一旦出现错误就立即停止应用程序。在其他语言中,可能会出现异常或暂停。

当ZeroMQ检测到外部故障时,它会向调用代码返回一个错误。在一些罕见的情况下,如果没有明显的策略来从错误中恢复,它会无声地删除消息。

到目前为止,我们看到的大多数C示例中都没有错误处理。真正的代码应该对每个ZeroMQ调用执行错误处理。如果您使用的是C之外的语言绑定,那么绑定可能会为您处理错误。在C语言中,你需要自己做这个。有一些简单的规则,从POSIX约定开始:

  • 如果创建对象的方法失败,则返回NULL。
  • 处理数据的方法可能返回已处理的字节数,或在出现错误或故障时返回-1。
  • 其他方法在成功时返回0,在错误或失败时返回-1。
  • 错误代码在errno或zmq_errno()中提供。
  • zmq_strerror()提供了用于日志记录的描述性错误文本。

For example:

void *context = zmq_ctx_new ();
assert (context);
void *socket = zmq_socket (context, ZMQ_REP);
assert (socket);
int rc = zmq_bind (socket, "tcp://*:5555");
if (rc == -1) {
    printf ("E: bind failed: %s\n", strerror (errno));
    return -1;
}

有两个主要的例外情况,你应该作为非致命的处理:

  • 当您的代码接收到带有ZMQ_DONTWAIT选项的消息并且没有等待的数据时,ZeroMQ将返回-1并再次将errno设置为EAGAIN。
  • 当一个线程调用zmq_ctx_destroy(),而其他线程仍在执行阻塞工作时,zmq_ctx_destroy()调用关闭上下文,所有阻塞调用都以-1退出,errno设置为ETERM。

在C/ c++中,断言可以在经过优化的代码中完全删除,所以不要错误地将整个ZeroMQ调用封装在assert()中。它看起来整洁;然后优化器删除所有您想要执行的断言和调用,您的应用程序就会以令人印象深刻的方式崩溃。

图 19 - 带终止信号的并行管道Parallel Pipeline with Kill Signaling

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-z7MRSt5T-1611294599097)(https://github.com/imatix/zguide/raw/master/images/fig19.png)]

让我们看看如何干净利落地关闭进程。我们将使用上一节中的并行管道示例。如果我们在后台启动了大量的worker,那么现在我们想在批处理完成时杀死它们。让我们通过发送一个kill消息给工人来实现这一点。最好的地方是sink,因为它知道批处理什么时候完成。
我们怎样把水槽和工人连接起来?推/拉插座是单向的。我们可以切换到另一种套接字类型,或者混合多个套接字流。让我们试试后者:使用发布-订阅模型向工人发送kill消息:

  • sink 在新端点上创建一个PUB socket 。
  • Workers 将他们的输入socket 连接到这个端点。
  • 当sink 检测到批处理的结束时,它向其PUB socket 发送一个kill。
  • 当Workers 检测到此终止消息时,它将退出。

sink不需要太多的新代码:

void *controller = zmq_socket (context, ZMQ_PUB);
zmq_bind (controller, "tcp://*:5559");
…
//  Send kill signal to workers
s_send (controller, "KILL");

这是worker 进程,它使用我们前面看到的zmq_poll()技术管理两个sockets (一个获取任务的PULL socket和一个获取控制命令的SUB socket):

[taskwork2: Parallel task worker with kill signaling in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl| PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | ooc | Q | Racket
下面是修改后的sink应用程序。当它收集完结果后,它会向所有workers发送一条“杀死”消息:
[tasksink2: Parallel task sink with kill signaling in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl| PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | ooc | Q | Racket

处理中断信号 Handling Interrupt Signals

当使用Ctrl-C或其他信号(如SIGTERM)中断时,实际应用程序需要干净地关闭。
默认情况下,这些操作只会杀死进程,这意味着不会刷新消息,不会干净地关闭文件,等等。

下面是我们如何处理不同语言的信号:

[interrupt: Handling Ctrl-C cleanly in C](javascript:😉

C++ | C# | Delphi | Erlang | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Ada | Basic | Clojure | CL | F# | Felix | Objective-C | ooc | Q | Racket | Tcl

该程序提供s_catch_signals(),它捕获Ctrl-C (SIGINT)和SIGTERM。当其中一个信号到达时,s_catch_signals()处理程序设置全局变量s_interrupted。由于您的信号处理程序,您的应用程序不会自动死亡。相反,你有机会收拾干净,优雅地离开。现在您必须显式地检查中断并正确地处理它。通过在主代码的开头调用s_catch_signals()(从interrupt.c复制)来实现这一点。这将设置信号处理。中断将影响ZeroMQ调用如下:

  • 如果您的代码在阻塞调用(发送消息、接收消息或轮询)中阻塞,那么当信号到达时,调用将返回EINTR。
  • 如果s_recv()之类的包装器被中断,则返回NULL。
    因此,检查EINTR返回代码:NULL返回 and/or s_interrupted。
    下面是一个典型的代码片段:
s_catch_signals ();
client = zmq_socket (...);
while (!s_interrupted) {
    char *message = s_recv (client);
    if (!message)
        break;          //  Ctrl-C used
}
zmq_close (client);

如果您调用s_catch_signals()而不测试中断,那么您的应用程序将对Ctrl-C和SIGTERM免疫,这可能有用,但通常不是。

检测内存泄漏 Detecting Memory Leaks

任何长时间运行的应用程序都必须正确地管理内存,否则最终会耗尽所有可用内存并崩溃。如果您使用的语言可以自动处理这一问题,那么恭喜您。如果您使用C或c++或任何其他负责内存管理的语言编写程序,这里有一个关于使用valgrind的简短教程,其中包括报告程序中出现的任何泄漏。

  • To install valgrind, e.g., on Ubuntu or Debian, issue this command:
sudo apt-get install valgrind
  • By default, ZeroMQ will cause valgrind to complain a lot. To remove these warnings, create a file called vg.supp that contains this:
{
   <socketcall_sendto>
   Memcheck:Param
   socketcall.sendto(msg)
   fun:send
   ...
}
{
   <socketcall_sendto>
   Memcheck:Param
   socketcall.send(msg)
   fun:send
   ...
}
  • Fix your applications to exit cleanly after Ctrl-C. For any application that exits by itself, that’s not needed, but for long-running applications, this is essential, otherwise valgrind will complain about all currently allocated memory.

  • Build your application with -DDEBUG if it’s not your default setting. That ensures valgrind can tell you exactly where memory is being leaked.

  • Finally, run valgrind thus:

valgrind --tool=memcheck --leak-check=full --suppressions=vg.supp someprog

And after fixing any errors it reported, you should get the pleasant message:

==30536== ERROR SUMMARY: 0 errors from 0 contexts...

多线程与ZeroMQMultithreading with ZeroMQ

ZeroMQ可能是有史以来编写多线程(MT)应用程序的最好方法。而ZeroMQ sockets 需要一些调整,如果你习惯了传统sockets ,ZeroMQ多线程将采取你所知道的写MT应用程序的一切,把它扔到一个堆在花园里,浇上汽油,并点燃它。这是一本难得的值得一读的书,但是大多数关于并发编程的书都值得一读。

为了使MT程序完全完美(我的意思是字面意思),我们不需要互斥锁、锁或任何其他形式的线程间通信,除了通过ZeroMQ sockets 发送的消息。

所谓“完美的MT程序”,我的意思是代码易于编写和理解,在任何编程语言和任何操作系统中都可以使用相同的设计方法,并且可以跨任意数量的cpu伸缩,没有等待状态,没有收益递减点。

如果您花了多年的时间学习一些技巧,使MT代码能够正常工作,更不用说快速地使用锁、信号量和关键部分,那么当您意识到这一切都是徒劳时,您会感到厌恶。如果说我们从30多年的并发编程中学到了什么,那就是:不要共享状态。就像两个醉汉想要分享一杯啤酒。他们是不是好朋友并不重要。他们迟早会打起来的。你加的酒越多,他们就越会为了啤酒而打架。大多数MT应用程序看起来都像醉酒的酒吧斗殴。

在编写经典的共享状态MT代码时,如果不能将这些奇怪的问题直接转化为压力和风险,那就太可笑了,因为在压力下似乎可以工作的代码会突然失效。一家在bug代码方面拥有世界一流经验的大型公司发布了它的“多线程代码中的11个可能问题”列表,其中包括被遗忘的同步、不正确的粒度、读写撕裂、无锁重排序、锁保护、两步舞和优先级反转。

我们数了7道题,不是11道。但这不是重点。问题是,您真的希望运行电网或股票市场的代码在繁忙的周四下午3点开始得到两步锁定护送吗?谁在乎这些术语的实际含义呢?这并不是让我们转向编程的原因,而是用更复杂的黑客攻击来对抗更复杂的副作用。

尽管一些被广泛使用的模型是整个行业的基础,但它们从根本上是被破坏的,共享状态并发就是其中之一。想要无限扩展的代码就像互联网一样,发送消息,除了对坏掉的编程模型的普遍蔑视之外,什么也不分享。
你应该遵循一些规则来编写快乐的多线程代码与ZeroMQ:

  • 在线程中单独隔离数据,永远不要在多个线程中共享数据。唯一的例外是ZeroMQ上下文,它是线程安全的。
  • 远离经典的并发机制,如互斥、临界区、信号量等。这些是ZeroMQ应用程序中的反模式。
  • 在进程开始时创建一个ZeroMQ上下文,并将其传递给希望通过inproc套接字连接的所有线程。
  • 使用附加线程在应用程序中创建结构,并使用inproc上的PAIR sockets将这些线程连接到它们的父线程。模式是:绑定父socket,然后创建连接其socket的子线程。
  • 使用分离的线程模拟独立的任务,并使用它们自己的contexts。通过tcp连接这些。稍后,您可以将它们转移到独立进程,而不需要显著更改代码。
  • 线程之间的所有交互都以ZeroMQ消息的形式发生,您可以或多或少地正式定义它。
  • 不要在线程之间共享ZeroMQ socket。ZeroMQ socket不是线程安全的。从技术上讲,可以将socket从一个线程迁移到另一个线程,但这需要技巧。在线程之间共享socket的惟一合理的地方是语言绑定,它需要像socket上的垃圾收集那样做。

例如,如果需要在应用程序中启动多个代理,则希望在它们各自的线程中运行每个代理。在一个线程中创建代理前端和后端socket,然后将socket传递给另一个线程中的代理,这很容易出错。这可能在一开始看起来有效,但在实际使用中会随机失败。记住:除非在创建socket的线程中,否则不要使用或关闭socket。

如果遵循这些规则,就可以很容易地构建优雅的多线程应用程序,然后根据需要将线程拆分为单独的进程。应用程序逻辑可以位于线程、进程或节点中:无论您的规模需要什么。
ZeroMQ使用本机OS线程,而不是虚拟的“绿色”线程。其优点是您不需要学习任何新的线程API,而且ZeroMQ线程可以干净地映射到您的操作系统。您可以使用诸如Intel的ThreadChecker之类的标准工具来查看您的应用程序在做什么。缺点是本地线程api并不总是可移植的,而且如果您有大量的线程(数千个),一些操作系统将会受到压力。

让我们看看这在实践中是如何工作的。我们将把原来的Hello World服务器变成更强大的服务器。原始服务器在一个线程中运行。如果每个请求的工作很低,很好:一个ØMQ线程CPU核心可以全速运行,没有等待,做了很多的工作。但是,实际的服务器必须对每个请求执行重要的工作。当10,000个客户机同时攻击服务器时,单个内核可能还不够。因此,一个实际的服务器将启动多个工作线程。然后,它以最快的速度接受请求,并将这些请求分发给它的工作线程。工作线程在工作中不断地工作,并最终将它们的响应发送回去。

当然,您可以使用代理代理和外部工作进程来完成所有这些操作,但是启动一个占用16个内核的进程通常比启动16个进程(每个进程占用一个内核)更容易。此外,将worker作为线程运行将减少网络跳、延迟和网络流量。Hello World服务的MT版本基本上将代理和worker分解为一个进程:

[mtserver: Multithreaded service in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Q | Ruby | Scala | Ada | Basic | Felix | Node.js | Objective-C | ooc | Racket | Tcl

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L1ohN5g5-1611294599098)(https://github.com/imatix/zguide/raw/master/images/fig20.png)]

到目前为止,所有代码都应该是你能看懂的。它是如何工作的:

  • 服务器启动一组工作线程。每个工作线程创建一个REP socket,然后在这个socket上处理请求。工作线程就像单线程服务器。唯一的区别是传输(inproc而不是tcp)和绑定连接方向。
  • 服务器创建一个 ROUTER socket来与clients 通信,并将其绑定到外部接口(通过tcp)。
  • 服务器创建一个 DEALER socket来与工人对话,并将其绑定到其内部接口(通过inproc)。
  • 服务器启动连接两个sockets的代理。代理从所有客户端公平地提取传入请求,并将这些请求分发给工作人员。它还将回复路由回它们的原点。

注意,在大多数编程语言中,创建线程是不可移植的。POSIX库是pthreads,但是在Windows上必须使用不同的API。在我们的示例中,pthread_create调用启动一个运行我们定义的worker_routine函数的新线程。我们将在 Advanced Request-Reply Patterns中看到如何将其封装到可移植API中。

这里的“工作”只是一秒钟的停顿。我们可以在worker中做任何事情,包括与其他节点通信。
这就是MT服务器在ØMQsockets 和节点方面的样子请注意 request-reply 链是如何表示为REQ-ROUTER-queue-DEALER-REP。

线程之间的通信Signaling Between Threads (PAIR Sockets)

当您开始使用ZeroMQ创建多线程应用程序时,您将遇到如何协调线程的问题。尽管您可能想要插入“sleep”语句,或者使用多线程技术(如信号量或互斥锁),但是您应该使用的惟一机制是ZeroMQ消息。记住酒鬼和啤酒瓶的故事。

让我们创建三个线程,当它们准备好时互相发出信号。在这个例子中,我们在inproc传输上使用 PAIR sockets:

[mtrelay: Multithreaded relay in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Q | Ruby | Scala | Ada | Basic | Felix | Node.js | Objective-C | ooc | Racket | Tcl
Figure 21 - 接力赛The Relay Race
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PKR8kDzh-1611294599099)(https://github.com/imatix/zguide/raw/master/images/fig21.png)]

这是一个经典的模式多线程与ZeroMQ:

  • 1.两个线程使用共享context,通过inproc进行通信。
  • 2.父线程创建一个socket,将其绑定到inproc://端点,然后启动子线程,将context传递给它。
    子线程创建第二个socket,将其连接到inproc://端点,然后向父线程发出准备就绪的信号。

注意,使用此模式的多线程代码不能扩展到进程。如果您使用inproc和 socket pairs,那么您正在构建一个紧密绑定的应用程序,即,其中线程在结构上相互依赖。当低延迟非常重要时,执行此操作。另一种设计模式是松散绑定的应用程序,其中线程有自己的context ,并通过ipc或tcp进行通信。您可以轻松地将松散绑定的线程拆分为单独的进程。
这是我们第一次展示使用 PAIR sockets的示例。为什么使用PAIR?其他socket 组合似乎也有效果,但它们都有副作用,可能会干扰信号:

  • 您可以使用PUSH作为发送方,PULL作为接收方。这看起来很简单,也很有效,但是请记住PUSH将向所有可用的接收者分发消息。如果你不小心启动了两个接收器(例如,你已经启动了一个接收器,然后你又启动了另一个接收器),你将“丢失”一半的信号。PAIR 具有拒绝多个连接的优势;这个PAIR 是独一无二的。
  • 您可以使用DEALER作为发送方,使用 ROUTER 作为接收方。ROUTER ,然而,将你的消息包装在一个“信封”,这意味着你的零大小的信号变成一个多部分的消息。如果您不关心数据并将任何内容视为有效信号,如果您从socket中读取的次数不超过一次,那么这就无关紧要了。然而,如果你决定发送真实的数据,你会突然发现ROUTER提供给你“错误”的消息。DEALER 还分发outgoing 的消息,像PUSH一样带来相同的风险。
  • 您可以将PUB用于发送方,将SUB用于接收方。这将正确地发送您的邮件,就像您发送邮件一样,PUB不会像PUSH或DEALER那样分发但是,您需要使用空订阅来配置订阅者,这很烦人。
    由于这些原因,PAIR是线程对之间协调的最佳选择。

节点的协调Node Coordination

当您想要协调网络上的一组节点时,PAIR sockets将不再有效。这是少数几个线程和节点策略不同的领域之一。基本上,节点来来去去,而线程通常是静态的。如果远程节点离开并返回, PAIR sockets不会自动重新连接。

Figure 22 - 发布-订阅同步Pub-Sub Synchronization

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vK3miAUx-1611294599100)(https://github.com/imatix/zguide/raw/master/images/fig22.png)]

线程和节点之间的第二个显著差异是,通常有固定数量的线程,但节点的数量更可变。让我们以前面的一个场景(天气服务器和客户机)为例,使用节点协调确保订阅者在启动时不会丢失数据。

以下是应用程序的工作原理:

  • 发布者预先知道它希望有多少订阅者。这是一个神奇的数字。
  • 发布者启动并等待所有订阅者连接。这是节点协调部分。每个订阅者订阅,然后通过另一个socket告诉发布者它已经准备好了。
  • 当发布者连接了所有订阅者后,它开始发布数据。
    在本例中,我们将使用REQ-REP套接字流来同步订阅者和发布者。以下是出版商:

[syncpub: Synchronized publisher in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

And here is the subscriber:
[syncsub: Synchronized subscriber in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

这个Bash shell脚本将启动10个订阅者,然后是发布者:

echo "Starting subscribers..."
for ((a=0; a<10; a++)); do
    syncsub &
done
echo "Starting publisher..."
syncpub

这就得到了令人满意的结果:

Starting subscribers...
Starting publisher...
Received 1000000 updates
Received 1000000 updates
...
Received 1000000 updates
Received 1000000 updates

我们不能假设SUB connect将在REQ/REP对话框完成时完成。如果使用除inproc之外的任何传输,则不能保证出站连接将以任何顺序完成。因此,该示例在订阅和发送REQ/REP同步之间强制休眠一秒钟。

一个更健壮的模型可以是:

  • Publisher打开PUB socket 并开始发送“Hello”消息(而不是数据)。
  • 订阅者连接SUB socket ,当他们收到一条Hello消息时,他们通过 REQ/REP socket pair告诉发布者。
  • 当发布者获得所有必要的确认后,它就开始发送实际数据。

零拷贝Zero-Copy

ZeroMQ的消息API允许您直接从应用程序缓冲区发送和接收消息,而不需要复制数据。
我们称之为零拷贝,它可以在某些应用程序中提高性能。

您应该考虑在以高频率发送大内存块(数千字节)的特定情况下使用zero-copy。对于短消息或较低的消息率,使用零拷贝将使您的代码更混乱、更复杂,并且没有可度量的好处。像所有优化一样,当您知道它有帮助时使用它,并在前后进行度量。

要执行zero-copy,可以使用zmq_msg_init_data()创建一条消息,该消息引用已经用malloc()或其他分配器分配的数据块,然后将其传递给zmq_msg_send()。创建消息时,还传递一个函数,ZeroMQ在发送完消息后将调用该函数释放数据块。这是最简单的例子,假设buffer是一个在堆上分配了1000字节的块:

void my_free (void *data, void *hint) {
    free (data);
}
//  Send message from buffer, which we allocate and ZeroMQ will free for us
zmq_msg_t message;
zmq_msg_init_data (&message, buffer, 1000, my_free, NULL);
zmq_msg_send (&message, socket, 0);

注意,发送消息后不调用zmq_msg_close()—libzmq在实际发送消息后将自动调用zmq_msg_close()。

没有办法在接收时执行零复制:ZeroMQ提供了一个缓冲区,您可以存储任意长的缓冲区,但是它不会直接将数据写入应用程序缓冲区。

在编写时,ZeroMQ的多部分消息与zero-copy很好地结合在一起。在传统的消息传递中,需要将不同的缓冲区组合到一个可以发送的缓冲区中。这意味着复制数据。使用ZeroMQ,您可以将来自不同来源的多个缓冲区作为单独的消息帧发送。将每个字段作为长度分隔的帧发送。对于应用程序,它看起来像一系列发送和接收调用。但是在内部,多个部分被写到网络中,并通过单个系统调用进行读取,因此非常高效。

发布-订阅消息信封Pub-Sub Message Envelopes

在 pub-sub 模式中,我们可以将密钥拆分为一个单独的消息框架,称为信封。如果你想使用pub-sub信封,那就自己做吧。它是可选的,在之前的 pub-sub例子中我们没有这样做。
对于简单的情况,使用 pub-sub信封要多做一些工作,但是对于实际情况,尤其是键和数据是自然分离的情况,使用它会更简洁。

Figure 23 - Pub-Sub Envelope with Separate Key

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qBEp6GJW-1611294599101)(https://github.com/imatix/zguide/raw/master/images/fig23.png)]

订阅执行前缀匹配。也就是说,它们查找“所有以XYZ开头的消息”。一个明显的问题是:如何将键与数据分隔开来,以便前缀匹配不会意外匹配数据。最好的答案是使用信封,因为匹配不会跨越框架边界。下面是一个极简示例,展示了 pub-sub信封在代码中的外观。此发布者发送两种类型的消息,A和B。

The envelope holds the message type:

[psenvpub: Pub-Sub envelope publisher in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket
The subscriber wants only messages of type B:
[psenvsub: Pub-Sub envelope subscriber in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

When you run the two programs, the subscriber should show you this:

[B] We would like to see this
[B] We would like to see this
[B] We would like to see this
...

此示例显示订阅筛选器拒绝或接受整个多部分消息(键和数据)。您永远不会得到多部分消息的一部分。如果您订阅了多个发布者,并且希望知道它们的地址,以便能够通过另一个socket (这是一个典型的用例)向它们发送数据,那么创建一个由三部分组成的消息。

Figure 24 - Pub-Sub Envelope with Sender Address

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QeEQOICv-1611294599101)(https://github.com/imatix/zguide/raw/master/images/fig24.png)]

High-Water Marks

当您可以快速地从一个进程发送消息到另一个进程时,您很快就会发现内存是一种宝贵的资源,并且可以被轻松地填满。流程中某些地方的几秒钟延迟可能会变成积压,导致服务器崩溃,除非您了解问题并采取预防措施。

问题是这样的:假设您有一个进程A以很高的频率向正在处理它们的进程B发送消息。突然,B变得非常繁忙(垃圾收集、CPU过载等等),短时间内无法处理消息。对于一些繁重的垃圾收集,可能需要几秒钟的时间,或者如果有更严重的问题,可能需要更长的时间。进程A仍然试图疯狂发送的消息会发生什么情况?有些将位于B的网络缓冲区中。有些将位于以太网线路本身。有些将位于A的网络缓冲区中。其余的会在A的内存中积累,就像A后面的应用程序发送它们一样快。如果不采取一些预防措施,A很容易耗尽内存并崩溃。

这是消息代理的一个一致的经典问题。更糟糕的是,从表面上看,这是B的错,而B通常是a无法控制的用户编写的应用程序。

答案是什么?一是把问题往上游推。A从其他地方获取信息。所以告诉这个过程,“停止!”等等。这叫做流量控制。这听起来很有道理,但是如果你在Twitter上发消息呢?你会告诉全世界的人在B行动起来的时候停止发推吗?

流程控制在某些情况下有效,但在其他情况下无效。运输层不能告诉应用层“停止”,就像地铁系统不能告诉大型企业“请让您的员工再工作半个小时”一样。我太忙了”。消息传递的解决方案是设置缓冲区大小的限制,然后当达到这些限制时,采取一些明智的行动。在某些情况下(不是地铁系统),答案是扔掉信息。在另一些国家,最好的策略是等待。

ZeroMQ使用HWM(高水位)的概念来定义其内部管道的容量。每个socket 外或socket 内的连接都有自己的管道和用于发送和/或接收的HWM,这取决于socket 类型。一些socket (PUB, PUSH)只有发送缓冲区。有些(SUB、PULL、REQ、REP)只有接收缓冲区。一些(DEALER, ROUTER, PAIR)有发送和接收缓冲区。

In ZeroMQ v2.x, HWM默认为无穷大。这很容易做到,但对于高容量publishers来说,通常也是致命的。In ZeroMQ v3.x,默认设置为1000,这样更合理。如果你还在用ZeroMQ v2.x,你应该总是在你的socket 上设置一个HWM,设置成1000或另一个考虑您的信息大小和预期的用户性能的数字来匹配ZeroMQ v3.x。

当socket 到达其HWM时,它将根据socket 类型阻塞或删除数据。如果PUB和ROUTER socket 到达它们的HWM,它们将丢弃数据,而其他 socket 类型将阻塞。在inproc传输中,发送方和接收方共享相同的缓冲区,因此实际的HWM是双方设置的HWM的和。最后,HWMs并不精确;由于libzmq实现其队列的方式,默认情况下最多可以获得1,000条消息,但实际缓冲区大小可能要小得多(只有一半)。

缺少消息问题的解决者(解决方式)Missing Message Problem Solver

在使用ZeroMQ构建应用程序时,您将不止一次地遇到这个问题:丢失预期接收到的消息。
我们已经整理了一个图表,介绍了造成这种情况的最常见原因。

Figure 25 - Missing Message Problem Solver

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-stgTeoZU-1611294599102)(https://github.com/imatix/zguide/raw/master/images/fig25.png)]

下面是图表的摘要:

  • 在 SUB sockets上,使用zmq_setsockopt()和ZMQ_SUBSCRIBE设置订阅,否则将不会收到消息。因为您通过前缀订阅消息,如果您订阅“”(空订阅),您将获得所有内容。
  • 如果您启动 SUB sockets(即在PUB socket开始发送数据之后,您将丢失连接之前发布的任何内容。如果这是一个问题,请设置您的体系结构,以便首先启动 SUB sockets,然后PUB socket开始发布。
  • 即使同步了 SUB 和 PUB socket,仍然可能丢失消息。这是因为在实际创建连接之前不会创建内部队列。如果您可以切换绑定/连接方向,以便 SUB socket 绑定,而PUB socket连接,您可能会发现它的工作方式与您所期望的一样。
  • 如果您使用REP和REQsockets,并且没有坚持同步发送/recv/send/recv命令,ZeroMQ将报告错误,您可能会忽略这些错误。然后,看起来就像是你在丢失信息。如果您使用REQ或REP,请坚持send/recv顺序,并且始终在实际代码中检查ZeroMQ调用中的错误。
  • 如果使用 PUSH sockets,您会发现第一个连接的PULL socket 将获取不公平的消息共享。只有在成功连接所有PULL套接字时才会发生准确的消息轮换,这可能需要几毫秒的时间。作为PUSH / PULL的替代方案,对于较低的数据速率,请考虑使用ROUTER / DEALER和负载平衡模式。
  • 如果您正在跨线程共享sockets ,请不要这样做。这将导致随机的怪异,并崩溃。
  • 如果使用inproc,请确保两个socket位于相同的context中。否则,连接端实际上会失败。同样,先绑定,然后连接。inproc不像tcp那样是一个断开连接的传输。
  • 如果您正在使用ROUTER sockets,通过发送不正确的身份帧(或忘记发送身份帧),很容易意外丢失消息。通常,在 ROUTER sockets 上设置ZMQ_ROUTER_MANDATORY选项是一个好主意,但是也要在每次发送调用时检查返回代码。
  • 最后,如果您真的不知道哪里出了问题,那么就创建一个最小的测试用例来重现问题,并向ZeroMQ社区寻求帮助。

Chapter 3 -高级 Advanced Request-Reply Patterns

在第2章-Sockets and Patterns 中,我们通过开发一系列小型应用程序来学习使用ZeroMQ的基础知识,每次都要探索ZeroMQ的新方面。在本章中,我们将继续使用这种方法,探索构建在ZeroMQ核心请求-应答模式之上的高级模式。

我们将讨论:

  • 请求-应答机制如何工作
  • 如何组合REQ、REP、DEALER和 ROUTER sockets
  • ROUTER sockets如何工作,详细的讨论
  • 负载平衡模式
  • 构建一个简单的负载平衡消息代理
  • 为ZeroMQ设计一个高级API
  • 构建异步请求-应答服务器
  • 一个详细的代理间路由示例

The Request-Reply Mechanisms(机制)

我们已经简要介绍了多部分消息。现在让我们看一个主要的用例,即回复消息信封。信封是一种用地址安全包装数据的方法,而不需要接触数据本身。通过将回复地址分离到信封中,我们可以编写通用的中介,如api和代理,无论消息有效负载或结构是什么,它们都可以创建、读取和删除地址。

在请求-应答模式中,信封包含应答的返回地址。这就是没有状态的ZeroMQ网络如何创建往返的请求-应答对话框。

当您使用REQ和REP sockets 时,您甚至看不到信封;这些sockets 自动处理它们。但是对于大多数有趣的请求-应答模式,您需要了解信封,特别是ROUTER sockets。我们会一步一步来。

The Simple Reply Envelope

请求-应答交换由请求消息和最终的应答消息组成。在简单的请求-应答模式中,每个请求都有一个应答。在更高级的模式中,请求和响应可以异步流动。然而,回复信封总是以相同的方式工作。

ZeroMQ应答信封正式由零个或多个应答地址组成,后跟一个空帧(信封分隔符),后跟消息体(零个或多个帧)。信封是由多个sockets 在一个链中一起工作创建的。我们来分解一下。

我们将从通过REQsocket发送“Hello”开始。REQ套接字创建了最简单的回复信封,它没有地址,只有一个空的分隔符框架和包含“Hello”字符串的消息框架。这是一个两帧的消息。

Figure 26 - Request with Minimal Envelope

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mXoZTscS-1611294599104)(https://github.com/imatix/zguide/raw/master/images/fig26.png)]

REP socket执行匹配工作:它剥离信封,直到并包括分隔符框架,保存整个信封,并将“Hello”字符串传递给应用程序。因此,我们最初的Hello World示例在内部使用了请求-回复信封,但是应用程序从未看到过它们。

如果您监视在hwclient和hwserver之间流动的网络数据,您将看到:每个请求和每个响应实际上是两个帧,一个空帧,然后是主体。这对于一个简单的REQ-REP对话框似乎没有多大意义。不过,当我们探讨ROUTER和DEALER 如何处理信封时,您将会看到原因。

加长回信信封The Extended Reply Envelope

现在,让我们使用中间的 ROUTER-DEALER代理扩展 REQ-REP对 ,看看这会如何影响回复信封。这是我们在Chapter 2 - Sockets and Patterns中已经看到的扩展请求-应答模式。实际上,我们可以插入任意数量的代理步骤。机制是一样的。

Figure 27 - Extended Request-Reply Pattern

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pNXwX2It-1611294599105)(https://github.com/imatix/zguide/raw/master/images/fig27.png)]

代理在伪代码中这样做:

prepare context, frontend and backend sockets
while true:
    poll on both sockets
    if frontend had input:
        read all frames from frontend
        send to backend
    if backend had input:
        read all frames from backend
        send to frontend

与其他socket不同,ROUTER socket跟踪它所拥有的每个连接,并将这些信息告诉调用者。
它告诉调用者的方法是将连接标识粘贴到接收到的每个消息前面。
标识(有时称为地址)只是一个二进制字符串,除了“这是连接的惟一句柄”之外没有任何意义。
然后,当您通过ROUTER socket发送消息时,您首先发送一个标识帧。

zmq_socket()手册页这样描述它:

当接收到消息时,ZMQ_ROUTER socket 在将消息传递给应用程序之前,应该在消息部分前加上一个包含消息的原始对等点标识的消息部分。接收到的消息在所有连接的对等点之间公平排队。发送消息时,ZMQ_ROUTER socket 应删除消息的第一部分,并使用它来确定消息应路由到的对等方的身份。

作为历史记录,ZeroMQ v2.2和更早的版本使用uuid作为标识。ZeroMQ v3.0和以后的版本在默认情况下生成一个5字节的标识(0 +一个随机32位整数)。这对网络性能有一定的影响,但仅当您使用多个代理跃点时,这种情况很少见。主要的更改是通过删除对UUID库的依赖来简化libzmq的构建。

身份是一个很难理解的概念,但如果你想成为ZeroMQ专家,它是必不可少的。 ROUTER socket 为它工作的每个连接创建一个随机标识。如果有三个 REQ sockets连接到ROUTER socket,它将为每个REQ sockets创建一个随机标识。
如果我们继续我们的工作示例,假设REQsocket有一个3字节的标识ABC。在内部,这意味着ROUTER socket保留一个哈希表,它可以在这个哈希表中搜索ABC并为REQsocket找到TCP连接。当我们从ROUTER socket接收消息时,我们得到三个帧。

Figure 28 - Request with One Address

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QayKEadj-1611294599105)(https://github.com/imatix/zguide/raw/master/images/fig28.png)]

代理循环的核心是“从一个socket读取,向另一个socket写入”,因此我们将这三帧发送到ROUTER socket上。如果您现在嗅探网络流量,您将看到这三个帧从DEALER socket飞向REP socket。REP socket和前面一样,去掉整个信封,包括新的回复地址,并再次向调用者传递“Hello”。顺便提一下,REP socket一次只能处理一个请求-应答交换,这就是为什么如果您尝试读取多个请求或发送多个响应而不坚持严格的recv-send循环,它会给出一个错误。
您现在应该能够可视化返回路径。当hwserver将“World”发送回来时,REP socket将其与它保存的信封打包,并通过网络向DEALER socket发送一个三帧回复消息。

Figure 29 - Reply with one Address

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1AFlaDDp-1611294599106)(https://github.com/imatix/zguide/raw/master/images/fig29.png)]

现在DEALER 读取这三帧,并通过 ROUTER socket发送所有这三帧。 ROUTER接受消息的第一帧,即ABC标识,并为此查找连接。如果它发现了,它就会把接下来的两帧泵到网络上。

Figure 30 - Reply with Minimal Envelope

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Afm7erMR-1611294599107)(https://github.com/imatix/zguide/raw/master/images/fig30.png)]

REQ socket 接收此消息,并检查第一帧是否为空分隔符,它就是空分隔符。REQ socket 丢弃了框架并将“World”传递给调用应用程序,该应用程序将它打印出来,这让第一次看到ZeroMQ的年轻一代感到惊讶。

What’s This Good For?

说实话,用于严格请求-应答或扩展请求-应答的用例在某种程度上是有限的。首先,没有简单的方法可以从常见的故障中恢复,比如由于应用程序代码错误导致服务器崩溃。我们将在可靠的请求-应答模式中看到更多这方面的内容。然而,一旦你掌握了这四个sockets 处理信封的方式,以及它们之间的通信方式,你就可以做一些非常有用的事情。我们了解了ROUTER 如何使用应答信封来决定将应答路由回哪个客户机REQ socket 。现在让我们用另一种方式来表达:

  • 每次ROUTER 给你一个消息,它会告诉你来自哪个对等点,作为一个身份。
  • 您可以将其与散列表一起使用(以标识为键),以便在新对等点到达时跟踪它们。
  • 如果将标识前缀作为消息的第一帧,ROUTER 将异步地将消息路由到连接到它的任何对等点。
    ROUTER sockets并不关心整个信封。他们对空分隔符一无所知。他们所关心的只是一个身份框架,这个框架让他们知道要向哪个连接发送消息。

(概述)Recap of Request-Reply Sockets

让我们来总结一下:

  • REQ socket向网络发送消息数据前面的空分隔符帧。REQ socket是同步的。REQ socket总是发送一个请求,然后等待一个响应。REQ socket每次只与一个对等点通信。如果您将一个REQ socket连接到多个对等点,则请求将被分发到每个对等点,并期望每个对等点一次发送一个响应。
  • REP socket读取并保存所有标识帧,直到并包括空分隔符,然后将以下一帧或多帧传递给调用方。 REP socket是同步的,每次只与一个对等点通信。如果您将一个 REP socket连接到多个对等点,则以公平的方式从对等点读取请求,并且始终将响应发送到发出最后一个请求的同一对等点。
  • DEALER socket不理会回复信封,并像处理任何多部分消息一样处理此消息。DEALER socket是异步的,就像PUSH and PULL 的组合。它们在所有连接之间分发发送的消息,并且公平队列接收来自所有连接的消息。
  • ROUTER socket 不理会回复信封,就像DEALER一样。它为其连接创建标识,并将这些标识作为任何接收到的消息中的第一帧传递给调用者。相反,当调用者发送消息时,它使用第一个消息帧作为标识来查找要发送到的连接。 ROUTERS是异步的。

Request-Reply Combinations(组合)

我们有四个 request-reply sockets,每个sockets具有特定的行为。我们已经看到它们如何以简单和扩展的请求-应答模式连接。但是这些sockets是您可以用来解决许多问题的构建块。
这些是合法的组合:

  • REQ to REP
  • DEALER to REP
  • REQ to ROUTER
  • DEALER to ROUTER
  • DEALER to DEALER
  • ROUTER to ROUTER

And these combinations are invalid (and I’ll explain why):

  • REQ to REQ
  • REQ to DEALER
  • REP to REP
  • REP to ROUTER

下面是一些记忆语义的技巧。DEALER 类似于异步REQ socket,而ROUTER 类似于异步REP socket。在我们使用REQ socket的地方,我们可以使用一个DEALER ;我们只需要自己读和写信封。在使用REP socket的地方,我们可以使用ROUTER ;我们只需要自己管理身份。将REQ和DEALER socket视为“客户端”,而REP和ROUTER socket视为“服务器”。大多数情况下,您需要绑定REP和ROUTER socket,并将REQ和DEALER socket连接到它们。它并不总是这么简单,但它是一个干净而令人难忘的起点。

The REQ to REP Combination

我们已经介绍了一个与REP服务器对话的REQ客户机,但是让我们看一个方面:REQ客户机必须启动消息流。代表服务器不能与未首先向其发送请求的REQ客户机通信。从技术上讲,这甚至是不可能的,如果您尝试了,API还会返回一个EFSM错误。

The DEALER to REP Combination

现在,让我们用DEALER替换REQ客户端。这为我们提供了一个异步客户机,它可以与多个 REP服务器通信。如果我们使用DEALER重写“Hello World”客户端,我们就可以发送任意数量的“Hello”请求,而不需要等待回复。

当我们使用一个DEALER 与一个REP socket通信时,我们必须准确地模拟REQ socket将发送的信封,否则REP socket将把消息作为无效丢弃。所以,为了传递信息,我们:

  • 发送一个设置了更多标志的空消息帧;然后
  • 发送消息体。

当我们收到信息时,我们:

  • 接收第一个帧,如果它不是空的,则丢弃整个消息;
  • 接收下一帧并将其传递给应用程序。

The REQ to ROUTER Combination

就像我们可以用DEALER替换REQ一样,我们也可以用ROUTER替换REP。这为我们提供了一个异步服务器,它可以同时与多个REQ客户机通信。如果我们使用ROUTER重写“Hello World”服务器,我们将能够并行处理任意数量的“Hello”请求。我们在 Chapter 2 - Sockets and Patterns mtserver示例中看到了这一点。我们可以用两种不同的方式使用ROUTER:

  • 作为在前端和后端sockets之间切换消息的代理。
  • 作为读取消息并对其进行操作的应用程序。

在第一种情况下,ROUTER 只是读取所有帧,包括人工身份帧,然后盲目地传递它们。在第二种情况下,ROUTER 必须知道它正在发送的回复信封的格式。由于另一个对等点是REQ socket,ROUTER 获得标识帧、空帧和数据帧。

The DEALER to ROUTER Combination

现在,我们可以切换REQ和REP与DEALER 和ROUTER ,以获得最强大的socket 组合,这是DEALER 与ROUTER 交谈。它为我们提供了与异步服务器通信的异步客户机,在异步服务器上,双方都完全控制消息格式。

因为DEALER 和ROUTER 都可以处理任意的消息格式,如果您希望安全地使用这些格式,您必须成为一个协议设计人员。至少您必须决定是否要模拟REQ/REP回复信封。这取决于你是否真的需要发送回复。

The DEALER to DEALER Combination

您可以用ROUTER交换一个REP ,但也可以用一个DEALER交换一个REP ,前提是DEALER只与一个同行通信。

当您将REP 替换为DEALER时,您的worker 可以突然完全异步,发送任意数量的回复。这样做的代价是你必须自己管理回复信封,并把它们处理好,否则什么都不管用。稍后我们将看到一个工作示例。就目前而言,DEALER 对DEALER 模式是一种比较棘手的模式,值得庆幸的是,我们很少需要这种模式。

The ROUTER to ROUTER Combinationtopprevnext

对于N-to-N连接,这听起来很完美,但是这是最难使用的组合。在使用ZeroMQ之前,您应该避免使用它。我们将在自由模式和 Reliable Request-Reply 模式中看到一个例子,以及为分布式计算框架中的点对点工作设计的DEALER to ROUTER 的另一种替代方案(and an alternative DEALER to ROUTER design for peer-to-peer work in A Framework for Distributed Computing.)。

Invalid Combinations

大多数情况下,试图将客户端连接到客户端或服务器连接到服务器是一个坏主意,不会奏效。不过,我不会给出笼统的模糊警告,而是会详细解释:

  • REQ对REQ:双方都希望从互相发送消息开始,并且只有在您对事情进行计时以便两个对等方同时交换消息的情况下,这才能工作。一想到它就会伤到我的大脑。
  • REQ to DEALER:理论上可以这样做,但是如果添加第二个REQ,就会中断,因为DEALER无法向原始对等点发送回复。因此REQ socket 会混淆,并且/或返回针对其他客户机的消息。
  • REP to REP::双方都会等待对方发出第一个信息。
  • REP to ROUTER:理论上,如果 ROUTER socket知道REP socket 已经连接并且知道该连接的身份,ROUTER socket 可以启动对话框并发送正确格式的请求。这是混乱的,并没有增加超过经销商路由器(It’s messy and adds nothing over DEALER to ROUTER)。

在这个有效与无效的细分中,常见的线程是ZeroMQ socket 连接总是偏向于绑定到端点的一个对等点,以及连接到端点的另一个对等点。此外,哪边绑定哪边连接并不是任意的,而是遵循自然模式。我们期望“在那里”的那一面是绑定的:它将是一个服务器、一个代理、一个发布者和一个收集器。“来了又走”的一方将clients 和workers联系起来。记住这一点将帮助您设计更好的ZeroMQ架构。

探索Exploring ROUTER Sockets

让我们再仔细看看ROUTER sockets。我们已经看到了它们如何通过将单个消息路由到特定的连接来工作。我将更详细地解释如何识别这些连接,以及ROUTER sockets在不能发送消息时做什么。

Identities and Addresses

ZeroMQ中的标识概念特别指ROUTER sockets ,以及它们如何标识与其他socket的连接。
更广泛地说,身份在回复信封中用作地址。在大多数情况下,标识是任意的,并且是ROUTER sockets 的本地标识:它是哈希表中的一个查找键。独立地,对等点可以有物理地址(网络端点,如“tcp://192.168.55.117:5670”)或逻辑地址(UUID或电子邮件地址或其他惟一密钥)。

使用ROUTER sockets 与特定对等点通信的应用程序,如果已经构建了必要的散列表,则可以将逻辑地址转换为标识。因为ROUTER sockets 只在一个连接(到一个特定的对等点)发送消息时声明该连接的身份,所以您只能真正地回复一个消息,而不能自动地与一个对等点通信。

这是真的,即使你翻转规则,使ROUTER连接到对等点,而不是等待对等点连接到ROUTER。但是,您可以强制ROUTER sockets 使用逻辑地址来代替它的标识。zmq_setsockopt引用页面调用此设置socket标识。
其工作原理如下:

  • 对等应用程序在绑定或连接之前设置其对等socket (DEALER 或REQ)的ZMQ_IDENTITY选项。
  • 通常,对等点然后连接到已经绑定的 ROUTER socket。但是 ROUTER也可以连接到对等点。
  • 在连接时,对等socket 告诉ROUTER socket,“请为这个连接使用这个标识”。
  • 如果对等socket 没有这样说,ROUTER就为连接生成它通常的任意随机标识。
  • ROUTER socket现在将此逻辑地址提供给应用程序,作为来自该对等点的任何消息的前缀标识帧。
  • ROUTER 还期望逻辑地址作为任何传出消息的前缀标识帧。

下面是连接到ROUTER socket的两个对等点的简单例子,其中一个附加了一个逻辑地址“PEER2”:

[identity: Identity check in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Q | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Racket

Here is what the program prints:

----------------------------------------
[005] 006B8B4567
[000]
[039] ROUTER uses a generated 5 byte identity
----------------------------------------
[005] PEER2
[000]
[038] ROUTER uses REQ's socket identity

的错误处理 ROUTER Error Handling

ROUTER sockets确实有一种处理无法发送到任何地方的消息的方法:它们无声地丢弃这些消息。
这种态度在工作代码中是有意义的,但它使调试变得困难。“发送标识为第一帧”的方法非常棘手,以至于我们在学习时经常会出错,而ROUTER在我们出错时的死寂也不是很有建设性。

因为ZeroMQ v3.2中有一个socket选项可以设置为捕捉这个错误:**ZMQ_ROUTER_MANDATORY。**将其设置在ROUTER socket 上,然后当您在发送调用上提供不可路由的标识时,socket将发出EHOSTUNREACH错误的信号。

负载平衡模式The Load Balancing Pattern

现在让我们看一些代码。我们将看到如何将ROUTER socket连接到 REQ socket,然后连接到DEALER socket。这两个示例遵循相同的逻辑,即负载平衡模式。这种模式是我们第一次公开使用 ROUTER socket进行有意路由,而不是简单地充当应答通道。

负载平衡模式非常常见,我们将在本书中多次看到它。它解决了简单的循环式路由(如 PUSH和DEALER提供)的主要问题,即如果任务没有大致占用相同的时间,那么循环式路由就会变得低效。

这是邮局的比喻。如果每个柜台都有一个队列,有些人购买邮票(快速、简单的交易),有些人开立新帐户(非常慢的交易),那么您将发现邮票购买者被不公平地困在队列中。就像在邮局一样,如果您的消息传递体系结构是不公平的,人们会感到恼火。

邮局的解决方案是创建一个队列,这样即使一个或两个柜台工作缓慢,其他柜台将继续以先到先得的方式为客户服务。

PUSH和DEALER使用这种简单方法的一个原因是纯粹的性能。如果你到达美国任何一个主要机场,你会发现移民处排着长队。边境巡逻官员将提前派人在每个柜台排队,而不是使用单一队列。让人们提前走50码可以为每位乘客节省一到两分钟。由于每次护照检查的时间大致相同,所以或多或少是公平的。这是PUSH和DEALER的策略:提前发送工作负载,这样就有更少的旅行距离。

这是ZeroMQ反复出现的主题:世界上的问题是多种多样的,用正确的方法解决不同的问题可以让你受益。机场不是邮局,而且没有一个尺寸适合任何人,真的很好。
让我们回到一个worker (DEALER 或者 REQ)连接到一个broker (ROUTER)的场景。broker必须知道worker什么时候准备好了,并保存一个workers列表,以便每次可以使用最近最少使用的工人。

事实上,解决方案非常简单:工作人员在开始和完成每个任务后都会发送一条“ready”消息。broker逐个读取这些消息。每次读取一条消息时,它都是从最后一个使用的worker中读取的。因为我们使用的是ROUTER socket,我们得到一个标识,然后我们可以用它把一个任务发送回worker。
这是request-reply的一个曲解,因为任务与应答一起发送,任务的任何响应都作为一个新请求发送。下面的代码示例应该更清楚。

ROUTER Broker and REQ Workers

下面是一个负载平衡模式的例子,使用ROUTER broker与一组REQ workers对话:

[rtreq: ROUTER-to-REQ in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

该示例运行5秒,然后每个工作人员打印他们处理了多少任务。如果路由成功了,我们希望工作得到公平分配:

Completed: 20 tasks
Completed: 18 tasks
Completed: 21 tasks
Completed: 23 tasks
Completed: 19 tasks
Completed: 21 tasks
Completed: 17 tasks
Completed: 17 tasks
Completed: 25 tasks
Completed: 19 tasks

要与本例中的工作人员对话,我们必须创建一个对 REQ-friendly的信封,它由一个标识符和一个空信封分隔符框架组成。

Figure 31 - Routing Envelope for REQ

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WXYgPbyO-1611294599108)(https://github.com/imatix/zguide/raw/master/images/fig31.png)]

ROUTER Broker and DEALER Workers

任何地方你可以使用REQ,你就可以使用DEALER。有两个具体的区别:

  • REQ socket总是在任何数据帧之前发送一个空的分隔符帧;DEALER没有。
  • REQ socket在收到回复之前只发送一条消息;DEALER是完全异步的。

同步和异步行为对我们的示例没有影响,因为我们正在执行严格的请求-应答。当我们处理从失败中恢复时,它更相关,我们将在 Reliable Request-Reply Patterns中讨论这个问题。现在让我们看看完全相同的例子,但与REQ socket替换为一个DEALER socket:

[rtdealer: ROUTER-to-DEALER in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

代码几乎是相同的,除了worker使用一个DEALER socket,并读写数据帧之前的空帧。
当我想要保持与REQ worker的兼容性时,我使用这种方法。

但是,请记住空分隔符帧的原因:它允许多跳扩展请求在 REP socket中终止, REP socket使用该分隔符分隔应答信封,以便将数据帧传递给应用程序。

如果我们从来不需要将消息传递给REP socket,我们可以简单地在两边删除空分隔符框架,这使事情变得更简单。这通常是我为纯DEALER to ROUTER协议使用的设计。

负载平衡消息代理 A Load Balancing Message Broker

前面的示例只完成了一半。它可以用虚拟的请求和响应来管理一组工人,但是它没有办法与clients交谈。如果我们添加第二个接收clients请求的frontend ROUTER socket,并将我们的示例转换为一个可以将消息从前端切换到后端的代理,我们将得到一个有用的、可重用的小型负载平衡消息代理。

Figure 32 - Load Balancing Broker

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lEW9ttjr-1611294599109)(https://github.com/imatix/zguide/raw/master/images/fig32.png)]

该代理执行以下操作:

  • 接受来自一组clients的连接。
  • 接受来自一组workers的连接。
  • 接受来自clients的请求,并将这些请求保存在一个队列中。
  • 使用负载平衡模式将这些请求发送给workers 。
  • 收到workers的回复。
  • 将这些响应发送回原始请求client。

理代码相当长,但值得理解:

[lbbroker: Load balancing broker in C](javascript:😉

C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

这个程序最困难的部分是(a)每个socket 读取和写入的信封,以及(b)负载平衡算法。我们将依次从消息信封格式开始。

让我们遍历一个完整的请求-响应链,从client 到worker,然后返回。在这段代码中,我们设置了client 和worker sockets 的标识,以便更容易地跟踪消息帧。实际上,我们允许 ROUTER sockets为连接创建身份。假设客户机的标识是“client”,而worker的标识是“worker”。客户机应用程序发送一个包含“Hello”的帧。

Figure 33 - Message that Client Sends

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mMwvHS8c-1611294599111)(https://github.com/imatix/zguide/raw/master/images/fig33.png)]

Because the REQ socket adds its empty delimiter frame and the ROUTER socket adds its connection identity, the proxy reads off the frontend ROUTER socket the client address, empty delimiter frame, and the data part.

Figure 34 - Message Coming in on Frontend

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JakJZh0t-1611294599112)(https://github.com/imatix/zguide/raw/master/images/fig34.png)]

The broker sends this to the worker, prefixed by the address of the chosen worker, plus an additional empty part to keep the REQ at the other end happy.

Figure 35 - Message Sent to Backend

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Nq14nxJ9-1611294599113)(https://github.com/imatix/zguide/raw/master/images/fig35.png)]

This complex envelope stack gets chewed up first by the backend ROUTER socket, which removes the first frame. Then the REQ socket in the worker removes the empty part, and provides the rest to the worker application.

Figure 36 - Message Delivered to Worker

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-m40SfwHg-1611294599115)(https://github.com/imatix/zguide/raw/master/images/fig36.png)]

//直译了==> 需修改
工作人员必须保存信封(信封是到空消息框为止的所有部分,包括空消息框),然后才能对数据部分执行所需的操作。请注意,REP套接字会自动执行此操作,但我们使用的是REQ-ROUTER模式,因此我们可以获得适当的负载平衡。

在返回路径上,消息与传入时相同,即,后端套接字将消息分成五部分发送给代理,代理将消息分成三部分发送给前端套接字,客户端将消息分成一个部分。

现在让我们看看负载平衡算法。它要求客户端和工作人员都使用REQ套接字,并且工作人员在收到消息时正确地存储和重放信封。该算法是:

  • 创建一个poll集,它总是轮询后端,只有当有一个或多个工作人员可用时才轮询前端。
  • 轮询具有无限超时的活动。
  • 如果后端有活动,我们要么有一个“就绪”消息,要么有一个客户端的回复。在这两种情况下,我们都将worker地址(第一部分)存储在worker队列中,如果其余部分是客户机应答,则通过前端将其发送回客户机。
  • 如果前端有活动,我们接收客户机请求,弹出下一个worker(最后使用的),并将请求发送到后端。这意味着发送worker地址、空部分以及客户机请求的三个部分。

您现在应该看到,您可以重用和扩展负载平衡算法,并根据工作人员在其初始“就绪”消息中提供的信息进行更改。例如,工作人员可能启动并进行性能自我测试,然后告诉代理他们的速度有多快。然后,代理可以选择可用的最快的工人,而不是最老的工人。

A High-Level API for ZeroMQtopprevnext

我们将把request-reply推入堆栈并打开另一个区域,即ZeroMQ API本身。这样做是有原因的:当我们编写更复杂的示例时,底层ZeroMQ API看起来越来越笨拙。看看我们的负载平衡代理的工作线程的核心:

**while** (true) {
`    `*//  Get one address frame and empty delimiter*
`    `char *address = s_recv (worker);
`    `char *empty = s_recv (worker);
`    `assert (*empty == 0);
`    `free (empty);

`    `*//  Get request, send reply*
`    `char *request = s_recv (worker);
`    `printf ("Worker: %s**\n**", request);
`    `free (request);

`    `s_sendmore (worker, address);
`    `s_sendmore (worker, "");
`    `s_send`     `(worker, "OK");
`    `free (address);
}

该代码甚至不能重用,因为它只能处理信封中的一个回复地址,而且它已经对ZeroMQ API进行了一些包装。如果我们使用libzmq简单消息API,我们必须这样写:

**while** (true) {
`    `*//  Get one address frame and empty delimiter*
`    `char address [255];
`    `int address_size = zmq_recv (worker, address, 255, 0);
`    `**if** (address_size == -1)
`        `**break**;

`    `char empty [1];
`    `int empty_size = zmq_recv (worker, empty, 1, 0);
`    `assert (empty_size <= 0);
`    `**if** (empty_size == -1)
`        `**break**;

`    `*//  Get request, send reply*
`    `char request [256];
`    `int request_size = zmq_recv (worker, request, 255, 0);
`    `**if** (request_size == -1)
`        `**return** NULL;
`    `request [request_size] = 0;
`    `printf ("Worker: %s**\n**", request);
`    `
`    `zmq_send (worker, address, address_size, ZMQ_SNDMORE);
`    `zmq_send (worker, empty, 0, ZMQ_SNDMORE);
`    `zmq_send (worker, "OK", 2, 0);
}

当代码太长而不能快速编写时,理解它也太长。到目前为止,我一直坚持使用本机API,因为作为ZeroMQ用户,我们需要深入了解这一点。但当它阻碍我们的时候,我们必须把它当作一个需要解决的问题。

当然,我们不能仅仅更改ZeroMQ API,这是一个文档化的公共契约,成千上万的人同意并依赖它。相反,我们基于到目前为止的经验,尤其是编写更复杂的请求-应答模式的经验,在顶层构建一个更高级别的API。

我们想要的是一个API,它允许我们一次性接收和发送完整的消息,包括包含任意数量回复地址的回复信封。它让我们用最少的代码行来做我们想做的事情。

创建一个好的消息API相当困难。我们有一个术语问题:ZeroMQ使用“message”来描述多部分消息和单个消息框架。我们有一个期望问题:有时将消息内容视为可打印的字符串数据是很自然的,有时将其视为二进制块。我们面临着技术上的挑战,尤其是如果我们想避免过多地复制数据的话。

尽管我的特定用例是C语言,但制作一个好的API所面临的挑战影响到所有的语言。无论您使用哪种语言,请考虑如何为您的语言绑定做出贡献,使其与我将要描述的C绑定一样好(或更好)。

Features of a Higher-Level API

My solution is to use three fairly natural and obvious concepts: string (already the basis for our s_send and s_recv) helpers, frame (a message frame), and message (a list of one or more frames). Here is the worker code, rewritten onto an API using these concepts:

while (true) {
zmsg_t *msg = zmsg_recv (worker);
zframe_reset (zmsg_last (msg), “OK”, 2);
zmsg_send (&msg, worker);
}

Cutting the amount of code we need to read and write complex messages is great: the results are easy to read and understand. Let’s continue this process for other aspects of working with ZeroMQ. Here’s a wish list of things I’d like in a higher-level API, based on my experience with ZeroMQ so far:

  • Automatic handling of sockets. I find it cumbersome to have to close sockets manually, and to have to explicitly define the linger timeout in some (but not all) cases. It’d be great to have a way to close sockets automatically when I close the context.

  • Portable thread management. Every nontrivial ZeroMQ application uses threads, but POSIX threads aren’t portable. So a decent high-level API should hide this under a portable layer.

  • Piping from parent to child threads. It’s a recurrent problem: how to signal between parent and child threads. Our API should provide a ZeroMQ message pipe (using PAIR sockets and inproc automatically.

  • Portable clocks. Even getting the time to a millisecond resolution, or sleeping for some milliseconds, is not portable. Realistic ZeroMQ applications need portable clocks, so our API should provide them.

  • A reactor to replace zmq_poll(). The poll loop is simple, but clumsy. Writing a lot of these, we end up doing the same work over and over: calculating timers, and calling code when sockets are ready. A simple reactor with socket readers and timers would save a lot of repeated work.

  • Proper handling of Ctrl-C. We already saw how to catch an interrupt. It would be useful if this happened in all applications.

The CZMQ High-Level APItopprevnext

Turning this wish list into reality for the C language gives us CZMQ, a ZeroMQ language binding for C. This high-level binding, in fact, developed out of earlier versions of the examples. It combines nicer semantics for working with ZeroMQ with some portability layers, and (importantly for C, but less for other languages) containers like hashes and lists. CZMQ also uses an elegant object model that leads to frankly lovely code.

Here is the load balancing broker rewritten to use a higher-level API (CZMQ for the C case):

[lbbroker2: Load balancing broker using high-level API in C](javascript:😉

C++ | Delphi | Haxe | Java | Lua | PHP | Python | Scala | Ada | Basic | C# | Clojure | CL | Erlang | F# | Felix | Go | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Tcl

// Shows how to handle Ctrl-C

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <fcntl.h>

#include <zmq.h>

//Signal handlingCreate a self-pipe and call s_catch_signals(pipe’s writefd) in your application//at startup, and then exit your main loop if your pipe contains any data.//Works especially well with zmq_poll.

#define S_NOTIFY_MSG " "
#define S_ERROR_MSG “Error while writing to self-pipe.\n”
static int s_fd;
static void s_signal_handler (int signal_value)
{
int rc = write (s_fd, S_NOTIFY_MSG, sizeof(S_NOTIFY_MSG));
if (rc != sizeof(S_NOTIFY_MSG)) {
write (STDOUT_FILENO, S_ERROR_MSG, sizeof(S_ERROR_MSG)-1);
exit(1);
}
}

static void s_catch_signals (int fd)
{
s_fd = fd;

struct sigaction action;
action.sa_handler = s_signal_handler;
// Doesn’t matter if SA_RESTART set because self-pipe will wake up zmq_poll
// But setting to 0 will allow zmq_read to be interrupted.
action.sa_flags = 0;
sigemptyset (&action.sa_mask);
sigaction (SIGINT, &action, NULL);
sigaction (SIGTERM, &action, NULL);
}

int main (void)
{
int rc;

void *context = zmq_ctx_new ();
void socket = zmq_socket (context, ZMQ_REP);
zmq_bind (socket, "tcp://
:5555");

    `int pipefds[2];
`    `rc = pipe(pipefds);
`    `**if** (rc != 0) {
`        `perror("Creating self-pipe");
`        `exit(1);
`    }`
`    `**for** (int i = 0; i < 2; i++) {
`        `int flags = fcntl(pipefds[0], F_GETFL, 0);
`        `**if** (flags < 0) {
`            `perror ("fcntl(F_GETFL)");
`            `exit(1);
`        }`
`        `rc = fcntl (pipefds[0], F_SETFL, flags | O_NONBLOCK);
`        `**if** (rc != 0) {
`            `perror ("fcntl(F_SETFL)");
`            `exit(1);
`        }`
`    }

s_catch_signals (pipefds[1]);

zmq_pollitem_t items [] = {
{ 0, pipefds[0], ZMQ_POLLIN, 0 },
{ socket, 0, ZMQ_POLLIN, 0 }
};

    `**while** (1) {
`        `rc = zmq_poll (items, 2, -1);
`        `**if** (rc == 0) {
`            `**continue**;
`        }` **else** **if** (rc < 0) {
`            `**if** (errno == EINTR) { **continue**; }
`            `perror("zmq_poll");
`            `exit(1);
`        }
        `*// Signal pipe FD*
`        `**if** (items [0].revents & ZMQ_POLLIN) {
`            `char buffer [1];
`            `read (pipefds[0], buffer, 1);`  `*// clear notifying byte*
`            `printf ("W: interrupt received, killing server…**\n**");
`            `**break**;
`        }

// Read socket
if (items [1].revents & ZMQ_POLLIN) {
char buffer [255];
// Use non-blocking so we can continue to check self-pipe via zmq_poll
rc = zmq_recv (socket, buffer, 255, ZMQ_NOBLOCK);
if (rc < 0) {
if (errno == EAGAIN) { continue; }
if (errno == EINTR) { continue; }
perror(“recv”);
exit(1);
}
printf (“W: recv**\n**”);

            `*// Now send message back.*
`            `*// …*
`        }`
`    }

printf (“W: cleaning up**\n**”);
zmq_close (socket);
zmq_ctx_destroy (context);
return 0;
}

Or, if you’re calling zmq_poll(), test on the return code:

if (zmq_poll (items, 2, 1000 * 1000) == -1)
break;// Interrupted

The previous example still uses zmq_poll(). So how about reactors? The CZMQ zloopreactor is simple but functional. It lets you:

  • Set a reader on any socket, i.e., code that is called whenever the socket has input.
  • Cancel a reader on a socket.
  • Set a timer that goes off once or multiple times at specific intervals.
  • Cancel a timer.

zloop of course uses zmq_poll() internally. It rebuilds its poll set each time you add or remove readers, and it calculates the poll timeout to match the next timer. Then, it calls the reader and timer handlers for each socket and timer that need attention.

When we use a reactor pattern, our code turns inside out. The main logic looks like this:

zloop_t *reactor = zloop_new ();
zloop_reader (reactor, self->backend, s_handle_backend, self);
zloop_start (reactor);
zloop_destroy (&reactor);

The actual handling of messages sits inside dedicated functions or methods. You may not like the style—it’s a matter of taste. What it does help with is mixing timers and socket activity. In the rest of this text, we’ll use zmq_poll() in simpler cases, and zloop in more complex examples.

Here is the load balancing broker rewritten once again, this time to use zloop:

[lbbroker3: Load balancing broker using zloop in C](javascript:😉

Haxe | Java | Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

If you’re using child threads, they won’t receive the interrupt. To tell them to shutdown, you can either:

  • Destroy the context, if they are sharing the same context, in which case any blocking calls they are waiting on will end with ETERM.
  • Send them shutdown messages, if they are using their own contexts. For this you’ll need some socket plumbing.

异步客户端/服务器模式The Asynchronous Client/Server Pattern

在ROUTER to DEALER示例中,我们看到了一个1到N的用例,其中一个服务器与多个工作程序异步对话。我们可以将其颠倒过来,以获得一个非常有用的N-to-1架构,其中各种客户端与单个服务器进行通信,并以异步方式执行此操作。

Figure 37 - Asynchronous Client/Server

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Lrh4Kjzy-1611294599116)(https://github.com/imatix/zguide/raw/master/images/fig37.png)]

以下是它的工作原理:

  • 客户端连接到服务器并发送请求。
  • 对于每个请求,服务器发送0个或更多回复。
  • 客户端可以发送多个请求而无需等待回复。
  • 服务器可以发送多个回复,而无需等待新请求。

Here’s code that shows how this works:

[asyncsrv: Asynchronous client/server in C](javascript:😉

C++ | C# | Clojure | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | CL | Felix | Objective-C | ooc | Perl | Q | Racket

该示例在一个进程中运行,多个线程模拟真实的多进程体系结构。运行该示例时,您将看到三个客户端(每个客户端都有一个随机ID),打印出他们从服务器获得的回复。仔细查看,您会看到每个客户端任务每个请求获得0个或更多回复。

对此代码的一些评论:

  • 客户端每秒发送一次请求,并返回零个或多个回复。为了使用zmq_poll()来完成这项工作,我们不能简单地使用1秒的超时轮询,或者在收到最后一个回复后我们最终只发送一个新请求。所以我们以高频率(每次轮询1/100秒的100次)进行轮询,这几乎是准确的。
  • 服务器使用工作线程池,每个线程同步处理一个请求。它使用内部队列将它们连接到它的前端套接字。它使用zmq_proxy()调用连接前端和后端套接字。

Figure 38 - Detail of Asynchronous Server

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PGvwssEI-1611294599117)(https://github.com/imatix/zguide/raw/master/images/fig38.png)]

Note that we’re doing DEALER to ROUTER dialog between client and server, but internally between the server main thread and workers, we’re doing DEALER to DEALER. If the workers were strictly synchronous, we’d use REP. However, because we want to send multiple replies, we need an async socket. We do not want to route replies, they always go to the single server thread that sent us the request.

Let’s think about the routing envelope. The client sends a message consisting of a single frame. The server thread receives a two-frame message (original message prefixed by client identity). We send these two frames on to the worker, which treats it as a normal reply envelope, returns that to us as a two frame message. We then use the first frame as an identity to route the second frame back to the client as a reply.

It looks something like this:

     client          server       frontend       worker
   [ DEALER ]<---->[ ROUTER <----> DEALER <----> DEALER ]
             1 part         2 parts       2 parts

Now for the sockets: we could use the load balancing ROUTER to DEALER pattern to talk to workers, but it’s extra work. In this case, a DEALER to DEALER pattern is probably fine: the trade-off is lower latency for each request, but higher risk of unbalanced work distribution. Simplicity wins in this case.

When you build servers that maintain stateful conversations with clients, you will run into a classic problem. If the server keeps some state per client, and clients keep coming and going, eventually it will run out of resources. Even if the same clients keep connecting, if you’re using default identities, each connection will look like a new one.

We cheat in the above example by keeping state only for a very short time (the time it takes a worker to process a request) and then throwing away the state. But that’s not practical for many cases. To properly manage client state in a stateful asynchronous server, you have to:

  • Do heartbeating from client to server. In our example, we send a request once per second, which can reliably be used as a heartbeat.

  • Store state using the client identity (whether generated or explicit) as key.

  • Detect a stopped heartbeat. If there’s no request from a client within, say, two seconds, the server can detect this and destroy any state it’s holding for that client.

Worked Example: Inter-Broker Routing

让我们看看目前为止所见过的所有内容,并将其扩展到实际应用程序。我们将在几次迭代中逐步构建它。我们最好的客户急需我们,并要求设计一个大型云计算设施。他有一个跨越许多数据中心的云的愿景,每个数据中心都是客户和工作者的集群,并且它们作为一个整体协同工作。因为我们足够聪明地知道实践总是胜过理论,所以我们建议使用ZeroMQ进行工作模拟。我们的客户,渴望在他自己的老板改变主意之前锁定预算,并在Twitter上阅读关于ZeroMQ的好消息,同意。

Establishing the Details

以后有几个espressos,我们想跳进编写代码,但是一个小小的声音告诉我们在完全解决错误问题之前获得更多细节。我们问道:“云做了什么工作?”

客户解释说:

  • 工作人员使用各种硬件,但他们都能够处理任何任务。每个集群有几百个工作者,总共有十几个集群。
  • 客户端为工作人员创建任务。每个任务都是一个独立的工作单元,所有客户想要的是找到一个可用的工人,并尽快将任务发送给它。会有很多客户,他们会随意来去匆匆。
  • 真正的困难是能够随时添加和删除集群。集群可以立即离开或加入云,为其所有工作人员和客户提供服务。
  • 如果他们自己的集群中没有工作人员,则客户端的任务将转移到云中的其他可用工作人员。
  • 客户端一次发送一个任务,等待回复。如果他们在X秒内没有得到答案,他们就会再次发出任务。这不是我们关心的问题;客户端API已经完成了。
  • 工人一次处理一项任务;他们是非常简单的野兽。如果它们崩溃,它们会被启动它们的任何脚本重新启动。

所以我们仔细检查以确保我们正确理解这一点:

  • “群集之间会有某种超级网络互连,对吧?”,我们问道。客户说,“是的,当然,我们不是白痴。
  • ”“我们在谈论什么样的卷?”,我们问道。客户回答说:“每个群集最多有一千个客户端,每个最多每秒执行十个请求。请求很小,回复也很小,每个不超过1K字节。”

所以我们做了一些计算,看到这对普通的TCP很有效。 2,500个客户端x 10 /秒x 1,000字节x 2个方向= 50 MB /秒或400 Mb /秒,对1Gb网络来说不是问题。

这是一个简单的问题,不需要特殊的硬件或协议,只需要一些巧妙的路由算法和精心设计。我们首先设计一个集群(一个数据中心),然后我们弄清楚如何将集群连接在一起。

单个集群的体系结构Architecture of a Single Cluster

工人和客户是同步的。我们希望使用负载平衡模式将任务路由到工作人员。工人都是一样的;我们的设施没有不同服务的概念。工人是匿名的;客户从不直接解决它们。我们在此不做任何尝试,以提供有保证的交付,重试等。
由于我们已经检查过的原因,客户和工人不会直接相互通话。它使得无法动态添加或删除节点。所以我们的基本模型包括我们之前看到的请求 - 回复消息代理。

Figure 39 - Cluster Architecture

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8EXFg5iD-1611294599119)(https://github.com/imatix/zguide/raw/master/images/fig39.png)]

Scaling to Multiple Clusters

现在我们将其扩展到多个集群。每个集群都有一组客户端和工作者,以及将这些连接在一起的代理。

Figure 40 - Multiple Clusters

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jOlzYaC1-1611294599120)(https://github.com/imatix/zguide/raw/master/images/fig40.png)]

问题是:我们如何让每个集群的客户端与其他集群的工作人员交谈?有几种可能性,每种都有利有弊:

  • 客户可以直接连接到两个经纪人。优点是我们不需要修改经纪人或工人。但客户端变得更加复杂并且意识到整体拓扑结构。例如,如果我们要添加第三个或第四个群集,则所有客户端都会受到影响。实际上,我们必须将路由和故障转移逻辑移动到客户端,这并不好。
  • 工人可以直接连接到两个经纪人。但REQ工作人员不能这样做,他们只能回复一个经纪人。我们可能会使用REP,但是REP并没有给我们提供可自定义的代理到工作路由,比如负载均衡,只有内置的负载均衡。那是失败的;如果我们想将工作分配给闲置工人,我们确实需要负载平衡。一种解决方案是将ROUTER套接字用于工作节点。让我们标记这个“想法#1”。
  • 经纪人可以相互联系。这看起来最好,因为它创建了最少的额外连接。我们无法动态添加集群,但这可能超出了范围。现在,客户和工作人员仍然不了解真正的网络拓扑,经纪人告诉对方何时有剩余容量。让我们标记这个“想法#2”。让我们探索想法#1。在这个模型中,我们让工人连接到两个经纪人并接受任何一个经纪人的工作。

Figure 41 - Idea 1: Cross-connected Workers

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9sWaiWxW-1611294599121)(https://github.com/imatix/zguide/raw/master/images/fig41.png)]

看起来很可行。然而,它并没有提供我们想要的东西,即如果可能的话客户得到本地工人,而只有在比等待更好的情况下才能得到远程工作人员。工人们也会向两个经纪人发出“准备好”的信号,可以同时获得两份工作,而其他工人仍处于闲置状态。看起来这个设计失败了,因为我们再次将路由逻辑放在边缘。

那么,想法#2然后。我们将经纪人互连,不要接触客户或工人,这些都是我们习惯的REQ。

Figure 42 - Idea 2: Brokers Talking to Each Other

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4RYYS9jB-1611294599122)(https://github.com/imatix/zguide/raw/master/images/fig42.png)]

这种设计很有吸引力,因为问题在一个地方得到解决,对世界其他地方来说是不可见的。基本上,经纪人互相打开秘密通道,像骆驼商人一样低语,“嘿,我有一些闲置的容量。如果你有太多的客户,请给我一个喊叫,我们会处理”。
实际上,它只是一种更复杂的路由算法:经纪人成为彼此的分包商。在我们使用真实代码之前,还有其他一些关于此设计的事情:

  • 它将常见情况(同一群集上的客户端和工作者)视为默认情况,并为特殊情况(群集之间的混洗作业)执行额外工作。
  • 它允许我们为不同类型的工作使用不同的消息流。这意味着我们可以以不同方式处理它们,例如,使用不同类型的网络连接。
  • 感觉它会顺利扩展。连接三个或更多经纪人并不会变得过于复杂。如果我们发现这是一个问题,可以通过添加超级代理来轻松解决。

我们现在做一个有效的例子。我们将整个集群打包到一个进程中。这显然是不现实的,但它使模拟变得简单,并且模拟可以准确地扩展到实际过程。这是ZeroMQ的魅力 - 您可以在微观层面进行设计并扩展到宏观层面。线程成为进程,然后变成框,模式和逻辑保持不变。我们的每个“集群”进程都包含客户端线程,工作线程和代理线程。

我们现在很了解基本模型:

  • REQ客户端(REQ)线程创建工作负载并将它们传递给代理(ROUTER)。
  • REQ工作线程(REQ)处理工作负载并将结果返回给代理(ROUTER)。
  • 代理使用负载平衡模式对工作负载进行排队和分发。

Federation Versus Peering

有几种可能的方法来互连经纪人。我们想要的是能够告诉其他经纪人“我们有能力”,然后接收多个任务。我们还需要能够告诉其他经纪人,“停止,我们已经满员”。它不需要是完美的;有时我们可能接受我们无法立即处理的工作,然后我们会尽快完成。

最简单的互连是联合,其中经纪人为彼此模拟客户和工人。我们可以通过将我们的前端连接到其他代理的后端套接字来完成此操作。请注意,将套接字绑定到端点并将其连接到其他端点是合法的。

Figure 43 - Cross-connected Brokers in Federation Model

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WTWave9k-1611294599123)(https://github.com/imatix/zguide/raw/master/images/fig43.png)]

这将给我们两个经纪人的简单逻辑和一个相当好的机制:当没有工人时,告诉其他经纪人“准备好”,并从中接受一份工作。问题还在于它对于这个问题来说太简单了。联合代理一次只能处理一个任务。如果代理模拟锁定步骤客户端和工作者,则根据定义它也将是锁定步骤,并且如果它具有许多可用工作程序,则它们将不被使用。我们的经纪人需要以完全异步的方式连接。

联合模型非常适合其他类型的路由,尤其是面向服务的体系结构(SOA),它按服务名称和接近度而不是负载平衡或循环方式进行路由。因此,不要将其视为无用,它不适合所有用例。

让我们看一下对等方法,而不是联邦,让经纪人明确地相互了解并通过特权渠道进行交谈。让我们打破这一点,假设我们想要互连N个经纪人。每个代理都有(N-1)个对等体,所有代理都使用完全相同的代码和逻辑。经纪人之间有两种截然不同的信息流:

  • 每个经纪人都需要随时告诉其同行有多少工人。这可以是相当简单的信息 - 只是定期更新的数量。显而易见(和正确)的套接字模式是pub-sub。因此,每个代理打开一个PUB套接字并发布该状态信息,,以从其对等方获取状态信息。每个代理都需要一种方法将任务委派给对等方并异步地回复。我们将使用ROUTER插座完成此操作;没有其他组合有效。
  • 每个代理都有两个这样的套接字:一个用于接收任务,一个用于委托的任务。如果我们不使用两个套接字,那么每次都要知道我们是在阅读请求还是回复。这意味着在邮件信封中添加更多信息。经纪人与其本地客户和工人之间也存在信息流。

The Naming Ceremony

每个流程有三个流x两个套接字=我们必须在代理中管理的六个套接字。选择好名字对于保持多插件杂耍行为在我们的思想中合理连贯至关重要。套接字做了一些事情,他们做的事情应该成为他们名字的基础。这是关于几个星期后在一个寒冷的星期一早上咖啡之前阅读代码,并没有感到任何痛苦。

让我们为插座做一个萨满教的命名仪式。这三个流程是:

  • 代理与其客户和工作者之间的本地请求 - 回复流。
  • 代理与其对等代理之间的云请求 - 回复流。
  • 经纪人与其同行经纪人之间的状态流。
    找到长度相同的有意义的名称意味着我们的代码将很好地对齐。这不是一件大事,但对细节的关注有所帮助。对于每个流程,代理有两个套接字,我们可以正交地调用前端和后端。我们经常使用这些名字。前端接收信息或任务。后端将这些发送给其他对等方。概念流程是从前到后(回复从后到前以相反的方向)。因此,在我们为本教程编写的所有代码中,我们将使用这些套接字名称:
  • Localfe和localbe为当地流量。
  • Cloudfe和cloudbe为云流。
  • Statefe和statebe为州流动。

对于我们的运输而且因为我们在一个盒子上模拟整个东西,我们将使用ipc来处理所有事情。这样做的好处就是在连接方面像tcp一样工作(也就是说,它是一个断开的传输,与inproc不同),但我们不需要IP地址或DNS名称,这在这里会很痛苦。相反,我们将使用名为something-local,something-cloud和something-state的ipc端点,其中某些东西是我们模拟集群的名称。

您可能会认为这对某些名称来说是很多工作。为什么不叫它们s1,s2,s3,s4等?答案是,如果你的大脑不是一台完美的机器,你在阅读代码时需要很多帮助,我们会看到这些名字确实有帮助。记住“三个流动,两个方向”比“六个不同的插座”更容易。

Figure 44 - Broker Socket Arrangement

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QlzWKbQV-1611294599124)(https://github.com/imatix/zguide/raw/master/images/fig44.png)]

Note that we connect the cloudbe in each broker to the cloudfe in every other broker, and likewise we connect the statebe in each broker to the statefe in every other broker.

Prototyping the State Flowtopprevnext

Because each socket flow has its own little traps for the unwary, we will test them in real code one-by-one, rather than try to throw the whole lot into code in one go. When we’re happy with each flow, we can put them together into a full program. We’ll start with the state flow.

Figure 45 - The State Flow

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KV3zHFjz-1611294599125)(https://github.com/imatix/zguide/raw/master/images/fig45.png)]

Here is how this works in code:

[peering1: Prototype state flow in C](javascript:😉

C# | Clojure | Delphi | F# | Go | Haskell | Haxe | Java | Lua | Node.js | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | C++ | CL | Erlang | Felix | Objective-C | ooc | Perl | Q

  • Each broker has an identity that we use to construct ipc endpoint names. A real broker would need to work with TCP and a more sophisticated configuration scheme. We’ll look at such schemes later in this book, but for now, using generated ipc names lets us ignore the problem of where to get TCP/IP addresses or names.

  • We use a zmq_poll() loop as the core of the program. This processes incoming messages and sends out state messages. We send a state message only if we did not get any incoming messages and we waited for a second. If we send out a state message each time we get one in, we’ll get message storms.

  • We use a two-part pub-sub message consisting of sender address and data. Note that we will need to know the address of the publisher in order to send it tasks, and the only way is to send this explicitly as a part of the message.

  • We don’t set identities on subscribers because if we did then we’d get outdated state information when connecting to running brokers.

  • We don’t set a HWM on the publisher, but if we were using ZeroMQ v2.x that would be a wise idea.

We can build this little program and run it three times to simulate three clusters. Let’s call them DC1, DC2, and DC3 (the names are arbitrary). We run these three commands, each in a separate window:

peering1 DC1 DC2 DC3  #  Start DC1 and connect to DC2 and DC3
peering1 DC2 DC1 DC3  #  Start DC2 and connect to DC1 and DC3
peering1 DC3 DC1 DC2  #  Start DC3 and connect to DC1 and DC2

You’ll see each cluster report the state of its peers, and after a few seconds they will all happily be printing random numbers once per second. Try this and satisfy yourself that the three brokers all match up and synchronize to per-second state updates.

In real life, we’d not send out state messages at regular intervals, but rather whenever we had a state change, i.e., whenever a worker becomes available or unavailable. That may seem like a lot of traffic, but state messages are small and we’ve established that the inter-cluster connections are super fast.

If we wanted to send state messages at precise intervals, we’d create a child thread and open the statebe socket in that thread. We’d then send irregular state updates to that child thread from our main thread and allow the child thread to conflate them into regular outgoing messages. This is more work than we need here.

Prototyping the Local and Cloud Flowstopprevnext

Let’s now prototype the flow of tasks via the local and cloud sockets. This code pulls requests from clients and then distributes them to local workers and cloud peers on a random basis.

Figure 46 - The Flow of Tasks

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Su6ki9dp-1611294599126)(https://github.com/imatix/zguide/raw/master/images/fig46.png)]

Before we jump into the code, which is getting a little complex, let’s sketch the core routing logic and break it down into a simple yet robust design.

We need two queues, one for requests from local clients and one for requests from cloud clients. One option would be to pull messages off the local and cloud frontends, and pump these onto their respective queues. But this is kind of pointless because ZeroMQ sockets arequeues already. So let’s use the ZeroMQ socket buffers as queues.

This was the technique we used in the load balancing broker, and it worked nicely. We only read from the two frontends when there is somewhere to send the requests. We can always read from the backends, as they give us replies to route back. As long as the backends aren’t talking to us, there’s no point in even looking at the frontends.

So our main loop becomes:

  • Poll the backends for activity. When we get a message, it may be “ready” from a worker or it may be a reply. If it’s a reply, route back via the local or cloud frontend.

  • If a worker replied, it became available, so we queue it and count it.

  • While there are workers available, take a request, if any, from either frontend and route to a local worker, or randomly, to a cloud peer.

Randomly sending tasks to a peer broker rather than a worker simulates work distribution across the cluster. It’s dumb, but that is fine for this stage.

We use broker identities to route messages between brokers. Each broker has a name that we provide on the command line in this simple prototype. As long as these names don’t overlap with the ZeroMQ-generated UUIDs used for client nodes, we can figure out whether to route a reply back to a client or to a broker.

Here is how this works in code. The interesting part starts around the comment “Interesting part”.

[peering2: Prototype local and cloud flow in C](javascript:😉

C# | Delphi | F# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | C++ | Clojure | CL | Erlang | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket

peering2 me you
peering2 you me

Some comments on this code:

  • In the C code at least, using the zmsg class makes life much easier, and our code much shorter. It’s obviously an abstraction that works. If you build ZeroMQ applications in C, you should use CZMQ.

  • Because we’re not getting any state information from peers, we naively assume they are running. The code prompts you to confirm when you’ve started all the brokers. In the real case, we’d not send anything to brokers who had not told us they exist.

You can satisfy yourself that the code works by watching it run forever. If there were any misrouted messages, clients would end up blocking, and the brokers would stop printing trace information. You can prove that by killing either of the brokers. The other broker tries to send requests to the cloud, and one-by-one its clients block, waiting for an answer.

Putting it All Togethertopprevnext

Let’s put this together into a single package. As before, we’ll run an entire cluster as one process. We’re going to take the two previous examples and merge them into one properly working design that lets you simulate any number of clusters.

This code is the size of both previous prototypes together, at 270 LoC. That’s pretty good for a simulation of a cluster that includes clients and workers and cloud workload distribution. Here is the code:

[peering3: Full cluster simulation in C](javascript:😉

Delphi | F# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Erlang | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

  • The client threads detect and report a failed request. They do this by polling for a response and if none arrives after a while (10 seconds), printing an error message.

  • Client threads don’t print directly, but instead send a message to a monitor socket (PUSH) that the main loop collects (PULL) and prints off. This is the first case we’ve seen of using ZeroMQ sockets for monitoring and logging; this is a big use case that we’ll come back to later.

  • Clients simulate varying loads to get the cluster 100% at random moments, so that tasks are shifted over to the cloud. The number of clients and workers, and delays in the client and worker threads control this. Feel free to play with them to see if you can make a more realistic simulation.

  • The main loop uses two pollsets. It could in fact use three: information, backends, and frontends. As in the earlier prototype, there is no point in taking a frontend message if there is no backend capacity.

These are some of the problems that arose during development of this program:

  • Clients would freeze, due to requests or replies getting lost somewhere. Recall that the ROUTER socket drops messages it can’t route. The first tactic here was to modify the client thread to detect and report such problems. Secondly, I put zmsg_dump() calls after every receive and before every send in the main loop, until the origin of the problems was clear.

  • The main loop was mistakenly reading from more than one ready socket. This caused the first message to be lost. I fixed that by reading only from the first ready socket.

  • The zmsg class was not properly encoding UUIDs as C strings. This caused UUIDs that contain 0 bytes to be corrupted. I fixed that by modifying zmsg to encode UUIDs as printable hex strings.

This simulation does not detect disappearance of a cloud peer. If you start several peers and stop one, and it was broadcasting capacity to the others, they will continue to send it work even if it’s gone. You can try this, and you will get clients that complain of lost requests. The solution is twofold: first, only keep the capacity information for a short time so that if a peer does disappear, its capacity is quickly set to zero. Second, add reliability to the request-reply chain. We’ll look at reliability in the next chapter.

Chapter 4 - Reliable Request-Reply Patterns

第3章 - 高级请求 - 应答模式涵盖了ZeroMQ的请求 - 应答模式的高级用法以及工作示例。本章着眼于可靠性的一般问题,并在ZeroMQ的核心请求 - 应答模式之上构建一组可靠的消息传递模式。在本章中,我们将重点放在用户空间请求 - 回复模式,可重用模型,帮助您设计自己的ZeroMQ架构:

  • 懒惰的盗版模式:来自客户端的可靠请求-回复
  • 简单的海盗模式:使用负载平衡的可靠请求 - 回复
  • 偏执的海盗模式:可靠的请求 - 回复与心跳
  • Majordomo模式:面向服务的可靠排队
  • 泰坦尼克号模式:基于磁盘/断开连接的可靠排队
  • 二进制星形模式:主备份服务器故障转移
  • Freelance模式:无代理可靠请求 - 回复

What is “Reliability”?

大多数谈到“可靠性”的人并不真正知道他们的意思。我们只能根据失败来定义可靠性。也就是说,如果我们能够处理一组明确定义和理解的失败,那么我们就这些失败而言是可靠的。不多也不少。因此,让我们看一下分布式ZeroMQ应用程序中可能的失败原因,大概是概率的降序:

  • 应用程序代码是最糟糕的罪犯。它可能会崩溃并退出,冻结并停止响应输入,对输入运行速度太慢,耗尽所有内存,等等。
  • 系统代码(例如我们使用ZeroMQ编写的代理)可能会因应用程序代码的原因而死亡。系统代码应该比应用程序代码更可靠,但它仍然可以崩溃和刻录,尤其是当它试图为慢速客户端排队消息时内存不足。
  • 消息队列可能会溢出,通常是在学习与慢速客户端进行残酷交易的系统代码中。当队列溢出时,它开始丢弃消息。所以我们收到“丢失”的消息。
  • 网络可能失败(例如,WiFi被关闭或超出范围)。在这种情况下,ZeroMQ将自动重新连接,但与此同时,消息可能会丢失。
  • 硬件可能会失败,并在该框上运行所有进程。网络可能以异乎寻常的方式失败,例如,交换机上的某些端口可能死亡,并且网络的那些部分变得不可访问。整个数据中心可能受到雷击,地震,火灾或更普通的电力或冷却故障的影响。

使软件系统完全可靠地应对所有这些可能的故障是一项非常困难和昂贵的工作,超出了本书的范围。

因为上面列表中的前五个案例覆盖了大公司以外的99.9%的实际需求(根据我刚刚开展的一项高度科学的研究,这也告诉我78%的统计数据是当场弥补的,而且从来没有相信我们没有伪造自己的统计数据,这就是我们要研究的内容。如果您是一家有资金在最后两个案例上花钱的大公司,请立即联系我公司!在我的海滨别墅后面有一个大洞,等待转换成一个行政游泳池。

Designing Reliability

因此,为了使事情变得非常简单,可靠性是“在代码冻结或崩溃时保持正常工作”,这种情况我们将缩短为“死亡”。但是,我们希望保持正常工作的事情比仅仅消息更复杂。我们需要采用每个核心ZeroMQ消息模式,看看如何使它工作(如果可以的话)即使代码死了。
我们一个接一个地拿走它们:

  • 请求 - 回复:如果服务器死亡(处理请求时),客户端可以解决这个问题,因为它不会得到回复。然后它可以放弃,等待,稍后再试,找到另一台服务器,依此类推。至于客户死亡,我们现在可以将其视为“别人的问题”。
  • Pub-sub:如果客户端死了(获得了一些数据),服务器就不知道了。 Pub-sub不会将任何信息从客户端发送回服务器。但是客户可以通过带外联系服务器,例如通过请求 - 回复,并询问“请重新发送我错过的所有内容”。至于服务器死亡,这在这里超出了范围。订户还可以自我验证他们没有运行得太慢,并采取行动(例如,警告操作员并死亡)。
  • 管道:如果工人死亡(工作时),呼吸机不知道它。管道,如时间磨削齿轮,只能在一个方向上工作。但是下游收集器可以检测到一个任务没有完成,并向呼吸机发回一条消息说:“嘿,重新发送任务324!”如果呼吸机或收集器死亡,无论上游客户端最初发送的工作批次是否都会厌倦等待并重新发送整批产品。它并不优雅,但系统代码实际上不应该经常死亡。

在本章中,我们将仅关注请求 - 回复,这是可靠消息传递的最低成果。

基本的请求 - 应答模式(对REP服务器套接字执行阻塞发送/接收的REQ客户端套接字)在处理最常见的故障类型时得分很低。如果服务器在处理请求时崩溃,则客户端将永久挂起。如果网络丢失请求或回复,则客户端将永久挂起。

由于ZeroMQ能够以静默方式重新连接对等端,负载平衡消息等,因此请求 - 应答仍然比TCP好得多。但它对于实际工作来说仍然不够好。您可以真正信任基本请求 - 回复模式的唯一情况是在同一进程中的两个线程之间,没有网络或单独的服务器进程死亡。

然而,通过一些额外的工作,这个简单的模式成为分布式网络中实际工作的良好基础,我们得到一组可靠的请求 - 回复(RRR)模式,我喜欢称之为海盗模式(你最终会得到这个笑话,我希望)。根据我的经验,有三种方法可以将客户端连接到服务器。每个都需要一种特定的可靠性方法:

  • 多个客户端直接与单个服务器通信。使用案例:客户需要与之交谈的单个知名服务器。我们打算处理的故障类型:服务器崩溃和重启以及网络断开连接。
  • 多个客户端与代理程序代理交谈,该代理程序将工作分配给多个工作用例:面向服务的事务处理。我们打算处理的故障类型:工作人员崩溃和重新启动,工作人员繁忙循环,工作负载过重,队列崩溃和重新启动以及网络断开连接。
  • 多个客户端与没有中间代理的多个服务器通信。使用案例:名称解析等分布式服务。我们打算处理的故障类型:服务崩溃和重启,服务繁忙循环,服务过载和网络断开。

这些方法中的每一种都有其权衡,通常你会混合它们。我们将详细介绍这三个方面。

Client-Side Reliability (Lazy Pirate Pattern)

我们可以通过对客户端进行一些更改来获得非常简单的可靠请求 - 回复。我们称之为懒惰海盗模式。我们不是做阻止接收,而是:

  • 轮询REQ套接字并仅在确定答复到达时才从其接收。
  • 如果在超时期限内没有回复,则重新发送请求。
  • 如果在多次请求后仍然没有回复,请放弃该交易。

如果你试图在严格的发送/接收方式以外的任何地方使用REQ套接字,你将得到一个错误(从技术上讲,REQ套接字实现了一个小的有限状态机来强制执行发送/接收乒乓,等等错误代码称为“EFSM”)。当我们想要以海盗模式使用REQ时,这有点烦人,因为我们可能会在收到回复之前发送多个请求。

相当不错的暴力解决方案是在出错后关闭并重新打开REQ套接字:

[lpclient: Lazy Pirate client in C](javascript:😉

C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Ruby | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Q | Racket | Scala

Run this together with the matching server:
[lpserver: Lazy Pirate server in C](javascript:😉

C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Q | Racket

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f0znMj51-1611294599127)(https://github.com/imatix/zguide/raw/master/images/fig47.png)]

要运行此测试用例,请在两个控制台窗口中启动客户端和服务器。几条消息后服务器会随机出错。您可以查看客户的回复。以下是服务器的典型输出:

I: normal request (1)
I: normal request (2)
I: normal request (3)
I: simulating CPU overload
I: normal request (4)
I: simulating a crash

And here is the client’s response:

I: connecting to server...
I: server replied OK (1)
I: server replied OK (2)
I: server replied OK (3)
W: no response from server, retrying...
I: connecting to server...
W: no response from server, retrying...
I: connecting to server...
E: server seems to be offline, abandoning

客户端对每条消息进行排序,并检查回复是否按顺序返回:没有请求或回复丢失,并且没有回复多次或无序回复。运行测试几次,直到你确信这个机制确实有效。您不需要生产应用程序中的序列号;他们只是帮助我们信任我们的设计。

客户端使用REQ套接字,并且强制关闭/重新打开,因为REQ套接字强制执行严格的发送/接收周期。您可能会想要使用经销商,但这不是一个好的决定。首先,它意味着要模仿REQ对信封所做的秘密调味(如果你忘记了它是什么,那么这是一个你不想要这样做的好兆头)。其次,这可能意味着可能会收到您没想到的回复。

当我们有一组客户端与单个服务器通信时,仅在客户端处理故障。它可以处理服务器崩溃,但仅当恢复意味着重新启动同一服务器时。如果存在永久性错误,例如服务器硬件上的电源耗尽,则此方法将不起作用。因为服务器中的应用程序代码通常是任何架构中最大的故障源,所以取决于单个服务器并不是一个好主意。

所以,利弊:

  • 利:简单易懂和实施。
  • 利:使用现有客户端和服务器应用程序代码轻松工作。
  • 利:ZeroMQ自动重试实际的重新连接,直到它工作。
  • 弊:不会故障转移到备份或备用服务器。

Basic Reliable Queuing (Simple Pirate Pattern)

我们的第二种方法是使用队列代理扩展Lazy Pirate模式,让我们透明地与多个服务器通信,我们可以更准确地称之为“工作者”。我们将分阶段开发,从最小的工作模型,简单的海盗模式开始。

在所有这些海盗模式中,工人都是无国籍的。如果应用程序需要某些共享状态(例如共享数据库),我们在设计消息传递框架时就不会知道它。拥有队列代理意味着工作人员可以来来去去,而客户对此一无所知。如果一个工人死亡,另一个工人接管。这是一个很好的,简单的拓扑结构,只有一个真正的弱点,即中央队列本身,它可能成为一个需要管理的问题,以及一个单一的故障点。

Figure 48 - The Simple Pirate Pattern

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CWjVVkY1-1611294599128)(https://github.com/imatix/zguide/raw/master/images/fig48.png)]

队列代理的基础是第3章 - 高级请求 - 应答模式中的负载平衡代理。处理死亡或封锁工人需要做的最低限度是什么?事实证明,这一点令人惊讶。我们已经在客户端有一个重试机制。因此使用负载平衡模式将非常有效。这符合ZeroMQ的理念,即我们可以通过在中间插入天真的代理来扩展像请求 - 回复这样的点对点模式。

我们不需要特殊的客户;我们还在使用Lazy Pirate客户端。这是队列,与负载平衡代理的主要任务相同:

[spqueue: Simple Pirate queue in C](javascript:😉

C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

这是工作者,它使用Lazy Pirate服务器并使其适应负载平衡模式(使用REQ“就绪”信令):
[spworker: Simple Pirate worker in C](javascript:😉

C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

要对此进行测试,请按任意顺序启动少数工作程序,Lazy Pirate客户端和队列。你会看到工人最终都崩溃并烧毁,客户重试然后放弃。队列永远不会停止,您可以重新启动工作人员和客户端。此模型适用于任意数量的客户和工作人员。

强大可靠的排队(偏执海盗模式)Robust Reliable Queuing (Paranoid Pirate Pattern)

Figure 49 - The Paranoid Pirate Pattern

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6b1ltzB5-1611294599129)(https://github.com/imatix/zguide/raw/master/images/fig49.png)]

Simple Pirate Queue模式非常有效,特别是因为它只是两个现有模式的组合。它仍然存在一些缺点:

  • 面对队列崩溃和重启,它并不健壮。客户将恢复,但工人不会。虽然ZeroMQ会自动重新连接工作服务器,但就新启动的队列而言,工作人员尚未准备就绪,因此不存在。要解决这个问题,我们必须从队列到工作人员进行心跳检测,以便工作人员可以检测队列何时消失。
  • 队列不检测工作程序故障,因此如果工作程序在空闲时死亡,则队列无法将其从工作队列中删除,直到队列向其发送请求为止。客户端等待并重试。这不是一个关键问题,但并不好。为了使这项工作正常,我们从工作人员到队列进行心跳,以便队列可以在任何阶段检测丢失的工作人员。

我们将在一个适当迂腐的偏执狂海盗模式中解决这些问题。

我们以前为工人使用了REQ套接字。对于Paranoid Pirate工作人员,我们将切换到DEALER套接字。这样做的好处是让我们可以随时发送和接收消息,而不是REQ强加的锁步发送/接收。 DEALER的缺点是我们必须自己进行信封管理(重新阅读第3章 - 高级请求 - 回复模式以了解此概念的背景)。

We’re still using the Lazy Pirate client. Here is the Paranoid Pirate queue proxy:

[ppqueue: Paranoid Pirate queue in C](javascript:😉

C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala
队列通过工作者的心跳扩展了负载平衡模式。心跳是那些很难做到的“简单”事情之一。我会在一秒钟内解释更多。
Here is the Paranoid Pirate worker:

[ppworker: Paranoid Pirate worker in C](javascript:😉

C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala
关于这个例子的一些评论:

  • 该代码包括以前的故障模拟。这使得它(a)很难调试,(b)重用是危险的。如果要调试此功能,请禁用故障模拟。
  • 工作者使用类似于我们为Lazy Pirate客户端设计的重新连接策略,有两个主要区别:(a)它做指数后退,(b)它无限重试(而客户端重试几次之前)报告失败)。尝试使用客户端,队列和工作程序,例如使用如下脚本:
ppqueue &
for i in 1 2 3 4; do
    ppworker &
    sleep 1
done
lpclient &

你应该看到工人在模拟崩溃时一个接一个地死去,客户最终放弃了。您可以停止并重新启动队列,客户端和工作人员将重新连接并继续运行。无论你对队列和工作人员做了什么,客户端都不会得到无序的回复:整个链条都有效,或者客户放弃。

Heartbeating

心跳解决了知道同伴是活着还是死亡的问题。这不是ZeroMQ特有的问题。 TCP有一个很长的超时(大约30分钟左右),这意味着无法知道一个同伴是否已经死亡,已经断开连接,或者周末去了布拉格的伏特加,一个红发和一大笔费用帐户。

心脏病正好并不容易。在编写Paranoid Pirate示例时,花了大约五个小时才能使心跳正常工作。请求 - 回复链的其余部分大概花了十分钟。创建“错误失败”尤其容易,即,当对等体因为心跳未正确发送而决定断开连接时。我们将看看人们使用ZeroMQ进行心跳的三个主要答案。

Shrugging It Offtopprevnext

The most common approach is to do no heartbeating at all and hope for the best. Many if not most ZeroMQ applications do this. ZeroMQ encourages this by hiding peers in many cases. What problems does this approach cause?

  • When we use a ROUTER socket in an application that tracks peers, as peers disconnect and reconnect, the application will leak memory (resources that the application holds for each peer) and get slower and slower.

  • When we use SUB- or DEALER-based data recipients, we can’t tell the difference between good silence (there’s no data) and bad silence (the other end died). When a recipient knows the other side died, it can for example switch over to a backup route.

  • If we use a TCP connection that stays silent for a long while, it will, in some networks, just die. Sending something (technically, a “keep-alive” more than a heartbeat), will keep the network alive.

One-Way Heartbeatstopprevnext

A second option is to send a heartbeat message from each node to its peers every second or so. When one node hears nothing from another within some timeout (several seconds, typically), it will treat that peer as dead. Sounds good, right? Sadly, no. This works in some cases but has nasty edge cases in others.

For pub-sub, this does work, and it’s the only model you can use. SUB sockets cannot talk back to PUB sockets, but PUB sockets can happily send “I’m alive” messages to their subscribers.

As an optimization, you can send heartbeats only when there is no real data to send. Furthermore, you can send heartbeats progressively slower and slower, if network activity is an issue (e.g., on mobile networks where activity drains the battery). As long as the recipient can detect a failure (sharp stop in activity), that’s fine.

Here are the typical problems with this design:

  • It can be inaccurate when we send large amounts of data, as heartbeats will be delayed behind that data. If heartbeats are delayed, you can get false timeouts and disconnections due to network congestion. Thus, always treat any incoming data as a heartbeat, whether or not the sender optimizes out heartbeats.

  • While the pub-sub pattern will drop messages for disappeared recipients, PUSH and DEALER sockets will queue them. So if you send heartbeats to a dead peer and it comes back, it will get all the heartbeats you sent, which can be thousands. Whoa, whoa!

  • This design assumes that heartbeat timeouts are the same across the whole network. But that won’t be accurate. Some peers will want very aggressive heartbeating in order to detect faults rapidly. And some will want very relaxed heartbeating, in order to let sleeping networks lie and save power.

Ping-Pong Heartbeatstopprevnext

The third option is to use a ping-pong dialog. One peer sends a ping command to the other, which replies with a pong command. Neither command has any payload. Pings and pongs are not correlated. Because the roles of “client” and “server” are arbitrary in some networks, we usually specify that either peer can in fact send a ping and expect a pong in response. However, because the timeouts depend on network topologies known best to dynamic clients, it is usually the client that pings the server.

This works for all ROUTER-based brokers. The same optimizations we used in the second model make this work even better: treat any incoming data as a pong, and only send a ping when not otherwise sending data.

Heartbeating for Paranoid Piratetopprevnext

For Paranoid Pirate, we chose the second approach. It might not have been the simplest option: if designing this today, I’d probably try a ping-pong approach instead. However the principles are similar. The heartbeat messages flow asynchronously in both directions, and either peer can decide the other is “dead” and stop talking to it.

In the worker, this is how we handle heartbeats from the queue:

  • We calculate a liveness, which is how many heartbeats we can still miss before deciding the queue is dead. It starts at three and we decrement it each time we miss a heartbeat.
  • We wait, in the zmq_poll loop, for one second each time, which is our heartbeat interval.
  • If there’s any message from the queue during that time, we reset our liveness to three.
  • If there’s no message during that time, we count down our liveness.
  • If the liveness reaches zero, we consider the queue dead.
  • If the queue is dead, we destroy our socket, create a new one, and reconnect.
  • To avoid opening and closing too many sockets, we wait for a certain interval before reconnecting, and we double the interval each time until it reaches 32 seconds.

And this is how we handle heartbeats to the queue:

  • We calculate when to send the next heartbeat; this is a single variable because we’re talking to one peer, the queue.
  • In the zmq_poll loop, whenever we pass this time, we send a heartbeat to the queue.

Here’s the essential heartbeating code for the worker:

#define HEARTBEAT_LIVENESS3// 3-5 is reasonable
#define HEARTBEAT_INTERVAL1000// msecs
#define INTERVAL_INIT1000// Initial reconnect
#define INTERVAL_MAX32000// After exponential backoff


// If liveness hits zero, queue is considered disconnected
size_t liveness = HEARTBEAT_LIVENESS;
size_t interval = INTERVAL_INIT;

// Send out heartbeats at regular intervals
uint64_t heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;

while (true) {
zmq_pollitem_t items [] = { { worker,0, ZMQ_POLLIN, 0 } };
int rc = zmq_poll (items, 1, HEARTBEAT_INTERVAL * ZMQ_POLL_MSEC);

if (items [0].revents & ZMQ_POLLIN) {
// Receive any message from queue
liveness = HEARTBEAT_LIVENESS;
interval = INTERVAL_INIT;
}
else
if (–liveness == 0) {
zclock_sleep (interval);
if (interval < INTERVAL_MAX)
interval = 2;
zsocket_destroy (ctx, worker);

liveness = HEARTBEAT_LIVENESS;
}
// Send heartbeat to queue if it’s time*
if (zclock_time () > heartbeat_at) {
heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;
// Send heartbeat message to queue
}
}

The queue does the same, but manages an expiration time for each worker.

Here are some tips for your own heartbeating implementation:

  • Use zmq_poll or a reactor as the core of your application’s main task.

  • Start by building the heartbeating between peers, test it by simulating failures, and then build the rest of the message flow. Adding heartbeating afterwards is much trickier.

  • Use simple tracing, i.e., print to console, to get this working. To help you trace the flow of messages between peers, use a dump method such as zmsg offers, and number your messages incrementally so you can see if there are gaps.

  • In a real application, heartbeating must be configurable and usually negotiated with the peer. Some peers will want aggressive heartbeating, as low as 10 msecs. Other peers will be far away and want heartbeating as high as 30 seconds.

  • If you have different heartbeat intervals for different peers, your poll timeout should be the lowest (shortest time) of these. Do not use an infinite timeout.

  • Do heartbeating on the same socket you use for messages, so your heartbeats also act as a keep-alive to stop the network connection from going stale (some firewalls can be unkind to silent connections).

Contracts and Protocolstopprevnext

If you’re paying attention, you’ll realize that Paranoid Pirate is not interoperable with Simple Pirate, because of the heartbeats. But how do we define “interoperable”? To guarantee interoperability, we need a kind of contract, an agreement that lets different teams in different times and places write code that is guaranteed to work together. We call this a “protocol”.

It’s fun to experiment without specifications, but that’s not a sensible basis for real applications. What happens if we want to write a worker in another language? Do we have to read code to see how things work? What if we want to change the protocol for some reason? Even a simple protocol will, if it’s successful, evolve and become more complex.

Lack of contracts is a sure sign of a disposable application. So let’s write a contract for this protocol. How do we do that?

There’s a wiki at rfc.zeromq.org that we made especially as a home for public ZeroMQ contracts.
To create a new specification, register on the wiki if needed, and follow the instructions. It’s fairly straightforward, though writing technical texts is not everyone’s cup of tea.

It took me about fifteen minutes to draft the new Pirate Pattern Protocol. It’s not a big specification, but it does capture enough to act as the basis for arguments (“your queue isn’t PPP compatible; please fix it!”).

Turning PPP into a real protocol would take more work:

  • There should be a protocol version number in the READY command so that it’s possible to distinguish between different versions of PPP.

  • Right now, READY and HEARTBEAT are not entirely distinct from requests and replies. To make them distinct, we would need a message structure that includes a “message type” part.

Service-Oriented Reliable Queuing (Majordomo Pattern)topprevnext

Figure 50 - The Majordomo Pattern

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OYjY2Ck5-1611294599130)(https://github.com/imatix/zguide/raw/master/images/fig50.png)]

The nice thing about progress is how fast it happens when lawyers and committees aren’t involved. The one-page MDP specification turns PPP into something more solid. This is how we should design complex architectures: start by writing down the contracts, and only thenwrite software to implement them.

The Majordomo Protocol (MDP) extends and improves on PPP in one interesting way: it adds a “service name” to requests that the client sends, and asks workers to register for specific services. Adding service names turns our Paranoid Pirate queue into a service-oriented broker. The nice thing about MDP is that it came out of working code, a simpler ancestor protocol (PPP), and a precise set of improvements that each solved a clear problem. This made it easy to draft.

To implement Majordomo, we need to write a framework for clients and workers. It’s really not sane to ask every application developer to read the spec and make it work, when they could be using a simpler API that does the work for them.

So while our first contract (MDP itself) defines how the pieces of our distributed architecture talk to each other, our second contract defines how user applications talk to the technical framework we’re going to design.

Majordomo has two halves, a client side and a worker side. Because we’ll write both client and worker applications, we will need two APIs. Here is a sketch for the client API, using a simple object-oriented approach:

//Majordomo Protocol client example
//Uses the mdcli API to hide all MDP aspects

//Lets us build this source without creating a library
#include “mdcliapi.c”

int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], “-v”));
mdcli_t *session = mdcli_new (“tcp://localhost:5555”, verbose);

int count;
for (count = 0; count < 100000; count++) {
zmsg_t request = zmsg_new ();
zmsg_pushstr (request, “Hello world”);
zmsg_t reply = mdcli_send (session, “echo”, &request);
if (reply)
zmsg_destroy (&reply);
else
break;
// Interrupt or failure

}
printf ("%d requests/replies processed**\n**", count);
mdcli_destroy (&session);
return 0;
}

That’s it. We open a session to the broker, send a request message, get a reply message back, and eventually close the connection. Here’s a sketch for the worker API:

//Majordomo Protocol worker example
//Uses the mdwrk API to hide all MDP aspects

//Lets us build this source without creating a library
#include “mdwrkapi.c”

int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], “-v”));
mdwrk_t *session = mdwrk_new (
“tcp://localhost:5555”, “echo”, verbose);

zmsg_t reply = NULL;
while (true) {
zmsg_t request = mdwrk_recv (session, &reply);
if (request == NULL)
break;
// Worker was interrupted

reply = request;// Echo is complex… 😃
}
mdwrk_destroy (&session);
return 0;
}

It’s more or less symmetrical, but the worker dialog is a little different. The first time a worker does a recv(), it passes a null reply. Thereafter, it passes the current reply, and gets a new request.

The client and worker APIs were fairly simple to construct because they’re heavily based on the Paranoid Pirate code we already developed. Here is the client API:

[mdcliapi: Majordomo client API in C](javascript:😉

C# | Go | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

[mdclient: Majordomo client application in C](javascript:😉

C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

[mdwrkapi: Majordomo worker API in C](javascript:😉

C# | Go | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

[mdworker: Majordomo worker application in C](javascript:😉

C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

  • The APIs are single-threaded. This means, for example, that the worker won’t send heartbeats in the background. Happily, this is exactly what we want: if the worker application gets stuck, heartbeats will stop and the broker will stop sending requests to the worker.

  • The worker API doesn’t do an exponential back-off; it’s not worth the extra complexity.

  • The APIs don’t do any error reporting. If something isn’t as expected, they raise an assertion (or exception depending on the language). This is ideal for a reference implementation, so any protocol errors show immediately. For real applications, the API should be robust against invalid messages.

You might wonder why the worker API is manually closing its socket and opening a new one, when ZeroMQ will automatically reconnect a socket if the peer disappears and comes back. Look back at the Simple Pirate and Paranoid Pirate workers to understand. Although ZeroMQ will automatically reconnect workers if the broker dies and comes back up, this isn’t sufficient to re-register the workers with the broker. I know of at least two solutions. The simplest, which we use here, is for the worker to monitor the connection using heartbeats, and if it decides the broker is dead, to close its socket and start afresh with a new socket. The alternative is for the broker to challenge unknown workers when it gets a heartbeat from the worker and ask them to re-register. That would require protocol support.

Now let’s design the Majordomo broker. Its core structure is a set of queues, one per service. We will create these queues as workers appear (we could delete them as workers disappear, but forget that for now because it gets complex). Additionally, we keep a queue of workers per service.

And here is the broker:

[mdbroker: Majordomo broker in C](javascript:😉

C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Here are some things to note about the broker code:

  • The Majordomo Protocol lets us handle both clients and workers on a single socket. This is nicer for those deploying and managing the broker: it just sits on one ZeroMQ endpoint rather than the two that most proxies need.

  • The broker implements all of MDP/0.1 properly (as far as I know), including disconnection if the broker sends invalid commands, heartbeating, and the rest.

  • It can be extended to run multiple threads, each managing one socket and one set of clients and workers. This could be interesting for segmenting large architectures. The C code is already organized around a broker class to make this trivial.

  • A primary/failover or live/live broker reliability model is easy, as the broker essentially has no state except service presence. It’s up to clients and workers to choose another broker if their first choice isn’t up and running.

  • The examples use five-second heartbeats, mainly to reduce the amount of output when you enable tracing. Realistic values would be lower for most LAN applications. However, any retry has to be slow enough to allow for a service to restart, say 10 seconds at least.

We later improved and extended the protocol and the Majordomo implementation, which now sits in its own Github project. If you want a properly usable Majordomo stack, use the GitHub project.

Asynchronous Majordomo Patterntopprevnext

The Majordomo implementation in the previous section is simple and stupid. The client is just the original Simple Pirate, wrapped up in a sexy API. When I fire up a client, broker, and worker on a test box, it can process 100,000 requests in about 14 seconds. That is partially due to the code, which cheerfully copies message frames around as if CPU cycles were free. But the real problem is that we’re doing network round-trips. ZeroMQ disables Nagle’s algorithm, but round-tripping is still slow.

Theory is great in theory, but in practice, practice is better. Let’s measure the actual cost of round-tripping with a simple test program. This sends a bunch of messages, first waiting for a reply to each message, and second as a batch, reading all the replies back as a batch. Both approaches do the same work, but they give very different results. We mock up a client, broker, and worker:

[tripping: Round-trip demonstrator in C](javascript:😉

C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Setting up test...
Synchronous round-trip test...
 9057 calls/second
Asynchronous round-trip test...
 173010 calls/second

Note that the client thread does a small pause before starting. This is to get around one of the “features” of the router socket: if you send a message with the address of a peer that’s not yet connected, the message gets discarded. In this example we don’t use the load balancing mechanism, so without the sleep, if the worker thread is too slow to connect, it will lose messages, making a mess of our test.

As we see, round-tripping in the simplest case is 20 times slower than the asynchronous, “shove it down the pipe as fast as it’ll go” approach. Let’s see if we can apply this to Majordomo to make it faster.

First, we modify the client API to send and receive in two separate methods:

mdcli_t *mdcli_new(char *broker);
voidmdcli_destroy (mdcli_t **self_p);
intmdcli_send(mdcli_t *self, char *service, zmsg_t **request_p);
zmsg_t*mdcli_recv(mdcli_t *self);

It’s literally a few minutes’ work to refactor the synchronous client API to become asynchronous:

[mdcliapi2: Majordomo asynchronous client API in C](javascript:😉

C# | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

  • We use a DEALER socket instead of REQ, so we emulate REQ with an empty delimiter frame before each request and each response.
  • We don’t retry requests; if the application needs to retry, it can do this itself.
  • We break the synchronous send method into separate send and recv methods.
  • The send method is asynchronous and returns immediately after sending. The caller can thus send a number of messages before getting a response.
  • The recv method waits for (with a timeout) one response and returns that to the caller.

And here’s the corresponding client test program, which sends 100,000 messages and then receives 100,000 back:

[mdclient2: Majordomo client application in C](javascript:😉

C++ | C# | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

$ time mdclient
100000 requests/replies processed

real    0m14.088s
user    0m1.310s
sys     0m2.670s

And here’s the asynchronous client, with a single worker:

$ time mdclient2
100000 replies received

real    0m8.730s
user    0m0.920s
sys     0m1.550s

Twice as fast. Not bad, but let’s fire up 10 workers and see how it handles the traffic

$ time mdclient2
100000 replies received

real    0m3.863s
user    0m0.730s
sys     0m0.470s

It isn’t fully asynchronous because workers get their messages on a strict last-used basis. But it will scale better with more workers. On my PC, after eight or so workers, it doesn’t get any faster. Four cores only stretches so far. But we got a 4x improvement in throughput with just a few minutes’ work. The broker is still unoptimized. It spends most of its time copying message frames around, instead of doing zero-copy, which it could. But we’re getting 25K reliable request/reply calls a second, with pretty low effort.

However, the asynchronous Majordomo pattern isn’t all roses. It has a fundamental weakness, namely that it cannot survive a broker crash without more work. If you look at the mdcliapi2 code you’ll see it does not attempt to reconnect after a failure. A proper reconnect would require the following:

  • A number on every request and a matching number on every reply, which would ideally require a change to the protocol to enforce.
  • Tracking and holding onto all outstanding requests in the client API, i.e., those for which no reply has yet been received.
  • In case of failover, for the client API to resend all outstanding requests to the broker.

It’s not a deal breaker, but it does show that performance often means complexity. Is this worth doing for Majordomo? It depends on your use case. For a name lookup service you call once per session, no. For a web frontend serving thousands of clients, probably yes.

Service Discoverytopprevnext

So, we have a nice service-oriented broker, but we have no way of knowing whether a particular service is available or not. We know whether a request failed, but we don’t know why. It is useful to be able to ask the broker, “is the echo service running?” The most obvious way would be to modify our MDP/Client protocol to add commands to ask this. But MDP/Client has the great charm of being simple. Adding service discovery to it would make it as complex as the MDP/Worker protocol.

Another option is to do what email does, and ask that undeliverable requests be returned. This can work well in an asynchronous world, but it also adds complexity. We need ways to distinguish returned requests from replies and to handle these properly.

Let’s try to use what we’ve already built, building on top of MDP instead of modifying it. Service discovery is, itself, a service. It might indeed be one of several management services, such as “disable service X”, “provide statistics”, and so on. What we want is a general, extensible solution that doesn’t affect the protocol or existing applications.

So here’s a small RFC that layers this on top of MDP: the Majordomo Management Interface (MMI). We already implemented it in the broker, though unless you read the whole thing you probably missed that. I’ll explain how it works in the broker:

  • When a client requests a service that starts with mmi., instead of routing this to a worker, we handle it internally.

  • We handle just one service in this broker, which is mmi.service, the service discovery service.

  • The payload for the request is the name of an external service (a real one, provided by a worker).

  • The broker returns “200” (OK) or “404” (Not found), depending on whether there are workers registered for that service or not.

Here’s how we use the service discovery in an application:

[mmiecho: Service discovery over Majordomo in C](javascript:😉

C# | Go | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Idempotent Servicestopprevnext

Idempotency is not something you take a pill for. What it means is that it’s safe to repeat an operation. Checking the clock is idempotent. Lending ones credit card to ones children is not. While many client-to-server use cases are idempotent, some are not. Examples of idempotent use cases include:

  • Stateless task distribution, i.e., a pipeline where the servers are stateless workers that compute a reply based purely on the state provided by a request. In such a case, it’s safe (though inefficient) to execute the same request many times.

  • A name service that translates logical addresses into endpoints to bind or connect to. In such a case, it’s safe to make the same lookup request many times.

And here are examples of a non-idempotent use cases:

  • A logging service. One does not want the same log information recorded more than once.

  • Any service that has impact on downstream nodes, e.g., sends on information to other nodes. If that service gets the same request more than once, downstream nodes will get duplicate information.

  • Any service that modifies shared data in some non-idempotent way; e.g., a service that debits a bank account is not idempotent without extra work.

When our server applications are not idempotent, we have to think more carefully about when exactly they might crash. If an application dies when it’s idle, or while it’s processing a request, that’s usually fine. We can use database transactions to make sure a debit and a credit are always done together, if at all. If the server dies while sending its reply, that’s a problem, because as far as it’s concerned, it has done its work.

If the network dies just as the reply is making its way back to the client, the same problem arises. The client will think the server died and will resend the request, and the server will do the same work twice, which is not what we want.

To handle non-idempotent operations, use the fairly standard solution of detecting and rejecting duplicate requests. This means:

  • The client must stamp every request with a unique client identifier and a unique message number.

  • The server, before sending back a reply, stores it using the combination of client ID and message number as a key.

  • The server, when getting a request from a given client, first checks whether it has a reply for that client ID and message number. If so, it does not process the request, but just resends the reply.

Disconnected Reliability (Titanic Pattern)topprevnext

Once you realize that Majordomo is a “reliable” message broker, you might be tempted to add some spinning rust (that is, ferrous-based hard disk platters). After all, this works for all the enterprise messaging systems. It’s such a tempting idea that it’s a little sad to have to be negative toward it. But brutal cynicism is one of my specialties. So, some reasons you don’t want rust-based brokers sitting in the center of your architecture are:

  • As you’ve seen, the Lazy Pirate client performs surprisingly well. It works across a whole range of architectures, from direct client-to-server to distributed queue proxies. It does tend to assume that workers are stateless and idempotent. But we can work around that limitation without resorting to rust.

  • Rust brings a whole set of problems, from slow performance to additional pieces that you have to manage, repair, and handle 6 a.m. panics from, as they inevitably break at the start of daily operations. The beauty of the Pirate patterns in general is their simplicity. They won’t crash. And if you’re still worried about the hardware, you can move to a peer-to-peer pattern that has no broker at all. I’ll explain later in this chapter.

Having said this, however, there is one sane use case for rust-based reliability, which is an asynchronous disconnected network. It solves a major problem with Pirate, namely that a client has to wait for an answer in real time. If clients and workers are only sporadically connected (think of email as an analogy), we can’t use a stateless network between clients and workers. We have to put state in the middle.

So, here’s the Titanic pattern, in which we write messages to disk to ensure they never get lost, no matter how sporadically clients and workers are connected. As we did for service discovery, we’re going to layer Titanic on top of MDP rather than extend it. It’s wonderfully lazy because it means we can implement our fire-and-forget reliability in a specialized worker, rather than in the broker. This is excellent for several reasons:

  • It is much easier because we divide and conquer: the broker handles message routing and the worker handles reliability.
  • It lets us mix brokers written in one language with workers written in another.
  • It lets us evolve the fire-and-forget technology independently.

The only downside is that there’s an extra network hop between broker and hard disk. The benefits are easily worth it.

There are many ways to make a persistent request-reply architecture. We’ll aim for one that is simple and painless. The simplest design I could come up with, after playing with this for a few hours, is a “proxy service”. That is, Titanic doesn’t affect workers at all. If a client wants a reply immediately, it talks directly to a service and hopes the service is available. If a client is happy to wait a while, it talks to Titanic instead and asks, “hey, buddy, would you take care of this for me while I go buy my groceries?”

Figure 51 - The Titanic Pattern

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fq5WjoTn-1611294599131)(https://github.com/imatix/zguide/raw/master/images/fig51.png)]

Titanic is thus both a worker and a client. The dialog between client and Titanic goes along these lines:

  • Client: Please accept this request for me. Titanic: OK, done.
  • Client: Do you have a reply for me? Titanic: Yes, here it is. Or, no, not yet.
  • Client: OK, you can wipe that request now, I’m happy. Titanic: OK, done.

Whereas the dialog between Titanic and broker and worker goes like this:

  • Titanic: Hey, Broker, is there an coffee service? Broker: Uhm, Yeah, seems like.
  • Titanic: Hey, coffee service, please handle this for me.
  • Coffee: Sure, here you are.
  • Titanic: Sweeeeet!

You can work through this and the possible failure scenarios. If a worker crashes while processing a request, Titanic retries indefinitely. If a reply gets lost somewhere, Titanic will retry. If the request gets processed but the client doesn’t get the reply, it will ask again. If Titanic crashes while processing a request or a reply, the client will try again. As long as requests are fully committed to safe storage, work can’t get lost.

The handshaking is pedantic, but can be pipelined, i.e., clients can use the asynchronous Majordomo pattern to do a lot of work and then get the responses later.

We need some way for a client to request its replies. We’ll have many clients asking for the same services, and clients disappear and reappear with different identities. Here is a simple, reasonably secure solution:

  • Every request generates a universally unique ID (UUID), which Titanic returns to the client after it has queued the request.
  • When a client asks for a reply, it must specify the UUID for the original request.

In a realistic case, the client would want to store its request UUIDs safely, e.g., in a local database.

Before we jump off and write yet another formal specification (fun, fun!), let’s consider how the client talks to Titanic. One way is to use a single service and send it three different request types. Another way, which seems simpler, is to use three services:

  • titanic.request: store a request message, and return a UUID for the request.
  • titanic.reply: fetch a reply, if available, for a given request UUID.
  • titanic.close: confirm that a reply has been stored and processed.

We’ll just make a multithreaded worker, which as we’ve seen from our multithreading experience with ZeroMQ, is trivial. However, let’s first sketch what Titanic would look like in terms of ZeroMQ messages and frames. This gives us the Titanic Service Protocol (TSP).

Using TSP is clearly more work for client applications than accessing a service directly via MDP. Here’s the shortest robust “echo” client example:

[ticlient: Titanic client example in C](javascript:😉

C# | Haxe | Java | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

For example, this client blocks on each request whereas in a real application, we’d want to be doing useful work while tasks are executed. This requires some nontrivial plumbing to build a background thread and talk to that cleanly. It’s the kind of thing you want to wrap in a nice simple API that the average developer cannot misuse. It’s the same approach that we used for Majordomo.

Here’s the Titanic implementation. This server handles the three services using three threads, as proposed. It does full persistence to disk using the most brutal approach possible: one file per message. It’s so simple, it’s scary. The only complex part is that it keeps a separate queue of all requests, to avoid reading the directory over and over:

[titanic: Titanic broker example in C](javascript:😉

C# | Haxe | Java | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Some notes about this code:

  • Note that some loops start by sending, others by receiving messages. This is because Titanic acts both as a client and a worker in different roles.
  • The Titanic broker uses the MMI service discovery protocol to send requests only to services that appear to be running. Since the MMI implementation in our little Majordomo broker is quite poor, this won’t work all the time.
  • We use an inproc connection to send new request data from the titanic.requestservice through to the main dispatcher. This saves the dispatcher from having to scan the disk directory, load all request files, and sort them by date/time.

The important thing about this example is not performance (which, although I haven’t tested it, is surely terrible), but how well it implements the reliability contract. To try it, start the mdbroker and titanic programs. Then start the ticlient, and then start the mdworker echo service. You can run all four of these using the -v option to do verbose activity tracing. You can stop and restart any piece except the client and nothing will get lost.

If you want to use Titanic in real cases, you’ll rapidly be asking “how do we make this faster?”

Here’s what I’d do, starting with the example implementation:

  • Use a single disk file for all data, rather than multiple files. Operating systems are usually better at handling a few large files than many smaller ones.
  • Organize that disk file as a circular buffer so that new requests can be written contiguously (with very occasional wraparound). One thread, writing full speed to a disk file, can work rapidly.
  • Keep the index in memory and rebuild the index at startup time, from the disk buffer. This saves the extra disk head flutter needed to keep the index fully safe on disk. You would want an fsync after every message, or every N milliseconds if you were prepared to lose the last M messages in case of a system failure.
  • Use a solid-state drive rather than spinning iron oxide platters.
  • Pre-allocate the entire file, or allocate it in large chunks, which allows the circular buffer to grow and shrink as needed. This avoids fragmentation and ensures that most reads and writes are contiguous.

And so on. What I’d not recommend is storing messages in a database, not even a “fast” key/value store, unless you really like a specific database and don’t have performance worries. You will pay a steep price for the abstraction, ten to a thousand times over a raw disk file.

If you want to make Titanic even more reliable, duplicate the requests to a second server, which you’d place in a second location just far away enough to survive a nuclear attack on your primary location, yet not so far that you get too much latency.

If you want to make Titanic much faster and less reliable, store requests and replies purely in memory. This will give you the functionality of a disconnected network, but requests won’t survive a crash of the Titanic server itself.

High-Availability Pair (Binary Star Pattern)topprevnext

Figure 52 - High-Availability Pair, Normal Operation

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IQZouXdJ-1611294599133)(https://github.com/imatix/zguide/raw/master/images/fig52.png)]

The Binary Star pattern puts two servers in a primary-backup high-availability pair. At any given time, one of these (the active) accepts connections from client applications. The other (the passive) does nothing, but the two servers monitor each other. If the active disappears from the network, after a certain time the passive takes over as active.

We developed the Binary Star pattern at iMatix for our OpenAMQ server. We designed it:

  • To provide a straightforward high-availability solution.
  • To be simple enough to actually understand and use.
  • To fail over reliably when needed, and only when needed.

Assuming we have a Binary Star pair running, here are the different scenarios that will result in a failover:

  • The hardware running the primary server has a fatal problem (power supply explodes, machine catches fire, or someone simply unplugs it by mistake), and disappears. Applications see this, and reconnect to the backup server.
  • The network segment on which the primary server sits crashes—perhaps a router gets hit by a power spike—and applications start to reconnect to the backup server.
  • The primary server crashes or is killed by the operator and does not restart automatically.

Figure 53 - High-availability Pair During Failover

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hKDsJ0jJ-1611294599134)(https://github.com/imatix/zguide/raw/master/images/fig53.png)]

Recovery from failover works as follows:

  • The operators restart the primary server and fix whatever problems were causing it to disappear from the network.
  • The operators stop the backup server at a moment when it will cause minimal disruption to applications.
  • When applications have reconnected to the primary server, the operators restart the backup server.

Recovery (to using the primary server as active) is a manual operation. Painful experience teaches us that automatic recovery is undesirable. There are several reasons:

  • Failover creates an interruption of service to applications, possibly lasting 10-30 seconds. If there is a real emergency, this is much better than total outage. But if recovery creates a further 10-30 second outage, it is better that this happens off-peak, when users have gone off the network.

  • When there is an emergency, the absolute first priority is certainty for those trying to fix things. Automatic recovery creates uncertainty for system administrators, who can no longer be sure which server is in charge without double-checking.

  • Automatic recovery can create situations where networks fail over and then recover, placing operators in the difficult position of analyzing what happened. There was an interruption of service, but the cause isn’t clear.

Having said this, the Binary Star pattern will fail back to the primary server if this is running (again) and the backup server fails. In fact, this is how we provoke recovery.

The shutdown process for a Binary Star pair is to either:

  1. Stop the passive server and then stop the active server at any later time, or
  2. Stop both servers in any order but within a few seconds of each other.

Stopping the active and then the passive server with any delay longer than the failover timeout will cause applications to disconnect, then reconnect, and then disconnect again, which may disturb users.

Detailed Requirementstopprevnext

Binary Star is as simple as it can be, while still working accurately. In fact, the current design is the third complete redesign. Each of the previous designs we found to be too complex, trying to do too much, and we stripped out functionality until we came to a design that was understandable, easy to use, and reliable enough to be worth using.

These are our requirements for a high-availability architecture:

  • The failover is meant to provide insurance against catastrophic system failures, such as hardware breakdown, fire, accident, and so on. There are simpler ways to recover from ordinary server crashes and we already covered these.

  • Failover time should be under 60 seconds and preferably under 10 seconds.

  • Failover has to happen automatically, whereas recovery must happen manually. We want applications to switch over to the backup server automatically, but we do not want them to switch back to the primary server except when the operators have fixed whatever problem there was and decided that it is a good time to interrupt applications again.

  • The semantics for client applications should be simple and easy for developers to understand. Ideally, they should be hidden in the client API.

  • There should be clear instructions for network architects on how to avoid designs that could lead to split brain syndrome, in which both servers in a Binary Star pair think they are the active server.

  • There should be no dependencies on the order in which the two servers are started.

  • It must be possible to make planned stops and restarts of either server without stopping client applications (though they may be forced to reconnect).

  • Operators must be able to monitor both servers at all times.

  • It must be possible to connect the two servers using a high-speed dedicated network connection. That is, failover synchronization must be able to use a specific IP route.

We make the following assumptions:

  • A single backup server provides enough insurance; we don’t need multiple levels of backup.

  • The primary and backup servers are equally capable of carrying the application load. We do not attempt to balance load across the servers.

  • There is sufficient budget to cover a fully redundant backup server that does nothing almost all the time.

We don’t attempt to cover the following:

  • The use of an active backup server or load balancing. In a Binary Star pair, the backup server is inactive and does no useful work until the primary server goes offline.

  • The handling of persistent messages or transactions in any way. We assume the existence of a network of unreliable (and probably untrusted) servers or Binary Star pairs.

  • Any automatic exploration of the network. The Binary Star pair is manually and explicitly defined in the network and is known to applications (at least in their configuration data).

  • Replication of state or messages between servers. All server-side state must be recreated by applications when they fail over.

Here is the key terminology that we use in Binary Star:

  • Primary: the server that is normally or initially active.

  • Backup: the server that is normally passive. It will become active if and when the primary server disappears from the network, and when client applications ask the backup server to connect.

  • Active: the server that accepts client connections. There is at most one active server.

  • Passive: the server that takes over if the active disappears. Note that when a Binary Star pair is running normally, the primary server is active, and the backup is passive. When a failover has happened, the roles are switched.

To configure a Binary Star pair, you need to:

  1. Tell the primary server where the backup server is located.
  2. Tell the backup server where the primary server is located.
  3. Optionally, tune the failover response times, which must be the same for both servers.

The main tuning concern is how frequently you want the servers to check their peering status, and how quickly you want to activate failover. In our example, the failover timeout value defaults to 2,000 msec. If you reduce this, the backup server will take over as active more rapidly but may take over in cases where the primary server could recover. For example, you may have wrapped the primary server in a shell script that restarts it if it crashes. In that case, the timeout should be higher than the time needed to restart the primary server.

For client applications to work properly with a Binary Star pair, they must:

  1. Know both server addresses.
  2. Try to connect to the primary server, and if that fails, to the backup server.
  3. Detect a failed connection, typically using heartbeating.
  4. Try to reconnect to the primary, and then backup (in that order), with a delay between retries that is at least as high as the server failover timeout.
  5. Recreate all of the state they require on a server.
  6. Retransmit messages lost during a failover, if messages need to be reliable.

It’s not trivial work, and we’d usually wrap this in an API that hides it from real end-user applications.

These are the main limitations of the Binary Star pattern:

  • A server process cannot be part of more than one Binary Star pair.
  • A primary server can have a single backup server, and no more.
  • The passive server does no useful work, and is thus wasted.
  • The backup server must be capable of handling full application loads.
  • Failover configuration cannot be modified at runtime.
  • Client applications must do some work to benefit from failover.
Preventing Split-Brain Syndrometopprevnext

Split-brain syndrome occurs when different parts of a cluster think they are active at the same time. It causes applications to stop seeing each other. Binary Star has an algorithm for detecting and eliminating split brain, which is based on a three-way decision mechanism (a server will not decide to become active until it gets application connection requests and it cannot see its peer server).

However, it is still possible to (mis)design a network to fool this algorithm. A typical scenario would be a Binary Star pair, that is distributed between two buildings, where each building also had a set of applications and where there was a single network link between both buildings. Breaking this link would create two sets of client applications, each with half of the Binary Star pair, and each failover server would become active.

To prevent split-brain situations, we must connect a Binary Star pair using a dedicated network link, which can be as simple as plugging them both into the same switch or, better, using a crossover cable directly between two machines.

We must not split a Binary Star architecture into two islands, each with a set of applications. While this may be a common type of network architecture, you should use federation, not high-availability failover, in such cases.

A suitably paranoid network configuration would use two private cluster interconnects, rather than a single one. Further, the network cards used for the cluster would be different from those used for message traffic, and possibly even on different paths on the server hardware. The goal is to separate possible failures in the network from possible failures in the cluster. Network ports can have a relatively high failure rate.

Binary Star Implementationtopprevnext

Without further ado, here is a proof-of-concept implementation of the Binary Star server. The primary and backup servers run the same code, you choose their roles when you run the code:

[bstarsrv: Binary Star server in C](javascript:😉

Haxe | Java | Python | Ruby | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Scala

[bstarcli: Binary Star client in C](javascript:😉

Haxe | Java | Python | Ruby | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Scala

bstarsrv -p     # Start primary
bstarsrv -b     # Start backup
bstarcli

You can then provoke failover by killing the primary server, and recovery by restarting the primary and killing the backup. Note how it’s the client vote that triggers failover, and recovery.

Binary star is driven by a finite state machine. Events are the peer state, so “Peer Active” means the other server has told us it’s active. “Client Request” means we’ve received a client request. “Client Vote” means we’ve received a client request AND our peer is inactive for two heartbeats.

Note that the servers use PUB-SUB sockets for state exchange. No other socket combination will work here. PUSH and DEALER block if there is no peer ready to receive a message. PAIR does not reconnect if the peer disappears and comes back. ROUTER needs the address of the peer before it can send it a message.

Figure 54 - Binary Star Finite State Machine

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Km8gFUmJ-1611294599135)(https://github.com/imatix/zguide/raw/master/images/fig54.png)]

Binary Star Reactortopprevnext

Binary Star is useful and generic enough to package up as a reusable reactor class. The reactor then runs and calls our code whenever it has a message to process. This is much nicer than copying/pasting the Binary Star code into each server where we want that capability.

In C, we wrap the CZMQ zloop class that we saw before. zloop lets you register handlers to react on socket and timer events. In the Binary Star reactor, we provide handlers for voters and for state changes (active to passive, and vice versa). Here is the bstar API:

//bstar class - Binary Star reactor

#include “bstar.h”

// States we can be in at any point in time
typedef enum {
STATE_PRIMARY = 1,// Primary, waiting for peer to connect
STATE_BACKUP = 2,// Backup, waiting for peer to connect
STATE_ACTIVE = 3,// Active - accepting connections
STATE_PASSIVE = 4// Passive - not accepting connections
} state_t;

// Events, which start with the states our peer can be in
typedef enum {
PEER_PRIMARY = 1,// HA peer is pending primary
PEER_BACKUP = 2,// HA peer is pending backup
PEER_ACTIVE = 3,// HA peer is active
PEER_PASSIVE = 4,// HA peer is passive
CLIENT_REQUEST = 5// Client makes request
} event_t;

// Structure of our class

struct _bstar_t {
zctx_t ctx;// Our private context*
zloop_t loop;// Reactor loop*
void statepub;// State publisher*
void statesub;// State subscriber*
state_t state;// Current state
event_t event;// Current event
int64_t peer_expiry;// When peer is considered ‘dead’
zloop_fn voter_fn;// Voting socket handler*
void voter_arg;// Arguments for voting handler*
zloop_fn active_fn;// Call when become active*
void active_arg;// Arguments for handler*
zloop_fn passive_fn;// Call when become passive*
void passive_arg;// Arguments for handler*
};

//The finite-state machine is the same as in the proof-of-concept server.
//To understand this reactor in detail, first read the CZMQ zloop class.

//We send state information every this often
//If peer doesn’t respond in two heartbeats, it is ‘dead’
#define BSTAR_HEARTBEAT1000// In msecs

//Binary Star finite state machine (applies event to state)//Returns -1 if there was an exception, 0 if event was valid.

static int
s_execute_fsm (bstar_t self)
{
int rc = 0;
// Primary server is waiting for peer to connect*
// Accepts CLIENT_REQUEST events in this state
if (self->state == STATE_PRIMARY) {
if (self->event == PEER_BACKUP) {
zclock_log (“I: connected to backup (passive), ready as active”);
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
}
else
if (self->event == PEER_ACTIVE) {
zclock_log (“I: connected to backup (active), ready as passive”);
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}
else
if (self->event == CLIENT_REQUEST) {
// Allow client requests to turn us into the active if we’ve
// waited sufficiently long to believe the backup is not
// currently acting as active (i.e., after a failover)
assert (self->peer_expiry > 0);
if (zclock_time () >= self->peer_expiry) {
zclock_log (“I: request from client, ready as active”);
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
} else
// Don’t respond to clients yet - it’s possible we’re
// performing a failback and the backup is currently active
rc = -1;
}
}
else
// Backup server is waiting for peer to connect
// Rejects CLIENT_REQUEST events in this state
if (self->state == STATE_BACKUP) {
if (self->event == PEER_ACTIVE) {
zclock_log (“I: connected to primary (active), ready as passive”);
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}
else
if (self->event == CLIENT_REQUEST)
rc = -1;
}
else
// Server is active
// Accepts CLIENT_REQUEST events in this state
// The only way out of ACTIVE is death
if (self->state == STATE_ACTIVE) {
if (self->event == PEER_ACTIVE) {
// Two actives would mean split-brain
zclock_log (“E: fatal error - dual actives, aborting”);
rc = -1;
}
}
else
// Server is passive
// CLIENT_REQUEST events can trigger failover if peer looks dead
if (self->state == STATE_PASSIVE) {
if (self->event == PEER_PRIMARY) {
// Peer is restarting - become active, peer will go passive
zclock_log (“I: primary (passive) is restarting, ready as active”);
self->state = STATE_ACTIVE;
}
else
if (self->event == PEER_BACKUP) {
// Peer is restarting - become active, peer will go passive
zclock_log (“I: backup (passive) is restarting, ready as active”);
self->state = STATE_ACTIVE;
}
else
if (self->event == PEER_PASSIVE) {
// Two passives would mean cluster would be non-responsive
zclock_log (“E: fatal error - dual passives, aborting”);
rc = -1;
}
else
if (self->event == CLIENT_REQUEST) {
// Peer becomes active if timeout has passed
// It’s the client request that triggers the failover
assert (self->peer_expiry > 0);
if (zclock_time () >= self->peer_expiry) {
// If peer is dead, switch to the active state
zclock_log (“I: failover successful, ready as active”);
self->state = STATE_ACTIVE;
}
else
// If peer is alive, reject connections
rc = -1;
}
// Call state change handler if necessary
if (self->state == STATE_ACTIVE && self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
}
return rc;
}

static void
s_update_peer_expiry (bstar_t *self)
{
self->peer_expiry = zclock_time () + 2 * BSTAR_HEARTBEAT;
}

// Reactor event handlers…

// Publish our state to peer
int s_send_state (zloop_t *loop, int timer_id, void *arg)
{
bstar_t *self = (bstar_t *) arg;
zstr_sendf (self->statepub, “%d”, self->state);
return 0;
}

// Receive state from peer, execute finite state machine
int s_recv_state (zloop_t *loop, zmq_pollitem_t *poller, void *arg)
{
bstar_t *self = (bstar_t *) arg;
char *state = zstr_recv (poller->socket);
if (state) {
self->event = atoi (state);
s_update_peer_expiry (self);
free (state);
}
return s_execute_fsm (self);
}

// Application wants to speak to us, see if it’s possible
int s_voter_ready (zloop_t *loop, zmq_pollitem_t poller, void arg)
{
bstar_t self = (bstar_t ) arg;
// If server can accept input now, call appl handler

self->event = CLIENT_REQUEST;
if (s_execute_fsm (self) == 0)
(self->voter_fn) (self->loop, poller, self->voter_arg);
else {
// Destroy waiting message, no-one to read it

zmsg_t *msg = zmsg_recv (poller->socket);
zmsg_destroy (&msg);
}
return 0;
}

//This is the constructor for our bstar class. We have to tell it//whether we’re primary or backup server, as well as our local and//remote endpoints to bind and connect to:

bstar_t *
bstar_new (int primary, char *local, char *remote)
{
bstar_t
*self;

self = (bstar_t *) zmalloc (sizeof (bstar_t));

// Initialize the Binary Star
self->ctx = zctx_new ();
self->loop = zloop_new ();
self->state = primary? STATE_PRIMARY: STATE_BACKUP;

// Create publisher for state going to peer
self->statepub = zsocket_new (self->ctx, ZMQ_PUB);
zsocket_bind (self->statepub, local);

// Create subscriber for state coming from peer
self->statesub = zsocket_new (self->ctx, ZMQ_SUB);
zsocket_set_subscribe (self->statesub, “”);
zsocket_connect (self->statesub, remote);

// Set-up basic reactor events
zloop_timer (self->loop, BSTAR_HEARTBEAT, 0, s_send_state, self);
zmq_pollitem_t poller = { self->statesub, 0, ZMQ_POLLIN };
zloop_poller (self->loop, &poller, s_recv_state, self);
return self;
}

// The destructor shuts down the bstar reactor:

void
bstar_destroy (bstar_t **self_p)
{
assert (self_p);
if (*self_p) {
bstar_t *self = *self_p;
zloop_destroy (&self->loop);
zctx_destroy (&self->ctx);
free (self);
*self_p = NULL;
}
}

//This method returns the underlying zloop reactor, so we can add//additional timers and readers:

zloop_t *
bstar_zloop (bstar_t *self)
{
return self->loop;
}

//This method registers a client voter socket. Messages received//on this socket provide the CLIENT_REQUEST events for the Binary Star//FSM and are passed to the provided application handler. We require//exactly one voter per bstar instance:

int
bstar_voter (bstar_t *self, char endpoint, int type, zloop_fn handler,
void arg)
{
// Hold actual handler+arg so we can call this later

void *socket = zsocket_new (self->ctx, type);
zsocket_bind (socket, endpoint);
assert (!self->voter_fn);
self->voter_fn = handler;
self->voter_arg = arg;
zmq_pollitem_t poller = { socket, 0, ZMQ_POLLIN };
return zloop_poller (self->loop, &poller, s_voter_ready, self);
}

// Register handlers to be called each time there’s a state change:

void
bstar_new_active (bstar_t *self, zloop_fn handler, void *arg)
{
assert (!self->active_fn);
self->active_fn = handler;
self->active_arg = arg;
}

void
bstar_new_passive (bstar_t *self, zloop_fn handler, void *arg)
{
assert (!self->passive_fn);
self->passive_fn = handler;
self->passive_arg = arg;
}

// Enable/disable verbose tracing, for debugging:

void bstar_set_verbose (bstar_t *self, bool verbose)
{
zloop_set_verbose (self->loop, verbose);
}

//Finally, start the configured reactor. It will end if any handler//returns -1 to the reactor, or if the process receives SIGINT or SIGTERM:

int
bstar_start (bstar_t *self)
{
assert (self->voter_fn);
s_update_peer_expiry (self);
return zloop_start (self->loop);
}

And here is the class implementation:

[bstar: Binary Star core class in C](javascript:😉

Haxe | Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

[bstarsrv2: Binary Star server, using core class in C](javascript:😉

Haxe | Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

Brokerless Reliability (Freelance Pattern)topprevnext

It might seem ironic to focus so much on broker-based reliability, when we often explain ZeroMQ as “brokerless messaging”. However, in messaging, as in real life, the middleman is both a burden and a benefit. In practice, most messaging architectures benefit from a mix of distributed and brokered messaging. You get the best results when you can decide freely what trade-offs you want to make. This is why I can drive twenty minutes to a wholesaler to buy five cases of wine for a party, but I can also walk ten minutes to a corner store to buy one bottle for a dinner. Our highly context-sensitive relative valuations of time, energy, and cost are essential to the real world economy. And they are essential to an optimal message-based architecture.

This is why ZeroMQ does not impose a broker-centric architecture, though it does give you the tools to build brokers, aka proxies, and we’ve built a dozen or so different ones so far, just for practice.

So we’ll end this chapter by deconstructing the broker-based reliability we’ve built so far, and turning it back into a distributed peer-to-peer architecture I call the Freelance pattern. Our use case will be a name resolution service. This is a common problem with ZeroMQ architectures: how do we know the endpoint to connect to? Hard-coding TCP/IP addresses in code is insanely fragile. Using configuration files creates an administration nightmare. Imagine if you had to hand-configure your web browser, on every PC or mobile phone you used, to realize that “google.com” was “74.125.230.82”.

A ZeroMQ name service (and we’ll make a simple implementation) must do the following:

  • Resolve a logical name into at least a bind endpoint, and a connect endpoint. A realistic name service would provide multiple bind endpoints, and possibly multiple connect endpoints as well.

  • Allow us to manage multiple parallel environments, e.g., “test” versus “production”, without modifying code.

  • Be reliable, because if it is unavailable, applications won’t be able to connect to the network.

Putting a name service behind a service-oriented Majordomo broker is clever from some points of view. However, it’s simpler and much less surprising to just expose the name service as a server to which clients can connect directly. If we do this right, the name service becomes the only global network endpoint we need to hard-code in our code or configuration files.

Figure 55 - The Freelance Pattern

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MOpKAibs-1611294599136)(https://github.com/imatix/zguide/raw/master/images/fig55.png)]

The types of failure we aim to handle are server crashes and restarts, server busy looping, server overload, and network issues. To get reliability, we’ll create a pool of name servers so if one crashes or goes away, clients can connect to another, and so on. In practice, two would be enough. But for the example, we’ll assume the pool can be any size.

In this architecture, a large set of clients connect to a small set of servers directly. The servers bind to their respective addresses. It’s fundamentally different from a broker-based approach like Majordomo, where workers connect to the broker. Clients have a couple of options:

  • Use REQ sockets and the Lazy Pirate pattern. Easy, but would need some additional intelligence so clients don’t stupidly try to reconnect to dead servers over and over.

  • Use DEALER sockets and blast out requests (which will be load balanced to all connected servers) until they get a reply. Effective, but not elegant.

  • Use ROUTER sockets so clients can address specific servers. But how does the client know the identity of the server sockets? Either the server has to ping the client first (complex), or the server has to use a hard-coded, fixed identity known to the client (nasty).

We’ll develop each of these in the following subsections.

Model One: Simple Retry and Failovertopprevnext

So our menu appears to offer: simple, brutal, complex, or nasty. Let’s start with simple and then work out the kinks. We take Lazy Pirate and rewrite it to work with multiple server endpoints.

Start one or several servers first, specifying a bind endpoint as the argument:

[flserver1: Freelance server, Model One in C](javascript:😉

C# | Java | Lua | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

[flclient1: Freelance client, Model One in C](javascript:😉

C# | Java | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

flserver1 tcp://*:5555 &
flserver1 tcp://*:5556 &
flclient1 tcp://localhost:5555 tcp://localhost:5556

Although the basic approach is Lazy Pirate, the client aims to just get one successful reply. It has two techniques, depending on whether you are running a single server or multiple servers:

  • With a single server, the client will retry several times, exactly as for Lazy Pirate.
  • With multiple servers, the client will try each server at most once until it’s received a reply or has tried all servers.

This solves the main weakness of Lazy Pirate, namely that it could not fail over to backup or alternate servers.

However, this design won’t work well in a real application. If we’re connecting many sockets and our primary name server is down, we’re going to experience this painful timeout each time.

Model Two: Brutal Shotgun Massacretopprevnext

Let’s switch our client to using a DEALER socket. Our goal here is to make sure we get a reply back within the shortest possible time, no matter whether a particular server is up or down. Our client takes this approach:

  • We set things up, connecting to all servers.
  • When we have a request, we blast it out as many times as we have servers.
  • We wait for the first reply, and take that.
  • We ignore any other replies.

What will happen in practice is that when all servers are running, ZeroMQ will distribute the requests so that each server gets one request and sends one reply. When any server is offline and disconnected, ZeroMQ will distribute the requests to the remaining servers. So a server may in some cases get the same request more than once.

What’s more annoying for the client is that we’ll get multiple replies back, but there’s no guarantee we’ll get a precise number of replies. Requests and replies can get lost (e.g., if the server crashes while processing a request).

So we have to number requests and ignore any replies that don’t match the request number. Our Model One server will work because it’s an echo server, but coincidence is not a great basis for understanding. So we’ll make a Model Two server that chews up the message and returns a correctly numbered reply with the content “OK”. We’ll use messages consisting of two parts: a sequence number and a body.

Start one or more servers, specifying a bind endpoint each time:

[flserver2: Freelance server, Model Two in C](javascript:😉

C# | Java | Lua | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

[flclient2: Freelance client, Model Two in C](javascript:😉

C# | Java | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

  • The client is structured as a nice little class-based API that hides the dirty work of creating ZeroMQ contexts and sockets and talking to the server. That is, if a shotgun blast to the midriff can be called “talking”.

  • The client will abandon the chase if it can’t find any responsive server within a few seconds.

  • The client has to create a valid REP envelope, i.e., add an empty message frame to the front of the message.

The client performs 10,000 name resolution requests (fake ones, as our server does essentially nothing) and measures the average cost. On my test box, talking to one server, this requires about 60 microseconds. Talking to three servers, it takes about 80 microseconds.

The pros and cons of our shotgun approach are:

  • Pro: it is simple, easy to make and easy to understand.
  • Pro: it does the job of failover, and works rapidly, so long as there is at least one server running.
  • Con: it creates redundant network traffic.
  • Con: we can’t prioritize our servers, i.e., Primary, then Secondary.
  • Con: the server can do at most one request at a time, period.
Model Three: Complex and Nastytopprevnext

The shotgun approach seems too good to be true. Let’s be scientific and work through all the alternatives. We’re going to explore the complex/nasty option, even if it’s only to finally realize that we preferred brutal. Ah, the story of my life.

We can solve the main problems of the client by switching to a ROUTER socket. That lets us send requests to specific servers, avoid servers we know are dead, and in general be as smart as we want to be. We can also solve the main problem of the server (single-threadedness) by switching to a ROUTER socket.

But doing ROUTER to ROUTER between two anonymous sockets (which haven’t set an identity) is not possible. Both sides generate an identity (for the other peer) only when they receive a first message, and thus neither can talk to the other until it has first received a message. The only way out of this conundrum is to cheat, and use hard-coded identities in one direction. The proper way to cheat, in a client/server case, is to let the client “know” the identity of the server. Doing it the other way around would be insane, on top of complex and nasty, because any number of clients should be able to arise independently. Insane, complex, and nasty are great attributes for a genocidal dictator, but terrible ones for software.

Rather than invent yet another concept to manage, we’ll use the connection endpoint as identity. This is a unique string on which both sides can agree without more prior knowledge than they already have for the shotgun model. It’s a sneaky and effective way to connect two ROUTER sockets.

Remember how ZeroMQ identities work. The server ROUTER socket sets an identity before it binds its socket. When a client connects, they do a little handshake to exchange identities, before either side sends a real message. The client ROUTER socket, having not set an identity, sends a null identity to the server. The server generates a random UUID to designate the client for its own use. The server sends its identity (which we’ve agreed is going to be an endpoint string) to the client.

This means that our client can route a message to the server (i.e., send on its ROUTER socket, specifying the server endpoint as identity) as soon as the connection is established. That’s not immediately after doing a zmq_connect(), but some random time thereafter. Herein lies one problem: we don’t know when the server will actually be available and complete its connection handshake. If the server is online, it could be after a few milliseconds. If the server is down and the sysadmin is out to lunch, it could be an hour from now.

There’s a small paradox here. We need to know when servers become connected and available for work. In the Freelance pattern, unlike the broker-based patterns we saw earlier in this chapter, servers are silent until spoken to. Thus we can’t talk to a server until it’s told us it’s online, which it can’t do until we’ve asked it.

My solution is to mix in a little of the shotgun approach from model 2, meaning we’ll fire (harmless) shots at anything we can, and if anything moves, we know it’s alive. We’re not going to fire real requests, but rather a kind of ping-pong heartbeat.

This brings us to the realm of protocols again, so here’s a short spec that defines how a Freelance client and server exchange ping-pong commands and request-reply commands.

It is short and sweet to implement as a server. Here’s our echo server, Model Three, now speaking FLP:

[flserver3: Freelance server, Model Three in C](javascript:😉

C# | Java | Lua | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

[flclient3: Freelance client, Model Three in C](javascript:😉

C# | Java | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

[flcliapi: Freelance client API in C](javascript:😉

C# | Java | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

  • Multithreaded API: the client API consists of two parts, a synchronous flcliapi class that runs in the application thread, and an asynchronous agent class that runs as a background thread. Remember how ZeroMQ makes it easy to create multithreaded apps. The flcliapi and agent classes talk to each other with messages over an inprocsocket. All ZeroMQ aspects (such as creating and destroying a context) are hidden in the API. The agent in effect acts like a mini-broker, talking to servers in the background, so that when we make a request, it can make a best effort to reach a server it believes is available.

  • Tickless poll timer: in previous poll loops we always used a fixed tick interval, e.g., 1 second, which is simple enough but not excellent on power-sensitive clients (such as notebooks or mobile phones), where waking the CPU costs power. For fun, and to help save the planet, the agent uses a tickless timer, which calculates the poll delay based on the next timeout we’re expecting. A proper implementation would keep an ordered list of timeouts. We just check all timeouts and calculate the poll delay until the next one.

Conclusiontopprevnext

In this chapter, we’ve seen a variety of reliable request-reply mechanisms, each with certain costs and benefits. The example code is largely ready for real use, though it is not optimized. Of all the different patterns, the two that stand out for production use are the Majordomo pattern, for broker-based reliability, and the Freelance pattern, for brokerless reliability.

Chapter 5 - Advanced Pub-Sub Patternstopprevnext

In Chapter 3 - Advanced Request-Reply Patterns and Chapter 4 - Reliable Request-Reply Patterns we looked at advanced use of ZeroMQ’s request-reply pattern. If you managed to digest all that, congratulations. In this chapter we’ll focus on publish-subscribe and extend ZeroMQ’s core pub-sub pattern with higher-level patterns for performance, reliability, state distribution, and monitoring.

We’ll cover:

  • When to use publish-subscribe
  • How to handle too-slow subscribers (the Suicidal Snail pattern)
  • How to design high-speed subscribers (the Black Box pattern)
  • How to monitor a pub-sub network (the Espresso pattern)
  • How to build a shared key-value store (the Clone pattern)
  • How to use reactors to simplify complex servers
  • How to use the Binary Star pattern to add failover to a server
Pros and Cons of Pub-Subtopprevnext

ZeroMQ’s low-level patterns have their different characters. Pub-sub addresses an old messaging problem, which is multicast or group messaging. It has that unique mix of meticulous simplicity and brutal indifference that characterizes ZeroMQ. It’s worth understanding the trade-offs that pub-sub makes, how these benefit us, and how we can work around them if needed.

First, PUB sends each message to “all of many”, whereas PUSH and DEALER rotate messages to “one of many”. You cannot simply replace PUSH with PUB or vice versa and hope that things will work. This bears repeating because people seem to quite often suggest doing this.

More profoundly, pub-sub is aimed at scalability. This means large volumes of data, sent rapidly to many recipients. If you need millions of messages per second sent to thousands of points, you’ll appreciate pub-sub a lot more than if you need a few messages a second sent to a handful of recipients.

To get scalability, pub-sub uses the same trick as push-pull, which is to get rid of back-chatter. This means that recipients don’t talk back to senders. There are some exceptions, e.g., SUB sockets will send subscriptions to PUB sockets, but it’s anonymous and infrequent.

Killing back-chatter is essential to real scalability. With pub-sub, it’s how the pattern can map cleanly to the PGM multicast protocol, which is handled by the network switch. In other words, subscribers don’t connect to the publisher at all, they connect to a multicast groupon the switch, to which the publisher sends its messages.

When we remove back-chatter, our overall message flow becomes much simpler, which lets us make simpler APIs, simpler protocols, and in general reach many more people. But we also remove any possibility to coordinate senders and receivers. What this means is:

  • Publishers can’t tell when subscribers are successfully connected, both on initial connections, and on reconnections after network failures.

  • Subscribers can’t tell publishers anything that would allow publishers to control the rate of messages they send. Publishers only have one setting, which is full-speed, and subscribers must either keep up or lose messages.

  • Publishers can’t tell when subscribers have disappeared due to processes crashing, networks breaking, and so on.

The downside is that we actually need all of these if we want to do reliable multicast. The ZeroMQ pub-sub pattern will lose messages arbitrarily when a subscriber is connecting, when a network failure occurs, or just if the subscriber or network can’t keep up with the publisher.

The upside is that there are many use cases where almost reliable multicast is just fine. When we need this back-chatter, we can either switch to using ROUTER-DEALER (which I tend to do for most normal volume cases), or we can add a separate channel for synchronization (we’ll see an example of this later in this chapter).

Pub-sub is like a radio broadcast; you miss everything before you join, and then how much information you get depends on the quality of your reception. Surprisingly, this model is useful and widespread because it maps perfectly to real world distribution of information. Think of Facebook and Twitter, the BBC World Service, and the sports results.

As we did for request-reply, let’s define reliability in terms of what can go wrong. Here are the classic failure cases for pub-sub:

  • Subscribers join late, so they miss messages the server already sent.
  • Subscribers can fetch messages too slowly, so queues build up and then overflow.
  • Subscribers can drop off and lose messages while they are away.
  • Subscribers can crash and restart, and lose whatever data they already received.
  • Networks can become overloaded and drop data (specifically, for PGM).
  • Networks can become too slow, so publisher-side queues overflow and publishers crash.

A lot more can go wrong but these are the typical failures we see in a realistic system. Since v3.x, ZeroMQ forces default limits on its internal buffers (the so-called high-water mark or HWM), so publisher crashes are rarer unless you deliberately set the HWM to infinite.

All of these failure cases have answers, though not always simple ones. Reliability requires complexity that most of us don’t need, most of the time, which is why ZeroMQ doesn’t attempt to provide it out of the box (even if there was one global design for reliability, which there isn’t).

Pub-Sub Tracing (Espresso Pattern)topprevnext

Let’s start this chapter by looking at a way to trace pub-sub networks. In Chapter 2 - Sockets and Patterns we saw a simple proxy that used these to do transport bridging. The zmq_proxy() method has three arguments: a frontend and backend socket that it bridges together, and a capture socket to which it will send all messages.

The code is deceptively simple:

[espresso: Espresso Pattern in C](javascript:😉

C# | Java | Python | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

The subscriber thread subscribes to “A” and “B”, receives five messages, and then destroys its socket. When you run the example, the listener prints two subscription messages, five data messages, two unsubscribe messages, and then silence:

[002] 0141
[002] 0142
[007] B-91164
[007] B-12979
[007] A-52599
[007] A-06417
  • 0
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值