0MQ

ØMQ - The Guide

By Pieter Hintjens, CEO of iMatix

Please use the issue tracker for all comments and errata(勘误表). This version covers the latest stable(稳定的) release of ZeroMQ (3.2). If you are using older versions of ZeroMQ then some of the examples and explanations won't be accurate(精确的).

The Guide is originally in C, but also in PHP, Python, Lua, and Haxe. We've also translated most of the examples into C++, C#, CL, Delphi, Erlang, F#, Felix, Haskell, Java, Objective-C, Ruby, Ada, Basic, Clojure, Go, Haxe, Node.js, ooc, Perl, and Scala.

Preface

topprevnext

ZeroMQ in a Hundred Words

topprevnext

ZeroMQ (also known as ØMQ, 0MQ, or zmq) looks like an embeddable (可嵌入)networking library but acts like a concurrency (并发性)framework.(框架) It gives you sockets (插座)that carry atomic (原子的)messages across various transports like in-process, inter-process, TCP, and multicast.(多路广播) You can connect sockets N-to-N with patterns like fan-out, pub-sub, task distribution,(分布) and request-reply. It's fast enough to be the fabric (织物)for clustered (成群的)products. Its asynchronous (异步的)I/O model gives you scalable (可攀登的)multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems. ZeroMQ is from iMatix and is LGPLv3 open source.

How It Began

topprevnext

We took a normal TCP socket, injected(注入) it with a mix of radioactive isotopes(同位素) stolen from a secret Soviet atomic research project, bombarded(轰炸) it with 1950-era cosmic(宇宙的) rays, and put it into the hands of a drug-addled comic(喜剧的) book author with a badly-disguised fetish(恋物) for bulging(膨胀) muscles(肌肉) clad(穿衣) in spandex(斯潘德克斯弹性纤维). Yes, ZeroMQ sockets are the world-saving superheroes(超级英雄) of the networking world.

Figure 1 - A terrible accident…

fig1.png

The Zen of Zero

topprevnext

The Ø in ZeroMQ is all about tradeoffs. On the one hand this strange name lowers ZeroMQ's visibility (能见度)on Google and Twitter. On the other hand it annoys the heck (饲草架)out of some Danish folk who write us things like "ØMG røtfl", and "Ø is not a funny looking zero!" and "Rødgrød med fløde!", which is apparently(显然地) an insult(侮辱) that means "may your neighbours be the direct descendants(后裔) of Grendel!" Seems like a fair trade.

Originally the zero in ZeroMQ was meant as "zero broker" and (as close to) "zero latency(潜伏)" (as possible). Since then, it has come to encompass(包含) different goals: zero administration(管理), zero cost, zero waste. More generally, "zero" refers to the culture of minimalism(极简派艺术) that permeates(渗透) the project. We add power by removing complexity(复杂) rather than by exposing new functionality(功能).

Audience

topprevnext

This book is written for professional programmers who want to learn how to make the massively(大量地) distributed(分布式的) software that will dominate(控制) the future of computing. We assume(承担) you can read C code, because most of the examples here are in C even though ZeroMQ is used in many languages. We assume you care about scale(规模), because ZeroMQ solves that problem above all others. We assume you need the best possible results with the least possible cost, because otherwise you won't appreciate the trade-offs that ZeroMQ makes. Other than that basic background, we try to present all the concepts(观念) in networking and distributed computing you will need to use ZeroMQ.

Acknowledgements

topprevnext

Thanks to Andy Oram for making the O'Reilly book happen, and editing this text.

Thanks to Bill Desmarais, Brian Dorsey, Daniel Lin, Eric Desgranges, Gonzalo Diethelm, Guido Goldstein, Hunter Ford, Kamil Shakirov, Martin Sustrik, Mike Castleman, Naveen Chawla, Nicola Peduzzi, Oliver Smith, Olivier Chamoux, Peter Alexander, Pierre Rouleau, Randy Dryburgh, John Unwin, Alex Thomas, Mihail Minkov, Jeremy Avnet, Michael Compton, Kamil Kisiel, Mark Kharitonov, Guillaume Aubert, Ian Barber, Mike Sheridan, Faruk Akgul, Oleg Sidorov, Lev Givon, Allister MacLeod, Alexander D'Archangel, Andreas Hoelzlwimmer, Han Holl, Robert G. Jakabosky, Felipe Cruz, Marcus McCurdy, Mikhail Kulemin, Dr. Gergő Érdi, Pavel Zhukov, Alexander Else, Giovanni Ruggiero, Rick "Technoweenie", Daniel Lundin, Dave Hoover, Simon Jefford, Benjamin Peterson, Justin Case, Devon Weller, Richard Smith, Alexander Morland, Wadim Grasza, Michael Jakl, Uwe Dauernheim, Sebastian Nowicki, Simone Deponti, Aaron Raddon, Dan Colish, Markus Schirp, Benoit Larroque, Jonathan Palardy, Isaiah Peng, Arkadiusz Orzechowski, Umut Aydin, Matthew Horsfall, Jeremy W. Sherman, Eric Pugh, Tyler Sellon, John E. Vincent, Pavel Mitin, Min RK, Igor Wiedler, Olof Åkesson, Patrick Lucas, Heow Goodman, Senthil Palanisami, John Gallagher, Tomas Roos, Stephen McQuay, Erik Allik, Arnaud Cogoluègnes, Rob Gagnon, Dan Williams, Edward Smith, James Tucker, Kristian Kristensen, Vadim Shalts, Martin Trojer, Tom van Leeuwen, Hiten Pandya, Harm Aarts, Marc Harter, Iskren Ivov Chernev, Jay Han, Sonia Hamilton, Nathan Stocks, Naveen Palli, and Zed Shaw for their contributions to this work.


Chapter 1 - Basics

topprevnext

Fixing the World

topprevnext

How to explain ZeroMQ? Some of us start by saying all the wonderful things it does. It's sockets(插座) on steroids(类固醇). It's like mailboxes with routing(路由选择). It's fast! Others try to share their moment of enlightenment(启迪), that zap-pow-kaboom satori(心灵之顿悟) paradigm-shift moment when it all became obvious. Things just become simpler. Complexity(复杂) goes away. It opens the mind. Others try to explain by comparison(比较). It's smaller, simpler, but still looks familiar. Personally, I like to remember why we made ZeroMQ at all, because that's most likely where you, the reader, still are today.

Programming is science dressed up as art because most of us don't understand the physics of software and it's rarely, if ever, taught. The physics of software is not algorithms(算法), data structures(结构), languages and abstractions(抽象). These are just tools we make, use, throw away. The real physics of software is the physics of people—specifically, (特别地)our limitations w(限制)hen it comes to complexity, and our desire to work together to solve large problems in pieces. This is the science of programming: make building blocks that people can understand and use easily, and people will work together to solve the very largest problems.

We live in a connected world, and modern software has to navigate(驾驶) this world. So the building blocks for tomorrow's very largest solutions(解决方案) are connected and massively(大量地) parallel(平行的). It's not enough for code to be "strong and silent" any more. Code has to talk to code. Code has to be chatty(饶舌的), sociable(社交的), well-connected. Code has to run like the human brain, trillions(万亿) of individual(个人的) neurons(神经元) firing off messages to each other, a massively parallel network with no central control, no single point of failure, yet able to solve immensely(极大地) difficult problems. And it's no accident that the future of code looks like the human brain, because the endpoints(端点) of every network are, at some level, human brains.

If you've done any work with threads, protocols(协议), or networks, you'll realize this is pretty much impossible. It's a dream. Even connecting a few programs across a few sockets is plain nasty(肮脏的) when you start to handle real life situations. Trillions? The cost would be unimaginable(不可思议的). Connecting computers is so difficult that software and services to do this is a multi-billion dollar business.

So we live in a world where the wiring is years ahead of our ability to use it. We had a software crisis(危机) in the 1980s, when leading software engineers like Fred Brooks believed there was no "Silver Bullet" to "promise even one order of magnitude(大小) of improvement(改进) in productivity(生产力), reliability(可靠性), or simplicity(朴素)".

Brooks missed(感到思念的) free and open source software, which solved that crisis, enabling us to share knowledge efficiently(有效地). Today we face another software crisis, but it's one we don't talk about much. Only the largest, richest firms can afford to create connected applications. There is a cloud, but it's proprietary(所有权). Our data and our knowledge is disappearing from our personal computers into clouds that we cannot access and with which we cannot compete. Who owns our social networks? It is like the mainframe-PC revolution in reverse(相反).

We can leave the political philosophy(哲学) for another book. The point is that while the Internet offers the potential(潜能) of massively(大量地) connected code, the reality is that this is out of reach for most of us, and so large interesting problems (in health, education, economics(经济学), transport, and so on) remain unsolved because there is no way to connect the code, and thus no way to connect the brains that could work together to solve these problems.

There have been many attempts to solve the challenge of connected code. There are thousands of IETF specifications(规格), each solving part of the puzzle. For application developers, HTTP is perhaps the one solution(解决方案) to have been simple enough to work, but it arguably(可论证地) makes the problem worse by encouraging developers and architects(建筑师) to think in terms of big servers and thin, stupid clients.

So today people are still connecting applications using raw UDP and TCP, proprietary protocols(协议), HTTP, and Websockets. It remains painful, slow, hard to scale(衡量), and essentially centralized(集中的). Distributed(分配) P2P architectures(建筑学) are mostly for play, not work. How many applications use Skype or Bittorrent to exchange data?

Which brings us back to the science of programming. To fix the world, we needed to do two things. One, to solve the general problem of "how to connect any code to any code, anywhere". Two, to wrap(包) that up in the simplest possible building blocks that people could understand and use easily.

It sounds ridiculously(可笑地) simple. And maybe it is. That's kind of the whole point.

Starting Assumptions

topprevnext

We assume(承担) you are using at least version 3.2 of ZeroMQ. We assume you are using a Linux box or something similar. We assume you can read C code, more or less, as that's the default language for the examples. We assume that when we write constants like PUSH or SUBSCRIBE, you can imagine they are really called ZMQ_PUSH or ZMQ_SUBSCRIBE if the programming language needs it.

Getting the Examples

topprevnext

The examples live in a public GitHub repository. The simplest way to get all the examples is to clone(无性繁殖) this repository(贮藏室):

git clone --depth=1 https://github.com/imatix/zguide.git

Next, browse the examples subdirectory(子目录). You'll find examples by language. If there are examples missing in a language you use, you're encouraged to submit a translation. This is how this text became so useful, thanks to the work of many people. All examples are licensed under MIT/X11.

Ask and Ye Shall Receive

topprevnext

So let's start with some code. We start of course with a Hello World example. We'll make a client and a server. The client sends "Hello" to the server, which replies with "World". Here's the server in C, which opens a ZeroMQ socket(插座) on port 5555, reads requests on it, and replies with "World" to each request:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Q | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc

Figure 2 - Request-Reply

fig2.png

The REQ-REP socket(插座) pair is in lockstep(因循守旧). The client issues zmq_send() and then zmq_recv(), in a loop(环) (or once if that's all it needs). Doing any other sequence(序列) (e.g., sending two messages in a row) will result in a return code of -1 from the send or recv call. Similarly, the service issues zmq_recv() and then zmq_send() in that order, as often as it needs to.

ZeroMQ uses C as its reference(参考) language and this is the main language we'll use for examples. If you're reading this online, the link below the example takes you to translations into other programming languages. Let's compare the same server in C++:

//
// Hello World server in C++
// Binds(捆绑) REP socket to tcp://*:5555
// Expects "Hello" from client, replies with "World"
//

#include <zmq.hpp>
#include <string>
#include <iostream>
#ifndef _WIN32
#include <unistd.h>
#else
#include <windows.h>

#define sleep(n) Sleep(n)
#endif

int main () {
// Prepare our context(环境) and socket(插座)
zmq::context_t context (1);
zmq::socket_t socket(插座) (context(环境), ZMQ_REP);
socket.bind ("tcp://*:5555");

while (true) {
zmq::message_t request;

// Wait for next request from client
socket.recv (&request);
std::cout << "Received Hello" << std::endl;

// Do some 'work'
sleep(1);

// Send reply back to client
zmq::message_t reply (5);
memcpy (reply.data (), "World", 5);
socket.send (reply);
}
return 0;
}

hwserver.cpp: Hello World server

You can see that the ZeroMQ API is similar in C and C++. In a language like PHP or Java, we can hide even more and the code becomes even easier to read:

<?php
/*
* Hello World server
* Binds(捆绑) REP socket(插座) to tcp://*:5555
* Expects "Hello" from client, replies with "World"
* @author Ian Barber <ian(dot)barber(at)gmail(dot)com>
*/

$context = new ZMQContext(1);

// Socket(插座) to talk to clients
$responder = new ZMQSocket($context, ZMQ::SOCKET_REP);
$responder->bind("tcp://*:5555");

while (true) {
// Wait for next request from client
$request = $responder->recv();
printf ("Received request: [%s]\n", $request);

// Do some 'work'
sleep (1);

// Send reply back to client
$responder->send("World");
}

hwserver.php: Hello World server

//
// Hello World server in Java
// Binds(捆绑) REP socket(插座) to tcp://*:5555
// Expects "Hello" from client, replies with "World"
//

import org.zeromq.ZMQ;

public class hwserver {

public static void main(String[] args) throws Exception {
ZMQ.Context context = ZMQ.context(1);

// Socket(插座) to talk to clients
ZMQ.Socket responder = context.socket(ZMQ.REP);
responder.bind("tcp://*:5555");

while (!Thread.currentThread().isInterrupted()) {
// Wait for next request from the client
byte[] request = responder.recv(0);
System.out.println("Received Hello");

// Do some 'work'
Thread.sleep(1000);

// Send reply back to client
String reply = "World";
responder.send(reply.getBytes(), 0);
}
responder.close();
context.term();
}
}

hwserver.java: Hello World server

The server in other languages:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Q | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc

Here's the client code:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Q | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc

Now this looks too simple to be realistic(现实的), but ZeroMQ sockets(插座) have, as we already learned, superpowers(超级大国). You could throw thousands of clients at this server, all at once, and it would continue to work happily and quickly. For fun, try starting the client and then starting the server, see how it all still works, then think for a second what this means.

Let us explain briefly what these two programs are actually doing. They create a ZeroMQ context(环境) to work with, and a socket. Don't worry what the words mean. You'll pick it up. The server binds(捆绑) its REP (reply) socket to port 5555. The server waits for a request in a loop(环), and responds(应答) each time with a reply. The client sends a request and reads the reply back from the server.

If you kill the server (Ctrl-C) and restart(重新启动) it, the client won't recover properly. Recovering from crashing processes isn't quite that easy. Making a reliable(可靠的) request-reply flow is complex(复杂的) enough that we won't cover it until Chapter 4 - Reliable Request-Reply Patterns.

There is a lot happening behind the scenes but what matters to us programmers is how short and sweet the code is, and how often it doesn't crash, even under a heavy load. This is the request-reply pattern, probably the simplest way to use ZeroMQ. It maps to RPC and the classic(经典的) client/server model.

A Minor Note on Strings

topprevnext

ZeroMQ doesn't know anything about the data you send except its size in bytes. That means you are responsible(负责的) for formatting(格式化) it safely so that applications can read it back. Doing this for objects and complex data types is a job for specialized(专业的) libraries like Protocol Buffers. But even for strings, you need to take care.

In C and some other languages, strings are terminated(终止) with a null byte. We could send a string like "HELLO" with that extra null byte:

zmq_send (requester, "Hello", 6, 0);

However, if you send a string from another language, it probably will not include that null byte. For example, when we send that same string in Python, we do this:

socket.send ("Hello")

Then what goes onto the wire is a length (one byte for shorter strings) and the string contents as individual(个人的) characters.

Figure 3 - A ZeroMQ string

fig3.png

And if you read this from a C program, you will get something that looks like a string, and might by accident act like a string (if by luck the five bytes find themselves followed by an innocently(纯洁地) lurking(潜伏) null), but isn't a proper string. When your client and server don't agree on the string format, you will get weird(怪异的) results.

When you receive string data from ZeroMQ in C, you simply cannot trust that it's safely terminated(终止). Every single time you read a string, you should allocate(分配) a new buffer(缓冲区) with space for an extra byte, copy the string, and terminate it properly with a null.

So let's establish(建立) the rule that ZeroMQ strings are length-specified and are sent on the wire without a trailing null. In the simplest case (and we'll do this in our examples), a ZeroMQ string maps neatly to a ZeroMQ message frame(设计), which looks like the above figure—a length and some bytes.

Here is what we need to do, in C, to receive a ZeroMQ string and deliver it to the application as a valid C string:

// Receive ZeroMQ string from socket(插座) and convert(转变) into C string
// Chops(砍) string at 255 chars(炭), if it's longer

static char *
s_recv (void *socket) {
char buffer [256];
int size = zmq_recv (socket(插座), buffer(有软皮摩擦), 255, 0);
if (size == -1)
return NULL;
if (size > 255)
size = 255;
buffer [size] = 0;
return strdup (buffer);
}

This makes a handy helper function and in the spirit of making things we can reuse profitably(有利可图的), let's write a similar s_send function that sends strings in the correct ZeroMQ format, and package this into a header file we can reuse.

The result is zhelpers.h, which lets us write sweeter and shorter ZeroMQ applications in C. It is a fairly long source, and only fun for C developers, so read it at leisure(闲暇).

Version Reporting

topprevnext

ZeroMQ does come in several versions and quite often, if you hit a problem, it'll be something that's been fixed in a later version. So it's a useful trick to know exactly what version of ZeroMQ you're actually linking with.

Here is a tiny program that does that:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Q | Ruby | Scala | Tcl | Ada | Basic | Haxe | ooc | Racket

Getting the Message Out

topprevnext

The second classic(经典的) pattern is one-way data distribution(分布), in which a server pushes updates to a set of clients. Let's see an example that pushes out weather updates consisting of a zip code, temperature, and relative humidity(湿度). We'll generate(形成) random(随机的) values, just like the real weather stations do.

Here's the server. We'll use port 5556 for this application:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc | Q

There's no start and no end to this stream of updates, it's like a never ending broadcast.

Here is the client application, which listens to the stream of updates and grabs anything to do with a specified(规定的) zip code, by default New York City because that's a great place to start any adventure:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc | Q

Figure 4 - Publish-Subscribe

fig4.png

Note that when you use a SUB socket(插座) you must set a subscription(捐献) using zmq_setsockopt() and SUBSCRIBE, as in this code. If you don't set any subscription, you won't get any messages. It's a common mistake for beginners. The subscriber(订户) can set many subscriptions, which are added together. That is, if an update matches ANY subscription, the subscriber receives it. The subscriber can also cancel specific(特殊的) subscriptions. A subscription is often, but not necessarily a printable(印得出的) string. See zmq_setsockopt() for how this works.

The PUB-SUB socket pair is asynchronous(异步的). The client does zmq_recv(), in a loop(环) (or once if that's all it needs). Trying to send a message to a SUB socket will cause an error. Similarly, the service does zmq_send() as often as it needs to, but must not do zmq_recv() on a PUB socket(插座).

In theory with ZeroMQ sockets, it does not matter which end connects and which end binds(捆绑). However, in practice there are undocumented differences that I'll come to later. For now, bind the PUB and connect the SUB, unless your network design makes that impossible.

There is one more important thing to know about PUB-SUB sockets: you do not know precisely(精确地) when a subscriber(订户) starts to get messages. Even if you start a subscriber, wait a while, and then start the publisher, the subscriber will always miss the first messages that the publisher sends. This is because as the subscriber connects to the publisher (something that takes a small but non-zero time), the publisher may already be sending messages out.

This "slow joiner" symptom(症状) hits enough people often enough that we're going to explain it in detail. Remember that ZeroMQ does asynchronous(异步的) I/O, i.e., in the background. Say you have two nodes doing this, in this order:

  • Subscriber connects to an endpoint(端点) and receives and counts messages.
  • Publisher binds to an endpoint and immediately sends 1,000 messages.

Then the subscriber will most likely not receive anything. You'll blink(眨眼), check that you set a correct filter and try again, and the subscriber will still not receive anything.

Making a TCP connection involves(包含) to and from handshaking(握手) that takes several milliseconds(毫秒) depending on your network and the number of hops(蜱酒花) between peers(撒尿). In that time, ZeroMQ can send many messages. For sake(目的) of argument assume(承担) it takes 5 msecs to establish(建立) a connection, and that same link can handle 1M messages per second. During the 5 msecs that the subscriber is connecting to the publisher, it takes the publisher only 1 msec to send out those 1K messages.

In Chapter 2 - Sockets and Patterns we'll explain how to synchronize(合拍) a publisher and subscribers so that you don't start to publish data until the subscribers really are connected and ready. There is a simple and stupid way to delay the publisher, which is to sleep. Don't do this in a real application, though, because it is extremely fragile as well as inelegant(不雅的) and slow. Use sleeps to prove to yourself what's happening, and then wait for Chapter 2 - Sockets and Patterns to see how to do this right.

The alternative(二中择一) to synchronization(同步) is to simply assume that the published data stream is infinite(无限的) and has no start and no end. One also assumes that the subscriber doesn't care what transpired(发生) before it started up. This is how we built our weather client example.

So the client subscribes to its chosen zip code and collects 100 updates for that zip code. That means about ten million updates from the server, if zip codes are randomly(随便地) distributed(分布式的). You can start the client, and then the server, and the client will keep working. You can stop and restart(重新启动) the server as often as you like, and the client will keep working. When the client has collected its hundred updates, it calculates(计算) the average, prints it, and exits.

Some points about the publish-subscribe (pub-sub) pattern:

  • A subscriber can connect to more than one publisher, using one connect call each time. Data will then arrive and be interleaved(交错) ("fair-queued") so that no single publisher drowns out the others.
  • If a publisher has no connected subscribers, then it will simply drop all messages.
  • If you're using TCP and a subscriber(订户) is slow, messages will queue up on the publisher. We'll look at how to protect publishers against this using the "high-water mark" later.
  • From ZeroMQ v3.x, filtering happens at the publisher side when using a connected protocol(协议) (tcp:// or ipc://). Using the epgm:// protocol, filtering happens at the subscriber side. In ZeroMQ v2.x, all filtering happened at the subscriber side.

This is how long it takes to receive and filter 10M messages on my laptop(膝上型轻便电脑), which is an 2011-era Intel i5, decent(正派的) but nothing special:

$ time wuclient
Collecting updates from weather server...
Average temperature for zipcode '10001 ' was 28F

real    0m4.470s
user    0m0.000s
sys     0m0.008s

Divide and Conquer

topprevnext

Figure 5 - Parallel(平行线) Pipeline

fig5.png

As a final example (you are surely getting tired of juicy code and want to delve(钻研) back into philological(文献学的) discussions about comparative(比较的) abstractive(摘要式的) norms(规范)), let's do a little supercomputing(超级计算). Then coffee. Our supercomputing application is a fairly typical(典型的) parallel processing model. We have:

  • A ventilator(通风设备) that produces tasks that can be done in parallel
  • A set of workers that process tasks
  • A sink that collects results back from the worker processes

In reality, workers run on superfast(超快速) boxes, perhaps using GPUs (graphic(图表的) processing units) to do the hard math. Here is the ventilator(通风设备). It generates(形成) 100 tasks, each a message telling the worker to sleep for some number of milliseconds(毫秒):


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | ooc | Q | Racket

Here is the worker application. It receives a message, sleeps for that number of seconds, and then signals that it's finished:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | ooc | Q | Racket

Here is the sink application. It collects the 100 tasks, then calculates(计算) how long the overall processing took, so we can confirm(确认) that the workers really were running in parallel(平行的) if there are more than one of them:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | ooc | Q | Racket

The average cost of a batch(一批) is 5 seconds. When we start 1, 2, or 4 workers we get results like this from the sink:

  • 1 worker: total elapsed(消逝) time: 5034 msecs.
  • 2 workers: total elapsed time: 2421 msecs.
  • 4 workers: total elapsed time: 1018 msecs.

Let's look at some aspects(方面) of this code in more detail:

  • The workers connect upstream(上游部门) to the ventilator(通风设备), and downstream(下游地) to the sink. This means you can add workers arbitrarily(武断地). If the workers bound to their endpoints(端点), you would need (a) more endpoints and (b) to modify(修改) the ventilator and/or the sink each time you added a worker. We say that the ventilator and sink are stable parts of our architecture(建筑学) and the workers are dynamic parts of it.
  • We have to synchronize(合拍) the start of the batch with all workers being up and running. This is a fairly common gotcha(明白了) in ZeroMQ and there is no easy solution(解决方案). The zmq_connect method takes a certain time. So when a set of workers connect to the ventilator(通风设备), the first one to successfully connect will get a whole load of messages in that short time while the others are also connecting. If you don't synchronize(合拍) the start of the batch(一批) somehow, the system won't run in parallel(平行线) at all. Try removing the wait in the ventilator, and see what happens.
  • The ventilator's PUSH socket(插座) distributes(分配) tasks to workers (assuming(承担) they are all connected before the batch starts going out) evenly. This is called load balancing and it's something we'll look at again in more detail.
  • The sink's PULL socket collects results from workers evenly. This is called fair-queuing.

Figure 6 - Fair Queuing

fig6.png

The pipeline(管道) pattern also exhibits(展览品) the "slow joiner" syndrome(综合征), leading to accusations(控告) that PUSH sockets don't load balance properly. If you are using PUSH and PULL, and one of your workers gets way more messages than the others, it's because that PULL socket has joined faster than the others, and grabs a lot of messages before the others manage to connect. If you want proper load balancing, you probably want to look at the load balancing pattern in Chapter 3 - Advanced Request-Reply Patterns.

Programming with ZeroMQ

topprevnext

Having seen some examples, you must be eager to start using ZeroMQ in some apps. Before you start that, take a deep breath, chillax, and reflect(反映) on some basic advice that will save you much stress(压力) and confusion(混淆).

  • Learn ZeroMQ step-by-step(按部就班的). It's just one simple API, but it hides a world of possibilities. Take the possibilities slowly and master each one.
  • Write nice code. Ugly code hides problems and makes it hard for others to help you. You might get used to meaningless(无意义的) variable(变量的) names, but people reading your code won't. Use names that are real words, that say something other than "I'm too careless to tell you what this variable is really for". Use consistent(始终如一的) indentation(压痕) and clean layout(布局). Write nice code and your world will be more comfortable.
  • Test what you make as you make it. When your program doesn't work, you should know what five lines are to blame. This is especially true when you do ZeroMQ magic, which just won't work the first few times you try it.
  • When you find that things don't work as expected, break your code into pieces, test each one, see which one is not working. ZeroMQ lets you make essentially modular(模块化的) code; use that to your advantage.
  • Make abstractions(抽象) (classes, methods, whatever) as you need them. If you copy/paste(张贴) a lot of code, you're going to copy/paste errors, too.

Getting the Context(环境) Right

topprevnext

ZeroMQ applications always start by creating a context, and then using that for creating sockets(插座). In C, it's the zmq_ctx_new() call. You should create and use exactly one context in your process. Technically, the context is the container for all sockets in a single process, and acts as the transport for inproc sockets, which are the fastest way to connect threads in one process. If at runtime a process has two contexts, these are like separate ZeroMQ instances(实例). If that's explicitly(明确的) what you want, OK, but otherwise remember:

Call zmq_ctx_new() once at the start of a process, and zmq_ctx_destroy() once at the end.

If you're using the fork() system call, do zmq_ctx_new() after the fork and at the beginning of the child process code. In general, you want to do interesting (ZeroMQ) stuff(东西) in the children, and boring process management in the parent.

Making a Clean Exit

topprevnext

Classy(优等的) programmers share the same motto as classy hit men: always clean-up when you finish the job. When you use ZeroMQ in a language like Python, stuff gets automatically(自动地) freed for you. But when using C, you have to carefully free objects when you're finished with them or else you get memory leaks, unstable(不稳定的) applications, and generally bad karma(因果报应).

Memory leaks are one thing, but ZeroMQ is quite finicky(过分讲究的) about how you exit an application. The reasons are technical and painful, but the upshot(结果) is that if you leave any sockets(插座) open, the zmq_ctx_destroy() function will hang forever. And even if you close all sockets, zmq_ctx_destroy() will by default wait forever if there are pending(未决定的) connects or sends unless you set the LINGER to zero on those sockets before closing them.

The ZeroMQ objects we need to worry about are messages, sockets, and contexts(环境). Luckily it's quite simple, at least in simple programs:

  • If you are opening and closing a lot of sockets(插座), that's probably a sign that you need to redesign(重新设计) your application. In some cases socket handles won't be freed until you destroy the context(环境).
  • When you exit the program, close your sockets and then call zmq_ctx_destroy(). This destroys the context.

This is at least the case for C development. In a language with automatic(自动的) object destruction(破坏), sockets and contexts will be destroyed as you leave the scope(范围). If you use exceptions(例外) you'll have to do the clean-up in something like a "final" block, the same as for any resource.

If you're doing multithreaded work, it gets rather more complex(复杂的) than this. We'll get to multithreading in the next chapter, but because some of you will, despite(尽管) warnings, try to run before you can safely walk, below is the quick and dirty guide to making a clean exit in a multithreaded ZeroMQ application.

First, do not try to use the same socket from multiple threads. Please don't explain why you think this would be excellent fun, just please don't do it. Next, you need to shut down each socket that has ongoing requests. The proper way is to set a low LINGER value (1 second), and then close the socket. If your language binding(结合) doesn't do this for you automatically(自动地) when you destroy a context, I'd suggest sending a patch(眼罩).

Finally, destroy the context. This will cause any blocking receives or polls(投票) or sends in attached(附加的) threads (i.e., which share the same context) to return with an error. Catch that error, and then set linger(徘徊) on, and close sockets in that thread, and exit. Do not destroy the same context(环境) twice. The zmq_ctx_destroy in the main thread will block until all sockets(插座) it knows about are safely closed.

Voila(瞧)! It's complex(复杂的) and painful enough that any language binding(有约束力的) author worth his or her salt will do this automatically(自动地) and make the socket closing dance unnecessary.

Why We Needed ZeroMQ

topprevnext

Now that you've seen ZeroMQ in action, let's go back to the "why".

Many applications these days consist of components(成分) that stretch(伸展) across some kind of network, either a LAN or the Internet. So many application developers end up doing some kind of messaging. Some developers use message queuing products, but most of the time they do it themselves, using TCP or UDP. These protocols(协议) are not hard to use, but there is a great difference between sending a few bytes from A to B, and doing messaging in any kind of reliable(可靠的) way.

Let's look at the typical(典型的) problems we face when we start to connect pieces using raw TCP. Any reusable(可重复使用的) messaging layer would need to solve all or most of these:

  • How do we handle I/O? Does our application block, or do we handle I/O in the background? This is a key design decision. Blocking I/O creates architectures(建筑学) that do not scale(衡量) well. But background I/O can be very hard to do right.
  • How do we handle dynamic(动态的) components, i.e., pieces that go away temporarily(临时地)? Do we formally(正式地) split components into "clients" and "servers" and mandate(授权) that servers cannot disappear? What then if we want to connect servers to servers? Do we try to reconnect(使再接合) every few seconds?
  • How do we represent a message on the wire? How do we frame(有木架的) data so it's easy to write and read, safe from buffer(缓冲区) overflows(充满), efficient(有效率的) for small messages, yet adequate(充足的) for the very largest videos of dancing cats wearing party hats?
  • How do we handle messages that we can't deliver immediately? Particularly, if we're waiting for a component to come back online? Do we discard(抛弃) messages, put them into a database, or into a memory queue?
  • Where do we store message queues? What happens if the component reading from a queue is very slow and causes our queues to build up? What's our strategy(战略) then?
  • How do we handle lost messages? Do we wait for fresh data, request a resend(再发), or do we build some kind of reliability(可靠性) layer that ensures(保证) messages cannot be lost? What if that layer itself crashes?
  • What if we need to use a different network transport. Say, multicast(多路广播) instead of TCP unicast(单一传播)? Or IPv6? Do we need to rewrite the applications, or is the transport abstracted(摘要) in some layer?
  • How do we route(路线) messages? Can we send the same message to multiple peers(撒尿)? Can we send replies back to an original requester?
  • How do we write an API for another language? Do we re-implement a wire-level protocol or do we repackage(重新包装) a library? If the former, how can we guarantee(保证) efficient and stable(稳定的) stacks(堆)? If the latter, how can we guarantee interoperability(互操作性)?
  • How do we represent data so that it can be read between different architectures(建筑学)? Do we enforce(实施) a particular encoding for data types? How far is this the job of the messaging system rather than a higher layer?
  • How do we handle network errors? Do we wait and retry(重操作), ignore(驳回诉讼) them silently, or abort(中止计划)?

Take a typical(典型的) open source project like Hadoop Zookeeper and read the C API code in src/c/src/zookeeper.c. When I read this code, in January 2013, it was 4,200 lines of mystery(神秘) and in there is an undocumented, client/server network communication protocol(协议). I see it's efficient(有效率的) because it uses poll instead of select. But really, Zookeeper should be using a generic(类的) messaging layer and an explicitly(明确地) documented wire level protocol. It is incredibly(难以置信地) wasteful(浪费的) for teams to be building this particular wheel over and over.

But how to make a reusable(可重复使用的) messaging layer? Why, when so many projects need this technology, are people still doing it the hard way by driving TCP sockets(插座) in their code, and solving the problems in that long list over and over?

It turns out that building reusable messaging systems is really difficult, which is why few FOSS projects ever tried, and why commercial(商业的) messaging products are complex(复杂的), expensive, inflexible(顽固的), and brittle(易碎的). In 2006, iMatix designed AMQP which started to give FOSS developers perhaps the first reusable recipe(食谱) for a messaging system. AMQP works better than many other designs, but remains relatively complex, expensive, and brittle. It takes weeks to learn to use, and months to create stable(稳定的) architectures that don't crash when things get hairy(多毛的).

Figure 7 - Messaging as it Starts

fig7.png

Most messaging projects, like AMQP, that try to solve this long list of problems in a reusable way do so by inventing a new concept(观念), the "broker", that does addressing, routing(路由选择), and queuing. This results in a client/server protocol or a set of APIs on top of some undocumented protocol that allows applications to speak to this broker. Brokers are an excellent thing in reducing the complexity(复杂) of large networks. But adding broker-based messaging to a product like Zookeeper would make it worse, not better. It would mean adding an additional(附加的) big box, and a new single point of failure. A broker rapidly becomes a bottleneck(瓶颈) and a new risk(风险) to manage. If the software supports it, we can add a second, third, and fourth broker and make some failover(失效备援) scheme(计划). People do this. It creates more moving pieces, more complexity, and more things to break.

And a broker-centric setup(设置) needs its own operations team. You literally(照字面地) need to watch the brokers day and night, and beat them with a stick when they start misbehaving(作弊). You need boxes, and you need backup boxes, and you need people to manage those boxes. It is only worth doing for large applications with many moving pieces, built by several teams of people over several years.

Figure 8 - Messaging as it Becomes

fig8.png

So small to medium(中间的) application developers are trapped. Either they avoid network programming and make monolithic(整体的) applications that do not scale(衡量). Or they jump into network programming and make brittle(易碎的), complex(复杂的) applications that are hard to maintain(维持). Or they bet on a messaging product, and end up with scalable(可攀登的) applications that depend on expensive, easily broken technology. There has been no really good choice, which is maybe why messaging is largely stuck in the last century and stirs(搅拌) strong emotions(情感): negative(负的) ones for users, gleeful(愉快的) joy for those selling support and licenses.

What we need is something that does the job of messaging, but does it in such a simple and cheap way that it can work in any application, with close to zero cost. It should be a library which you just link, without any other dependencies(依赖性). No additional(附加的) moving pieces, so no additional risk(风险). It should run on any OS and work with any programming language.

And this is ZeroMQ: an efficient(有效率的), embeddable(可嵌入) library that solves most of the problems an application needs to become nicely elastic(有弹性的) across a network, without much cost.

Specifically:

  • It handles I/O asynchronously(异步的), in background threads. These communicate with application threads using lock-free data structures(结构), so concurrent(并发的) ZeroMQ applications need no locks, semaphores(信号), or other wait states.
  • Components(成分) can come and go dynamically(动态地) and ZeroMQ will automatically(自动地) reconnect(使再接合). This means you can start components in any order. You can create "service-oriented(服务型的) architectures(建筑学)" (SOAs) where services can join and leave the network at any time.
  • It queues messages automatically when needed. It does this intelligently(聪明地), pushing messages as close as possible to the receiver before queuing them.
  • It has ways of dealing with over-full queues (called "high water mark"). When a queue is full, ZeroMQ automatically blocks senders, or throws away messages, depending on the kind of messaging you are doing (the so-called "pattern").
  • It lets your applications talk to each other over arbitrary(任意的) transports: TCP, multicast(多路广播), in-process, inter-process. You don't need to change your code to use a different transport.
  • It handles slow/blocked readers safely, using different strategies(战略) that depend on the messaging pattern.
  • It lets you route(路线) messages using a variety of patterns such as request-reply and pub-sub. These patterns are how you create the topology(拓扑学), the structure of your network.
  • It lets you create proxies(代理人) to queue, forward, or capture(俘获) messages with a single call. Proxies can reduce the interconnection(互连) complexity(复杂) of a network.
  • It delivers whole messages exactly as they were sent, using a simple framing(设计) on the wire. If you write a 10k message, you will receive a 10k message.
  • It does not impose(强加) any format on messages. They are blobs from zero to gigabytes(十亿字节) large. When you want to represent data you choose some other product on top, such as msgpack, Google's protocol(协议) buffers(有软皮摩擦), and others.
  • It handles network errors intelligently, by retrying(重试) automatically in cases where it makes sense.
  • It reduces your carbon footprint(足迹). Doing more with less CPU means your boxes use less power, and you can keep your old boxes in use for longer. Al Gore would love ZeroMQ.

Actually ZeroMQ does rather more than this. It has a subversive(破坏性的) effect on how you develop network-capable applications. Superficially(表面的), it's a socket-inspired API on which you do zmq_recv() and zmq_send(). But message processing rapidly becomes the central loop(环), and your application soon breaks down into a set of message processing tasks. It is elegant(高雅的) and natural. And it scales(天平): each of these tasks maps to a node, and the nodes talk to each other across arbitrary(任意的) transports. Two nodes in one process (node is a thread), two nodes on one box (node is a process), or two nodes on one network (node is a box)—it's all the same, with no application code changes.

Socket Scalability

topprevnext

Let's see ZeroMQ's scalability(可扩展性) in action. Here is a shell(壳) script that starts the weather server and then a bunch(群) of clients in parallel(平行线):

wuserver &
wuclient 12345 &
wuclient 23456 &
wuclient 34567 &
wuclient 45678 &
wuclient 56789 &

As the clients run, we take a look at the active processes using the top command', and we see something like (on a 4-core box):

PID  USER  PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME+  COMMAND
7136  ph   20   0 1040m 959m 1156 R  157 12.0 16:25.47 wuserver
7966  ph   20   0 98608 1804 1372 S   33  0.0  0:03.94 wuclient
7963  ph   20   0 33116 1748 1372 S   14  0.0  0:00.76 wuclient
7965  ph   20   0 33116 1784 1372 S    6  0.0  0:00.47 wuclient
7964  ph   20   0 33116 1788 1372 S    5  0.0  0:00.25 wuclient
7967  ph   20   0 33072 1740 1372 S    5  0.0  0:00.35 wuclient

Let's think for a second about what is happening here. The weather server has a single socket(插座), and yet here we have it sending data to five clients in parallel. We could have thousands of concurrent(并发的) clients. The server application doesn't see them, doesn't talk to them directly. So the ZeroMQ socket is acting like a little server, silently accepting client requests and shoving(挤) data out to them as fast as the network can handle it. And it's a multithreaded server, squeezing(挤) more juice out of your CPU.

Upgrading from ZeroMQ v2.2 to ZeroMQ v3.2

topprevnext

Compatible Changes

topprevnext

These changes don't impact(影响) existing application code directly:

  • Pub-sub filtering is now done at the publisher side instead of subscriber(订户) side. This improves performance significantly(意味深长地) in many pub-sub use cases. You can mix v3.2 and v2.1/v2.2 publishers and subscribers safely.

Incompatible Changes

topprevnext

These are the main areas of impact(影响) on applications and language bindings(结合):

  • Changed send/recv methods: zmq_send() and zmq_recv() have a different, simpler interface(界面), and the old functionality(功能) is now provided by zmq_msg_send() and zmq_msg_recv(). Symptom(症状): compile(编译) errors. Solution(解决方案): fix up your code.
  • These two methods return positive(积极的) values on success, and -1 on error. In v2.x they always returned zero on success. Symptom: apparent(显然的) errors when things actually work fine. Solution: test strictly for return code = -1, not non-zero.
  • zmq_poll() now waits for milliseconds(毫秒), not microseconds(微秒). Symptom: application stops responding(回答) (in fact responds 1000 times slower). Solution: use the ZMQ_POLL_MSEC macro(巨大的) defined(定义) below, in all zmq_poll calls.
  • ZMQ_NOBLOCK is now called ZMQ_DONTWAIT. Symptom: compile failures on the ZMQ_NOBLOCK macro.
  • The ZMQ_HWM socket(插座) option is now broken into ZMQ_SNDHWM and ZMQ_RCVHWM. Symptom(症状): compile(编译) failures on the ZMQ_HWM macro.
  • Most but not all zmq_getsockopt() options are now integer(整数) values. Symptom: runtime error returns on zmq_setsockopt and zmq_getsockopt.
  • The ZMQ_SWAP option has been removed. Symptom: compile failures on ZMQ_SWAP. Solution(解决方案): redesign(重新设计) any code that uses this functionality(功能).

Suggested Shim Macros

topprevnext

For applications that want to run on both v2.x and v3.2, such as language bindings(结合), our advice is to emulate(仿真) v3.2 as far as possible. Here are C macro(巨大的) definitions(定义) that help your C/C++ code to work across both versions (taken from CZMQ):

#ifndef ZMQ_DONTWAIT
# define ZMQ_DONTWAIT ZMQ_NOBLOCK
#endif
#if ZMQ_VERSION_MAJOR == 2
# define(定义) zmq_msg_send(msg,sock,opt) zmq_send (sock, msg, opt)
# define zmq_msg_recv(msg,sock,opt) zmq_recv (sock, msg, opt)
# define zmq_ctx_destroy(context(环境)) zmq_term(context)
# define ZMQ_POLL_MSEC 1000
// zmq_poll is usec
# define ZMQ_SNDHWM ZMQ_HWM
# define ZMQ_RCVHWM ZMQ_HWM
#elif ZMQ_VERSION_MAJOR == 3
# define ZMQ_POLL_MSEC 1
// zmq_poll is msec
#endif

Warning: Unstable(不稳定的) Paradigms!

topprevnext

Traditional network programming is built on the general assumption(假定) that one socket(插座) talks to one connection, one peer(贵族). There are multicast(多路广播) protocols(协议), but these are exotic(异国的). When we assume(承担) "one socket = one connection", we scale(衡量) our architectures(建筑学) in certain ways. We create threads of logic(逻辑) where each thread work with one socket, one peer. We place intelligence(智力) and state in these threads.

In the ZeroMQ universe, sockets are doorways(门口) to fast little background communications engines that manage a whole set of connections automagically(自动地) for you. You can't see, work with, open, close, or attach(依附) state to these connections. Whether you use blocking send or receive, or poll(投票), all you can talk to is the socket, not the connections it manages for you. The connections are private and invisible(无形的), and this is the key to ZeroMQ's scalability(可扩展性).

This is because your code, talking to a socket, can then handle any number of connections across whatever network protocols are around, without change. A messaging pattern sitting in ZeroMQ scales more cheaply than a messaging pattern sitting in your application code.

So the general assumption(假定) no longer applies. As you read the code examples, your brain will try to map them to what you know. You will read "socket(插座)" and think "ah, that represents a connection to another node". That is wrong. You will read "thread" and your brain will again think, "ah, a thread represents a connection to another node", and again your brain will be wrong.

If you're reading this Guide for the first time, realize that until you actually write ZeroMQ code for a day or two (and maybe three or four days), you may feel confused(混乱), especially by how simple ZeroMQ makes things for you, and you may try to impose(强加) that general assumption on ZeroMQ, and it won't work. And then you will experience your moment of enlightenment(启迪) and trust, that zap-pow-kaboom satori(心灵之顿悟) paradigm-shift moment when it all becomes clear.


Chapter 2 - Sockets and Patterns

topprevnext

In Chapter 1 - Basics we took ZeroMQ for a drive, with some basic examples of the main ZeroMQ patterns: request-reply, pub-sub, and pipeline(管道). In this chapter, we're going to get our hands dirty and start to learn how to use these tools in real programs.

We'll cover:

  • How to create and work with ZeroMQ sockets.
  • How to send and receive messages on sockets.
  • How to build your apps around ZeroMQ's asynchronous(异步的) I/O model.
  • How to handle multiple sockets in one thread.
  • How to handle fatal(致命的) and nonfatal(非致命的) errors properly.
  • How to handle interrupt signals like Ctrl-C.
  • How to shut down a ZeroMQ application cleanly.
  • How to check a ZeroMQ application for memory leaks.
  • How to send and receive multipart messages.
  • How to forward messages across networks.
  • How to build a simple message queuing broker.
  • How to write multithreaded applications with ZeroMQ.
  • How to use ZeroMQ to signal between threads.
  • How to use ZeroMQ to coordinate(调整) a network of nodes.
  • How to create and use message envelopes for pub-sub.
  • Using the HWM (high-water mark) to protect against memory overflows(充满).

The Socket API

topprevnext

To be perfectly honest, ZeroMQ does a kind of switch-and-bait on you, for which we don't apologize. It's for your own good and it hurts us more than it hurts you. ZeroMQ presents a familiar socket-based API, which requires great effort for us to hide a bunch(群) of message-processing engines. However, the result will slowly fix your world view about how to design and write distributed(分布式的) software.

Sockets(插座) are the de facto(事实上的) standard API for network programming, as well as being useful for stopping your eyes from falling onto your cheeks. One thing that makes ZeroMQ especially tasty to developers is that it uses sockets and messages instead of some other arbitrary(任意的) set of concepts(观念). Kudos(荣誉) to Martin Sustrik for pulling this off. It turns "Message Oriented Middleware", a phrase guaranteed(保证) to send the whole room off to Catatonia, into "Extra Spicy Sockets!", which leaves us with a strange craving(渴望) for pizza and a desire to know more.

Like a favorite dish, ZeroMQ sockets are easy to digest. Sockets have a life in four parts, just like BSD sockets:

  • Creating and destroying sockets, which go together to form a karmic circle of socket life (see zmq_socket(), zmq_close()).
  • Plugging sockets into the network topology(拓扑学) by creating ZeroMQ connections to and from them (see zmq_bind(), zmq_connect()).

Note that sockets are always void(空的) pointers, and messages (which we'll come to very soon) are structures(结构). So in C you pass sockets as-such, but you pass addresses of messages in all functions that work with messages, like zmq_msg_send() and zmq_msg_recv(). As a mnemonic(记忆的), realize that "in ZeroMQ, all your sockets are belong(属于) to us", but messages are things you actually own in your code.

Creating, destroying, and configuring(安装) sockets(插座) works as you'd expect for any object. But remember that ZeroMQ is an asynchronous(异步的), elastic(有弹性的) fabric(织物). This has some impact(影响) on how we plug sockets into the network topology(拓扑学) and how we use the sockets after that.

Plugging Sockets into the Topology

topprevnext

To create a connection between two nodes, you use zmq_bind() in one node and zmq_connect() in the other. As a general rule of thumb(拇指), the node that does zmq_bind() is a "server", sitting on a well-known(著名的) network address, and the node which does zmq_connect() is a "client", with unknown or arbitrary(任意的) network addresses. Thus we say that we "bind(绑) a socket to an endpoint(端点)" and "connect a socket to an endpoint", the endpoint being that well-known network address.

ZeroMQ connections are somewhat different from classic(经典的) TCP connections. The main notable(值得注意的) differences are:

  • One socket(插座) may have many outgoing(外出的) and many incoming connections.
  • There is no zmq_accept() method. When a socket is bound to an endpoint(端点) it automatically(自动地) starts accepting connections.
  • The network connection itself happens in the background, and ZeroMQ will automatically(自动地) reconnect(使再接合) if the network connection is broken (e.g., if the peer(贵族) disappears and then comes back).
  • Your application code cannot work with these connections directly; they are encapsulated(密封的) under the socket(插座).

Many architectures(建筑学) follow some kind of client/server model, where the server is the component(成分) that is most static, and the clients are the components that are most dynamic(动态的), i.e., they come and go the most. There are sometimes issues of addressing: servers will be visible(明显的) to clients, but not necessarily vice(副的) versa. So mostly it's obvious which node should be doing zmq_bind() (the server) and which should be doing zmq_connect() (the client). It also depends on the kind of sockets you're using, with some exceptions(例外) for unusual network architectures. We'll look at socket types later.

Now, imagine we start the client before we start the server. In traditional networking, we get a big red Fail flag. But ZeroMQ lets us start and stop pieces arbitrarily(武断地). As soon as the client node does zmq_connect(), the connection exists and that node can start to write messages to the socket. At some stage (hopefully before messages queue up so much that they start to get discarded(抛弃), or the client blocks), the server comes alive, does a zmq_bind(), and ZeroMQ starts to deliver messages.

A server node can bind(绑) to many endpoints(端点) (that is, a combination(结合) of protocol(协议) and address) and it can do this using a single socket. This means it will accept connections across different transports:

zmq_bind (socket, "tcp://*:5555");
zmq_bind (socket, "tcp://*:9999");
zmq_bind (socket, "inproc://somename");

With most transports, you cannot bind(结合) to the same endpoint(端点) twice, unlike for example in UDP. The ipc transport does, however, let one process bind to an endpoint already used by a first process. It's meant to allow a process to recover after a crash.

Although ZeroMQ tries to be neutral(中立的) about which side binds and which side connects, there are differences. We'll see these in more detail later. The upshot(结果) is that you should usually think in terms of "servers" as static parts of your topology(拓扑学) that bind to more or less fixed endpoints, and "clients" as dynamic(动态的) parts that come and go and connect to these endpoints. Then, design your application around this model. The chances that it will "just work" are much better like that.

Sockets(插座) have types. The socket type defines(定义) the semantics(语义学) of the socket, its policies(政策) for routing(路由选择) messages inwards and outwards, queuing, etc. You can connect certain types of socket together, e.g., a publisher socket and a subscriber(订户) socket. Sockets work together in "messaging patterns". We'll look at this in more detail later.

It's the ability to connect sockets in these different ways that gives ZeroMQ its basic power as a message queuing system. There are layers on top of this, such as proxies(代理人), which we'll get to later. But essentially, with ZeroMQ you define your network architecture(建筑学) by plugging pieces together like a child's construction toy.

Sending and Receiving Messages

topprevnext

To send and receive messages you use the zmq_msg_send() and zmq_msg_recv() methods. The names are conventional(符合习俗的), but ZeroMQ's I/O model is different enough from the classic(经典的) TCP model that you will need time to get your head around it.

Figure 9 - TCP sockets(插座) are 1 to 1

fig9.png

Let's look at the main differences between TCP sockets and ZeroMQ sockets when it comes to working with data:

  • ZeroMQ sockets carry messages, like UDP, rather than a stream of bytes as TCP does. A ZeroMQ message is length-specified binary(二进制的) data. We'll come to messages shortly; their design is optimized(最佳化的) for performance and so a little tricky(狡猾的).
  • ZeroMQ sockets do their I/O in a background thread. This means that messages arrive in local input(投入) queues and are sent from local output queues, no matter what your application is busy doing.
  • ZeroMQ sockets have one-to-N routing(路由选择) behavior(行为) built-in(嵌入的), according to the socket type.

The zmq_send() method does not actually send the message to the socket connection(s). It queues the message so that the I/O thread can send it asynchronously(异步的). It does not block except in some exception(例外) cases. So the message is not necessarily sent when zmq_send() returns to your application.

Unicast Transports

topprevnext

ZeroMQ provides a set of unicast(单一传播) transports (inproc, ipc, and tcp) and multicast(多路广播) transports (epgm, pgm). Multicast is an advanced technique that we'll come to later. Don't even start using it unless you know that your fan-out ratios(比率) will make 1-to-N unicast(单一传播) impossible.

For most common cases, use tcp, which is a disconnected TCP transport. It is elastic(有弹性的), portable, and fast enough for most cases. We call this disconnected(拆开) because ZeroMQ's tcp transport doesn't require that the endpoint(端点) exists before you connect to it. Clients and servers can connect and bind(捆绑) at any time, can go and come back, and it remains transparent(透明的) to applications.

The inter-process ipc transport is disconnected, like tcp. It has one limitation(限制): it does not yet work on Windows. By convention(大会) we use endpoint names with an ".ipc" extension(延长) to avoid potential(潜在的) conflict(冲突) with other file names. On UNIX systems, if you use ipc endpoints you need to create these with appropriate(适当的) permissions otherwise they may not be shareable(可分享的) between processes running under different user IDs. You must also make sure all processes can access the files, e.g., by running in the same working directory.

The inter-thread transport, inproc, is a connected signaling transport. It is much faster than tcp or ipc. This transport has a specific(特殊的) limitation(限制) compared to tcp and ipc: the server must issue a bind(捆绑) before any client issues a connect. This is something future versions of ZeroMQ may fix, but at present this defines(定义) how you use inproc sockets(插座). We create and bind one socket and start the child threads, which create and connect the other sockets.

ZeroMQ is Not a Neutral Carrier

topprevnext

A common question that newcomers to ZeroMQ ask (it's one I've asked myself) is, "how do I write an XYZ server in ZeroMQ?" For example, "how do I write an HTTP server in ZeroMQ?" The implication(含义) is that if we use normal sockets to carry HTTP requests and responses, we should be able to use ZeroMQ sockets to do the same, only much faster and better.

The answer used to be "this is not how it works". ZeroMQ is not a neutral(中立的) carrier: it imposes(利用) a framing(框架) on the transport protocols(协议) it uses. This framing is not compatible(兼容的) with existing protocols, which tend(照料) to use their own framing. For example, compare an HTTP request and a ZeroMQ request, both over TCP/IP.

Figure 10 - HTTP on the Wire

fig10.png

The HTTP request uses CR-LF as its simplest framing delimiter(划界), whereas(然而) ZeroMQ uses a length-specified frame. So you could write an HTTP-like protocol using ZeroMQ, using for example the request-reply socket(插座) pattern. But it would not be HTTP.

Figure 11 - ZeroMQ on the Wire

fig11.png

Since v3.3, however, ZeroMQ has a socket option called ZMQ_ROUTER_RAW that lets you read and write data without the ZeroMQ framing. You could use this to read and write proper HTTP requests and responses. Hardeep Singh contributed(贡献) this change so that he could connect to Telnet servers from his ZeroMQ application. At time of writing this is still somewhat experimental(实验的), but it shows how ZeroMQ keeps evolving(发展) to solve new problems. Maybe the next patch(眼罩) will be yours.

I/O Threads

topprevnext

We said that ZeroMQ does I/O in a background thread. One I/O thread (for all sockets) is sufficient(足够的) for all but the most extreme(极端的) applications. When you create a new context(环境), it starts with one I/O thread. The general rule of thumb(拇指) is to allow one I/O thread per gigabyte(十亿字节) of data in or out per second. To raise the number of I/O threads, use the zmq_ctx_set() call before creating any sockets:

int io_threads = 4;
void *context = zmq_ctx_new ();
zmq_ctx_set (context(环境), ZMQ_IO_THREADS, io_threads);
assert(维护) (zmq_ctx_get (context, ZMQ_IO_THREADS) == io_threads);

We've seen that one socket(插座) can handle dozens, even thousands of connections at once. This has a fundamental(基本的) impact(影响) on how you write applications. A traditional networked(网路的) application has one process or one thread per remote(遥远的) connection, and that process or thread handles one socket. ZeroMQ lets you collapse(倒塌) this entire structure(结构) into a single process and then break it up as necessary for scaling(缩放比例).

If you are using ZeroMQ for inter-thread communications only (i.e., a multithreaded application that does no external(外部的) socket I/O) you can set the I/O threads to zero. It's not a significant(重大的) optimization(最佳化) though, more of a curiosity(好奇).

Messaging Patterns

topprevnext

Underneath(在…的下面) the brown paper wrapping(包装纸) of ZeroMQ's socket API lies the world of messaging patterns. If you have a background in enterprise messaging, or know UDP well, these will be vaguely(含糊地) familiar. But to most ZeroMQ newcomers, they are a surprise. We're so used to the TCP paradigm(范例) where a socket maps one-to-one to another node.

Let's recap(翻新的轮胎) briefly what ZeroMQ does for you. It delivers blobs of data (messages) to nodes, quickly and efficiently(有效地). You can map nodes to threads, processes, or nodes. ZeroMQ gives your applications a single socket API to work with, no matter what the actual transport (like in-process, inter-process, TCP, or multicast(多路广播)). It automatically(自动的) reconnects(使再接合) to peers(撒尿) as they come and go. It queues messages at both sender and receiver, as needed. It limits these queues to guard processes against running out of memory. It handles socket errors. It does all I/O in background threads. It uses lock-free techniques for talking between nodes, so there are never locks, waits, semaphores(信号), or deadlocks(僵局).

But cutting through that, it routes(路由) and queues messages according to precise(精确的) recipes(食谱) called patterns. It is these patterns that provide ZeroMQ's intelligence(智力). They encapsulate(压缩) our hard-earned experience of the best ways to distribute(分配) data and work. ZeroMQ's patterns are hard-coded but future versions may allow user-definable patterns.

ZeroMQ patterns are implemented(实施) by pairs of sockets(插座) with matching types. In other words, to understand ZeroMQ patterns you need to understand socket types and how they work together. Mostly, this just takes study; there is little that is obvious at this level.

The built-in(嵌入的) core ZeroMQ patterns are:

  • Request-reply, which connects a set of clients to a set of services. This is a remote(遥远的) procedure(程序) call and task distribution(分布) pattern.
  • Pub-sub, which connects a set of publishers to a set of subscribers(订阅). This is a data distribution pattern.
  • Pipeline, which connects nodes in a fan-out/fan-in pattern that can have multiple steps and loops(环). This is a parallel(平行的) task distribution and collection pattern.
  • Exclusive pair, which connects two sockets exclusively(唯一地). This is a pattern for connecting two threads in a process, not to be confused(混乱) with "normal" pairs of sockets.

We looked at the first three of these in Chapter 1 - Basics, and we'll see the exclusive pair pattern later in this chapter. The zmq_socket() man page is fairly clear about the patterns — it's worth reading several times until it starts to make sense. These are the socket combinations t(结合)hat are valid for a connect-bind pair (either side can bind):(结合)

  • PUB and SUB
  • REQ and REP
  • REQ and ROUTER (take care, REQ inserts an extra null frame(框架))
  • DEALER and REP (take care, REP assumes(承担) a null frame)
  • DEALER and ROUTER
  • DEALER and DEALER
  • ROUTER and ROUTER
  • PUSH and PULL
  • PAIR and PAIR

You'll also see references(参考) to XPUB and XSUB sockets(插座), which we'll come to later (they're like raw versions of PUB and SUB). Any other combination(结合) will produce undocumented and unreliable(不可靠的) results, and future versions of ZeroMQ will probably return errors if you try them. You can and will, of course, bridge other socket types via code, i.e., read from one socket type and write to another.

High-Level Messaging Patterns

topprevnext

These four core patterns are cooked into ZeroMQ. They are part of the ZeroMQ API, implemented(实施) in the core C++ library, and are guaranteed(保证) to be available in all fine retail(零售的) stores.

On top of those, we add high-level messaging patterns. We build these high-level patterns on top of ZeroMQ and implement them in whatever language we're using for our application. They are not part of the core library, do not come with the ZeroMQ package, and exist in their own space as part of the ZeroMQ community. For example the Majordomo pattern, which we explore in Chapter 4 - Reliable(可靠的) Request-Reply Patterns, sits in the GitHub Majordomo project in the ZeroMQ organization.

One of the things we aim to provide you with in this book are a set of such high-level patterns, both small (how to handle messages sanely(稳健地)) and large (how to make a reliable pub-sub architecture(建筑学)).

Working with Messages

topprevnext

The libzmq core library has in fact two APIs to send and receive messages. The zmq_send() and zmq_recv() methods that we've already seen and used are simple one-liners(小笑话). We will use these often, but zmq_recv() is bad at dealing with arbitrary(任意的) message sizes: it truncates(把…截短) messages to whatever buffer(缓冲区) size you provide. So there's a second API that works with zmq_msg_t structures(结构), with a richer but more difficult API:

On the wire, ZeroMQ messages are blobs of any size from zero upwards that fit in memory. You do your own serialization(序列化) using protocol(协议) buffers(有软皮摩擦), msgpack, JSON, or whatever else your applications need to speak. It's wise to choose a data representation(代表) that is portable, but you can make your own decisions about trade-offs.

In memory, ZeroMQ messages are zmq_msg_t structures(结构) (or classes depending on your language). Here are the basic ground rules for using ZeroMQ messages in C:

  • You create and pass around zmq_msg_t objects, not blocks of data.
  • To write a message from new data, you use zmq_msg_init_size() to create a message and at the same time allocate(分配) a block of data of some size. You then fill that data using memcpy, and pass the message to zmq_msg_send().
  • To release (not destroy) a message, you call zmq_msg_close(). This drops a reference(参考), and eventually(最后的) ZeroMQ will destroy the message.
  • After you pass a message to zmq_msg_send(), ØMQ will clear the message, i.e., set the size to zero. You cannot send the same message twice, and you cannot access the message data after sending it.
  • These rules don't apply if you use zmq_send() and zmq_recv(), to which you pass byte arrays(数组), not message structures(结构).

If you want to send the same message more than once, and it's sizable(相当大的), create a second message, initialize(初始化) it using zmq_msg_init(), and then use zmq_msg_copy() to create a copy of the first message. This does not copy the data but copies a reference(参考). You can then send the message twice (or more, if you create more copies) and the message will only be finally destroyed when the last copy is sent or closed.

ZeroMQ also supports multipart messages, which let you send or receive a list of frames(框架) as a single on-the-wire message. This is widely used in real applications and we'll look at that later in this chapter and in Chapter 3 - Advanced Request-Reply Patterns.

Frames (also called "message parts" in the ZeroMQ reference manual(手工的) pages) are the basic wire format for ZeroMQ messages. A frame is a length-specified block of data. The length can be zero upwards. If you've done any TCP programming you'll appreciate why frames are a useful answer to the question "how much data am I supposed to read of this network socket(插座) now?"

There is a wire-level protocol called ZMTP that defines(定义) how ZeroMQ reads and writes frames on a TCP connection. If you're interested in how this works, the spec(投机) is quite short.

Originally, a ZeroMQ message was one frame, like UDP. We later extended(延伸) this with multipart messages, which are quite simply series of frames with a "more" bit set to one, followed by one with that bit set to zero. The ZeroMQ API then lets you write messages with a "more" flag and when you read messages, it lets you check if there's "more".

In the low-level ZeroMQ API and the reference manual, therefore, there's some fuzziness(绒毛的特性) about messages versus(对) frames. So here's a useful lexicon(词典):

  • A message can be one or more parts.
  • These parts are also called "frames(框架)".
  • Each part is a zmq_msg_t object.
  • You send and receive each part separately, in the low-level API.
  • Higher-level APIs provide wrappers(包) to send entire multipart messages.

Some other things that are worth knowing about messages:

  • You may send zero-length messages, e.g., for sending a signal from one thread to another.
  • ZeroMQ guarantees(保证) to deliver all the parts (one or more) for a message, or none of them.
  • ZeroMQ does not send the message (single or multipart) right away, but at some indeterminate(不确定的) later time. A multipart message must therefore fit in memory.
  • A message (single or multipart) must fit in memory. If you want to send files of arbitrary(任意的) sizes, you should break them into pieces and send each piece as separate single-part messages. Using multipart data will not reduce memory consumption(消费).
  • You must call zmq_msg_close() when finished with a received message, in languages that don't automatically(自动地) destroy objects when a scope(范围) closes. You don't call this method after sending a message.

And to be repetitive(重复的), do not use zmq_msg_init_data() yet. This is a zero-copy method and is guaranteed to create trouble for you. There are far more important things to learn about ZeroMQ before you start to worry about shaving off microseconds(微秒).

This rich API can be tiresome to work with. The methods are optimized(最佳化的) for performance, not simplicity(朴素). If you start using these you will almost definitely(清楚地) get them wrong until you've read the man pages with some care. So one of the main jobs of a good language binding(装订) is to wrap this API up in classes that are easier to use.

Handling Multiple Sockets

topprevnext

In all the examples so far, the main loop(环) of most examples has been:

  1. Wait for message on socket(插座).
  2. Process message.
  3. Repeat.

What if we want to read from multiple endpoints(端点) at the same time? The simplest way is to connect one socket to all the endpoints and get ZeroMQ to do the fan-in for us. This is legal(法律的) if the remote(遥远的) endpoints are in the same pattern, but it would be wrong to connect a PULL socket to a PUB endpoint.

To actually read from multiple sockets all at once, use zmq_poll(). An even better way might be to wrap(包) zmq_poll() in a framework(框架) that turns it into a nice event-driven reactor, but it's significantly(重大的) more work than we want to cover here.

Let's start with a dirty hack, partly for the fun of not doing it right, but mainly because it lets me show you how to do nonblocking socket reads. Here is a simple example of reading from two sockets using nonblocking reads. This rather confused(混乱) program acts both as a subscriber(订户) to weather updates, and a worker for parallel(平行的) tasks:

// Reading from multiple sockets(插座)
// This version uses a simple recv loop(打环)

#include "zhelpers.h"

int main (void)
{
// Connect to task ventilator(通风设备)
void *context = zmq_ctx_new ();
void *receiver = zmq_socket(插座) (context(环境), ZMQ_PULL);
zmq_connect (receiver, "tcp://localhost:5557");

// Connect to weather server
void *subscriber = zmq_socket (context, ZMQ_SUB);
zmq_connect (subscriber, "tcp://localhost:5556");
zmq_setsockopt (subscriber(订阅), ZMQ_SUBSCRIBE, "10001 ", 6);

// Process messages from both sockets(插座)
// We prioritize(把…区分优先次序) traffic from the task ventilator(通风设备)
while (1) {
char msg [256];
while (1) {
int size = zmq_recv (receiver, msg, 255, ZMQ_DONTWAIT);
if (size != -1) {
// Process task
}
else
break;
}
while (1) {
int size = zmq_recv (subscriber(订阅), msg, 255, ZMQ_DONTWAIT);
if (size != -1) {
// Process weather update
}
else
break;
}
// No activity, so sleep for 1 msec
s_sleep (1);
}
zmq_close (receiver);
zmq_close (subscriber);
zmq_ctx_destroy (context);
return 0;
}


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Java | Lua | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Haskell | Haxe | Node.js | ooc | Q | Racket

The cost of this approach(方法) is some additional(附加的) latency(潜伏) on the first message (the sleep at the end of the loop(环), when there are no waiting messages to process). This would be a problem in applications where submillisecond latency was vital(至关重要的). Also, you need to check the documentation(文件材料) for nanosleep() or whatever function you use to make sure it does not busy-loop.

You can treat the sockets(插座) fairly by reading first from one, then the second rather than prioritizing(把…区分优先次序) them as we did in this example.

Now let's see the same senseless(愚蠢的) little application done right, using zmq_poll():

// Reading from multiple sockets
// This version uses zmq_poll()

#include "zhelpers.h"

int main (void)
{
// Connect to task ventilator(通风设备)
void *context = zmq_ctx_new ();
void *receiver = zmq_socket(插座) (context(环境), ZMQ_PULL);
zmq_connect (receiver, "tcp://localhost:5557");

// Connect to weather server
void *subscriber = zmq_socket (context, ZMQ_SUB);
zmq_connect (subscriber, "tcp://localhost:5556");
zmq_setsockopt (subscriber(订阅), ZMQ_SUBSCRIBE, "10001 ", 6);

// Process messages from both sockets(插座)
while (1) {
char msg [256];
zmq_pollitem_t items [] = {
{ receiver, 0, ZMQ_POLLIN, 0 },
{ subscriber, 0, ZMQ_POLLIN, 0 }
};
zmq_poll (items, 2, -1);
if (items [0].revents & ZMQ_POLLIN) {
int size = zmq_recv (receiver, msg, 255, 0);
if (size != -1) {
// Process task
}
}
if (items [1].revents & ZMQ_POLLIN) {
int size = zmq_recv (subscriber(订阅), msg, 255, 0);
if (size != -1) {
// Process weather update
}
}
}
zmq_close (subscriber);
zmq_ctx_destroy (context);
return 0;
}


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Haxe | ooc | Q | Racket

The items structure(结构) has these four members:

typedef struct {
void *socket; // ZeroMQ socket(插座) to poll(投票) on
int fd; // OR, native file handle to poll on
short events; // Events to poll on
short revents; // Events returned after poll(投票)
} zmq_pollitem_t;

Multipart Messages

topprevnext

ZeroMQ lets us compose(构成) a message out of several frames(框架), giving us a "multipart message". Realistic applications use multipart messages heavily, both for wrapping(包装用的) messages with address information and for simple serialization(序列化). We'll look at reply envelopes later.

What we'll learn now is simply how to blindly and safely read and write multipart messages in any application (such as a proxy(代理人)) that needs to forward messages without inspecting them.

When you work with multipart messages, each part is a zmq_msg item. E.g., if you are sending a message with five parts, you must construct, send, and destroy five zmq_msg items. You can do this in advance (and store the zmq_msg items in an array(数组) or other structure(结构)), or as you send them, one-by-one.

Here is how we send the frames in a multipart message (we receive each frame into a message object):

zmq_msg_send (&message, socket(插座), ZMQ_SNDMORE);

zmq_msg_send (&message, socket(插座), ZMQ_SNDMORE);

zmq_msg_send (&message, socket, 0);

Here is how we receive and process all the parts in a message, be it single part or multipart:

while (1) {
zmq_msg_t message;
zmq_msg_init (&message);
zmq_msg_recv (&message, socket, 0);
// Process the message frame(框架)

zmq_msg_close (&message);
if (!zmq_msg_more (&message))
break; // Last message frame
}

Some things to know about multipart messages:

  • When you send a multipart message, the first part (and all following parts) are only actually sent on the wire when you send the final part.
  • If you are using zmq_poll(), when you receive the first part of a message, all the rest has also arrived.
  • You will receive all parts of a message, or none at all.
  • Each part of a message is a separate zmq_msg item.
  • You will receive all parts of a message whether or not you check the more property.
  • On sending, ZeroMQ queues message frames(框架) in memory until the last is received, then sends them all.
  • There is no way to cancel a partially(部分地) sent message, except by closing the socket(插座).

Intermediaries and Proxies

topprevnext

ZeroMQ aims for decentralized(分散的) intelligence(智力), but that doesn't mean your network is empty space in the middle. It's filled with message-aware infrastructure(基础设施) and quite often, we build that infrastructure with ZeroMQ. The ZeroMQ plumbing(垂直) can range from tiny pipes to full-blown(成熟的) service-oriented(服务型的) brokers. The messaging industry calls this intermediation, meaning that the stuff(东西) in the middle deals with either side. In ZeroMQ, we call these proxies(代理人), queues, forwarders, device(装置), or brokers, depending on the context(环境).

This pattern is extremely common in the real world and is why our societies and economies(经济) are filled with intermediaries(中间人) who have no other real function than to reduce the complexity(复杂) and scaling(衡量) costs of larger networks. Real-world intermediaries are typically(代表性地) called wholesalers(批发), distributors(经销商), managers, and so on.

The Dynamic(动态的) Discovery Problem

topprevnext

One of the problems you will hit as you design larger distributed(分布式的) architectures(建筑学) is discovery. That is, how do pieces know about each other? It's especially difficult if pieces come and go, so we call this the "dynamic discovery problem".

There are several solutions(解决方案) to dynamic discovery. The simplest is to entirely avoid it by hard-coding (or configuring(安装)) the network architecture so discovery is done by hand. That is, when you add a new piece, you reconfigure(重新配置) the network to know about it.

Figure 12 - Small-Scale Pub-Sub Network

fig12.png

In practice, this leads to increasingly fragile and unwieldy(笨拙的) architectures. Let's say you have one publisher and a hundred subscribers(订阅). You connect each subscriber to the publisher by configuring a publisher endpoint(端点) in each subscriber. That's easy. Subscribers are dynamic; the publisher is static. Now say you add more publishers. Suddenly, it's not so easy any more. If you continue to connect each subscriber to each publisher, the cost of avoiding dynamic discovery gets higher and higher.

Figure 13 - Pub-Sub Network with a Proxy(代理人)

fig13.png

There are quite a few answers to this, but the very simplest answer is to add an intermediary(中间的); that is, a static point in the network to which all other nodes connect. In classic(经典的) messaging, this is the job of the message broker. ZeroMQ doesn't come with a message broker as such, but it lets us build intermediaries quite easily.

You might wonder, if all networks eventually(最后) get large enough to need intermediaries, why don't we simply have a message broker in place for all applications? For beginners, it's a fair compromise(妥协). Just always use a star topology(拓扑学), forget about performance, and things will usually work. However, message brokers are greedy things; in their role as central intermediaries, they become too complex(复杂的), too stateful(状态性的), and eventually a problem.

It's better to think of intermediaries as simple stateless(没有国家的) message switches(开关). A good analogy(类比) is an HTTP proxy; it's there, but doesn't have any special role. Adding a pub-sub proxy solves the dynamic discovery problem in our example. We set the proxy in the "middle" of the network. The proxy opens an XSUB socket(插座), an XPUB socket, and binds(捆绑) each to well-known(著名的) IP addresses and ports. Then, all other processes connect to the proxy, instead of to each other. It becomes trivial(不重要的) to add more subscribers or publishers.

Figure 14 - Extended(延伸) Pub-Sub

fig14.png

We need XPUB and XSUB sockets because ZeroMQ does subscription(捐献) forwarding from subscribers to publishers. XSUB and XPUB are exactly like SUB and PUB except they expose subscriptions as special messages. The proxy has to forward these subscription messages from subscriber side to publisher side, by reading them from the XPUB socket and writing them to the XSUB socket. This is the main use case for XSUB and XPUB.

Shared Queue (DEALER and ROUTER sockets)

topprevnext

In the Hello World client/server application, we have one client that talks to one service. However, in real cases we usually need to allow multiple services as well as multiple clients. This lets us scale(规模) up the power of the service (many threads or processes or nodes rather than just one). The only constraint(约束) is that services must be stateless, all state being in the request or in some shared storage such as a database.

Figure 15 - Request Distribution

fig15.png

There are two ways to connect multiple clients to multiple servers. The brute(畜生) force way is to connect each client socket(插座) to multiple service endpoints(端点). One client socket can connect to multiple service sockets, and the REQ socket will then distribute(分配) requests among these services. Let's say you connect a client socket to three service endpoints; A, B, and C. The client makes requests R1, R2, R3, R4. R1 and R4 go to service A, R2 goes to B, and R3 goes to service C.

This design lets you add more clients cheaply. You can also add more services. Each client will distribute its requests to the services. But each client has to know the service topology(拓扑学). If you have 100 clients and then you decide to add three more services, you need to reconfigure(重新配置) and restart(重新启动) 100 clients in order for the clients to know about the three new services.

That's clearly not the kind of thing we want to be doing at 3 a.m. when our supercomputing(超级计算) cluster(群) has run out of resources and we desperately(拼命地) need to add a couple of hundred of new service nodes. Too many static pieces are like liquid concrete(具体物): knowledge is distributed and the more static pieces you have, the more effort it is to change the topology. What we want is something sitting in between clients and services that centralizes(集中) all knowledge of the topology. Ideally(理想的), we should be able to add and remove services or clients at any time without touching any other part of the topology.

So we'll write a little message queuing broker that gives us this flexibility(灵活性). The broker binds(捆绑) to two endpoints, a frontend(前端) for clients and a backend for services. It then uses zmq_poll() to monitor these two sockets for activity and when it has some, it shuttles messages between its two sockets. It doesn't actually manage any queues explicitly(明确的)—ZeroMQ does that automatically o(自动地)n each socket.

When you use REQ to talk to REP, you get a strictly synchronous(同步的) request-reply dialog. The client sends a request. The service reads the request and sends a reply. The client then reads the reply. If either the client or the service try to do anything else (e.g., sending two requests in a row without waiting for a response), they will get an error.

But our broker has to be nonblocking. Obviously, we can use zmq_poll() to wait for activity on either socket, but we can't use REP and REQ.

Figure 16 - Extended(延伸) Request-Reply

fig16.png

Luckily, there are two sockets called DEALER and ROUTER that let you do nonblocking request-response. You'll see in Chapter 3 - Advanced Request-Reply Patterns how DEALER and ROUTER sockets let you build all kinds of asynchronous(异步的) request-reply flows. For now, we're just going to see how DEALER and ROUTER let us extend REQ-REP across an intermediary(中间的), that is, our little broker.

In this simple extended request-reply pattern, REQ talks to ROUTER and DEALER talks to REP. In between the DEALER and ROUTER, we have to have code (like our broker) that pulls messages off the one socket and shoves(推) them onto the other.

The request-reply broker binds to two endpoints, one for clients to connect to (the frontend socket) and one for workers to connect to (the backend). To test this broker, you will want to change your workers so they connect to the backend socket. Here is a client that shows what I mean:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

Here is the worker:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

And here is the broker, which properly handles multipart messages:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

Figure 17 - Request-Reply Broker

fig17.png

Using a request-reply broker makes your client/server architectures(建筑学) easier to scale(衡量) because clients don't see workers, and workers don't see clients. The only static node is the broker in the middle.

ZeroMQ's Built-In Proxy(代理人) Function

topprevnext

It turns out that the core loop(环) in the previous section's rrbroker is very useful, and reusable(可重复使用的). It lets us build pub-sub forwarders and shared queues and other little intermediaries(中间人) with very little effort. ZeroMQ wraps(外套) this up in a single method, zmq_proxy():

zmq_proxy (frontend(前端), backend, capture(俘获));

The two (or three sockets(插座), if we want to capture data) must be properly connected, bound, and configured(配置). When we call the zmq_proxy method, it's exactly like starting the main loop of rrbroker. Let's rewrite the request-reply broker to call zmq_proxy, and re-badge this as an expensive-sounding "message queue" (people have charged houses for code that did less):


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Q | Ruby | Tcl | Ada | Basic | Felix | Objective-C | ooc | Racket | Scala

If you're like most ZeroMQ users, at this stage your mind is starting to think, "What kind of evil(邪恶的) stuff(东西) can I do if I plug random(随机的) socket(插座) types into the proxy(代理人)?" The short answer is: try it and work out what is happening. In practice, you would usually stick to ROUTER/DEALER, XSUB/XPUB, or PULL/PUSH.

Transport Bridging

topprevnext

A frequent request from ZeroMQ users is, "How do I connect my ZeroMQ network with technology X?" where X is some other networking or messaging technology.

Figure 18 - Pub-Sub Forwarder Proxy

fig18.png

The simple answer is to build a bridge. A bridge is a small application that speaks one protocol(协议) at one socket, and converts(皈依者) to/from a second protocol at another socket. A protocol interpreter(解释者), if you like. A common bridging problem in ZeroMQ is to bridge two transports or networks.

As an example, we're going to write a little proxy that sits in between a publisher and a set of subscribers(订阅), bridging two networks. The frontend(前端) socket (SUB) faces the internal(内部的) network where the weather server is sitting, and the backend (PUB) faces subscribers on the external(外部的) network. It subscribes to the weather service on the frontend socket, and republishes(再版) its data on the backend socket.


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

It looks very similar to the earlier proxy(代理人) example, but the key part is that the frontend(前端) and backend sockets(插座) are on two different networks. We can use this model for example to connect a multicast(多路广播) network (pgm transport) to a tcp publisher.

Handling Errors and ETERM

topprevnext

ZeroMQ's error handling philosophy(哲学) is a mix of fail-fast and resilience(恢复力). Processes, we believe, should be as vulnerable(易受攻击的) as possible to internal(内部的) errors, and as robust(强健的) as possible against external(外部的) attacks and errors. To give an analogy(类比), a living cell will self-destruct(自毁) if it detects(察觉) a single internal error, yet it will resist attack from the outside by all means possible.

Assertions(断言), which pepper the ZeroMQ code, are absolutely(绝对地) vital(至关重要的) to robust code; they just have to be on the right side of the cellular(细胞的) wall. And there should be such a wall. If it is unclear whether a fault is internal or external, that is a design flaw(瑕疵) to be fixed. In C/C++, assertions stop the application immediately with an error. In other languages, you may get exceptions(例外) or halts(停止).

When ZeroMQ detects(察觉) an external(外部的) fault it returns an error to the calling code. In some rare cases, it drops messages silently if there is no obvious strategy(战略) for recovering from the error.

In most of the C examples we've seen so far there's been no error handling. Real code should do error handling on every single ZeroMQ call. If you're using a language binding(结合) other than C, the binding may handle errors for you. In C, you do need to do this yourself. There are some simple rules, starting with POSIX conventions(大会):

  • Methods that create objects return NULL if they fail.
  • Methods that process data may return the number of bytes processed, or -1 on an error or failure.
  • Other methods return 0 on success and -1 on an error or failure.
  • The error code is provided in errno or zmq_errno().
  • A descriptive(描写的) error text for logging is provided by zmq_strerror().

For example:

void *context = zmq_ctx_new ();
assert (context);
void *socket = zmq_socket(插座) (context(环境), ZMQ_REP);
assert (socket);
int rc = zmq_bind (socket, "tcp://*:5555");
if (rc == -1) {
printf ("E: bind(绑) failed: %s\n", strerror (errno));
return -1;
}

There are two main exceptional(异常的) conditions that you should handle as nonfatal(非致命的):

  • When your code receives a message with the ZMQ_DONTWAIT option and there is no waiting data, ZeroMQ will return -1 and set errno to EAGAIN.
  • When one thread calls zmq_ctx_destroy(), and other threads are still doing blocking work, the zmq_ctx_destroy() call closes the context(环境) and all blocking calls exit with -1, and errno set to ETERM.

In C/C++, asserts(维护) can be removed entirely in optimized(最佳化的) code, so don't make the mistake of wrapping(包) the whole ZeroMQ call in an assert(). It looks neat; then the optimizer removes all the asserts and the calls you want to make, and your application breaks in impressive(感人的) ways.

Figure 19 - Parallel(平行线) Pipeline(管道) with Kill Signaling

fig19.png

Let's see how to shut down a process cleanly. We'll take the parallel pipeline example from the previous section. If we've started a whole lot of workers in the background, we now want to kill them when the batch(一批) is finished. Let's do this by sending a kill message to the workers. The best place to do this is the sink because it really knows when the batch is done.

How do we connect the sink to the workers? The PUSH/PULL sockets(插座) are one-way only. We could switch(转换) to another socket type, or we could mix multiple socket flows. Let's try the latter: using a pub-sub model to send kill messages to the workers:

  • The sink creates a PUB socket on a new endpoint(端点).
  • Workers bind(绑) their input(投入) socket to this endpoint.
  • When the sink detects(察觉) the end of the batch, it sends a kill to its PUB socket.
  • When a worker detects this kill message, it exits.

It doesn't take much new code in the sink:

void *controller = zmq_socket(插座) (context(环境), ZMQ_PUB);
zmq_bind (controller, "tcp://*:5559");

// Send kill signal to workers
s_send (controller, "KILL");

Here is the worker process, which manages two sockets (a PULL socket getting tasks, and a SUB socket getting control commands), using the zmq_poll() technique we saw earlier:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | ooc | Q | Racket

Here is the modified(改进的) sink application. When it's finished collecting results, it broadcasts a kill message to all workers:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | ooc | Q | Racket

Handling Interrupt Signals

topprevnext

Realistic(现实的) applications need to shut down cleanly when interrupted with Ctrl-C or another signal such as SIGTERM. By default, these simply kill the process, meaning messages won't be flushed(激动的), files won't be closed cleanly, and so on.

Here is how we handle a signal in various languages:


C++ | C# | Delphi | Erlang | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Ada | Basic | Clojure | CL | F# | Felix | Objective-C | ooc | Q | Racket | Tcl

The program provides s_catch_signals(), which traps Ctrl-C (SIGINT) and SIGTERM. When either of these signals arrive, the s_catch_signals() handler sets the global variable(变量的) s_interrupted. Thanks to your signal handler, your application will not die automatically(自动地). Instead, you have a chance to clean up and exit gracefully(优雅地). You have to now explicitly(明确地) check for an interrupt and handle it properly. Do this by calling s_catch_signals() (copy this from interrupt.c) at the start of your main code. This sets up the signal handling. The interrupt will affect ZeroMQ calls as follows:

  • If your code is blocking in a blocking call (sending a message, receiving a message, or polling(投票)), then when a signal arrives, the call will return with EINTR.
  • Wrappers like s_recv() return NULL if they are interrupted.

So check for an EINTR return code, a NULL return, and/or s_interrupted.

Here is a typical(典型的) code fragment(碎片):

s_catch_signals ();
client = zmq_socket (...);
while (!s_interrupted) {
    char *message = s_recv (client);
    if (!message)
        break;          //  Ctrl-C used
}
zmq_close (client);

If you call s_catch_signals() and don't test for interrupts, then your application will become immune(免疫者) to Ctrl-C and SIGTERM, which may be useful, but is usually not.

Detecting Memory Leaks

topprevnext

Any long-running application has to manage memory correctly, or eventually(最后的) it'll use up all available memory and crash. If you use a language that handles this automatically(自动地) for you, congratulations. If you program in C or C++ or any other language where you're responsible(负责的) for memory management, here's a short tutorial(辅导的) on using valgrind, which among other things will report on any leaks your programs have.

  • To install(安装) valgrind, e.g., on Ubuntu or Debian, issue this command:
sudo apt-get install valgrind
  • By default, ZeroMQ will cause valgrind to complain a lot. To remove these warnings, create a file called vg.supp that contains this:
{
   <socketcall_sendto>
   Memcheck:Param
   socketcall.sendto(msg)
   fun:send
   ...
}
{
   <socketcall_sendto>
   Memcheck:Param
   socketcall.send(msg)
   fun:send
   ...
}
  • Fix your applications to exit cleanly after Ctrl-C. For any application that exits by itself, that's not needed, but for long-running applications, this is essential, otherwise valgrind will complain about all currently allocated(分配) memory.
  • Build your application with -DDEBUG if it's not your default setting. That ensures(保证) valgrind can tell you exactly where memory is being leaked.
  • Finally, run valgrind thus:
valgrind --tool=memcheck --leak-check=full --suppressions=vg.supp someprog

And after fixing any errors it reported, you should get the pleasant message:

==30536== ERROR SUMMARY: 0 errors from 0 contexts...

Multithreading with ZeroMQ

topprevnext

ZeroMQ is perhaps the nicest way ever to write multithreaded (MT) applications. Whereas(然而) ZeroMQ sockets(插座) require some readjustment(重新调整) if you are used to traditional sockets, ZeroMQ multithreading will take everything you know about writing MT applications, throw it into a heap in the garden, pour gasoline(汽油) over it, and set it alight(下来). It's a rare book that deserves burning, but most books on concurrent(并发的) programming do.

To make utterly(完全地) perfect MT programs (and I mean that literally(文字的)), we don't need mutexes(互斥), locks, or any other form of inter-thread communication except messages sent across ZeroMQ sockets.

By "perfect MT programs", I mean code that's easy to write and understand, that works with the same design approach(方法) in any programming language, and on any operating system, and that scales(天平) across any number of CPUs with zero wait states and no point of diminishing(逐渐缩小的) returns.

If you've spent years learning tricks to make your MT code work at all, let alone rapidly, with locks and semaphores(信号) and critical(鉴定的) sections, you will be disgusted(厌恶的) when you realize it was all for nothing. If there's one lesson we've learned from 30+ years of concurrent programming, it is: just don't share state. It's like two drunkards(酒鬼) trying to share a beer. It doesn't matter if they're good buddies(伙伴). Sooner or later, they're going to get into a fight. And the more drunkards you add to the table, the more they fight each other over the beer. The tragic(悲剧的) majority of MT applications look like drunken(喝醉的) bar fights.

The list of weird(怪异的) problems that you need to fight as you write classic(经典的) shared-state MT code would be hilarious(欢闹的) if it didn't translate directly into stress(压力) and risk(风险), as code that seems to work suddenly fails under pressure. A large firm with world-beating experience in buggy(童车) code released its list of "11 Likely Problems In Your Multithreaded Code", which covers forgotten synchronization(同步), incorrect granularity(间隔尺寸), read and write tearing, lock-free reordering(再订购), lock convoys(护送), two-step dance, and priority(优先) inversion(倒置).

Yeah, we counted seven problems, not eleven. That's not the point though. The point is, do you really want that code running the power grid or stock market to start getting two-step lock convoys at 3 p.m. on a busy Thursday? Who cares what the terms actually mean? This is not what turned us on to programming, fighting ever more complex(复杂的) side effects with ever more complex hacks.

Some widely used models, despite(尽管) being the basis(基础) for entire industries, are fundamentally(根本地) broken, and shared state concurrency(并发性) is one of them. Code that wants to scale without limit does it like the Internet does, by sending messages and sharing nothing except a common contempt(轻视) for broken programming models.

You should follow some rules to write happy multithreaded code with ZeroMQ:

  • Isolate(隔离) data privately within its thread and never share data in multiple threads. The only exception(例外) to this are ZeroMQ contexts(环境), which are threadsafe.
  • Stay away from the classic concurrency mechanisms(机制) like as mutexes, critical sections, semaphores, etc. These are an anti-pattern in ZeroMQ applications.
  • Create one ZeroMQ context at the start of your process, and pass that to all threads that you want to connect via inproc sockets.
  • Use attached threads to create structure(结构) within your application, and connect these to their parent threads using PAIR sockets(插座) over inproc. The pattern is: bind(绑) parent socket, then create child thread which connects its socket.
  • Use detached threads to simulate(模仿的) independent tasks, with their own contexts(环境). Connect these over tcp. Later you can move these to stand-alone processes without changing the code significantly(意味深长地).
  • All interaction(相互作用) between threads happens as ZeroMQ messages, which you can define(定义) more or less formally(正式地).
  • Don't share ZeroMQ sockets between threads. ZeroMQ sockets are not threadsafe. Technically it's possible to migrate(移动) a socket from one thread to another but it demands skill. The only place where it's remotely(遥远的) sane(健全的) to share sockets between threads are in language bindings that need to do magic like garbage collection on sockets.

If you need to start more than one proxy(代理人) in an application, for example, you will want to run each in their own thread. It is easy to make the error of creating the proxy frontend(前端) and backend sockets in one thread, and then passing the sockets to the proxy in another thread. This may appear to work at first but will fail randomly(随便地) in real use. Remember: Do not use or close sockets except in the thread that created them.

If you follow these rules, you can quite easily build elegant(高雅的) multithreaded applications, and later split off threads into separate processes as you need to. Application logic(逻辑) can sit in threads, processes, or nodes: whatever your scale(规模) needs.

ZeroMQ uses native OS threads rather than virtual(虚拟的) "green" threads. The advantage is that you don't need to learn any new threading API, and that ZeroMQ threads map cleanly to your operating system. You can use standard tools like Intel's ThreadChecker to see what your application is doing. The disadvantages are that native threading APIs are not always portable, and that if you have a huge number of threads (in the thousands), some operating systems will get stressed(强调).

Let's see how this works in practice. We'll turn our old Hello World server into something more capable(能干的). The original server ran in a single thread. If the work per request is low, that's fine: one ØMQ thread can run at full speed on a CPU core, with no waits, doing an awful (可怕的)lot of work. But realistic (现实的)servers have to do nontrivial (非平凡的)work per request. A single core may not be enough when 10,000 clients hit the server all at once. So a realistic server will start multiple worker threads. It then accepts requests as fast as it can and distributes (分配)these to its worker threads. The worker threads grind (磨碎)through the work and eventually (最后)send their replies back.

You can, of course, do all this using a proxy broker and external(外部的) worker processes, but often it's easier to start one process that gobbles(火鸡叫声) up sixteen cores than sixteen processes, each gobbling up one core. Further, running workers as threads will cut out a network hop, latency(潜伏), and network traffic.

The MT version of the Hello World service basically(主要地) collapses(倒塌) the broker and workers into a single process:

// Multithreaded Hello World server

#include "zhelpers.h"
#include <pthread.h>
#include <unistd.h>

static void *
worker_routine (void *context) {
// Socket(插座) to talk to dispatcher(调度员)
void *receiver = zmq_socket (context(环境), ZMQ_REP);
zmq_connect (receiver, "inproc://workers");

while (1) {
char *string = s_recv (receiver);
printf ("Received request: [%s]\n", string);
free (string);
// Do some 'work'
sleep (1);
// Send reply back to client
s_send (receiver, "World");
}
zmq_close (receiver);
return NULL;
}

int main (void)
{
void *context = zmq_ctx_new ();

// Socket(插座) to talk to clients
void *clients = zmq_socket (context(环境), ZMQ_ROUTER);
zmq_bind (clients, "tcp://*:5555");

// Socket(插座) to talk to workers
void *workers = zmq_socket (context(环境), ZMQ_DEALER);
zmq_bind (workers, "inproc://workers");

// Launch(发射) pool of worker threads
int thread_nbr;
for (thread_nbr = 0; thread_nbr < 5; thread_nbr++) {
pthread_t worker;
pthread_create (&worker, NULL, worker_routine(日常的), context(环境));
}
// Connect work threads to client threads via a queue proxy(代理人)
zmq_proxy (clients, workers, NULL);

// We never get here, but clean up anyhow
zmq_close (clients);
zmq_close (workers);
zmq_ctx_destroy (context);
return 0;
}


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Q | Ruby | Scala | Ada | Basic | Felix | Node.js | Objective-C | ooc | Racket | Tcl

Figure 20 - Multithreaded Server

fig20.png

All the code should be recognizable(可辨认的) to you by now. How it works:

  • The server starts a set of worker threads. Each worker thread creates a REP socket(插座) and then processes requests on this socket. Worker threads are just like single-threaded servers. The only differences are the transport (inproc instead of tcp), and the bind-connect direction.
  • The server creates a ROUTER socket to talk to clients and binds(捆绑) this to its external(外部的) interface(界面) (over tcp).
  • The server creates a DEALER socket to talk to the workers and binds this to its internal(内部的) interface (over inproc).
  • The server starts a proxy(代理人) that connects the two sockets. The proxy pulls incoming requests fairly from all clients, and distributes(分配) those out to workers. It also routes(路由) replies back to their origin.

Note that creating threads is not portable in most programming languages. The POSIX library is pthreads, but on Windows you have to use a different API. In our example, the pthread_create call starts up a new thread running the worker_routine function we defined(定义). We'll see in Chapter 3 - Advanced Request-Reply Patterns how to wrap(包) this in a portable API.

Here the "work" is just a one-second pause. We could do anything in the workers, including talking to other nodes. This is what the MT server looks like in terms of ØMQ sockets (插座)and nodes. Note how the request-reply chain is REQ-ROUTER-queue-DEALER-REP.

Signaling Between Threads (PAIR Sockets)

topprevnext

When you start making multithreaded applications with ZeroMQ, you'll encounter(遭遇) the question of how to coordinate(调整) your threads. Though you might be tempted(诱惑) to insert "sleep" statements(声明), or use multithreading techniques such as semaphores(信号) or mutexes(互斥), the only mechanism(机制) that you should use are ZeroMQ messages. Remember the story of The Drunkards and The Beer Bottle.

Let's make three threads that signal each other when they are ready. In this example, we use PAIR sockets over the inproc transport:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Q | Ruby | Scala | Ada | Basic | Felix | Node.js | Objective-C | ooc | Racket | Tcl

Figure 21 - The Relay Race

fig21.png

This is a classic(经典的) pattern for multithreading with ZeroMQ:

  1. Two threads communicate over inproc, using a shared context(环境).
  2. The parent thread creates one socket(插座), binds(捆绑) it to an inproc:// endpoint, and then starts the child thread, passing the context(环境) to it.
  3. The child thread creates the second socket(插座), connects it to that inproc:// endpoint, and then signals to the parent thread that it's ready.

Note that multithreading code using this pattern is not scalable(可攀登的) out to processes. If you use inproc and socket pairs, you are building a tightly-bound application, i.e., one where your threads are structurally(在结构上) interdependent(相互依赖的). Do this when low latency(潜伏) is really vital(至关重要的). The other design pattern is a loosely bound application, where threads have their own context and communicate over ipc or tcp. You can easily break loosely bound threads into separate processes.

This is the first time we've shown an example using PAIR sockets. Why use PAIR? Other socket combinations(结合) might seem to work, but they all have side effects that could interfere(干涉) with signaling:

  • You can use PUSH for the sender and PULL for the receiver. This looks simple and will work, but remember that PUSH will distribute(分配) messages to all available receivers. If you by accident start two receivers (e.g., you already have one running and you start a second), you'll "lose" half of your signals. PAIR has the advantage of refusing more than one connection; the pair is exclusive.
  • You can use DEALER for the sender and ROUTER for the receiver. ROUTER, however, wraps(外套) your message in an "envelope", meaning your zero-size signal turns into a multipart message. If you don't care about the data and treat anything as a valid signal, and if you don't read more than once from the socket, that won't matter. If, however, you decide to send real data, you will suddenly find ROUTER providing you with "wrong" messages. DEALER also distributes outgoing(外出的) messages, giving the same risk(风险) as PUSH.
  • You can use PUB for the sender and SUB for the receiver. This will correctly deliver your messages exactly as you sent them and PUB does not distribute as PUSH or DEALER do. However, you need to configure(安装) the subscriber(订户) with an empty subscription(捐献), which is annoying.

For these reasons, PAIR makes the best choice for coordination(协调) between pairs of threads.

Node Coordination

topprevnext

When you want to coordinate(调整) a set of nodes on a network, PAIR sockets(插座) won't work well any more. This is one of the few areas where the strategies(战略) for threads and nodes are different. Principally(首要的), nodes come and go whereas(然而) threads are usually static. PAIR sockets do not automatically(自动地) reconnect(使再接合) if the remote(遥远的) node goes away and comes back.

Figure 22 - Pub-Sub Synchronization

fig22.png

The second significant(重大的) difference between threads and nodes is that you typically(代表性地) have a fixed number of threads but a more variable(变量的) number of nodes. Let's take one of our earlier scenarios(方案) (the weather server and clients) and use node coordination to ensure(保证) that subscribers(订阅) don't lose data when starting up.

This is how the application will work:

  • The publisher knows in advance how many subscribers it expects. This is just a magic number it gets from somewhere.
  • The publisher starts up and waits for all subscribers to connect. This is the node coordination part. Each subscriber subscribes and then tells the publisher it's ready via another socket.
  • When the publisher has all subscribers connected, it starts to publish data.

In this case, we'll use a REQ-REP socket flow to synchronize(合拍) subscribers and publisher. Here is the publisher:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

And here is the subscriber(订户):


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

This Bash shell(壳) script will start ten subscribers(订阅) and then the publisher:

echo "Starting subscribers..."
for ((a=0; a<10; a++)); do
    syncsub &
done
echo "Starting publisher..."
syncpub

Which gives us this satisfying output(输出):

Starting subscribers...
Starting publisher...
Received 1000000 updates
Received 1000000 updates
...
Received 1000000 updates
Received 1000000 updates

We can't assume(承担) that the SUB connect will be finished by the time the REQ/REP dialog is complete. There are no guarantees(保证) that outbound(出站) connects will finish in any order whatsoever(无论什么), if you're using any transport except inproc. So, the example does a brute(畜生) force sleep of one second between subscribing, and sending the REQ/REP synchronization(同步).

A more robust(强健的) model could be:

  • Publisher opens PUB socket(插座) and starts sending "Hello" messages (not data).
  • Subscribers connect SUB socket and when they receive a Hello message they tell the publisher via a REQ/REP socket pair.
  • When the publisher has had all the necessary confirmations(确认), it starts to send real data.

Zero-Copy

topprevnext

ZeroMQ's message API lets you send and receive messages directly from and to application buffers(有软皮摩擦) without copying data. We call this zero-copy, and it can improve performance in some applications.

You should think about using zero-copy in the specific(特殊的) case where you are sending large blocks of memory (thousands of bytes), at a high frequency(频率). For short messages, or for lower message rates, using zero-copy will make your code messier and more complex(复杂的) with no measurable(可测量的) benefit(利益). Like all optimizations(最佳化), use this when you know it helps, and measure before and after.

To do zero-copy, you use zmq_msg_init_data() to create a message that refers to a block of data already allocated(分配) with malloc() or some other allocator(分配算符), and then you pass that to zmq_msg_send(). When you create the message, you also pass a function that ZeroMQ will call to free the block of data, when it has finished sending the message. This is the simplest example, assuming(承担) buffer is a block of 1,000 bytes allocated on the heap:

void my_free (void *data, void *hint) {
free (data);
}
// Send message from buffer(缓冲区), which we allocate(分配) and ZeroMQ will free for us
zmq_msg_t message;
zmq_msg_init_data (&message, buffer, 1000, my_free, NULL);
zmq_msg_send (&message, socket, 0);

Note that you don't call zmq_msg_close() after sending a message—libzmq will do this automatically(自动地) when it's actually done sending the message.

There is no way to do zero-copy on receive: ZeroMQ delivers you a buffer(缓冲区) that you can store as long as you wish, but it will not write data directly into application buffers.

On writing, ZeroMQ's multipart messages work nicely together with zero-copy. In traditional messaging, you need to marshal(元帅) different buffers together into one buffer that you can send. That means copying data. With ZeroMQ, you can send multiple buffers coming from different sources as individual(个人的) message frames(框架). Send each field as a length-delimited frame. To the application, it looks like a series of send and receive calls. But internally(内部地), the multiple parts get written to the network and read back with single system calls, so it's very efficient(有效率的).

Pub-Sub Message Envelopes

topprevnext

In the pub-sub pattern, we can split the key into a separate message frame that we call an envelope. If you want to use pub-sub envelopes, make them yourself. It's optional(可选择的), and in previous pub-sub examples we didn't do this. Using a pub-sub envelope is a little more work for simple cases, but it's cleaner especially for real cases, where the key and the data are naturally separate things.

Figure 23 - Pub-Sub Envelope with Separate Key

fig23.png

Subscriptions(捐献) do a prefix(前缀) match. That is, they look for "all messages starting with XYZ". The obvious question is: how to delimit(划界) keys from data so that the prefix match doesn't accidentally(意外地) match data. The best answer is to use an envelope because the match won't cross a frame boundary(边界). Here is a minimalist(极简抽象派艺术的) example of how pub-sub envelopes look in code. This publisher sends messages of two types, A and B.

The envelope holds the message type:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

The subscriber(订户) wants only messages of type B:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

When you run the two programs, the subscriber(订户) should show you this:

[B] We would like to see this
[B] We would like to see this
[B] We would like to see this
...

This example shows that the subscription(捐献) filter rejects or accepts the entire multipart message (key plus data). You won't get part of a multipart message, ever. If you subscribe to multiple publishers and you want to know their address so that you can send them data via another socket(插座) (and this is a typical(典型的) use case), create a three-part message.

Figure 24 - Pub-Sub Envelope with Sender Address

fig24.png

High-Water Marks

topprevnext

When you can send messages rapidly from process to process, you soon discover that memory is a precious resource, and one that can be trivially(琐细地) filled up. A few seconds of delay somewhere in a process can turn into a backlog(积压的工作) that blows up a server unless you understand the problem and take precautions(预防).

The problem is this: imagine you have process A sending messages at high frequency(频率) to process B, which is processing them. Suddenly B gets very busy (garbage collection, CPU overload, whatever), and can't process the messages for a short period. It could be a few seconds for some heavy garbage collection, or it could be much longer, if there's a more serious problem. What happens to the messages that process A is still trying to send frantically(疯狂似地)? Some will sit in B's network buffers(有软皮摩擦). Some will sit on the Ethernet wire itself. Some will sit in A's network buffers. And the rest will accumulate(积攒) in A's memory, as rapidly as the application behind A sends them. If you don't take some precaution, A can easily run out of memory and crash.

It is a consistent(始终如一的), classic(经典的) problem with message brokers. What makes it hurt more is that it's B's fault, superficially(表面地), and B is typically a user-written application which A has no control over.

What are the answers? One is to pass the problem upstream(上游部门). A is getting the messages from somewhere else. So tell that process, "Stop!" And so on. This is called flow control. It sounds plausible(貌似可信的), but what if you're sending out a Twitter feed? Do you tell the whole world to stop tweeting while B gets its act together?

Flow control works in some cases, but not in others. The transport layer can't tell the application layer to "stop" any more than a subway(地铁) system can tell a large business, "please keep your staff(职员) at work for another half an hour. I'm too busy". The answer for messaging is to set limits on the size of buffers(有软皮摩擦), and then when we reach those limits, to take some sensible(明智的) action. In some cases (not for a subway system, though), the answer is to throw away messages. In others, the best strategy(战略) is to wait.

ZeroMQ uses the concept(观念) of HWM (high-water mark) to define(定义) the capacity(能力) of its internal(内部的) pipes. Each connection out of a socket(插座) or into a socket has its own pipe, and HWM for sending, and/or receiving, depending on the socket type. Some sockets (PUB, PUSH) only have send buffers. Some (SUB, PULL, REQ, REP) only have receive buffers. Some (DEALER, ROUTER, PAIR) have both send and receive buffers.

In ZeroMQ v2.x, the HWM was infinite(无限的) by default. This was easy but also typically(代表性地) fatal(致命的) for high-volume(大容量) publishers. In ZeroMQ v3.x, it's set to 1,000 by default, which is more sensible. If you're still using ZeroMQ v2.x, you should always set a HWM on your sockets, be it 1,000 to match ZeroMQ v3.x or another figure that takes into account your message sizes and expected subscriber(订户) performance.

When your socket reaches its HWM, it will either block or drop data depending on the socket type. PUB and ROUTER sockets will drop data if they reach their HWM, while other socket types will block. Over the inproc transport, the sender and receiver share the same buffers, so the real HWM is the sum of the HWM set by both sides.

Lastly, the HWMs are not exact; while you may get up to 1,000 messages by default, the real buffer size may be much lower (as little as half), due to the way libzmq implements its queues.

Missing Message Problem Solver

topprevnext

As you build applications with ZeroMQ, you will come across this problem more than once: losing messages that you expect to receive. We have put together a diagram that walks through the most common causes for this.

Figure 25 - Missing Message Problem Solver

fig25.png

Here's a summary of what the graphic(图表的) says:

  • On SUB sockets, set a subscription(捐献) using zmq_setsockopt() with ZMQ_SUBSCRIBE, or you won't get messages. Because you subscribe(签署) to messages by prefix(前缀), if you subscribe to "" (an empty subscription(捐献)), you will get everything.
  • If you start the SUB socket(插座) (i.e., establish(建立) a connection to a PUB socket) after the PUB socket has started sending out data, you will lose whatever it published before the connection was made. If this is a problem, set up your architecture(建筑学) so the SUB socket starts first, then the PUB socket starts publishing.
  • Even if you synchronize(合拍) a SUB and PUB socket, you may still lose messages. It's due to the fact that internal(内部的) queues aren't created until a connection is actually created. If you can switch(转换) the bind(捆绑)/connect direction so the SUB socket binds, and the PUB socket connects, you may find it works more as you'd expect.
  • If you're using REP and REQ sockets, and you're not sticking to the synchronous(同步的) send/recv/send/recv order, ZeroMQ will report errors, which you might ignore(驳回诉讼). Then, it would look like you're losing messages. If you use REQ or REP, stick to the send/recv order, and always, in real code, check for errors on ZeroMQ calls.
  • If you're using PUSH sockets, you'll find that the first PULL socket to connect will grab an unfair share of messages. The accurate(精确的) rotation(旋转) of messages only happens when all PULL sockets are successfully connected, which can take some milliseconds(毫秒). As an alternative(二中择一) to PUSH/PULL, for lower data rates, consider using ROUTER/DEALER and the load balancing pattern.
  • If you're sharing sockets across threads, don't. It will lead to random(随机的) weirdness(命运), and crashes.
  • If you're using inproc, make sure both sockets are in the same context(环境). Otherwise the connecting side will in fact fail. Also, bind first, then connect. inproc is not a disconnected(分离的) transport like tcp.
  • If you're using ROUTER sockets, it's remarkably(卓越的) easy to lose messages by accident, by sending malformed(畸形的) identity(身份) frames(框架) (or forgetting to send an identity frame). In general setting the ZMQ_ROUTER_MANDATORY option on ROUTER sockets(插座) is a good idea, but do also check the return code on every send call.
  • Lastly, if you really can't figure out what's going wrong, make a minimal test case that reproduces(复制) the problem, and ask for help from the ZeroMQ community.


Chapter 3 - Advanced Request-Reply Patterns

topprevnext

In Chapter 2 - Sockets and Patterns we worked through the basics of using ZeroMQ by developing a series of small applications, each time exploring new aspects(方面) of ZeroMQ. We'll continue this approach(方法) in this chapter as we explore advanced patterns built on top of ZeroMQ's core request-reply pattern.

We'll cover:

  • How the request-reply mechanisms(机制) work
  • How to combine REQ, REP, DEALER, and ROUTER sockets
  • How ROUTER sockets work, in detail
  • The load balancing pattern
  • Building a simple load balancing message broker
  • Designing a high-level API for ZeroMQ
  • Building an asynchronous(异步的) request-reply server
  • A detailed inter-broker routing(路由选择) example

The Request-Reply Mechanisms

topprevnext

We already looked briefly at multipart messages. Let's now look at a major use case, which is reply message envelopes. An envelope is a way of safely packaging up data with an address, without touching the data itself. By separating reply addresses into an envelope we make it possible to write general purpose intermediaries(中间人) such as APIs and proxies(代理人) that create, read, and remove addresses no matter what the message payload(有效载荷) or structure(结构) is.

In the request-reply pattern, the envelope holds the return address for replies. It is how a ZeroMQ network with no state can create round-trip request-reply dialogs.

When you use REQ and REP sockets(插座) you don't even see envelopes; these sockets deal with them automatically(自动地). But for most of the interesting request-reply patterns, you'll want to understand envelopes and particularly ROUTER sockets. We'll work through this step-by-step(按部就班的).

The Simple Reply Envelope

topprevnext

A request-reply exchange consists of a request message, and an eventual(最后的) reply message. In the simple request-reply pattern, there's one reply for each request. In more advanced patterns, requests and replies can flow asynchronously(异步的). However, the reply envelope always works the same way.

The ZeroMQ reply envelope formally(正式地) consists of zero or more reply addresses, followed by an empty frame(框架) (the envelope delimiter(划界)), followed by the message body (zero or more frames). The envelope is created by multiple sockets working together in a chain. We'll break this down.

We'll start by sending "Hello" through a REQ socket. The REQ socket creates the simplest possible reply envelope, which has no addresses, just an empty delimiter frame and the message frame containing the "Hello" string. This is a two-frame message.

Figure 26 - Request with Minimal Envelope

fig26.png

The REP socket(插座) does the matching work: it strips(带) off the envelope, up to and including the delimiter(划界) frame(框架), saves the whole envelope, and passes the "Hello" string up the application. Thus our original Hello World example used request-reply envelopes internally(内部地), but the application never saw them.

If you spy on the network data flowing between hwclient and hwserver, this is what you'll see: every request and every reply is in fact two frames, an empty frame and then the body. It doesn't seem to make much sense for a simple REQ-REP dialog. However you'll see the reason when we explore how ROUTER and DEALER handle envelopes.

The Extended(延伸) Reply Envelope

topprevnext

Now let's extend the REQ-REP pair with a ROUTER-DEALER proxy(代理人) in the middle and see how this affects the reply envelope. This is the extended request-reply pattern we already saw in Chapter 2 - Sockets and Patterns. We can, in fact, insert any number of proxy steps. The mechanics(力学) are the same.

Figure 27 - Extended Request-Reply Pattern

fig27.png

The proxy does this, in pseudo-code:

prepare context, frontend and backend sockets
while true:
    poll on both sockets
    if frontend had input:
        read all frames from frontend
        send to backend
    if backend had input:
        read all frames from backend
        send to frontend

The ROUTER socket, unlike other sockets, tracks every connection it has, and tells the caller about these. The way it tells the caller is to stick the connection identity in front of each message received. An identity(身份), sometimes called an address, is just a binary(二进制的) string with no meaning except "this is a unique(独特的) handle to the connection". Then, when you send a message via a ROUTER socket(插座), you first send an identity frame(框架).

The zmq_socket() man page describes it thus:

When receiving messages a ZMQ_ROUTER socket shall prepend(预先考虑) a message part containing the identity of the originating(起源的) peer(贵族) to the message before passing it to the application. Messages received are fair-queued from among all connected peers. When sending messages a ZMQ_ROUTER socket shall remove the first part of the message and use it to determine the identity of the peer the message shall be routed(已选择路径) to.

As a historical(历史的) note, ZeroMQ v2.2 and earlier use UUIDs as identities. ZeroMQ v3.0 and later generate(形成) a 5 byte identity by default (0 + a random(随机的) 32bit integer(整数)). There's some impact(影响) on network performance, but only when you use multiple proxy(代理人) hops(蜱酒花), which is rare. Mostly the change was to simplify(简化) building libzmq by removing the dependency(属国) on a UUID library.

Identities are a difficult concept(观念) to understand, but it's essential if you want to become a ZeroMQ expert. The ROUTER socket invents a random identity for each connection with which it works. If there are three REQ sockets connected to a ROUTER socket, it will invent three random identities, one for each REQ socket.

So if we continue our worked example, let's say the REQ socket has a 3-byte identity ABC. Internally(内部的), this means the ROUTER socket keeps a hash table where it can search for ABC and find the TCP connection for the REQ socket.

When we receive the message off the ROUTER socket, we get three frames.

Figure 28 - Request with One Address

fig28.png

The core of the proxy(代理人) loop(环) is "read from one socket(插座), write to the other", so we literally(照字面地) send these three frames(框架) out on the DEALER socket. If you now sniffed(嗅) the network traffic, you would see these three frames flying from the DEALER socket to the REP socket. The REP socket does as before, strips(带) off the whole envelope including the new reply address, and once again delivers the "Hello" to the caller.

Incidentally(附带的) the REP socket can only deal with one request-reply exchange at a time, which is why if you try to read multiple requests or send multiple replies without sticking to a strict recv-send cycle, it gives an error.

You should now be able to visualize(形象) the return path. When hwserver sends "World" back, the REP socket wraps(外套) that with the envelope it saved, and sends a three-frame reply message across the wire to the DEALER socket.

Figure 29 - Reply with one Address

fig29.png

Now the DEALER reads these three frames, and sends all three out via the ROUTER socket. The ROUTER takes the first frame for the message, which is the ABC identity(身份), and looks up the connection for this. If it finds that, it then pumps the next two frames out onto the wire.

Figure 30 - Reply with Minimal Envelope

fig30.png

The REQ socket picks this message up, and checks that the first frame is the empty delimiter(划界), which it is. The REQ socket discards(抛弃) that frame and passes "World" to the calling application, which prints it out to the amazement(惊异) of the younger us looking at ZeroMQ for the first time.

What's This Good For?

topprevnext

To be honest, the use cases for strict request-reply or extended(延伸) request-reply are somewhat limited. For one thing, there's no easy way to recover from common failures like the server crashing due to buggy(童车) application code. We'll see more about this in Chapter 4 - Reliable(可靠的) Request-Reply Patterns. However once you grasp(抓住) the way these four sockets deal with envelopes, and how they talk to each other, you can do very useful things. We saw how ROUTER uses the reply envelope to decide which client REQ socket to route(按某路线发送) a reply back to. Now let's express this another way:

  • Each time ROUTER gives you a message, it tells you what peer(贵族) that came from, as an identity.
  • You can use this with a hash table (with the identity as key) to track new peers as they arrive.
  • ROUTER will route(路线) messages asynchronously(异步的) to any peer(贵族) connected to it, if you prefix(加前缀) the identity(身份) as the first frame(框架) of the message.

ROUTER sockets(插座) don't care about the whole envelope. They don't know anything about the empty delimiter(划界). All they care about is that one identity frame that lets them figure out which connection to send a message to.

Recap(翻新的轮胎) of Request-Reply Sockets

topprevnext

Let's recap this:

  • The REQ socket sends, to the network, an empty delimiter frame in front of the message data. REQ sockets are synchronous(同步的). REQ sockets always send one request and then wait for one reply. REQ sockets talk to one peer at a time. If you connect a REQ socket to multiple peers, requests are distributed(分布式的) to and replies expected from each peer one turn at a time.
  • The REP socket reads and saves all identity frames up to and including the empty delimiter, then passes the following frame or frames to the caller. REP sockets are synchronous and talk to one peer at a time. If you connect a REP socket to multiple peers, requests are read from peers in fair fashion, and replies are always sent to the same peer that made the last request.
  • The DEALER socket is oblivious(遗忘的) to the reply envelope and handles this like any multipart message. DEALER sockets are asynchronous and like PUSH and PULL combined. They distribute sent messages among all connections, and fair-queue received messages from all connections.
  • The ROUTER socket is oblivious to the reply envelope, like DEALER. It creates identities for its connections, and passes these identities to the caller as a first frame in any received message. Conversely(相反的), when the caller sends a message, it uses the first message frame as an identity to look up the connection to send to. ROUTERS are asynchronous.

Request-Reply Combinations

topprevnext

We have four request-reply sockets, each with a certain behavior(行为). We've seen how they connect in simple and extended(延伸) request-reply patterns. But these sockets are building blocks that you can use to solve many problems.

These are the legal(法律的) combinations(结合):

  • REQ to REP
  • DEALER to REP
  • REQ to ROUTER
  • DEALER to ROUTER
  • DEALER to DEALER
  • ROUTER to ROUTER

And these combinations(结合) are invalid(无效的) (and I'll explain why):

  • REQ to REQ
  • REQ to DEALER
  • REP to REP
  • REP to ROUTER

Here are some tips for remembering the semantics(语义学). DEALER is like an asynchronous(异步的) REQ socket(插座), and ROUTER is like an asynchronous REP socket. Where we use a REQ socket, we can use a DEALER; we just have to read and write the envelope ourselves. Where we use a REP socket, we can stick a ROUTER; we just need to manage the identities(身份) ourselves.

Think of REQ and DEALER sockets as "clients" and REP and ROUTER sockets as "servers". Mostly, you'll want to bind(绑) REP and ROUTER sockets, and connect REQ and DEALER sockets to them. It's not always going to be this simple, but it is a clean and memorable(显著的) place to start.

The REQ to REP Combination

topprevnext

We've already covered a REQ client talking to a REP server but let's take one aspect(方面): the REQ client must initiate(开始) the message flow. A REP server cannot talk to a REQ client that hasn't first sent it a request. Technically, it's not even possible, and the API also returns an EFSM error if you try it.

The DEALER to REP Combination

topprevnext

Now, let's replace the REQ client with a DEALER. This gives us an asynchronous(异步的) client that can talk to multiple REP servers. If we rewrote(重写) the "Hello World" client using DEALER, we'd be able to send off any number of "Hello" requests without waiting for replies.

When we use a DEALER to talk to a REP socket(插座), we must accurately(精确地) emulate(仿真) the envelope that the REQ socket would have sent, or the REP socket will discard(抛弃) the message as invalid(无效的). So, to send a message, we:

  • Send an empty message frame(框架) with the MORE flag set; then
  • Send the message body.

And when we receive a message, we:

  • Receive the first frame and if it's not empty, discard the whole message;
  • Receive the next frame and pass that to the application.

The REQ to ROUTER Combination

topprevnext

In the same way that we can replace REQ with DEALER, we can replace REP with ROUTER. This gives us an asynchronous server that can talk to multiple REQ clients at the same time. If we rewrote the "Hello World" server using ROUTER, we'd be able to process any number of "Hello" requests in parallel(平行线). We saw this in the Chapter 2 - Sockets and Patterns mtserver example.

We can use ROUTER in two distinct(明显的) ways:

  • As a proxy(代理人) that switches(开关) messages between frontend(前端) and backend sockets(插座).
  • As an application that reads the message and acts on it.

In the first case, the ROUTER simply reads all frames(框架), including the artificial(人造的) identity(身份) frame, and passes them on blindly. In the second case the ROUTER must know the format of the reply envelope it's being sent. As the other peer(贵族) is a REQ socket, the ROUTER gets the identity frame, an empty frame, and then the data frame.

The DEALER to ROUTER Combination(结合)

topprevnext

Now we can switch out both REQ and REP with DEALER and ROUTER to get the most powerful socket combination, which is DEALER talking to ROUTER. It gives us asynchronous(异步的) clients talking to asynchronous servers, where both sides have full control over the message formats.

Because both DEALER and ROUTER can work with arbitrary(任意的) message formats, if you hope to use these safely, you have to become a little bit of a protocol(协议) designer. At the very least you must decide whether you wish to emulate(仿真) the REQ/REP reply envelope. It depends on whether you actually need to send replies or not.

The DEALER to DEALER Combination

topprevnext

You can swap(与…交换) a REP with a ROUTER, but you can also swap a REP with a DEALER, if the DEALER is talking to one and only one peer.

When you replace a REP with a DEALER, your worker can suddenly go full asynchronous, sending any number of replies back. The cost is that you have to manage the reply envelopes yourself, and get them right, or nothing at all will work. We'll see a worked example later. Let's just say for now that DEALER to DEALER is one of the trickier(狡猾的) patterns to get right, and happily it's rare that we need it.

The ROUTER to ROUTER Combination

topprevnext

This sounds perfect for N-to-N connections, but it's the most difficult combination(结合) to use. You should avoid it until you are well advanced with ZeroMQ. We'll see one example it in the Freelance pattern in Chapter 4 - Reliable(可靠的) Request-Reply Patterns, and an alternative(供选择的) DEALER to ROUTER design for peer-to-peer(对等) work in Chapter 8 - A Framework for Distributed Computing.

Invalid Combinations

topprevnext

Mostly, trying to connect clients to clients, or servers to servers is a bad idea and won't work. However, rather than give general vague(模糊的) warnings, I'll explain in detail:

  • REQ to REQ: both sides want to start by sending messages to each other, and this could only work if you timed things so that both peers(撒尿) exchanged messages at the same time. It hurts my brain to even think about it.
  • REQ to DEALER: you could in theory do this, but it would break if you added a second REQ because DEALER has no way of sending a reply to the original peer. Thus the REQ socket(插座) would get confused(混乱), and/or return messages meant for another client.
  • REP to REP: both sides would wait for the other to send the first message.
  • REP to ROUTER: the ROUTER socket can in theory initiate(开始) the dialog and send a properly-formatted request, if it knows the REP socket has connected and it knows the identity(身份) of that connection. It's messy and adds nothing over DEALER to ROUTER.

The common thread in this valid versus(对) invalid(无效的) breakdown(故障) is that a ZeroMQ socket connection is always biased(有偏见的) towards one peer that binds(捆绑) to an endpoint(端点), and another that connects to that. Further, that which side binds and which side connects is not arbitrary(任意的), but follows natural patterns. The side which we expect to "be there" binds: it'll be a server, a broker, a publisher, a collector(收藏家). The side that "comes and goes" connects: it'll be clients and workers. Remembering this will help you design better ZeroMQ architectures(建筑学).

Exploring ROUTER Sockets

topprevnext

Let's look at ROUTER sockets(插座) a little closer. We've already seen how they work by routing(路由选择) individual(个人的) messages to specific(特殊的) connections. I'll explain in more detail how we identify(确定) those connections, and what a ROUTER socket does when it can't send a message.

Identities and Addresses

topprevnext

The identity concept(观念) in ZeroMQ refers specifically(特别地) to ROUTER sockets and how they identify the connections they have to other sockets. More broadly, identities(身份) are used as addresses in the reply envelope. In most cases, the identity is arbitrary(任意的) and local to the ROUTER socket: it's a lookup(查找) key in a hash table. Independently, a peer(贵族) can have an address that is physical (a network endpoint(端点) like "tcp://192.168.55.117:5670") or logical(合逻辑的) (a UUID or email address or other unique(独特的) key).

An application that uses a ROUTER socket to talk to specific peers can convert(转变) a logical address to an identity if it has built the necessary hash table. Because ROUTER sockets only announce the identity of a connection (to a specific peer) when that peer sends a message, you can only really reply to a message, not spontaneously(自发地) talk to a peer.

This is true even if you flip(弹) the rules and make the ROUTER connect to the peer rather than wait for the peer to connect to the ROUTER. However you can force the ROUTER socket to use a logical address in place of its identity. The zmq_setsockopt reference(参考) page calls this setting the socket identity. It works as follows:

  • The peer application sets the ZMQ_IDENTITY option of its peer socket (DEALER or REQ) before binding or connecting.
  • Usually the peer(贵族) then connects to the already-bound ROUTER socket(插座). But the ROUTER can also connect to the peer.
  • At connection time, the peer socket tells the router(路由器) socket, "please use this identity(身份) for this connection".
  • If the peer socket doesn't say that, the router generates(形成) its usual arbitrary(任意的) random(随机的) identity for the connection.
  • The ROUTER socket now provides this logical(合逻辑的) address to the application as a prefix(前缀) identity frame(框架) for any messages coming in from that peer.
  • The ROUTER also expects the logical address as the prefix identity frame for any outgoing(外出的) messages.

Here is a simple example of two peers that connect to a ROUTER socket, one that imposes(利用) a logical address "PEER2":


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Q | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Racket

Here is what the program prints:

----------------------------------------
[005] 006B8B4567
[000]
[039] ROUTER uses a generated 5 byte identity
----------------------------------------
[005] PEER2
[000]
[038] ROUTER uses REQ's socket identity

ROUTER Error Handling

topprevnext

ROUTER sockets(插座) do have a somewhat brutal(残忍的) way of dealing with messages they can't send anywhere: they drop them silently. It's an attitude that makes sense in working code, but it makes debugging(调试) hard. The "send identity(身份) as first frame(框架)" approach(方法) is tricky(狡猾的) enough that we often get this wrong when we're learning, and the ROUTER's stony(无情的) silence when we mess up isn't very constructive(建设性的).

Since ZeroMQ v3.2 there's a socket option you can set to catch this error: ZMQ_ROUTER_MANDATORY. Set that on the ROUTER socket and then when you provide an unroutable identity on a send call, the socket will signal an EHOSTUNREACH error.

The Load Balancing Pattern

topprevnext

Now let's look at some code. We'll see how to connect a ROUTER socket to a REQ socket, and then to a DEALER socket. These two examples follow the same logic(逻辑), which is a load balancing pattern. This pattern is our first exposure(暴露) to using the ROUTER socket for deliberate(故意的) routing(路由选择), rather than simply acting as a reply channel.

The load balancing pattern is very common and we'll see it several times in this book. It solves the main problem with simple round robin routing (as PUSH and DEALER offer) which is that round robin becomes inefficient(无效率的) if tasks do not all roughly take the same time.

It's the post office analogy(类比). If you have one queue per counter, and you have some people buying stamps (a fast, simple transaction(交易)), and some people opening new accounts (a very slow transaction), then you will find stamp buyers getting unfairly stuck in queues. Just as in a post office, if your messaging architecture(建筑学) is unfair, people will get annoyed.

The solution(解决方案) in the post office is to create a single queue so that even if one or two counters get stuck with slow work, other counters will continue to serve clients on a first-come, first-serve basis(基础).

One reason PUSH and DEALER use the simplistic(过分简单化的) approach(方法) is sheer(绝对的) performance. If you arrive in any major US airport, you'll find long queues of people waiting at immigration. The border patrol(巡逻) officials will send people in advance to queue up at each counter, rather than using a single queue. Having people walk fifty yards in advance saves a minute or two per passenger. And because every passport check takes roughly the same time, it's more or less fair. This is the strategy(战略) for PUSH and DEALER: send work loads ahead of time so that there is less travel distance.

This is a recurring(循环的) theme with ZeroMQ: the world's problems are diverse(不同的) and you can benefit(有益于) from solving different problems each in the right way. The airport isn't the post office and one size fits no one, really well.

Let's return to the scenario(方案) of a worker (DEALER or REQ) connected to a broker (ROUTER). The broker has to know when the worker is ready, and keep a list of workers so that it can take the least recently used worker each time.

The solution is really simple, in fact: workers send a "ready" message when they start, and after they finish each task. The broker reads these messages one-by-one. Each time it reads a message, it is from the last used worker. And because we're using a ROUTER socket(插座), we get an identity(身份) that we can then use to send a task back to the worker.

It's a twist(扭曲) on request-reply because the task is sent with the reply, and any response for the task is sent as a new request. The following code examples should make it clearer.

ROUTER Broker and REQ Workers

topprevnext

Here is an example of the load balancing pattern using a ROUTER broker talking to a set of REQ workers:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

The example runs for five seconds and then each worker prints how many tasks they handled. If the routing(路由选择) worked, we'd expect a fair distribution(分布) of work:

Completed: 20 tasks
Completed: 18 tasks
Completed: 21 tasks
Completed: 23 tasks
Completed: 19 tasks
Completed: 21 tasks
Completed: 17 tasks
Completed: 17 tasks
Completed: 25 tasks
Completed: 19 tasks

To talk to the workers in this example, we have to create a REQ-friendly envelope consisting of an identity(身份) plus an empty envelope delimiter(划界) frame(框架).

Figure 31 - Routing Envelope for REQ

fig31.png

ROUTER Broker and DEALER Workers

topprevnext

Anywhere you can use REQ, you can use DEALER. There are two specific(特殊的) differences:

  • The REQ socket(给…配插座) always sends an empty delimiter(划界) frame(框架) before any data frames; the DEALER does not.
  • The REQ socket will send only one message before it receives a reply; the DEALER is fully asynchronous(异步的).

The synchronous(同步的) versus(对) asynchronous behavior(行为) has no effect on our example because we're doing strict request-reply. It is more relevant(有关的) when we address recovering from failures, which we'll come to in Chapter 4 - Reliable(可靠的) Request-Reply Patterns.

Now let's look at exactly the same example but with the REQ socket replaced by a DEALER socket:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

The code is almost identical(同一的) except that the worker uses a DEALER socket(插座), and reads and writes that empty frame(框架) before the data frame. This is the approach(方法) I use when I want to keep compatibility(兼容性) with REQ workers.

However, remember the reason for that empty delimiter(划界) frame: it's to allow multihop(多次反射) extended(延伸) requests that terminate(终止) in a REP socket, which uses that delimiter to split off the reply envelope so it can hand the data frames to its application.

If we never need to pass the message along to a REP socket, we can simply drop the empty delimiter frame at both sides, which makes things simpler. This is usually the design I use for pure DEALER to ROUTER protocols(协议).

A Load Balancing Message Broker

topprevnext

The previous example is half-complete. It can manage a set of workers with dummy(虚拟的) requests and replies, but it has no way to talk to clients. If we add a second frontend ROUTER socket(插座) that accepts client requests, and turn our example into a proxy(代理人) that can switch(转换) messages from frontend(前端) to backend, we get a useful and reusable(可重复使用的) tiny load balancing message broker.

Figure 32 - Load Balancing Broker

fig32.png

This broker does the following:

  • Accepts connections from a set of clients.
  • Accepts connections from a set of workers.
  • Accepts requests from clients and holds these in a single queue.
  • Sends these requests to workers using the load balancing pattern.
  • Receives replies back from workers.
  • Sends these replies back to the original requesting client.

The broker code is fairly long, but worth understanding:

// Load-balancing broker
// Clients and workers are shown here in-process

#include "zhelpers.h"
#include <pthread.h>
#define NBR_CLIENTS 10
#define NBR_WORKERS 3

// Dequeue(出列) operation for queue implemented(实施) as array(数组) of anything
#define(定义) DEQUEUE(q) memmove (&(q)[0], &(q)[1], sizeof (q) - sizeof (q [0]))

// Basic request-reply client using REQ socket(插座)
// Because s_send and s_recv can't handle 0MQ binary(二进制的) identities(身份), we
// set a printable(印得出的) text identity to allow routing(路由选择).
//

static void *
client_task(void *args)
{
void *context = zmq_ctx_new();
void *client = zmq_socket(插座)(context(环境), ZMQ_REQ);

#if (defined (WIN32))

s_set_id(client, (intptr_t)args);
zmq_connect(client, "tcp://localhost:5672"); // frontend
#else
s_set_id(client); // Set a printable(印得出的) identity(身份)
zmq_connect(client, "ipc://frontend.ipc");
#endif

// Send request, get reply
s_send(client, "HELLO");
char *reply = s_recv(client);
printf("Client: %s\n", reply);
free(reply);
zmq_close(client);
zmq_ctx_destroy(context);
return NULL;
}

// While this example runs in a single process, that is just to make
// it easier to start and stop the example. Each thread has its own
// context(环境) and conceptually(概念地) acts as a separate process.
// This is the worker task, using a REQ socket(插座) to do load-balancing.
// Because s_send and s_recv can't handle 0MQ binary(二进制的) identities(身份), we
// set a printable(印得出的) text identity to allow routing(路由选择).

static void *
worker_task(void *args)
{
void *context = zmq_ctx_new();
void *worker = zmq_socket(插座)(context(环境), ZMQ_REQ);

#if (defined (WIN32))

s_set_id(worker, (intptr_t)args);
zmq_connect(worker, "tcp://localhost:5673"); // backend
#else
s_set_id(worker);
zmq_connect(worker, "ipc://backend.ipc");
#endif

// Tell broker we're ready for work
s_send(worker, "READY");

while (1) {
// Read and save all frames(框架) until we get an empty frame
// In this example there is only 1, but there could be more
char *identity = s_recv(worker);
char *empty = s_recv(worker);
assert(*empty == 0);
free(empty);

// Get request, send reply
char *request = s_recv(worker);
printf("Worker: %s\n", request);
free(request);

s_sendmore(worker, identity(身份));
s_sendmore(worker, "");
s_send(worker, "OK");
free(identity);
}
zmq_close(worker);
zmq_ctx_destroy(context);
return NULL;
}

// This is the main task. It starts the clients and workers, and then
// routes(路由) requests between the two layers. Workers signal READY when
// they start; after that we treat them as ready when they reply with
// a response back to a client. The load-balancing data structure(结构) is
// just a queue of next available workers.

int main(void)
{
// Prepare our context(环境) and sockets(插座)
void *context = zmq_ctx_new();
void *frontend = zmq_socket(插座)(context(环境), ZMQ_ROUTER);
void *backend = zmq_socket(context, ZMQ_ROUTER);

#if (defined (WIN32))

zmq_bind(frontend, "tcp://*:5672"); // frontend
zmq_bind(backend, "tcp://*:5673"); // backend
#else
zmq_bind(frontend, "ipc://frontend.ipc");
zmq_bind(backend, "ipc://backend.ipc");
#endif

int client_nbr;
for (client_nbr = 0; client_nbr < NBR_CLIENTS; client_nbr++) {
pthread_t client;
pthread_create(&client, NULL, client_task, (void *)(intptr_t)client_nbr);
}
int worker_nbr;
for (worker_nbr = 0; worker_nbr < NBR_WORKERS; worker_nbr++) {
pthread_t worker;
pthread_create(&worker, NULL, worker_task, (void *)(intptr_t)worker_nbr);
}
// Here is the main loop(环) for the least-recently-used queue. It has two
// sockets(插座); a frontend(前端) for clients and a backend for workers. It polls(投票)
// the backend in all cases, and polls the frontend only when there are
// one or more workers ready. This is a neat way to use 0MQ's own queues
// to hold messages we're not ready to process yet. When we get a client
// reply, we pop the next available worker and send the request to it,
// including the originating(起源的) client identity(身份). When a worker replies, we
// requeue that worker and forward the reply to the original client
// using the reply envelope.

// Queue of available workers
int available_workers = 0;
char *worker_queue[10];

while (1) {
zmq_pollitem_t items[] = {
{ backend, 0, ZMQ_POLLIN, 0 },
{ frontend, 0, ZMQ_POLLIN, 0 }
};
// Poll(投票) frontend(前端) only if we have available workers
int rc = zmq_poll(items, available_workers ? 2 : 1, -1);
if (rc == -1)
break; // Interrupted

// Handle worker activity on backend
if (items[0].revents & ZMQ_POLLIN) {
// Queue worker identity(身份) for load-balancing
char *worker_id = s_recv(backend);
assert(available_workers < NBR_WORKERS);
worker_queue[available_workers++] = worker_id;

// Second frame(框架) is empty
char *empty = s_recv(backend);
assert(empty[0] == 0);
free(empty);

// Third frame(框架) is READY or else a client reply identity(身份)
char *client_id = s_recv(backend);

// If client reply, send rest back to frontend(前端)
if (strcmp(client_id, "READY") != 0) {
empty = s_recv(backend);
assert(empty[0] == 0);
free(empty);
char *reply = s_recv(backend);
s_sendmore(frontend(前端), client_id);
s_sendmore(frontend, "");
s_send(frontend, reply);
free(reply);
if (--client_nbr == 0)
break; // Exit after N messages
}
free(client_id);
}
// Here is how we handle a client request:

if (items[1].revents & ZMQ_POLLIN) {
// Now get next client request, route(路线) to last-used worker
// Client request is [identity][empty][request]
char *client_id = s_recv(frontend);
char *empty = s_recv(frontend);
assert(empty[0] == 0);
free(empty);
char *request = s_recv(frontend);

s_sendmore(backend, worker_queue[0]);
s_sendmore(backend, "");
s_sendmore(backend, client_id);
s_sendmore(backend, "");
s_send(backend, request);

free(client_id);
free(request);

// Dequeue(出列) and drop the next worker identity(身份)
free(worker_queue[0]);
DEQUEUE(worker_queue);
available_workers--;
}
}
zmq_close(frontend);
zmq_close(backend);
zmq_ctx_destroy(context);
return 0;
}


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

The difficult part of this program is (a) the envelopes that each socket(插座) reads and writes, and (b) the load balancing algorithm(算法). We'll take these in turn, starting with the message envelope formats.

Let's walk through a full request-reply chain from client to worker and back. In this code we set the identity(身份) of client and worker sockets to make it easier to trace(追踪) the message frames(框架). In reality, we'd allow the ROUTER sockets to invent identities for connections. Let's assume(承担) the client's identity is "CLIENT" and the worker's identity is "WORKER". The client application sends a single frame containing "Hello".

Figure 33 - Message that Client Sends

fig33.png

Because the REQ socket adds its empty delimiter(划界) frame and the ROUTER socket adds its connection identity, the proxy(代理人) reads off the frontend(前端) ROUTER socket the client address, empty delimiter frame, and the data part.

Figure 34 - Message Coming in on Frontend

fig34.png

The broker sends this to the worker, prefixed(有前缀的) by the address of the chosen worker, plus an additional(附加的) empty part to keep the REQ at the other end happy.

Figure 35 - Message Sent to Backend

fig35.png

This complex(复杂的) envelope stack(堆) gets chewed up first by the backend ROUTER socket, which removes the first frame. Then the REQ socket in the worker removes the empty part, and provides the rest to the worker application.

Figure 36 - Message Delivered to Worker

fig36.png

The worker has to save the envelope (which is all the parts up to and including the empty message frame) and then it can do what's needed with the data part. Note that a REP socket would do this automatically(自动地), but we're using the REQ-ROUTER pattern so that we can get proper load balancing.

On the return path, the messages are the same as when they come in, i.e., the backend socket(插座) gives the broker a message in five parts, and the broker sends the frontend(前端) socket a message in three parts, and the client gets a message in one part.

Now let's look at the load balancing algorithm(算法). It requires that both clients and workers use REQ sockets, and that workers correctly store and replay(重赛) the envelope on messages they get. The algorithm is:

  • Create a pollset that always polls(投票) the backend, and polls the frontend only if there are one or more workers available.
  • Poll for activity with infinite(无限的) timeout.
  • If there is activity on the backend, we either have a "ready" message or a reply for a client. In either case, we store the worker address (the first part) on our worker queue, and if the rest is a client reply, we send it back to that client via the frontend.
  • If there is activity on the frontend, we take the client request, pop the next worker (which is the last used), and send the request to the backend. This means sending the worker address, empty part, and then the three parts of the client request.

You should now see that you can reuse and extend(延伸) the load balancing algorithm with variations(变化) based on the information the worker provides in its initial "ready" message. For example, workers might start up and do a performance self test, then tell the broker how fast they are. The broker can then choose the fastest available worker rather than the oldest.

A High-Level API for ZeroMQ

topprevnext

We're going to push request-reply onto the stack(堆) and open a different area, which is the ZeroMQ API itself. There's a reason for this detour(绕道): as we write more complex(复杂的) examples, the low-level ZeroMQ API starts to look increasingly clumsy(笨拙的). Look at the core of the worker thread from our load balancing broker:

while (true) {
// Get one address frame(框架) and empty delimiter(划界)
char *address = s_recv (worker);
char *empty = s_recv (worker);
assert (*empty == 0);
free (empty);

// Get request, send reply
char *request = s_recv (worker);
printf ("Worker: %s\n", request);
free (request);

s_sendmore (worker, address);
s_sendmore (worker, "");
s_send (worker, "OK");
free (address);
}

That code isn't even reusable(可重复使用的) because it can only handle one reply address in the envelope, and it already does some wrapping(包装纸) around the ZeroMQ API. If we used the libzmq simple message API this is what we'd have to write:

while (true) {
// Get one address frame(框架) and empty delimiter(划界)
char address [255];
int address_size = zmq_recv (worker, address, 255, 0);
if (address_size == -1)
break;

char empty [1];
int empty_size = zmq_recv (worker, empty, 1, 0);
zmq_recv (worker, &empty, 0);
assert (empty_size <= 0);
if (empty_size == -1)
break;

// Get request, send reply
char request [256];
int request_size = zmq_recv (worker, request, 255, 0);
if (request_size == -1)
return NULL;
request [request_size] = 0;
printf ("Worker: %s\n", request);

zmq_send (worker, address, address_size, ZMQ_SNDMORE);
zmq_send (worker, empty, 0, ZMQ_SNDMORE);
zmq_send (worker, "OK", 2, 0);
}

And when code is too long to write quickly, it's also too long to understand. Up until now, I've stuck to the native API because, as ZeroMQ users, we need to know that intimately(熟悉地). But when it gets in our way, we have to treat it as a problem to solve.

We can't of course just change the ZeroMQ API, which is a documented public contract(合同) on which thousands of people agree and depend. Instead, we construct a higher-level API on top based on our experience so far, and most specifically(特别地), our experience from writing more complex(复杂的) request-reply patterns.

What we want is an API that lets us receive and send an entire message in one shot, including the reply envelope with any number of reply addresses. One that lets us do what we want with the absolute(绝对的) least lines of code.

Making a good message API is fairly difficult. We have a problem of terminology(术语): ZeroMQ uses "message" to describe both multipart messages, and individual(个人的) message frames(框架). We have a problem of expectations: sometimes it's natural to see message content as printable(印得出的) string data, sometimes as binary(二进制的) blobs. And we have technical challenges, especially if we want to avoid copying data around too much.

The challenge of making a good API affects all languages, though my specific(特殊的) use case is C. Whatever language you use, think about how you could contribute(贡献) to your language binding(结合) to make it as good (or better) than the C binding I'm going to describe.

Features(特色) of a Higher-Level API

topprevnext

My solution(解决方案) is to use three fairly natural and obvious concepts(观念): string (already the basis(基础) for our s_send and s_recv) helpers, frame (a message frame(框架)), and message (a list of one or more frames). Here is the worker code, rewritten(改写) onto an API using these concepts(观念):

while (true) {
zmsg_t *msg = zmsg_recv (worker);
zframe_reset(重置) (zmsg_last (msg), "OK", 2);
zmsg_send (&msg, worker);
}

Cutting the amount(数量) of code we need to read and write complex(复杂的) messages is great: the results are easy to read and understand. Let's continue this process for other aspects(方面) of working with ZeroMQ. Here's a wish list of things I'd like in a higher-level API, based on my experience with ZeroMQ so far:

  • Automatic(自动的) handling of sockets(插座). I find it cumbersome(笨重的) to have to close sockets manually(手动地), and to have to explicitly(明确地) define(定义) the linger(徘徊) timeout in some (but not all) cases. It'd be great to have a way to close sockets automatically(自动地) when I close the context(环境).
  • Portable thread management. Every nontrivial(非平凡的) ZeroMQ application uses threads, but POSIX threads aren't portable. So a decent(正派的) high-level API should hide this under a portable layer.
  • Piping from parent to child threads. It's a recurrent(复发的) problem: how to signal between parent and child threads. Our API should provide a ZeroMQ message pipe (using PAIR sockets and inproc automatically.
  • Portable clocks. Even getting the time to a millisecond(毫秒) resolution(分辨率), or sleeping for some milliseconds, is not portable. Realistic(现实的) ZeroMQ applications need portable clocks, so our API should provide them.
  • A reactor(反应器) to replace zmq_poll(). The poll(投票) loop(环) is simple, but clumsy(笨拙的). Writing a lot of these, we end up doing the same work over and over: calculating(计算的) timers, and calling code when sockets(插座) are ready. A simple reactor(反应器) with socket readers and timers would save a lot of repeated work.
  • Proper handling of Ctrl-C. We already saw how to catch an interrupt. It would be useful if this happened in all applications.

The CZMQ High-Level API

topprevnext

Turning this wish list into reality for the C language gives us CZMQ, a ZeroMQ language binding(结合) for C. This high-level binding, in fact, developed out of earlier versions of the examples. It combines nicer semantics(语义学) for working with ZeroMQ with some portability(可移植性) layers, and (importantly for C, but less for other languages) containers like hashes and lists. CZMQ also uses an elegant(高雅的) object model that leads to frankly(真诚地) lovely code.

Here is the load balancing broker rewritten(改写) to use a higher-level API (CZMQ for the C case):


C++ | Delphi | Haxe | Java | Lua | PHP | Python | Scala | Ada | Basic | C# | Clojure | CL | Erlang | F# | Felix | Go | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Tcl

One thing CZMQ provides is clean interrupt handling. This means that Ctrl-C will cause any blocking ZeroMQ call to exit with a return code -1 and errno set to EINTR. The high-level recv methods will return NULL in such cases. So, you can cleanly exit a loop(环) like this:

// Shows how to handle Ctrl-C

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <fcntl.h>

#include <zmq.h>

// Signal handling
//
// Create a self-pipe and call s_catch_signals(pipe's writefd) in your application
// at startup, and then exit your main loop(环) if your pipe contains any data.
// Works especially well with zmq_poll(投票).

#define(定义) S_NOTIFY_MSG " "
#define S_ERROR_MSG "Error while writing to self-pipe.\n"

static int s_fd;
static void s_signal_handler (int signal_value)
{
int rc = write (s_fd, S_NOTIFY_MSG, sizeof(S_NOTIFY_MSG));
if (rc != sizeof(S_NOTIFY_MSG)) {
write (STDOUT_FILENO, S_ERROR_MSG, sizeof(S_ERROR_MSG)-1);
exit(1);
}
}

static void s_catch_signals (int fd)
{
s_fd = fd;

struct sigaction action;
action.sa_handler = s_signal_handler;
// Doesn't matter if SA_RESTART set because self-pipe will wake up zmq_poll(投票)
// But setting to 0 will allow zmq_read to be interrupted.
action.sa_flags = 0;
sigemptyset (&action.sa_mask);
sigaction (SIGINT, &action, NULL);
sigaction (SIGTERM, &action, NULL);
}

int main (void)
{
int rc;

void *context = zmq_ctx_new ();
void *socket = zmq_socket(插座) (context(环境), ZMQ_REP);
zmq_bind (socket, "tcp://*:5555");

int pipefds[2];
rc = pipe(pipefds);
if (rc != 0) {
perror("Creating self-pipe");
exit(1);
}
for (int i = 0; i < 2; i++) {
int flags = fcntl(pipefds[0], F_GETFL, 0);
if (flags < 0) {
perror ("fcntl(F_GETFL)");
exit(1);
}
rc = fcntl (pipefds[0], F_SETFL, flags | O_NONBLOCK);
if (rc != 0) {
perror ("fcntl(F_SETFL)");
exit(1);
}
}

s_catch_signals (pipefds[1]);

zmq_pollitem_t items [] = {
{ 0, pipefds[0], ZMQ_POLLIN, 0 },
{ socket, 0, ZMQ_POLLIN, 0 }
};

while (1) {
rc = zmq_poll (items, 2, -1);
if (rc == 0) {
continue;
} else if (rc < 0) {
if (errno == EINTR) { continue; }
perror("zmq_poll");
exit(1);
}

// Signal pipe FD
if (items [0].revents & ZMQ_POLLIN) {
char buffer [1];
read (pipefds[0], buffer, 1); // clear notifying byte
printf ("W: interrupt received, killing server…\n");
break;
}

// Read socket
if (items [1].revents & ZMQ_POLLIN) {
char buffer [255];
// Use non-blocking so we can continue to check self-pipe via zmq_poll(投票)
rc = zmq_recv (socket(插座), buffer(有软皮摩擦), 255, ZMQ_NOBLOCK);
if (rc < 0) {
if (errno == EAGAIN) { continue; }
if (errno == EINTR) { continue; }
perror("recv");
exit(1);
}
printf ("W: recv\n");

// Now send message back.
//
}
}

printf ("W: cleaning up\n");
zmq_close (socket);
zmq_ctx_destroy (context);
return 0;
}

Or, if you're calling zmq_poll(), test on the return code:

if (zmq_poll (items, 2, 1000 * 1000) == -1)
break; // Interrupted

The previous example still uses zmq_poll(). So how about reactors(反应器)? The CZMQ zloop reactor is simple but functional(功能的). It lets you:

  • Set a reader on any socket(插座), i.e., code that is called whenever the socket has input(投入).
  • Cancel a reader on a socket(插座).
  • Set a timer that goes off once or multiple times at specific(特殊的) intervals.
  • Cancel a timer.

zloop of course uses zmq_poll() internally(内部地). It rebuilds its poll(投票) set each time you add or remove readers, and it calculates(计算) the poll timeout to match the next timer. Then, it calls the reader and timer handlers for each socket and timer that need attention.

When we use a reactor(反应器) pattern, our code turns inside out. The main logic(逻辑) looks like this:

zloop_t *reactor = zloop_new ();
zloop_reader (reactor, self->backend, s_handle_backend, self);
zloop_start (reactor);
zloop_destroy (&reactor);

The actual handling of messages sits inside dedicated(专用的) functions or methods. You may not like the style—it's a matter of taste. What it does help with is mixing timers and socket a(插座)ctivity. In the rest of this text, we'll use zmq_poll() in simpler cases, and zloop in more complex(复杂的) examples.

Here is the load balancing broker rewritten(改写) once again, this time to use zloop:


Haxe | Java | Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

Getting applications to properly shut down when you send them Ctrl-C can be tricky(狡猾的). If you use the zctx class it'll automatically(自动的) set up signal handling, but your code still has to cooperate(合作). You must break any loop(环) if zmq_poll returns -1 or if any of the zstr_recv, zframe_recv, or zmsg_recv methods return NULL. If you have nested loops(环), it can be useful to make the outer ones conditional(条件句) on !zctx_interrupted.

If you're using child threads, they won't receive the interrupt. To tell them to shutdown(关机), you can either:

  • Destroy the context(环境), if they are sharing the same context, in which case any blocking calls they are waiting on will end with ETERM.
  • Send them shutdown messages, if they are using their own contexts. For this you'll need some socket(插座) plumbing(铅工业).

The Asynchronous(异步的) Client/Server Pattern

topprevnext

In the ROUTER to DEALER example, we saw a 1-to-N use case where one server talks asynchronously to multiple workers. We can turn this upside(上部) down to get a very useful N-to-1 architecture(建筑学) where various clients talk to a single server, and do this asynchronously.

Figure 37 - Asynchronous Client/Server

fig37.png

Here's how it works:

  • Clients connect to the server and send requests.
  • For each request, the server sends 0 or more replies.
  • Clients can send multiple requests without waiting for a reply.
  • Servers can send multiple replies without waiting for new requests.

Here's code that shows how this works:

// Asynchronous client-to-server (DEALER to ROUTER)
//
// While this example runs in a single process, that is to make
// it easier to start and stop the example. Each task has its own
// context(环境) and conceptually(概念地) acts as a separate process.

#include "czmq.h"

// This is our client task
// It connects to the server, and then sends a request once per second
// It collects responses as they arrive, and it prints them out. We will
// run several client tasks in parallel(平行线), each with a different random(随机的) ID.

static void *
client_task (void *args)
{
zctx_t *ctx = zctx_new ();
void *client = zsocket_new (ctx, ZMQ_DEALER);

// Set random(随机的) identity(身份) to make tracing(追溯) easier
char identity [10];
sprintf (identity, "%04X-%04X", randof (0x10000), randof (0x10000));
zsocket_set_identity (client, identity);
zsocket_connect (client, "tcp://localhost:5570");

zmq_pollitem_t items [] = { { client, 0, ZMQ_POLLIN, 0 } };
int request_nbr = 0;
while (true) {
// Tick once per second, pulling in arriving messages
int centitick;
for (centitick = 0; centitick < 100; centitick++) {
zmq_poll (items, 1, 10 * ZMQ_POLL_MSEC);
if (items [0].revents & ZMQ_POLLIN) {
zmsg_t *msg = zmsg_recv (client);
zframe_print (zmsg_last (msg), identity(身份));
zmsg_destroy (&msg);
}
}
zstr_sendf (client, "request #%d", ++request_nbr);
}
zctx_destroy (&ctx);
return NULL;
}

// This is our server task.
// It uses the multithreaded server model to deal requests out to a pool
// of workers and route(路线) replies back to clients. One worker can handle
// one request at a time but one client can talk to multiple workers at
// once.

static void server_worker (void *args, zctx_t *ctx, void *pipe);

void *server_task (void *args)
{
// Frontend(前端) socket(插座) talks to clients over TCP
zctx_t *ctx = zctx_new ();
void *frontend = zsocket_new (ctx, ZMQ_ROUTER);
zsocket_bind (frontend, "tcp://*:5570");

// Backend socket(插座) talks to workers over inproc
void *backend = zsocket_new (ctx, ZMQ_DEALER);
zsocket_bind (backend, "inproc://backend");

// Launch(发射) pool of worker threads, precise(精确的) number is not critical(鉴定的)
int thread_nbr;
for (thread_nbr = 0; thread_nbr < 5; thread_nbr++)
zthread_fork (ctx, server_worker, NULL);

// Connect backend to frontend(前端) via a proxy(代理人)
zmq_proxy (frontend, backend, NULL);

zctx_destroy (&ctx);
return NULL;
}

// Each worker task works on one request at a time and sends a random(随机的) number
// of replies back, with random delays between replies:

static void
server_worker (void *args, zctx_t *ctx, void *pipe)
{
void *worker = zsocket_new (ctx, ZMQ_DEALER);
zsocket_connect (worker, "inproc://backend");

while (true) {
// The DEALER socket(插座) gives us the reply envelope and message
zmsg_t *msg = zmsg_recv (worker);
zframe_t *identity = zmsg_pop (msg);
zframe_t *content = zmsg_pop (msg);
assert (content);
zmsg_destroy (&msg);

// Send 0..4 replies back
int reply, replies = randof (5);
for (reply = 0; reply < replies; reply++) {
// Sleep for some fraction(分数) of a second
zclock_sleep (randof (1000) + 1);
zframe_send (&identity(身份), worker, ZFRAME_REUSE + ZFRAME_MORE);
zframe_send (&content, worker, ZFRAME_REUSE);
}
zframe_destroy (&identity);
zframe_destroy (&content);
}
}

// The main thread simply starts several clients and a server, and then
// waits for the server to finish.

int main (void)
{
zthread_new (client_task, NULL);
zthread_new (client_task, NULL);
zthread_new (client_task, NULL);
zthread_new (server_task, NULL);
zclock_sleep (5 * 1000); // Run for 5 seconds then quit(离开)
return 0;
}


C++ | C# | Clojure | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | CL | Felix | Objective-C | ooc | Perl | Q | Racket

The example runs in one process, with multiple threads simulating(模拟) a real multiprocess architecture(建筑学). When you run the example, you'll see three clients (each with a random(随机的) ID), printing out the replies they get from the server. Look carefully and you'll see each client task gets 0 or more replies per request.

Some comments on this code:

  • The clients send a request once per second, and get zero or more replies back. To make this work using zmq_poll(), we can't simply poll(投票) with a 1-second timeout, or we'd end up sending a new request only one second after we received the last reply. So we poll at a high frequency(频率) (100 times at 1/100th of a second per poll), which is approximately(大约) accurate(精确的).
  • The server uses a pool of worker threads, each processing one request synchronously(同步地). It connects these to its frontend(前端) socket(插座) using an internal(内部的) queue. It connects the frontend and backend sockets using a zmq_proxy() call.

Figure 38 - Detail of Asynchronous(异步的) Server

fig38.png

Note that we're doing DEALER to ROUTER dialog between client and server, but internally between the server main thread and workers, we're doing DEALER to DEALER. If the workers were strictly synchronous, we'd use REP. However, because we want to send multiple replies, we need an async socket. We do not want to route(按某路线发送) replies, they always go to the single server thread that sent us the request.

Let's think about the routing(路由选择) envelope. The client sends a message consisting of a single frame(框架). The server thread receives a two-frame message (original message prefixed(有前缀的) by client identity(身份)). We send these two frames on to the worker, which treats it as a normal reply envelope, returns that to us as a two frame message. We then use the first frame as an identity to route the second frame back to the client as a reply.

It looks something like this:

     client          server       frontend       worker
   [ DEALER ]<---->[ ROUTER <----> DEALER <----> DEALER ]
             1 part         2 parts       2 parts

Now for the sockets: we could use the load balancing ROUTER to DEALER pattern to talk to workers, but it's extra work. In this case, a DEALER to DEALER pattern is probably fine: the trade-off is lower latency(潜伏) for each request, but higher risk(风险) of unbalanced work distribution(分布). Simplicity(朴素) wins in this case.

When you build servers that maintain(维持) stateful(状态性的) conversations with clients, you will run into a classic(经典的) problem. If the server keeps some state per client, and clients keep coming and going, eventually(最后) it will run out of resources. Even if the same clients keep connecting, if you're using default identities, each connection will look like a new one.

We cheat in the above example by keeping state only for a very short time (the time it takes a worker to process a request) and then throwing away the state. But that's not practical for many cases. To properly manage client state in a stateful asynchronous server, you have to:

  • Do heartbeating from client to server. In our example, we send a request once per second, which can reliably(可靠地) be used as a heartbeat(心跳).
  • Store state using the client identity (whether generated(形成) or explicit(明确的)) as key.
  • Detect(察觉) a stopped heartbeat. If there's no request from a client within, say, two seconds, the server can detect this and destroy any state it's holding for that client.

Worked Example: Inter-Broker Routing

topprevnext

Let's take everything we've seen so far, and scale(规模) things up to a real application. We'll build this step-by-step(按部就班的) over several iterations(迭代). Our best client calls us urgently(迫切地) and asks for a design of a large cloud computing facility(设施). He has this vision(视力) of a cloud that spans(跨度) many data centers, each a cluster(群) of clients and workers, and that works together as a whole. Because we're smart enough to know that practice always beats theory, we propose(建议) to make a working simulation(仿真) using ZeroMQ. Our client, eager to lock down the budget(预算) before his own boss changes his mind, and having read great things about ZeroMQ on Twitter, agrees.

Establishing the Details

topprevnext

Several espressos later, we want to jump into writing code, but a little voice tells us to get more details before making a sensational(轰动的) solution(解决方案) to entirely the wrong problem. "What kind of work is the cloud doing?", we ask.

The client explains:

  • Workers run on various kinds of hardware(计算机硬件), but they are all able to handle any task. There are several hundred workers per cluster, and as many as a dozen clusters in total.
  • Clients create tasks for workers. Each task is an independent unit of work and all the client wants is to find an available worker, and send it the task, as soon as possible. There will be a lot of clients and they'll come and go arbitrarily(武断地).
  • The real difficulty is to be able to add and remove clusters at any time. A cluster can leave or join the cloud instantly, bringing all its workers and clients with it.
  • If there are no workers in their own cluster, clients' tasks will go off to other available workers in the cloud.
  • Clients send out one task at a time, waiting for a reply. If they don't get an answer within X seconds, they'll just send out the task again. This isn't our concern(涉及); the client API does it already.
  • Workers process one task at a time; they are very simple beasts. If they crash, they get restarted(重新启动) by whatever script started them.

So we double-check to make sure that we understood this correctly:

  • "There will be some kind of super-duper(了不起的) network interconnect(使互相连接) between clusters, right?", we ask. The client says, "Yes, of course, we're not idiots(笨蛋)."
  • "What kind of volumes(量) are we talking about?", we ask. The client replies, "Up to a thousand clients per cluster, each doing at most ten requests per second. Requests are small, and replies are also small, no more than 1K bytes each."

So we do a little calculation(计算) and see that this will work nicely over plain TCP. 2,500 clients x 10/second x 1,000 bytes x 2 directions = 50MB/sec or 400Mb/sec, not a problem for a 1Gb network.

It's a straightforward(简单的) problem that requires no exotic(异国的) hardware(计算机硬件) or protocols(协议), just some clever routing(路由选择) algorithms(算法) and careful design. We start by designing one cluster(群) (one data center) and then we figure out how to connect clusters together.

Architecture(建筑学) of a Single Cluster

topprevnext

Workers and clients are synchronous(同步的). We want to use the load balancing pattern to route(按某路线发送) tasks to workers. Workers are all identical(完全相同的事物); our facility(设施) has no notion(概念) of different services. Workers are anonymous(匿名的); clients never address them directly. We make no attempt here to provide guaranteed(有保证的) delivery(交付), retry(重操作), and so on.

For reasons we already examined, clients and workers won't speak to each other directly. It makes it impossible to add or remove nodes dynamically(动态地). So our basic model consists of the request-reply message broker we saw earlier.

Figure 39 - Cluster Architecture

fig39.png

Scaling(衡量) to Multiple Clusters

topprevnext

Now we scale this out to more than one cluster. Each cluster has a set of clients and workers, and a broker that joins these together.

Figure 40 - Multiple Clusters

fig40.png

The question is: how do we get the clients of each cluster talking to the workers of the other cluster? There are a few possibilities, each with pros and cons:

  • Clients could connect directly to both brokers. The advantage is that we don't need to modify(修改) brokers or workers. But clients get more complex(复杂的) and become aware(意识到的) of the overall topology(拓扑学). If we want to add a third or forth(向前) cluster, for example, all the clients are affected. In effect we have to move routing and failover(失效备援) logic(逻辑) into the clients and that's not nice.
  • Workers might connect directly to both brokers. But REQ workers can't do that, they can only reply to one broker. We might use REPs but REPs don't give us customizable(可定制的) broker-to-worker routing like load balancing does, only the built-in(嵌入的) load balancing. That's a fail; if we want to distribute(分配) work to idle(虚度) workers, we precisely(精确地) need load balancing. One solution(解决方案) would be to use ROUTER sockets(插座) for the worker nodes. Let's label(标注) this "Idea #1".
  • Brokers could connect to each other. This looks neatest because it creates the fewest additional(附加的) connections. We can't add clusters on the fly, but that is probably out of scope(范围). Now clients and workers remain ignorant(无知的) of the real network topology, and brokers tell each other when they have spare capacity(能力). Let's label this "Idea #2".

Let's explore Idea #1. In this model, we have workers connecting to both brokers and accepting jobs from either one.

Figure 41 - Idea 1: Cross-connected Workers

fig41.png

It looks feasible(可行的). However, it doesn't provide what we wanted, which was that clients get local workers if possible and remote(遥远的) workers only if it's better than waiting. Also workers will signal "ready" to both brokers and can get two jobs at once, while other workers remain idle(闲置的). It seems this design fails because again we're putting routing(路由选择) logic(逻辑) at the edges.

So, idea #2 then. We interconnect(使互相连接) the brokers and don't touch the clients or workers, which are REQs like we're used to.

Figure 42 - Idea 2: Brokers Talking to Each Other

fig42.png

This design is appealing(呼吁) because the problem is solved in one place, invisible(无形的) to the rest of the world. Basically(主要地), brokers open secret channels to each other and whisper, like camel traders, "Hey, I've got some spare capacity(能力). If you have too many clients, give me a shout and we'll deal".

In effect it is just a more sophisticated(复杂的) routing algorithm(算法): brokers become subcontractors(转包商) for each other. There are other things to like about this design, even before we play with real code:

  • It treats the common case (clients and workers on the same cluster(群)) as default and does extra work for the exceptional(异常的) case (shuffling(支吾的) jobs between clusters).
  • It lets us use different message flows for the different types of work. That means we can handle them differently, e.g., using different types of network connection.
  • It feels like it would scale(衡量) smoothly. Interconnecting three or more brokers doesn't get overly complex(复杂的). If we find this to be a problem, it's easy to solve by adding a super-broker.

We'll now make a worked example. We'll pack an entire cluster into one process. That is obviously not realistic(现实的), but it makes it simple to simulate(模仿的), and the simulation(仿真) can accurately(精确地) scale to real processes. This is the beauty of ZeroMQ—you can design at the micro-level a(微级)nd scale that up to the macro-level. Threads become processes, and then become boxes and the patterns and logic remain the same. Each of our "cluster" processes contains client threads, worker threads, and a broker thread.

We know the basic model well by now:

  • The REQ client (REQ) threads create workloads and pass them to the broker (ROUTER).
  • The REQ worker (REQ) threads process workloads and return the results to the broker (ROUTER).
  • The broker queues and distributes(分配) workloads using the load balancing pattern.

Federation Versus Peering

topprevnext

There are several possible ways to interconnect brokers. What we want is to be able to tell other brokers, "we have capacity", and then receive multiple tasks. We also need to be able to tell other brokers, "stop, we're full". It doesn't need to be perfect; sometimes we may accept jobs we can't process immediately, then we'll do them as soon as possible.

The simplest interconnect is federation, in which brokers simulate(模仿的) clients and workers for each other. We would do this by connecting our frontend(前端) to the other broker's backend socket(插座). Note that it is legal(法律的) to both bind(绑) a socket to an endpoint(端点) and connect it to other endpoints.

Figure 43 - Cross-connected Brokers in Federation(联合) Model

fig43.png

This would give us simple logic(逻辑) in both brokers and a reasonably good mechanism(机制): when there are no workers, tell the other broker "ready", and accept one job from it. The problem is also that it is too simple for this problem. A federated(使结成同盟) broker would be able to handle only one task at a time. If the broker emulates(仿真) a lock-step client and worker, it is by definition(定义) also going to be lock-step, and if it has lots of available workers they won't be used. Our brokers need to be connected in a fully asynchronous(异步的) fashion.

The federation model is perfect for other kinds of routing(路由选择), especially service-oriented(服务型的) architectures(建筑学) (SOAs), which route(路线) by service name and proximity(亲近) rather than load balancing or round robin. So don't dismiss it as useless, it's just not right for all use cases.

Instead of federation, let's look at a peering approach(方法) in which brokers are explicitly(明确地) aware(意识到的) of each other and talk over privileged(享有特权的) channels. Let's break this down, assuming(承担) we want to interconnect(使互相连接) N brokers. Each broker has (N - 1) peers(撒尿), and all brokers are using exactly the same code and logic. There are two distinct(明显的) flows of information between brokers:

  • Each broker needs to tell its peers how many workers it has available at any time. This can be fairly simple information—just a quantity that is updated regularly. The obvious (and correct) socket pattern for this is pub-sub. So every broker opens a PUB socket and publishes state information on that, and every broker also opens a SUB socket and connects that to the PUB socket of every other broker to get state information from its peers.
  • Each broker needs a way to delegate(委派…为代表) tasks to a peer and get replies back, asynchronously. We'll do this using ROUTER sockets; no other combination(结合) works. Each broker has two such sockets: one for tasks it receives and one for tasks it delegates. If we didn't use two sockets, it would be more work to know whether we were reading a request or a reply each time. That would mean adding more information to the message envelope.

And there is also the flow of information between a broker and its local clients and workers.

The Naming Ceremony

topprevnext

Three flows x two sockets for each flow = six sockets that we have to manage in the broker. Choosing good names is vital(至关重要的) to keeping a multisocket juggling(玩杂耍) act reasonably coherent(连贯的) in our minds. Sockets do something and what they do should form the basis(基础) for their names. It's about being able to read the code several weeks later on a cold Monday morning before coffee, and not feel any pain.

Let's do a shamanistic naming ceremony(典礼) for the sockets. The three flows are:

  • A local request-reply flow between the broker and its clients and workers.
  • A cloud request-reply flow between the broker and its peer(贵族) brokers.
  • A state flow between the broker and its peer brokers.

Finding meaningful(有意义的) names that are all the same length means our code will align(结盟) nicely. It's not a big thing, but attention to details helps. For each flow the broker has two sockets(插座) that we can orthogonally(正交的) call the frontend and backend. We've used these names quite often. A frontend(前端) receives information or tasks. A backend sends those out to other peers. The conceptual(概念上的) flow is from front to back (with replies going in the opposite direction from back to front).

So in all the code we write for this tutorial(个别指导), we will use these socket names:

  • localfe and localbe for the local flow.
  • cloudfe and cloudbe for the cloud flow.
  • statefe and statebe for the state flow.

For our transport and because we're simulating(模仿) the whole thing on one box, we'll use ipc for everything. This has the advantage of working like tcp in terms of connectivity(连通性) (i.e., it's a disconnected(分离的) transport, unlike inproc), yet we don't need IP addresses or DNS names, which would be a pain here. Instead, we will use ipc endpoints called something-local, something-cloud, and something-state, where something is the name of our simulated(模拟的) cluster(群).

You might be thinking that this is a lot of work for some names. Why not call them s1, s2, s3, s4, etc.? The answer is that if your brain is not a perfect machine, you need a lot of help when reading code, and we'll see that these names do help. It's easier to remember "three flows, two directions" than "six different sockets(插座)".

Figure 44 - Broker Socket Arrangement

fig44.png

Note that we connect the cloudbe in each broker to the cloudfe in every other broker, and likewise(同样地) we connect the statebe in each broker to the statefe in every other broker.

Prototyping(样机研究) the State Flow

topprevnext

Because each socket flow has its own little traps for the unwary(粗心的), we will test them in real code one-by-one, rather than try to throw the whole lot into code in one go. When we're happy with each flow, we can put them together into a full program. We'll start with the state flow.

Figure 45 - The State Flow

fig45.png

Here is how this works in code:


C# | Clojure | Delphi | F# | Go | Haskell | Haxe | Java | Lua | Node.js | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | C++ | CL | Erlang | Felix | Objective-C | ooc | Perl | Q

Notes about this code:

  • Each broker has an identity(身份) that we use to construct ipc endpoint(端点) names. A real broker would need to work with TCP and a more sophisticated(复杂的) configuration(配置) scheme(计划). We'll look at such schemes later in this book, but for now, using generated(形成) ipc names lets us ignore(驳回诉讼) the problem of where to get TCP/IP addresses or names.
  • We use a zmq_poll() loop(环) as the core of the program. This processes incoming messages and sends out state messages. We send a state message only if we did not get any incoming messages and we waited for a second. If we send out a state message each time we get one in, we'll get message storms.
  • We use a two-part pub-sub message consisting of sender address and data. Note that we will need to know the address of the publisher in order to send it tasks, and the only way is to send this explicitly(明确地) as a part of the message.
  • We don't set identities(身份) on subscribers(订阅) because if we did then we'd get outdated state information when connecting to running brokers.
  • We don't set a HWM on the publisher, but if we were using ZeroMQ v2.x that would be a wise idea.

We can build this little program and run it three times to simulate(模仿的) three clusters(群). Let's call them DC1, DC2, and DC3 (the names are arbitrary(任意的)). We run these three commands, each in a separate window:

peering1 DC1 DC2 DC3  #  Start DC1 and connect to DC2 and DC3
peering1 DC2 DC1 DC3  #  Start DC2 and connect to DC1 and DC3
peering1 DC3 DC1 DC2  #  Start DC3 and connect to DC1 and DC2

You'll see each cluster report the state of its peers(撒尿), and after a few seconds they will all happily be printing random(随机的) numbers once per second. Try this and satisfy yourself that the three brokers all match up and synchronize(合拍) to per-second state updates.

In real life, we'd not send out state messages at regular intervals, but rather whenever we had a state change, i.e., whenever a worker becomes available or unavailable. That may seem like a lot of traffic, but state messages are small and we've established(建立) that the inter-cluster connections are super fast.

If we wanted to send state messages at precise(精确的) intervals, we'd create a child thread and open the statebe socket(插座) in that thread. We'd then send irregular(不规则的) state updates to that child thread from our main thread and allow the child thread to conflate(合并) them into regular outgoing(外出的) messages. This is more work than we need here.

Prototyping(样机研究) the Local and Cloud Flows

topprevnext

Let's now prototype(原型) the flow of tasks via the local and cloud sockets. This code pulls requests from clients and then distributes(分配) them to local workers and cloud peers on a random basis(基础).

Figure 46 - The Flow of Tasks

fig46.png

Before we jump into the code, which is getting a little complex(复杂的), let's sketch(素描) the core routing(路由选择) logic(逻辑) and break it down into a simple yet robust(强健的) design.

We need two queues, one for requests from local clients and one for requests from cloud clients. One option would be to pull messages off the local and cloud frontends(前端), and pump these onto their respective(分别的) queues. But this is kind of pointless(无意义的) because ZeroMQ sockets(插座) are queues already. So let's use the ZeroMQ socket buffers(有软皮摩擦) as queues.

This was the technique we used in the load balancing broker, and it worked nicely. We only read from the two frontends when there is somewhere to send the requests. We can always read from the backends, as they give us replies to route(按某路线发送) back. As long as the backends aren't talking to us, there's no point in even looking at the frontends.

So our main loop(环) becomes:

  • Poll(投票) the backends for activity. When we get a message, it may be "ready" from a worker or it may be a reply. If it's a reply, route back via the local or cloud frontend.
  • If a worker replied, it became available, so we queue it and count it.
  • While there are workers available, take a request, if any, from either frontend and route to a local worker, or randomly(随便地), to a cloud peer(贵族).

Randomly sending tasks to a peer broker rather than a worker simulates(模仿) work distribution(分布) across the cluster(群). It's dumb(哑的), but that is fine for this stage.

We use broker identities(身份) to route messages between brokers. Each broker has a name that we provide on the command line in this simple prototype(原型). As long as these names don't overlap(重叠) with the ZeroMQ-generated UUIDs used for client nodes, we can figure out whether to route a reply back to a client or to a broker.

Here is how this works in code. The interesting part starts around the comment "Interesting part".


C# | Delphi | F# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | C++ | Clojure | CL | Erlang | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket

Run this by, for instance(实例), starting two instances of the broker in two windows:

peering2 me you
peering2 you me

Some comments on this code:

  • In the C code at least, using the zmsg class makes life much easier, and our code much shorter. It's obviously an abstraction(抽象) that works. If you build ZeroMQ applications in C, you should use CZMQ.
  • Because we're not getting any state information from peers(撒尿), we naively(无邪地) assume(承担) they are running. The code prompts(提示) you to confirm(确认) when you've started all the brokers. In the real case, we'd not send anything to brokers who had not told us they exist.

You can satisfy yourself that the code works by watching it run forever. If there were any misrouted(错误指向) messages, clients would end up blocking, and the brokers would stop printing trace(追踪) information. You can prove that by killing either of the brokers. The other broker tries to send requests to the cloud, and one-by-one its clients block, waiting for an answer.

Putting it All Together

topprevnext

Let's put this together into a single package. As before, we'll run an entire cluster(群) as one process. We're going to take the two previous examples and merge(合并) them into one properly working design that lets you simulate(模仿的) any number of clusters.

This code is the size of both previous prototypes(原型) together, at 270 LoC. That's pretty good for a simulation(仿真) of a cluster that includes clients and workers and cloud workload distribution(分布). Here is the code:


Delphi | F# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Erlang | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

It's a nontrivial(非平凡的) program and took about a day to get working. These are the highlights(加亮区):

  • The client threads detect(察觉) and report a failed request. They do this by polling(投票) for a response and if none arrives after a while (10 seconds), printing an error message.
  • Client threads don't print directly, but instead send a message to a monitor socket(插座) (PUSH) that the main loop(环) collects (PULL) and prints off. This is the first case we've seen of using ZeroMQ sockets for monitoring and logging; this is a big use case that we'll come back to later.
  • Clients simulate(模仿的) varying(不同的) loads to get the cluster(群) 100% at random(随机的) moments, so that tasks are shifted(移动) over to the cloud. The number of clients and workers, and delays in the client and worker threads control this. Feel free to play with them to see if you can make a more realistic(现实的) simulation(仿真).
  • The main loop uses two pollsets. It could in fact use three: information, backends, and frontends(前端). As in the earlier prototype(原型), there is no point in taking a frontend message if there is no backend capacity(能力).

These are some of the problems that arose during development of this program:

  • Clients would freeze, due to requests or replies getting lost somewhere. Recall(召回) that the ROUTER socket(插座) drops messages it can't route(按某路线发送). The first tactic(策略) here was to modify(修改) the client thread to detect(察觉) and report such problems. Secondly, I put zmsg_dump() calls after every receive and before every send in the main loop(环), until the origin of the problems was clear.
  • The main loop was mistakenly reading from more than one ready socket. This caused the first message to be lost. I fixed that by reading only from the first ready socket.
  • The zmsg class was not properly encoding UUIDs as C strings. This caused UUIDs that contain 0 bytes to be corrupted(腐败的). I fixed that by modifying zmsg to encode UUIDs as printable(印得出的) hex strings.

This simulation(仿真) does not detect disappearance of a cloud peer(贵族). If you start several peers and stop one, and it was broadcasting capacity(能力) to the others, they will continue to send it work even if it's gone. You can try this, and you will get clients that complain of lost requests. The solution(解决方案) is twofold(双重的): first, only keep the capacity information for a short time so that if a peer does disappear, its capacity is quickly set to zero. Second, add reliability(可靠性) to the request-reply chain. We'll look at reliability in the next chapter.


Chapter 4 - Reliable(可靠的) Request-Reply Patterns

topprevnext

Chapter 3 - Advanced Request-Reply Patterns covered advanced uses of ZeroMQ's request-reply pattern with working examples. This chapter looks at the general question of reliability and builds a set of reliable messaging patterns on top of ZeroMQ's core request-reply pattern.

In this chapter, we focus heavily on user-space request-reply patterns, reusable(可重复使用的) models that help you design your own ZeroMQ architectures(建筑学):

  • The Lazy Pirate pattern: reliable(可靠的) request-reply from the client side
  • The Simple Pirate pattern: reliable request-reply using load balancing
  • The Paranoid Pirate pattern: reliable request-reply with heartbeating
  • The Majordomo pattern: service-oriented(服务型的) reliable queuing
  • The Titanic pattern: disk-based/disconnected(拆开) reliable queuing
  • The Binary Star pattern: primary-backup server failover(失效备援)
  • The Freelance pattern: brokerless reliable request-reply

What is "Reliability"?

topprevnext

Most people who speak of "reliability(可靠性)" don't really know what they mean. We can only define(定义) reliability in terms of failure. That is, if we can handle a certain set of well-defined(定义明确的) and understood failures, then we are reliable(可靠的) with respect to those failures. No more, no less. So let's look at the possible causes of failure in a distributed(分配) ZeroMQ application, in roughly descending(下降的) order of probability(可能性):

  • Application code is the worst offender(罪犯). It can crash and exit, freeze and stop responding(回答) to input(输入), run too slowly for its input, exhaust(排出) all memory, and so on.
  • System code—such as brokers we write using ZeroMQ—can die for the same reasons as application code. System code should be more reliable than application code, but it can still crash and burn, and especially run out of memory if it tries to queue messages for slow clients.
  • Message queues can overflow(溢出), typically(代表性地) in system code that has learned to deal brutally(残忍地) with slow clients. When a queue overflows, it starts to discard(抛弃) messages. So we get "lost" messages.
  • Networks can fail (e.g., WiFi gets switched(转换) off or goes out of range). ZeroMQ will automatically(自动地) reconnect(使再接合) in such cases, but in the meantime(其时), messages may get lost.
  • Hardware(计算机硬件) can fail and take with it all the processes running on that box.
  • Networks can fail in exotic(异国的) ways, e.g., some ports on a switch may die and those parts of the network become inaccessible(难达到的).
  • Entire data centers can be struck by lightning, earthquakes, fire, or more mundane(世俗的) power or cooling failures.

To make a software system fully reliable against all of these possible failures is an enormously(巨大地) difficult and expensive job and goes beyond the scope(范围) of this book.

Because the first five cases in the above list cover 99.9% of real world requirements outside large companies (according to a highly scientific study I just ran, which also told me that 78% of statistics(统计) are made up on the spot, and moreover never to trust a statistic that we didn't falsify(伪造) ourselves), that's what we'll examine. If you're a large company with money to spend on the last two cases, contact my company immediately! There's a large hole behind my beach house waiting to be converted(转变) into an executive(行政的) swimming pool.

Designing Reliability

topprevnext

So to make things brutally(残忍地) simple, reliability(可靠性) is "keeping things working properly when code freezes or crashes", a situation we'll shorten(缩短) to "dies". However, the things we want to keep working properly are more complex(复杂的) than just messages. We need to take each core ZeroMQ messaging pattern and see how to make it work (if we can) even when code dies.

Let's take them one-by-one:

  • Request-reply: if the server dies (while processing a request), the client can figure that out because it won't get an answer back. Then it can give up in a huff(把…吹胀), wait and try again later, find another server, and so on. As for the client dying, we can brush that off as "someone else's problem" for now.
  • Pub-sub: if the client dies (having gotten(得到) some data), the server doesn't know about it. Pub-sub doesn't send any information back from client to server. But the client can contact the server out-of-band, e.g., via request-reply, and ask, "please resend(再发) everything I missed(感到思念的)". As for the server dying, that's out of scope(范围) for here. Subscribers(订阅) can also self-verify that they're not running too slowly, and take action (e.g., warn the operator and die) if they are.
  • Pipeline(管道): if a worker dies (while working), the ventilator(通风设备) doesn't know about it. Pipelines, like the grinding(磨的) gears(齿轮) of time, only work in one direction. But the downstream(下游的) collector(收藏家) can detect(察觉) that one task didn't get done, and send a message back to the ventilator saying, "hey, resend task 324!" If the ventilator or collector dies, whatever upstream(上游部门) client originally sent the work batch(一批) can get tired of waiting and resend the whole lot. It's not elegant(高雅的), but system code should really not die often enough to matter.

In this chapter we'll focus just on request-reply, which is the low-hanging fruit of reliable(可靠的) messaging.

The basic request-reply pattern (a REQ client socket(插座) doing a blocking send/receive to a REP server socket) scores low on handling the most common types of failure. If the server crashes while processing the request, the client just hangs forever. If the network loses the request or the reply, the client hangs forever.

Request-reply is still much better than TCP, thanks to ZeroMQ's ability to reconnect(使再接合) peers(撒尿) silently, to load balance messages, and so on. But it's still not good enough for real work. The only case where you can really trust the basic request-reply pattern is between two threads in the same process where there's no network or separate server process to die.

However, with a little extra work, this humble pattern becomes a good basis(基础) for real work across a distributed(分布式的) network, and we get a set of reliable request-reply (RRR) patterns that I like to call the Pirate patterns (you'll eventually(最后的) get the joke, I hope).

There are, in my experience, roughly three ways to connect clients to servers. Each needs a specific(特殊的) approach(方法) to reliability:

  • Multiple clients talking directly to a single server. Use case: a single well-known(著名的) server to which clients need to talk. Types of failure we aim to handle: server crashes and restarts(重新开始), and network disconnects(拆开).
  • Multiple clients talking to a broker proxy(代理人) that distributes work to multiple workers. Use case: service-oriented(服务型的) transaction(交易) processing. Types of failure we aim to handle: worker crashes and restarts, worker busy looping(循环的), worker overload, queue crashes and restarts, and network disconnects.
  • Multiple clients talking to multiple servers with no intermediary(中间的) proxies. Use case: distributed services such as name resolution(分辨率). Types of failure we aim to handle: service crashes and restarts, service busy looping, service overload, and network disconnects.

Each of these approaches has its trade-offs and often you'll mix them. We'll look at all three in detail.

Client-Side Reliability (Lazy Pirate Pattern)

topprevnext

We can get very simple reliable(可靠的) request-reply with some changes to the client. We call this the Lazy Pirate(海盗) pattern. Rather than doing a blocking receive, we:

  • Poll(投票) the REQ socket(插座) and receive from it only when it's sure a reply has arrived.
  • Resend(再发) a request, if no reply has arrived within a timeout period.
  • Abandon(狂热) the transaction(交易) if there is still no reply after several requests.

If you try to use a REQ socket in anything other than a strict send/receive fashion, you'll get an error (technically, the REQ socket implements(工具) a small finite-state machine to enforce(实施) the send/receive ping-pong, and so the error code is called "EFSM"). This is slightly annoying when we want to use REQ in a pirate pattern, because we may send several requests before getting a reply.

The pretty good brute(畜生) force solution(解决方案) is to close and reopen(再开) the REQ socket after an error:


C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Ruby | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Q | Racket | Scala

Run this together with the matching server:


C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Q | Racket

Figure 47 - The Lazy Pirate Pattern

fig47.png

To run this test case, start the client and the server in two console(控制台) windows. The server will randomly(随便地) misbehave(作弊) after a few messages. You can check the client's response. Here is typical(典型的) output(输出) from the server:

I: normal request (1)
I: normal request (2)
I: normal request (3)
I: simulating CPU overload
I: normal request (4)
I: simulating a crash

And here is the client's response:

I: connecting to server...
I: server replied OK (1)
I: server replied OK (2)
I: server replied OK (3)
W: no response from server, retrying...
I: connecting to server...
W: no response from server, retrying...
I: connecting to server...
E: server seems to be offline, abandoning

The client sequences(序列) each message and checks that replies come back exactly in order: that no requests or replies are lost, and no replies come back more than once, or out of order. Run the test a few times until you're convinced(说服) that this mechanism(机制) actually works. You don't need sequence numbers in a production application; they just help us trust our design.

The client uses a REQ socket(插座), and does the brute(畜生) force close/reopen(再开) because REQ sockets impose(强加) that strict send/receive cycle. You might be tempted(诱惑) to use a DEALER instead, but it would not be a good decision. First, it would mean emulating(仿真) the secret sauce that REQ does with envelopes (if you've forgotten what that is, it's a good sign you don't want to have to do it). Second, it would mean potentially(可能地) getting back replies that you didn't expect.

Handling failures only at the client works when we have a set of clients talking to a single server. It can handle a server crash, but only if recovery(恢复) means restarting(重新起动) that same server. If there's a permanent(永久的) error, such as a dead power supply on the server hardware(计算机硬件), this approach(方法) won't work. Because the application code in servers is usually the biggest source of failures in any architecture(建筑学), depending on a single server is not a great idea.

So, pros and cons:

  • Pro: simple to understand and implement(工具).
  • Pro: works easily with existing client and server application code.
  • Pro: ZeroMQ automatically(自动的) retries(重操作) the actual reconnection(重新连接) until it works.
  • Con: doesn't failover(失效备援) to backup or alternate(交替的) servers.

Basic Reliable Queuing (Simple Pirate Pattern)

topprevnext

Our second approach extends(延伸) the Lazy Pirate pattern with a queue proxy(代理人) that lets us talk, transparently(显然地), to multiple servers, which we can more accurately(精确地) call "workers". We'll develop this in stages, starting with a minimal(最低的) working model, the Simple Pirate pattern.

In all these Pirate patterns, workers are stateless(没有国家的). If the application requires some shared state, such as a shared database, we don't know about it as we design our messaging framework(框架). Having a queue proxy means workers can come and go without clients knowing anything about it. If one worker dies, another takes over. This is a nice, simple topology(拓扑学) with only one real weakness, namely the central queue itself, which can become a problem to manage, and a single point of failure.

Figure 48 - The Simple Pirate Pattern

fig48.png

The basis(基础) for the queue proxy is the load balancing broker from Chapter 3 - Advanced Request-Reply Patterns. What is the very minimum we need to do to handle dead or blocked workers? Turns out, it's surprisingly little. We already have a retry(重操作) mechanism(机制) in the client. So using the load balancing pattern will work pretty well. This fits with ZeroMQ's philosophy(哲学) that we can extend(延伸) a peer-to-peer(对等) pattern like request-reply by plugging naive(天真的) proxies(代理人) in the middle.

We don't need a special client; we're still using the Lazy Pirate client. Here is the queue, which is identical(同一的) to the main task of the load balancing broker:


C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Here is the worker, which takes the Lazy Pirate server and adapts(适应) it for the load balancing pattern (using the REQ "ready" signaling):


C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

To test this, start a handful of workers, a Lazy Pirate client, and the queue, in any order. You'll see that the workers eventually(最后) all crash and burn, and the client retries(重操作) and then gives up. The queue never stops, and you can restart(重新启动) workers and clients ad nauseam. This model works with any number of clients and workers.

Robust(强健的) Reliable Queuing (Paranoid Pirate Pattern)

topprevnext

Figure 49 - The Paranoid Pirate Pattern

fig49.png

The Simple Pirate Queue pattern works pretty well, especially because it's just a combination(结合) of two existing patterns. Still, it does have some weaknesses:

  • It's not robust in the face of a queue crash and restart. The client will recover, but the workers won't. While ZeroMQ will reconnect(使再接合) workers' sockets(插座) automatically(自动地), as far as the newly started queue is concerned(涉及), the workers haven't signaled ready, so don't exist. To fix this, we have to do heartbeating from queue to worker so that the worker can detect(察觉) when the queue has gone away.
  • The queue does not detect(察觉) worker failure, so if a worker dies while idle(虚度), the queue can't remove it from its worker queue until the queue sends it a request. The client waits and retries(重操作) for nothing. It's not a critical(鉴定的) problem, but it's not nice. To make this work properly, we do heartbeating from worker to queue, so that the queue can detect a lost worker at any stage.

We'll fix these in a properly pedantic(迂腐的) Paranoid Pirate Pattern.

We previously used a REQ socket(插座) for the worker. For the Paranoid Pirate worker, we'll switch(转换) to a DEALER socket. This has the advantage of letting us send and receive messages at any time, rather than the lock-step send/receive that REQ imposes(利用). The downside(下降趋势) of DEALER is that we have to do our own envelope management (re-read Chapter 3 - Advanced Request-Reply Patterns for background on this concept(观念)).

We're still using the Lazy Pirate client. Here is the Paranoid Pirate queue proxy(代理人):

// Paranoid Pirate queue

#include "czmq.h"
#define(定义) HEARTBEAT_LIVENESS 3 // 3-5 is reasonable
#define HEARTBEAT_INTERVAL 1000 // msecs

// Paranoid(类似妄想狂的) Pirate Protocol constants
#define PPP_READY "\001"
// Signals worker is ready
#define PPP_HEARTBEAT "\002" // Signals worker heartbeat

// Here we define(定义) the worker class; a structure(结构) and a set of functions that
// act as constructor(构造函数), destructor(破坏者), and methods on worker objects:

typedef struct {
zframe_t *identity; // Identity of worker
char *id_string; // Printable identity
int64_t expiry; // Expires(期满) at this time
} worker_t;

// Construct new worker
static worker_t *
s_worker_new (zframe_t *identity)
{
worker_t *self = (worker_t *) zmalloc (sizeof (worker_t));
self->identity = identity;
self->id_string = zframe_strhex (identity);
self->expiry = zclock_time ()
+ HEARTBEAT_INTERVAL * HEARTBEAT_LIVENESS;
return self;
}

// Destroy specified(规定的) worker object, including identity(身份) frame(框架).
static void
s_worker_destroy (worker_t **self_p)
{
assert (self_p);
if (*self_p) {
worker_t *self = *self_p;
zframe_destroy (&self->identity);
free (self->id_string);
free (self);
*self_p = NULL;
}
}

// The ready method puts a worker to the end of the ready list:

static void
s_worker_ready (worker_t *self, zlist_t *workers)
{
worker_t *worker = (worker_t *) zlist_first (workers);
while (worker) {
if (streq (self->id_string, worker->id_string)) {
zlist_remove (workers, worker);
s_worker_destroy (&worker);
break;
}
worker = (worker_t *) zlist_next (workers);
}
zlist_append(附加) (workers, self);
}

// The next method returns the next available worker identity(身份):

static zframe_t *
s_workers_next (zlist_t *workers)
{
worker_t *worker = zlist_pop (workers);
assert (worker);
zframe_t *frame = worker->identity;
worker->identity = NULL;
s_worker_destroy (&worker);
return frame;
}

// The purge(净化) method looks for and kills expired(期满) workers. We hold workers
// from oldest to most recent, so we stop at the first alive worker:

static void
s_workers_purge (zlist_t *workers)
{
worker_t *worker = (worker_t *) zlist_first (workers);
while (worker) {
if (zclock_time () < worker->expiry)
break; // Worker is alive, we're done here

zlist_remove (workers, worker);
s_worker_destroy (&worker);
worker = (worker_t *) zlist_first (workers);
}
}

// The main task is a load-balancer with heartbeating on workers so we
// can detect(察觉) crashed or blocked worker tasks:

int main (void)
{
zctx_t *ctx = zctx_new ();
void *frontend = zsocket_new (ctx, ZMQ_ROUTER);
void *backend = zsocket_new (ctx, ZMQ_ROUTER);
zsocket_bind (frontend, "tcp://*:5555"); // For clients
zsocket_bind (backend, "tcp://*:5556"); // For workers

// List of available workers
zlist_t *workers = zlist_new ();

// Send out heartbeats(心跳) at regular intervals
uint64_t heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;

while (true) {
zmq_pollitem_t items [] = {
{ backend, 0, ZMQ_POLLIN, 0 },
{ frontend, 0, ZMQ_POLLIN, 0 }
};
// Poll(投票) frontend(前端) only if we have available workers
int rc = zmq_poll (items, zlist_size (workers)? 2: 1,
HEARTBEAT_INTERVAL * ZMQ_POLL_MSEC);
if (rc == -1)
break; // Interrupted

// Handle worker activity on backend
if (items [0].revents & ZMQ_POLLIN) {
// Use worker identity(身份) for load-balancing
zmsg_t *msg = zmsg_recv (backend);
if (!msg)
break; // Interrupted

// Any sign of life from worker means it's ready
zframe_t *identity = zmsg_unwrap (msg);
worker_t *worker = s_worker_new (identity);
s_worker_ready (worker, workers);

// Validate(证实) control message, or return reply to client
if (zmsg_size (msg) == 1) {
zframe_t *frame = zmsg_first (msg);
if (memcmp (zframe_data (frame(框架)), PPP_READY, 1)
&& memcmp (zframe_data (frame), PPP_HEARTBEAT, 1)) {
printf ("E: invalid(无效的) message from worker");
zmsg_dump (msg);
}
zmsg_destroy (&msg);
}
else
zmsg_send (&msg, frontend);
}
if (items [1].revents & ZMQ_POLLIN) {
// Now get next client request, route(路线) to next worker
zmsg_t *msg = zmsg_recv (frontend);
if (!msg)
break; // Interrupted
zframe_t *identity = s_workers_next (workers);
zmsg_prepend (msg, &identity);
zmsg_send (&msg, backend);
}
// We handle heartbeating after any socket(插座) activity. First, we send
// heartbeats(心跳) to any idle(闲置的) workers if it's time. Then, we purge(净化) any
// dead workers:
if (zclock_time () >= heartbeat_at) {
worker_t *worker = (worker_t *) zlist_first (workers);
while (worker) {
zframe_send (&worker->identity, backend,
ZFRAME_REUSE + ZFRAME_MORE);
zframe_t *frame = zframe_new (PPP_HEARTBEAT, 1);
zframe_send (&frame, backend, 0);
worker = (worker_t *) zlist_next (workers);
}
heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;
}
s_workers_purge (workers);
}
// When we're done, clean up properly
while (zlist_size (workers)) {
worker_t *worker = (worker_t *) zlist_pop (workers);
s_worker_destroy (&worker);
}
zlist_destroy (&workers);
zctx_destroy (&ctx);
return 0;
}


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

The queue extends(延伸) the load balancing pattern with heartbeating of workers. Heartbeating is one of those "simple" things that can be difficult to get right. I'll explain more about that in a second.

Here is the Paranoid Pirate worker:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Some comments about this example:

  • The code includes simulation(仿真) of failures, as before. This makes it (a) very hard to debug(调试), and (b) dangerous to reuse. When you want to debug this, disable the failure simulation.
  • The worker uses a reconnect(使再接合) strategy(战略) similar to the one we designed for the Lazy Pirate client, with two major differences: (a) it does an exponential(指数的) back-off(退下), and (b) it retries(重操作) indefinitely(不确定的) (whereas(然而) the client retries a few times before reporting a failure).

Try the client, queue, and workers, such as by using a script like this:

ppqueue &
for i in 1 2 3 4; do
    ppworker &
    sleep 1
done
lpclient &

You should see the workers die one-by-one as they simulate(模仿的) a crash, and the client eventually(最后) give up. You can stop and restart(重新启动) the queue and both client and workers will reconnect and carry on. And no matter what you do to queues and workers, the client will never get an out-of-order(无序的) reply: the whole chain either works, or the client abandons(狂热).

Heartbeating

topprevnext

Heartbeating solves the problem of knowing whether a peer(贵族) is alive or dead. This is not an issue specific(特殊的) to ZeroMQ. TCP has a long timeout (30 minutes or so), that means that it can be impossible to know whether a peer has died, been disconnected(拆开), or gone on a weekend to Prague with a case of vodka(伏特加酒), a redhead(红色头发的人), and a large expense account.

It's is not easy to get heartbeating right. When writing the Paranoid Pirate examples, it took about five hours to get the heartbeating working properly. The rest of the request-reply chain took perhaps ten minutes. It is especially easy to create "false failures", i.e., when peers decide that they are disconnected because the heartbeats(心跳) aren't sent properly.

We'll look at the three main answers people use for heartbeating with ZeroMQ.

Shrugging It Off

topprevnext

The most common approach(方法) is to do no heartbeating at all and hope for the best. Many if not most ZeroMQ applications do this. ZeroMQ encourages this by hiding peers in many cases. What problems does this approach cause?

  • When we use a ROUTER socket(插座) in an application that tracks peers, as peers disconnect and reconnect(使再接合), the application will leak memory (resources that the application holds for each peer) and get slower and slower.
  • When we use SUB- or DEALER-based data recipients(容器), we can't tell the difference between good silence (there's no data) and bad silence (the other end died). When a recipient knows the other side died, it can for example switch(转换) over to a backup route(路线).
  • If we use a TCP connection that stays silent for a long while, it will, in some networks, just die. Sending something (technically, a "keep-alive" more than a heartbeat), will keep the network alive.

One-Way Heartbeats

topprevnext

A second option is to send a heartbeat message from each node to its peers every second or so. When one node hears nothing from another within some timeout (several seconds, typically(代表性地)), it will treat that peer as dead. Sounds good, right? Sadly, no. This works in some cases but has nasty(肮脏的) edge cases in others.

For pub-sub, this does work, and it's the only model you can use. SUB sockets cannot talk back to PUB sockets, but PUB sockets can happily send "I'm alive" messages to their subscribers(订阅).

As an optimization(最佳化), you can send heartbeats only when there is no real data to send. Furthermore(此外), you can send heartbeats progressively(渐进地) slower and slower, if network activity is an issue (e.g., on mobile networks where activity drains(排水) the battery). As long as the recipient can detect(察觉) a failure (sharp stop in activity), that's fine.

Here are the typical problems with this design:

  • It can be inaccurate(错误的) when we send large amounts(数量) of data, as heartbeats(心跳) will be delayed behind that data. If heartbeats are delayed, you can get false timeouts and disconnections(断开) due to network congestion(拥挤). Thus, always treat any incoming data as a heartbeat, whether or not the sender optimizes(最优化) out heartbeats.
  • While the pub-sub pattern will drop messages for disappeared recipients(容器), PUSH and DEALER sockets(插座) will queue them. So if you send heartbeats to a dead peer(贵族) and it comes back, it will get all the heartbeats you sent, which can be thousands. Whoa(惊叹声), whoa!
  • This design assumes(承担) that heartbeat timeouts are the same across the whole network. But that won't be accurate(精确的). Some peers will want very aggressive heartbeating in order to detect(察觉) faults rapidly. And some will want very relaxed heartbeating, in order to let sleeping networks lie and save power.

Ping-Pong Heartbeats

topprevnext

The third option is to use a ping-pong dialog. One peer sends a ping command to the other, which replies with a pong(乒乓球) command. Neither command has any payload(有效载荷). Pings and pongs are not correlated(有相互关系的). Because the roles of "client" and "server" are arbitrary(任意的) in some networks, we usually specify(指定) that either peer can in fact send a ping and expect a pong in response. However, because the timeouts depend on network topologies(拓扑学) known best to dynamic(动态的) clients, it is usually the client that pings the server.

This works for all ROUTER-based brokers. The same optimizations(最佳化) we used in the second model make this work even better: treat any incoming data as a pong, and only send a ping when not otherwise sending data.

Heartbeating for Paranoid Pirate

topprevnext

For Paranoid Pirate, we chose the second approach(方法). It might not have been the simplest option: if designing this today, I'd probably try a ping-pong approach instead. However the principles(原理) are similar. The heartbeat messages flow asynchronously(异步的) in both directions, and either peer can decide the other is "dead" and stop talking to it.

In the worker, this is how we handle heartbeats from the queue:

  • We calculate a liveness, which is how many heartbeats we can still miss before deciding the queue is dead. It starts at three and we decrement(渐减) it each time we miss a heartbeat.
  • We wait, in the zmq_poll loop(环), for one second each time, which is our heartbeat(心跳) interval.
  • If there's any message from the queue during that time, we reset(重置) our liveness(活性) to three.
  • If there's no message during that time, we count down our liveness.
  • If the liveness reaches zero, we consider the queue dead.
  • If the queue is dead, we destroy our socket(插座), create a new one, and reconnect(使再接合).
  • To avoid opening and closing too many sockets, we wait for a certain interval before reconnecting, and we double the interval each time until it reaches 32 seconds.

And this is how we handle heartbeats to the queue:

  • We calculate(计算) when to send the next heartbeat; this is a single variable(变量) because we're talking to one peer(封为贵族), the queue.
  • In the zmq_poll loop, whenever we pass this time, we send a heartbeat to the queue.

Here's the essential heartbeating code for the worker:

#define HEARTBEAT_LIVENESS 3 // 3-5 is reasonable
#define HEARTBEAT_INTERVAL 1000 // msecs
#define INTERVAL_INIT 1000 // Initial reconnect
#define INTERVAL_MAX 32000 // After exponential backoff


// If liveness(活性) hits zero, queue is considered disconnected(拆开)
size_t liveness = HEARTBEAT_LIVENESS;
size_t interval = INTERVAL_INIT;

// Send out heartbeats(心跳) at regular intervals
uint64_t heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;

while (true) {
zmq_pollitem_t items [] = { { worker, 0, ZMQ_POLLIN, 0 } };
int rc = zmq_poll (items, 1, HEARTBEAT_INTERVAL * ZMQ_POLL_MSEC);

if (items [0].revents & ZMQ_POLLIN) {
// Receive any message from queue
liveness = HEARTBEAT_LIVENESS;
interval = INTERVAL_INIT;
}
else
if (--liveness == 0) {
zclock_sleep (interval);
if (interval < INTERVAL_MAX)
interval *= 2;
zsocket_destroy (ctx, worker);

liveness = HEARTBEAT_LIVENESS;
}
// Send heartbeat(心跳) to queue if it's time
if (zclock_time () > heartbeat_at) {
heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;
// Send heartbeat(心跳) message to queue
}
}

The queue does the same, but manages an expiration(呼气) time for each worker.

Here are some tips for your own heartbeating implementation(实现):

  • Use zmq_poll or a reactor(反应器) as the core of your application's main task.
  • Start by building the heartbeating between peers(撒尿), test it by simulating(模拟) failures, and then build the rest of the message flow. Adding heartbeating afterwards is much trickier(狡猾的).
  • Use simple tracing(追踪), i.e., print to console(安慰), to get this working. To help you trace the flow of messages between peers, use a dump(垃圾场) method such as zmsg offers, and number your messages incrementally(递增地) so you can see if there are gaps(间隙).
  • In a real application, heartbeating must be configurable(可配置的) and usually negotiated(谈判) with the peer. Some peers will want aggressive heartbeating, as low as 10 msecs. Other peers will be far away and want heartbeating as high as 30 seconds.
  • If you have different heartbeat intervals for different peers, your poll(投票) timeout should be the lowest (shortest time) of these. Do not use an infinite(无限的) timeout.
  • Do heartbeating on the same socket(插座) you use for messages, so your heartbeats also act as a keep-alive to stop the network connection from going stale(尿) (some firewalls(防火墙) can be unkind(无情的) to silent connections).

Contracts and Protocols

topprevnext

If you're paying attention, you'll realize that Paranoid Pirate is not interoperable(彼此协作的) with Simple Pirate, because of the heartbeats(心跳). But how do we define(定义) "interoperable"? To guarantee(保证) interoperability(互操作性), we need a kind of contract(合同), an agreement that lets different teams in different times and places write code that is guaranteed to work together. We call this a "protocol(协议)".

It's fun to experiment without specifications(规格), but that's not a sensible(明智的) basis(基础) for real applications. What happens if we want to write a worker in another language? Do we have to read code to see how things work? What if we want to change the protocol for some reason? Even a simple protocol will, if it's successful, evolve(发展) and become more complex(复杂的).

Lack of contracts is a sure sign of a disposable(可任意处理的) application. So let's write a contract for this protocol. How do we do that?

There's a wiki at rfc.zeromq.org that we made especially as a home for public ZeroMQ contracts.
To create a new specification, register on the wiki if needed, and follow the instructions. It's fairly straightforward(简单的), though writing technical texts is not everyone's cup of tea.

It took me about fifteen minutes to draft(起草) the new Pirate Pattern Protocol. It's not a big specification, but it does capture(捕获) enough to act as the basis for arguments ("your queue isn't PPP compatible(兼容的); please fix it!").

Turning PPP into a real protocol would take more work:

  • There should be a protocol version number in the READY command so that it's possible to distinguish(区分) between different versions of PPP.
  • Right now, READY and HEARTBEAT are not entirely distinct(明显的) from requests and replies. To make them distinct, we would need a message structure(结构) that includes a "message type" part.

Service-Oriented Reliable Queuing (Majordomo Pattern)

topprevnext

Figure 50 - The Majordomo Pattern

fig50.png

The nice thing about progress is how fast it happens when lawyers and committees(委员会) aren't involved(包含). The one-page MDP specification turns PPP into something more solid. This is how we should design complex(复杂的) architectures(建筑学): start by writing down the contracts(合同), and only then write software to implement(实施) them.

The Majordomo Protocol(协议) (MDP) extends(延伸) and improves on PPP in one interesting way: it adds a "service name" to requests that the client sends, and asks workers to register for specific(特殊的) services. Adding service names turns our Paranoid Pirate queue into a service-oriented(服务型的) broker. The nice thing about MDP is that it came out of working code, a simpler ancestor protocol (PPP), and a precise(精确的) set of improvements(改进) that each solved a clear problem. This made it easy to draft(草稿).

To implement Majordomo, we need to write a framework(框架) for clients and workers. It's really not sane(健全的) to ask every application developer to read the spec(投机) and make it work, when they could be using a simpler API that does the work for them.

So while our first contract (MDP itself) defines(定义) how the pieces of our distributed(分布式的) architecture talk to each other, our second contract defines how user applications talk to the technical framework we're going to design.

Majordomo(总监) has two halves, a client side and a worker side. Because we'll write both client and worker applications, we will need two APIs. Here is a sketch(素描) for the client API, using a simple object-oriented(面向对象的) approach(方法):

// Majordomo Protocol client example
// Uses the mdcli API to hide all MDP aspects(方面)

// Lets us build this source without creating a library
#include "mdcliapi.c"

int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], "-v"));
mdcli_t *session = mdcli_new ("tcp://localhost:5555", verbose);

int count;
for (count = 0; count < 100000; count++) {
zmsg_t *request = zmsg_new ();
zmsg_pushstr (request, "Hello world");
zmsg_t *reply = mdcli_send (session, "echo", &request);
if (reply)
zmsg_destroy (&reply);
else
break; // Interrupt or failure
}
printf ("%d requests/replies processed\n", count);
mdcli_destroy (&session);
return 0;
}

That's it. We open a session(会议) to the broker, send a request message, get a reply message back, and eventually(最后) close the connection. Here's a sketch(素描) for the worker API:

// Majordomo(总监) Protocol worker example
// Uses the mdwrk API to hide all MDP aspects(方面)

// Lets us build this source without creating a library
#include "mdwrkapi.c"

int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], "-v"));
mdwrk_t *session = mdwrk_new (
"tcp://localhost:5555", "echo", verbose);

zmsg_t *reply = NULL;
while (true) {
zmsg_t *request = mdwrk_recv (session, &reply);
if (request == NULL)
break; // Worker was interrupted
reply = request; // Echo(反射) is complex(复杂的)… :-)
}
mdwrk_destroy (&session);
return 0;
}

It's more or less symmetrical(匀称的), but the worker dialog is a little different. The first time a worker does a recv(), it passes a null reply. Thereafter, it passes the current reply, and gets a new request.

The client and worker APIs were fairly simple to construct because they're heavily based on the Paranoid Pirate code we already developed. Here is the client API:


C# | Go | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Let's see how the client API looks in action, with an example test program that does 100K request-reply cycles:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

And here is the worker API:


C# | Go | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Let's see how the worker API looks in action, with an example test program that implements(工具) an echo(回音) service:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Here are some things to note about the worker API code:

  • The APIs are single-threaded. This means, for example, that the worker won't send heartbeats(心跳) in the background. Happily, this is exactly what we want: if the worker application gets stuck, heartbeats will stop and the broker will stop sending requests to the worker.
  • The worker API doesn't do an exponential(指数的) back-off(退下); it's not worth the extra complexity(复杂).
  • The APIs don't do any error reporting. If something isn't as expected, they raise an assertion(断言) (or exception(例外) depending on the language). This is ideal(理想的) for a reference(参考) implementation(实现), so any protocol(协议) errors show immediately. For real applications, the API should be robust(强健的) against invalid(无效的) messages.

You might wonder why the worker API is manually(手动地) closing its socket(插座) and opening a new one, when ZeroMQ will automatically(自动地) reconnect(使再接合) a socket if the peer(贵族) disappears and comes back. Look back at the Simple Pirate and Paranoid Pirate workers to understand. Although ZeroMQ will automatically reconnect workers if the broker dies and comes back up, this isn't sufficient(足够的) to re-register the workers with the broker. I know of at least two solutions(解决方案). The simplest, which we use here, is for the worker to monitor the connection using heartbeats(心跳), and if it decides the broker is dead, to close its socket and start afresh(重新) with a new socket. The alternative(二中择一) is for the broker to challenge unknown workers when it gets a heartbeat from the worker and ask them to re-register. That would require protocol support.

Now let's design the Majordomo broker. Its core structure(结构) is a set of queues, one per service. We will create these queues as workers appear (we could delete them as workers disappear, but forget that for now because it gets complex(复合体)). Additionally, we keep a queue of workers per service.

And here is the broker:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

This is by far the most complex(复杂的) example we've seen. It's almost 500 lines of code. To write this and make it somewhat robust(强健的) took two days. However, this is still a short piece of code for a full service-oriented(服务型的) broker.

Here are some things to note about the broker code:

  • The Majordomo Protocol lets us handle both clients and workers on a single socket(插座). This is nicer for those deploying(配置) and managing the broker: it just sits on one ZeroMQ endpoint(端点) rather than the two that most proxies(代理人) need.
  • The broker implements(工具) all of MDP/0.1 properly (as far as I know), including disconnection(断开) if the broker sends invalid(无效的) commands, heartbeating, and the rest.
  • It can be extended(延伸) to run multiple threads, each managing one socket(插座) and one set of clients and workers. This could be interesting for segmenting(分段) large architectures(建筑学). The C code is already organized around a broker class to make this trivial(不重要的).
  • A primary/failover(失效备援) or live/live broker reliability(可靠性) model is easy, as the broker essentially has no state except service presence(存在). It's up to clients and workers to choose another broker if their first choice isn't up and running.
  • The examples use five-second heartbeats(心跳), mainly to reduce the amount(数量) of output(输出) when you enable tracing(追踪). Realistic(现实的) values would be lower for most LAN applications. However, any retry(重操作) has to be slow enough to allow for a service to restart(重新启动), say 10 seconds at least.

We later improved and extended the protocol(协议) and the Majordomo implementation(实现), which now sits in its own Github project. If you want a properly usable(可用的) Majordomo stack(堆), use the GitHub project.

Asynchronous(异步的) Majordomo Pattern

topprevnext

The Majordomo implementation in the previous section is simple and stupid. The client is just the original Simple Pirate, wrapped(包) up in a sexy API. When I fire up a client, broker, and worker on a test box, it can process 100,000 requests in about 14 seconds. That is partially(部分地) due to the code, which cheerfully copies message frames(框架) around as if CPU cycles were free. But the real problem is that we're doing network round-trips. ZeroMQ disables Nagle's algorithm, but round-tripping(借贷套利) is still slow.

Theory is great in theory, but in practice, practice is better. Let's measure the actual cost of round-tripping with a simple test program. This sends a bunch(群) of messages, first waiting for a reply to each message, and second as a batch(一批), reading all the replies back as a batch. Both approaches(方法) do the same work, but they give very different results. We mock(模拟的) up a client, broker, and worker:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

On my development box, this program says:

Setting up test...
Synchronous round-trip test...
 9057 calls/second
Asynchronous round-trip test...
 173010 calls/second

Note that the client thread does a small pause before starting. This is to get around one of the "features(特色)" of the router(路由器) socket(插座): if you send a message with the address of a peer(撒尿) that's not yet connected, the message gets discarded(抛弃). In this example we don't use the load balancing mechanism(机制), so without the sleep, if the worker thread is too slow to connect, it will lose messages, making a mess of our test.

As we see, round-tripping(借贷套利) in the simplest case is 20 times slower than the asynchronous(异步的), "shove(挤) it down the pipe as fast as it'll go" approach(方法). Let's see if we can apply this to Majordomo to make it faster.

First, we modify(修改) the client API to send and receive in two separate methods:

mdcli_t *mdcli_new (char *broker);
void mdcli_destroy (mdcli_t **self_p);
int mdcli_send (mdcli_t *self, char *service, zmsg_t **request_p);
zmsg_t *mdcli_recv (mdcli_t *self);

It's literally(文字的) a few minutes' work to refactor(重构) the synchronous(同步的) client API to become asynchronous(异步的):


C# | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

The differences are:

  • We use a DEALER socket(给…配插座) instead of REQ, so we emulate(仿真) REQ with an empty delimiter(划界) frame(框架) before each request and each response.
  • We don't retry(重操作) requests; if the application needs to retry, it can do this itself.
  • We break the synchronous(同步的) send method into separate send and recv methods.
  • The send method is asynchronous(异步的) and returns immediately after sending. The caller can thus send a number of messages before getting a response.
  • The recv method waits for (with a timeout) one response and returns that to the caller.

And here's the corresponding client test program, which sends 100,000 messages and then receives 100,000 back:


C++ | C# | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

The broker and worker are unchanged because we've not modified(修改) the protocol(协议) at all. We see an immediate improvement(改进) in performance. Here's the synchronous(同步的) client chugging(发出轧轧声) through 100K request-reply cycles:

$ time mdclient
100000 requests/replies processed

real    0m14.088s
user    0m1.310s
sys     0m2.670s

And here's the asynchronous(异步的) client, with a single worker:

$ time mdclient2
100000 replies received

real    0m8.730s
user    0m0.920s
sys     0m1.550s

Twice as fast. Not bad, but let's fire up 10 workers and see how it handles the traffic

$ time mdclient2
100000 replies received

real    0m3.863s
user    0m0.730s
sys     0m0.470s

It isn't fully asynchronous because workers get their messages on a strict last-used basis(基础). But it will scale(规模) better with more workers. On my PC, after eight or so workers, it doesn't get any faster. Four cores only stretches(伸展) so far. But we got a 4x improvement in throughput(生产量) with just a few minutes' work. The broker is still unoptimized. It spends most of its time copying message frames(框架) around, instead of doing zero-copy, which it could. But we're getting 25K reliable(可靠的人) request/reply calls a second, with pretty low effort.

However, the asynchronous Majordomo pattern isn't all roses. It has a fundamental(基本的) weakness, namely that it cannot survive(幸存) a broker crash without more work. If you look at the mdcliapi2 code you'll see it does not attempt to reconnect(使再接合) after a failure. A proper reconnect would require the following:

  • A number on every request and a matching number on every reply, which would ideally(理想地) require a change to the protocol to enforce(实施).
  • Tracking and holding onto all outstanding(杰出的) requests in the client API, i.e., those for which no reply has yet been received.
  • In case of failover(失效备援), for the client API to resend all outstanding requests to the broker.

It's not a deal breaker, but it does show that performance often means complexity(复杂). Is this worth doing for Majordomo? It depends on your use case. For a name lookup(查找) service you call once per session(会议), no. For a web frontend(前端) serving thousands of clients, probably yes.

Service Discovery

topprevnext

So, we have a nice service-oriented(服务型的) broker, but we have no way of knowing whether a particular service is available or not. We know whether a request failed, but we don't know why. It is useful to be able to ask the broker, "is the echo(回音) service running?" The most obvious way would be to modify our MDP/Client protocol to add commands to ask this. But MDP/Client has the great charm(魅力) of being simple. Adding service discovery to it would make it as complex(复杂的) as the MDP/Worker protocol.

Another option is to do what email does, and ask that undeliverable(无法投递的) requests be returned. This can work well in an asynchronous world, but it also adds complexity. We need ways to distinguish(区分) returned requests from replies and to handle these properly.

Let's try to use what we've already built, building on top of MDP instead of modifying it. Service discovery is, itself, a service. It might indeed be one of several management services, such as "disable service X", "provide statistics(统计)", and so on. What we want is a general, extensible(可延长的) solution(解决方案) that doesn't affect the protocol or existing applications.

So here's a small RFC that layers this on top of MDP: the Majordomo Management Interface (MMI). We already implemented(实施) it in the broker, though unless you read the whole thing you probably missed(感到思念的) that. I'll explain how it works in the broker:

  • When a client requests a service that starts with mmi., instead of routing(路由选择) this to a worker, we handle it internally(内部地).
  • We handle just one service in this broker, which is mmi.service, the service discovery service.
  • The payload(有效载荷) for the request is the name of an external(外部的) service (a real one, provided by a worker).
  • The broker returns "200" (OK) or "404" (Not found), depending on whether there are workers registered for that service or not.

Here's how we use the service discovery in an application:

// MMI echo(反射) query example

// Lets us build this source without creating a library
#include "mdcliapi.c"

int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], "-v"));
mdcli_t *session = mdcli_new ("tcp://localhost:5555", verbose);

// This is the service we want to look up
zmsg_t *request = zmsg_new ();
zmsg_addstr (request, "echo");

// This is the service we send our request to
zmsg_t *reply = mdcli_send (session, "mmi.service", &request);

if (reply) {
char *reply_code = zframe_strdup (zmsg_first (reply));
printf ("Lookup(查找) echo(反射) service: %s\n", reply_code);
free (reply_code);
zmsg_destroy (&reply);
}
else
printf ("E: no response from broker, make sure it's running\n");

mdcli_destroy (&session);
return 0;
}


C# | Go | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Try this with and without a worker running, and you should see the little program report "200" or "404" accordingly. The implementation(实现) of MMI in our example broker is flimsy(脆弱的). For example, if a worker disappears, services remain "present". In practice, a broker should remove services that have no workers after some configurable(可配置的) timeout.

Idempotent Services

topprevnext

Idempotency(幂等性) is not something you take a pill for. What it means is that it's safe to repeat an operation. Checking the clock is idempotent(幂等的). Lending ones credit card to ones children is not. While many client-to-server use cases are idempotent, some are not. Examples of idempotent use cases include:

  • Stateless(没有国家的) task distribution(分布), i.e., a pipeline(管道) where the servers are stateless workers that compute a reply based purely on the state provided by a request. In such a case, it's safe (though inefficient(无效率的)) to execute(实行) the same request many times.
  • A name service that translates logical(合逻辑的) addresses into endpoints(端点) to bind(绑) or connect to. In such a case, it's safe to make the same lookup(查找) request many times.

And here are examples of a non-idempotent use cases:

  • A logging service. One does not want the same log information recorded more than once.
  • Any service that has impact(影响) on downstream(下游的) nodes, e.g., sends on information to other nodes. If that service gets the same request more than once, downstream nodes will get duplicate(复制的) information.
  • Any service that modifies(修改) shared data in some non-idempotent way; e.g., a service that debits(借方) a bank account is not idempotent without extra work.

When our server applications are not idempotent, we have to think more carefully about when exactly they might crash. If an application dies when it's idle(闲置的), or while it's processing a request, that's usually fine. We can use database transactions(交易) to make sure a debit and a credit are always done together, if at all. If the server dies while sending its reply, that's a problem, because as far as it's concerned(涉及), it has done its work.

If the network dies just as the reply is making its way back to the client, the same problem arises. The client will think the server died and will resend(再发) the request, and the server will do the same work twice, which is not what we want.

To handle non-idempotent operations, use the fairly standard solution(解决方案) of detecting(检测) and rejecting duplicate requests. This means:

  • The client must stamp every request with a unique(独特的) client identifier(标识符) and a unique message number.
  • The server, before sending back a reply, stores it using the combination(结合) of client ID and message number as a key.
  • The server, when getting a request from a given client, first checks whether it has a reply for that client ID and message number. If so, it does not process the request, but just resends(再发) the reply.

Disconnected(拆开) Reliability(可靠性) (Titanic Pattern)

topprevnext

Once you realize that Majordomo is a "reliable(可靠的)" message broker, you might be tempted(诱惑) to add some spinning rust(锈) (that is, ferrous-based hard disk platters(大浅盘)). After all, this works for all the enterprise messaging systems. It's such a tempting idea that it's a little sad to have to be negative(负的) toward it. But brutal(残忍的) cynicism(玩世不恭) is one of my specialties(专业). So, some reasons you don't want rust-based brokers sitting in the center of your architecture(建筑学) are:

  • As you've seen, the Lazy Pirate client performs surprisingly well. It works across a whole range of architectures, from direct client-to-server to distributed(分布式的) queue proxies(代理人). It does tend(照料) to assume(承担) that workers are stateless(没有国家的) and idempotent(幂等的). But we can work around that limitation(限制) without resorting(求助) to rust.
  • Rust brings a whole set of problems, from slow performance to additional(附加的) pieces that you have to manage, repair, and handle 6 a.m. panics(恐慌) from, as they inevitably(不可避免地) break at the start of daily operations. The beauty of the Pirate patterns in general is their simplicity(朴素). They won't crash. And if you're still worried about the hardware(计算机硬件), you can move to a peer-to-peer(对等) pattern that has no broker at all. I'll explain later in this chapter.

Having said this, however, there is one sane(健全的) use case for rust-based reliability, which is an asynchronous(异步的) disconnected network. It solves a major problem with Pirate, namely that a client has to wait for an answer in real time. If clients and workers are only sporadically(零星地) connected (think of email as an analogy(类比)), we can't use a stateless network between clients and workers. We have to put state in the middle.

So, here's the Titanic pattern, in which we write messages to disk to ensure(保证) they never get lost, no matter how sporadically clients and workers are connected. As we did for service discovery, we're going to layer Titanic on top of MDP rather than extend(延伸) it. It's wonderfully lazy because it means we can implement(实施) our fire-and-forget(发射后自寻的) reliability in a specialized(专业的) worker, rather than in the broker. This is excellent for several reasons:

  • It is much easier because we divide and conquer(战胜): the broker handles message routing(路由选择) and the worker handles reliability.
  • It lets us mix brokers written in one language with workers written in another.
  • It lets us evolve(发展) the fire-and-forget technology independently.

The only downside(下降趋势) is that there's an extra network hop between broker and hard disk. The benefits(利益) are easily worth it.

There are many ways to make a persistent(固执的) request-reply architecture. We'll aim for one that is simple and painless(无痛的). The simplest design I could come up with, after playing with this for a few hours, is a "proxy service". That is, Titanic doesn't affect workers at all. If a client wants a reply immediately, it talks directly to a service and hopes the service is available. If a client is happy to wait a while, it talks to Titanic instead and asks, "hey, buddy(伙伴), would you take care of this for me while I go buy my groceries(杂货)?"

Figure 51 - The Titanic Pattern

fig51.png

Titanic is thus both a worker and a client. The dialog between client and Titanic goes along these lines:

  • Client: Please accept this request for me. Titanic: OK, done.
  • Client: Do you have a reply for me? Titanic: Yes, here it is. Or, no, not yet.
  • Client: OK, you can wipe that request now, I'm happy. Titanic: OK, done.

Whereas(然而) the dialog between Titanic and broker and worker goes like this:

  • Titanic: Hey, Broker, is there an coffee service? Broker: Uhm, Yeah, seems like.
  • Titanic: Hey, coffee service, please handle this for me.
  • Coffee: Sure, here you are.
  • Titanic: Sweeeeet!

You can work through this and the possible failure scenarios(方案). If a worker crashes while processing a request, Titanic retries(重操作) indefinitely(不确定的). If a reply gets lost somewhere, Titanic will retry. If the request gets processed but the client doesn't get the reply, it will ask again. If Titanic crashes while processing a request or a reply, the client will try again. As long as requests are fully committed(犯罪) to safe storage, work can't get lost.

The handshaking(握手) is pedantic(迂腐的), but can be pipelined, i.e., clients can use the asynchronous(异步的) Majordomo pattern to do a lot of work and then get the responses later.

We need some way for a client to request its replies. We'll have many clients asking for the same services, and clients disappear and reappear(再出现) with different identities(身份). Here is a simple, reasonably secure(保护) solution(解决方案):

  • Every request generates(形成) a universally(普遍地) unique(独特的) ID (UUID), which Titanic returns to the client after it has queued the request.
  • When a client asks for a reply, it must specify(指定) the UUID for the original request.

In a realistic(现实的) case, the client would want to store its request UUIDs safely, e.g., in a local database.

Before we jump off and write yet another formal(正式的) specification(规格) (fun, fun!), let's consider how the client talks to Titanic. One way is to use a single service and send it three different request types. Another way, which seems simpler, is to use three services:

  • titanic.request: store a request message, and return a UUID for the request.
  • titanic.reply: fetch a reply, if available, for a given request UUID.
  • titanic.close: confirm(确认) that a reply has been stored and processed.

We'll just make a multithreaded worker, which as we've seen from our multithreading experience with ZeroMQ, is trivial(不重要的). However, let's first sketch(画素描或速写) what Titanic would look like in terms of ZeroMQ messages and frames(框架). This gives us the Titanic Service Protocol (TSP).

Using TSP is clearly more work for client applications than accessing a service directly via MDP. Here's the shortest robust(强健的) "echo(回音)" client example:


C# | Haxe | Java | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Of course this can be, and should be, wrapped(包) up in some kind of framework(框架) or API. It's not healthy to ask average application developers to learn the full details of messaging: it hurts their brains, costs time, and offers too many ways to make buggy(童车) complexity(复杂). Additionally(附加的), it makes it hard to add intelligence(智力).

For example, this client blocks on each request whereas(然而) in a real application, we'd want to be doing useful work while tasks are executed(实行). This requires some nontrivial(非平凡的) plumbing(垂直) to build a background thread and talk to that cleanly. It's the kind of thing you want to wrap in a nice simple API that the average developer cannot misuse. It's the same approach(方法) that we used for Majordomo.

Here's the Titanic implementation(实现). This server handles the three services using three threads, as proposed(建议). It does full persistence(持续) to disk using the most brutal(残忍的) approach possible: one file per message. It's so simple, it's scary(提心吊胆的). The only complex(复杂的) part is that it keeps a separate queue of all requests, to avoid reading the directory over and over:


C# | Haxe | Java | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

To test this, start mdbroker and titanic, and then run ticlient. Now start mdworker arbitrarily(武断地), and you should see the client getting a response and exiting happily.

Some notes about this code:

  • Note that some loops(环) start by sending, others by receiving messages. This is because Titanic acts both as a client and a worker in different roles.
  • The Titanic broker uses the MMI service discovery protocol(协议) to send requests only to services that appear to be running. Since the MMI implementation(实现) in our little Majordomo broker is quite poor, this won't work all the time.
  • We use an inproc connection to send new request data from the titanic.request service through to the main dispatcher(调度员). This saves the dispatcher from having to scan(扫描) the disk directory, load all request files, and sort them by date/time.

The important thing about this example is not performance (which, although I haven't tested it, is surely terrible), but how well it implements(工具) the reliability(可靠性) contract(合同). To try it, start the mdbroker and titanic programs. Then start the ticlient, and then start the mdworker echo(反射) service. You can run all four of these using the -v option to do verbose(冗长的) activity tracing(追踪). You can stop and restart(重新启动) any piece except the client and nothing will get lost.

If you want to use Titanic in real cases, you'll rapidly be asking "how do we make this faster?"

Here's what I'd do, starting with the example implementation:

  • Use a single disk file for all data, rather than multiple files. Operating systems are usually better at handling a few large files than many smaller ones.
  • Organize that disk file as a circular(循环的) buffer(缓冲区) so that new requests can be written contiguously(连续的) (with very occasional(偶然的) wraparound(概括的)). One thread, writing full speed to a disk file, can work rapidly.
  • Keep the index in memory and rebuild the index at startup time, from the disk buffer. This saves the extra disk head flutter(摆动) needed to keep the index fully safe on disk. You would want an fsync after every message, or every N milliseconds(毫秒) if you were prepared to lose the last M messages in case of a system failure.
  • Use a solid-state(固态的) drive rather than spinning iron oxide(氧化物) platters(大浅盘).
  • Pre-allocate the entire file, or allocate(分配) it in large chunks(大块), which allows the circular buffer to grow and shrink(收缩) as needed. This avoids fragmentation(破碎) and ensures(保证) that most reads and writes are contiguous.

And so on. What I'd not recommend is storing messages in a database, not even a "fast" key/value store, unless you really like a specific(特殊的) database and don't have performance worries. You will pay a steep price for the abstraction(抽象), ten to a thousand times over a raw disk file.

If you want to make Titanic even more reliable, duplicate(复制) the requests to a second server, which you'd place in a second location just far away enough to survive(幸存) a nuclear attack on your primary location, yet not so far that you get too much latency(潜伏).

If you want to make Titanic much faster and less reliable(可靠的), store requests and replies purely in memory. This will give you the functionality(功能) of a disconnected(分离的) network, but requests won't survive a crash of the Titanic server itself.

High-Availability Pair (Binary Star Pattern)

topprevnext

Figure 52 - High-Availability Pair, Normal Operation

fig52.png

The Binary Star pattern puts two servers in a primary-backup high-availability pair. At any given time, one of these (the active) accepts connections from client applications. The other (the passive) does nothing, but the two servers monitor each other. If the active disappears from the network, after a certain time the passive takes over as active.

We developed the Binary Star pattern at iMatix for our OpenAMQ server. We designed it:

  • To provide a straightforward(简单的) high-availability solution(解决方案).
  • To be simple enough to actually understand and use.
  • To fail over reliably when needed, and only when needed.

Assuming(承担) we have a Binary Star pair running, here are the different scenarios(方案) that will result in a failover(失效备援):

  • The hardware(计算机硬件) running the primary server has a fatal(致命的) problem (power supply explodes, machine catches fire, or someone simply unplugs it by mistake), and disappears. Applications see this, and reconnect(使再接合) to the backup server.
  • The network segment(段) on which the primary server sits crashes—perhaps a router g(路由器)ets hit by a power spike—a(长钉)nd applications start to reconnect to the backup server.
  • The primary server crashes or is killed by the operator and does not restart(重新启动) automatically(自动地).

Figure 53 - High-availability Pair During Failover(失效备援)

fig53.png

Recovery(恢复) from failover works as follows:

  • The operators restart the primary server and fix whatever problems were causing it to disappear from the network.
  • The operators stop the backup server at a moment when it will cause minimal(最低的) disruption(破坏) to applications.
  • When applications have reconnected to the primary server, the operators restart the backup server.

Recovery (to using the primary server as active) is a manual(手工的) operation. Painful experience teaches us that automatic(自动的) recovery is undesirable(不良的). There are several reasons:

  • Failover creates an interruption(中断) of service to applications, possibly lasting 10-30 seconds. If there is a real emergency, this is much better than total outage(储运损耗). But if recovery creates a further 10-30 second outage, it is better that this happens off-peak(非尖峰的), when users have gone off the network.
  • When there is an emergency, the absolute(绝对的) first priority(优先) is certainty(必然) for those trying to fix things. Automatic recovery creates uncertainty(不确定) for system administrators(管理人), who can no longer be sure which server is in charge without double-checking.
  • Automatic recovery can create situations where networks fail over and then recover, placing operators in the difficult position of analyzing(分解) what happened. There was an interruption of service, but the cause isn't clear.

Having said this, the Binary Star pattern will fail back to the primary server if this is running (again) and the backup server fails. In fact, this is how we provoke(驱使) recovery.

The shutdown(关机) process for a Binary Star pair is to either:

  1. Stop the passive server and then stop the active server at any later time, or
  2. Stop both servers in any order but within a few seconds of each other.

Stopping the active and then the passive server with any delay longer than the failover timeout will cause applications to disconnect(拆开), then reconnect, and then disconnect again, which may disturb users.

Detailed Requirements

topprevnext

Binary(二进制的) Star is as simple as it can be, while still working accurately(精确地). In fact, the current design is the third complete redesign(重新设计). Each of the previous designs we found to be too complex(复杂的), trying to do too much, and we stripped(剥夺) out functionality(功能) until we came to a design that was understandable(可以理解的), easy to use, and reliable(可靠的) enough to be worth using.

These are our requirements for a high-availability architecture(建筑学):

  • The failover(失效备援) is meant to provide insurance against catastrophic(灾难的) system failures, such as hardware(计算机硬件) breakdown(故障), fire, accident, and so on. There are simpler ways to recover from ordinary server crashes and we already covered these.
  • Failover time should be under 60 seconds and preferably(较好) under 10 seconds.
  • Failover has to happen automatically(自动地), whereas(然而) recovery(恢复) must happen manually(手动地). We want applications to switch(转换) over to the backup server automatically, but we do not want them to switch back to the primary server except when the operators have fixed whatever problem there was and decided that it is a good time to interrupt applications again.
  • The semantics(语义学) for client applications should be simple and easy for developers to understand. Ideally(理想的), they should be hidden in the client API.
  • There should be clear instructions for network architects(建筑师) on how to avoid designs that could lead to split brain syndrome, in which both servers in a Binary Star pair think they are the active server.
  • There should be no dependencies(依赖性) on the order in which the two servers are started.
  • It must be possible to make planned stops and restarts(重新开始) of either server without stopping client applications (though they may be forced to reconnect(使再接合)).
  • Operators must be able to monitor both servers at all times.
  • It must be possible to connect the two servers using a high-speed dedicated(专用的) network connection. That is, failover synchronization(同步) must be able to use a specific(特殊的) IP route(路线).

We make the following assumptions(假定):

  • A single backup server provides enough insurance; we don't need multiple levels of backup.
  • The primary and backup servers are equally capable(能干的) of carrying the application load. We do not attempt to balance load across the servers.
  • There is sufficient(足够的) budget(预算) to cover a fully redundant(多余的) backup server that does nothing almost all the time.

We don't attempt to cover the following:

  • The use of an active backup server or load balancing. In a Binary Star pair, the backup server is inactive(不活跃的) and does no useful work until the primary server goes offline.
  • The handling of persistent(固执的) messages or transactions(交易) in any way. We assume(承担) the existence of a network of unreliable(不可靠的) (and probably untrusted) servers or Binary Star pairs.
  • Any automatic(自动的) exploration(探测) of the network. The Binary Star pair is manually(手动地) and explicitly(明确地) defined(定义) in the network and is known to applications (at least in their configuration(配置) data).
  • Replication(复制) of state or messages between servers. All server-side state must be recreated(娱乐) by applications when they fail over.

Here is the key terminology(术语) that we use in Binary Star:

  • Primary: the server that is normally or initially active.
  • Backup: the server that is normally passive. It will become active if and when the primary server disappears from the network, and when client applications ask the backup server to connect.
  • Active: the server that accepts client connections. There is at most one active server.
  • Passive: the server that takes over if the active disappears. Note that when a Binary Star pair is running normally, the primary server is active, and the backup is passive. When a failover(失效备援) has happened, the roles are switched(转换).

To configure(安装) a Binary Star pair, you need to:

  1. Tell the primary server where the backup server is located(处于).
  2. Tell the backup server where the primary server is located.
  3. Optionally(可选择的), tune(曲调) the failover response times, which must be the same for both servers.

The main tuning concern(关系) is how frequently you want the servers to check their peering(凝视) status, and how quickly you want to activate(刺激) failover. In our example, the failover timeout value defaults to 2,000 msec. If you reduce this, the backup server will take over as active more rapidly but may take over in cases where the primary server could recover. For example, you may have wrapped(包) the primary server in a shell(壳) script that restarts(重新开始) it if it crashes. In that case, the timeout should be higher than the time needed to restart the primary server.

For client applications to work properly with a Binary Star pair, they must:

  1. Know both server addresses.
  2. Try to connect to the primary server, and if that fails, to the backup server.
  3. Detect(察觉) a failed connection, typically(代表性地) using heartbeating.
  4. Try to reconnect(使再接合) to the primary, and then backup (in that order), with a delay between retries(重操作) that is at least as high as the server failover(失效备援) timeout.
  5. Recreate(娱乐) all of the state they require on a server.
  6. Retransmit(转播) messages lost during a failover, if messages need to be reliable(可靠的).

It's not trivial(不重要的) work, and we'd usually wrap(包) this in an API that hides it from real end-user applications.

These are the main limitations(限制) of the Binary Star pattern:

  • A server process cannot be part of more than one Binary Star pair.
  • A primary server can have a single backup server, and no more.
  • The passive server does no useful work, and is thus wasted.
  • The backup server must be capable(能干的) of handling full application loads.
  • Failover configuration(配置) cannot be modified(修改) at runtime.
  • Client applications must do some work to benefit(有益于) from failover.

Preventing Split-Brain Syndrome

topprevnext

Split-brain syndrome occurs when different parts of a cluster(群) think they are active at the same time. It causes applications to stop seeing each other. Binary(二进制的) Star has an algorithm(算法) for detecting and eliminating(消除) split brain, which is based on a three-way decision mechanism(机制) (a server will not decide to become active until it gets application connection requests and it cannot see its peer(贵族) server).

However, it is still possible to (mis)design a network to fool this algorithm. A typical scenario(方案) would be a Binary Star pair, that is distributed(分布式的) between two buildings, where each building also had a set of applications and where there was a single network link between both buildings. Breaking this link would create two sets of client applications, each with half of the Binary Star pair, and each failover server would become active.

To prevent split-brain situations, we must connect a Binary Star pair using a dedicated(专用的) network link, which can be as simple as plugging them both into the same switch(开关) or, better, using a crossover(交叉) cable(电缆) directly between two machines.

We must not split a Binary Star architecture(建筑学) into two islands, each with a set of applications. While this may be a common type of network architecture, you should use federation(联合), not high-availability failover(失效备援), in such cases.

A suitably paranoid(类似妄想狂的) network configuration(配置) would use two private cluster(群) interconnects(使互相连接), rather than a single one. Further, the network cards used for the cluster would be different from those used for message traffic, and possibly even on different paths on the server hardware(计算机硬件). The goal is to separate possible failures in the network from possible failures in the cluster. Network ports can have a relatively high failure rate.

Binary Star Implementation

topprevnext

Without further ado, here is a proof-of-concept(概念验证) implementation(实现) of the Binary Star server. The primary and backup servers run the same code, you choose their roles when you run the code:

// Binary Star server proof-of-concept implementation. This server does no
// real work; it just demonstrates(证明) the Binary Star failover model.

#include "czmq.h"

// States we can be in at any point in time
typedef enum {
STATE_PRIMARY = 1, // Primary, waiting for peer(贵族) to connect
STATE_BACKUP = 2, // Backup, waiting for peer to connect
STATE_ACTIVE = 3, // Active - accepting connections
STATE_PASSIVE = 4 // Passive - not accepting connections
} state_t;

// Events, which start with the states our peer(贵族) can be in
typedef enum {
PEER_PRIMARY = 1, // HA peer is pending(悬而未决) primary
PEER_BACKUP = 2, // HA peer is pending backup
PEER_ACTIVE = 3, // HA peer(贵族) is active
PEER_PASSIVE = 4, // HA peer is passive
CLIENT_REQUEST = 5 // Client makes request
} event_t;

// Our finite(有限的) state machine
typedef struct {
state_t state; // Current state
event_t event; // Current event
int64_t peer_expiry; // When peer(贵族) is considered 'dead'
} bstar_t;

// We send state information this often
// If peer doesn't respond(回答) in two heartbeats(心跳), it is 'dead'
#define HEARTBEAT 1000
// In msecs

// The heart of the Binary Star design is its finite-state machine (FSM).
// The FSM runs one event at a time. We apply an event to the current state,
// which checks if the event is accepted, and if so, sets a new state:

static bool
s_state_machine (bstar_t *fsm)
{
bool exception = false;

// These are the PRIMARY and BACKUP states; we're waiting to become
// ACTIVE or PASSIVE depending on events we get from our peer(贵族):
if (fsm->state == STATE_PRIMARY) {
if (fsm->event == PEER_BACKUP) {
printf ("I: connected to backup (passive), ready active\n");
fsm->state = STATE_ACTIVE;
}
else
if (fsm->event == PEER_ACTIVE) {
printf ("I: connected to backup (active), ready passive\n");
fsm->state = STATE_PASSIVE;
}
// Accept client connections
}
else
if (fsm->state == STATE_BACKUP) {
if (fsm->event == PEER_ACTIVE) {
printf ("I: connected to primary (active), ready passive\n");
fsm->state = STATE_PASSIVE;
}
else
// Reject client connections when acting as backup
if (fsm->event == CLIENT_REQUEST)
exception = true;
}
else
// These are the ACTIVE and PASSIVE states:

if (fsm->state == STATE_ACTIVE) {
if (fsm->event == PEER_ACTIVE) {
// Two actives would mean split-brain
printf ("E: fatal(致命的) error - dual(双的) actives, aborting(流产)\n");
exception = true;
}
}
else
// Server is passive
// CLIENT_REQUEST events can trigger(引发) failover(失效备援) if peer(贵族) looks dead
if (fsm->state == STATE_PASSIVE) {
if (fsm->event == PEER_PRIMARY) {
// Peer(撒尿) is restarting(重新起动) - become active, peer will go passive
printf ("I: primary (passive) is restarting, ready active\n");
fsm->state = STATE_ACTIVE;
}
else
if (fsm->event == PEER_BACKUP) {
// Peer(撒尿) is restarting(重新起动) - become active, peer will go passive
printf ("I: backup (passive) is restarting, ready active\n");
fsm->state = STATE_ACTIVE;
}
else
if (fsm->event == PEER_PASSIVE) {
// Two passives would mean cluster(群) would be non-responsive
printf ("E: fatal(致命的) error - dual(双的) passives, aborting(流产)\n");
exception = true;
}
else
if (fsm->event == CLIENT_REQUEST) {
// Peer(撒尿) becomes active if timeout has passed
// It's the client request that triggers(修饰) the failover(失效备援)
assert (fsm->peer_expiry > 0);
if (zclock_time () >= fsm->peer_expiry) {
// If peer(贵族) is dead, switch(转换) to the active state
printf ("I: failover(失效备援) successful, ready active\n");
fsm->state = STATE_ACTIVE;
}
else
// If peer is alive, reject connections
exception = true;
}
}
return exception;
}

// This is our main task. First we bind(捆绑)/connect our sockets(插座) with our
// peer(贵族) and make sure we will get state messages correctly. We use
// three sockets; one to publish state, one to subscribe(订阅) to state, and
// one for client requests/replies:

int main (int argc, char *argv [])
{
// Arguments can be either of:
// -p primary server, at tcp://localhost:5001
// -b backup server, at tcp://localhost:5002
zctx_t *ctx = zctx_new ();
void *statepub = zsocket_new (ctx, ZMQ_PUB);
void *statesub = zsocket_new (ctx, ZMQ_SUB);
zsocket_set_subscribe(订阅) (statesub, "");
void *frontend = zsocket_new (ctx, ZMQ_ROUTER);
bstar_t fsm = { 0 };

if (argc == 2 && streq (argv [1], "-p")) {
printf ("I: Primary active, waiting for backup (passive)\n");
zsocket_bind (frontend, "tcp://*:5001");
zsocket_bind (statepub, "tcp://*:5003");
zsocket_connect (statesub, "tcp://localhost:5004");
fsm.state = STATE_PRIMARY;
}
else
if (argc == 2 && streq (argv [1], "-b")) {
printf ("I: Backup passive, waiting for primary (active)\n");
zsocket_bind (frontend, "tcp://*:5002");
zsocket_bind (statepub, "tcp://*:5004");
zsocket_connect (statesub, "tcp://localhost:5003");
fsm.state = STATE_BACKUP;
}
else {
printf ("Usage(使用): bstarsrv { -p | -b }\n");
zctx_destroy (&ctx);
exit (0);
}
// We now process events on our two input(投入) sockets(插座), and process these
// events one at a time via our finite-state machine. Our "work" for
// a client request is simply to echo(反射) it back:

// Set timer for next outgoing(外出的) state message
int64_t send_state_at = zclock_time () + HEARTBEAT;
while (!zctx_interrupted) {
zmq_pollitem_t items [] = {
{ frontend, 0, ZMQ_POLLIN, 0 },
{ statesub, 0, ZMQ_POLLIN, 0 }
};
int time_left = (int) ((send_state_at - zclock_time ()));
if (time_left < 0)
time_left = 0;
int rc = zmq_poll (items, 2, time_left * ZMQ_POLL_MSEC);
if (rc == -1)
break; // Context(环境) has been shut down

if (items [0].revents & ZMQ_POLLIN) {
// Have a client request
zmsg_t *msg = zmsg_recv (frontend);
fsm.event = CLIENT_REQUEST;
if (s_state_machine (&fsm) == false)
// Answer client by echoing(反射) request back
zmsg_send (&msg, frontend);
else
zmsg_destroy (&msg);
}
if (items [1].revents & ZMQ_POLLIN) {
// Have state from our peer(贵族), execute(实行) as event
char *message = zstr_recv (statesub);
fsm.event = atoi (message);
free (message);
if (s_state_machine (&fsm))
break; // Error, so exit
fsm.peer_expiry = zclock_time () + 2 * HEARTBEAT;
}
// If we timed out, send state to peer(凝视)
if (zclock_time () >= send_state_at) {
char message [2];
sprintf (message, "%d", fsm.state);
zstr_send (statepub, message);
send_state_at = zclock_time () + HEARTBEAT;
}
}
if (zctx_interrupted)
printf ("W: interrupted\n");

// Shutdown(关机) sockets(插座) and context(环境)
zctx_destroy (&ctx);
return 0;
}


Haxe | Java | Python | Ruby | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Scala

And here is the client:

// Binary Star client proof-of-concept(概念验证) implementation(实现). This client does no
// real work; it just demonstrates(证明) the Binary Star failover(失效备援) model.

#include "czmq.h"
#define REQUEST_TIMEOUT 1000 // msecs
#define SETTLE_DELAY 2000 // Before failing over

int main (void)
{
zctx_t *ctx = zctx_new ();

char *server [] = { "tcp://localhost:5001", "tcp://localhost:5002" };
uint server_nbr = 0;

printf ("I: connecting to server at %s…\n", server [server_nbr]);
void *client = zsocket_new (ctx, ZMQ_REQ);
zsocket_connect (client, server [server_nbr]);

int sequence = 0;
while (!zctx_interrupted) {
// We send a request, then we work to get a reply
char request [10];
sprintf (request, "%d", ++sequence);
zstr_send (client, request);

int expect_reply = 1;
while (expect_reply) {
// Poll(投票) socket(插座) for a reply, with timeout
zmq_pollitem_t items [] = { { client, 0, ZMQ_POLLIN, 0 } };
int rc = zmq_poll (items, 1, REQUEST_TIMEOUT * ZMQ_POLL_MSEC);
if (rc == -1)
break; // Interrupted

// We use a Lazy Pirate strategy(战略) in the client. If there's no
// reply within our timeout, we close the socket(插座) and try again.
// In Binary Star, it's the client vote that decides which
// server is primary; the client must therefore try to connect
// to each server in turn:

if (items [0].revents & ZMQ_POLLIN) {
// We got a reply from the server, must match sequence(序列)
char *reply = zstr_recv (client);
if (atoi (reply) == sequence) {
printf ("I: server replied OK (%s)\n", reply);
expect_reply = 0;
sleep (1); // One request per second
}
else
printf ("E: bad reply from server: %s\n", reply);
free (reply);
}
else {
printf ("W: no response from server, failing over\n");

// Old socket(插座) is confused(混乱); close it and open a new one
zsocket_destroy (ctx, client);
server_nbr = (server_nbr + 1) % 2;
zclock_sleep (SETTLE_DELAY);
printf ("I: connecting to server at %s…\n",
server [server_nbr]);
client = zsocket_new (ctx, ZMQ_REQ);
zsocket_connect (client, server [server_nbr]);

// Send request again, on new socket(插座)
zstr_send (client, request);
}
}
}
zctx_destroy (&ctx);
return 0;
}


Haxe | Java | Python | Ruby | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Scala

To test Binary Star, start the servers and client in any order:

bstarsrv -p     # Start primary
bstarsrv -b     # Start backup
bstarcli

You can then provoke(驱使) failover(失效备援) by killing the primary server, and recovery(恢复) by restarting(重新起动) the primary and killing the backup. Note how it's the client vote that triggers(修饰) failover, and recovery.

Binary(二进制的) star is driven by a finite(有限的) state machine. Events are the peer(贵族) state, so "Peer Active" means the other server has told us it's active. "Client Request" means we've received a client request. "Client Vote" means we've received a client request AND our peer is inactive(不活跃的) for two heartbeats(心跳).

Note that the servers use PUB-SUB sockets(插座) for state exchange. No other socket combination(结合) will work here. PUSH and DEALER block if there is no peer ready to receive a message. PAIR does not reconnect(使再接合) if the peer disappears and comes back. ROUTER needs the address of the peer before it can send it a message.

Figure 54 - Binary Star Finite State Machine

fig54.png

Binary Star Reactor

topprevnext

Binary Star is useful and generic(类的) enough to package up as a reusable(可重复使用的) reactor(反应器) class. The reactor then runs and calls our code whenever it has a message to process. This is much nicer than copying/pasting(裱糊) the Binary Star code into each server where we want that capability(才能).

In C, we wrap(包) the CZMQ zloop class that we saw before. zloop lets you register handlers to react(反应) on socket and timer events. In the Binary Star reactor, we provide handlers for voters and for state changes (active to passive, and vice(副的) versa). Here is the bstar API:

// bstar class - Binary Star reactor

#include "bstar.h"

// States we can be in at any point in time
typedef enum {
STATE_PRIMARY = 1, // Primary, waiting for peer(贵族) to connect
STATE_BACKUP = 2, // Backup, waiting for peer to connect
STATE_ACTIVE = 3, // Active - accepting connections
STATE_PASSIVE = 4 // Passive - not accepting connections
} state_t;

// Events, which start with the states our peer(贵族) can be in
typedef enum {
PEER_PRIMARY = 1, // HA peer is pending(悬而未决) primary
PEER_BACKUP = 2, // HA peer(贵族) is pending(悬而未决) backup
PEER_ACTIVE = 3, // HA peer is active
PEER_PASSIVE = 4, // HA peer is passive
CLIENT_REQUEST = 5 // Client makes request
} event_t;

// Structure(结构) of our class

struct _bstar_t {
zctx_t *ctx; // Our private context
zloop_t *loop; // Reactor loop
void *statepub; // State publisher
void *statesub; // State subscriber
state_t state; // Current state
event_t event; // Current event
int64_t peer_expiry; // When peer(贵族) is considered 'dead'
zloop_fn *voter_fn; // Voting socket handler
void *voter_arg; // Arguments for voting handler
zloop_fn *active_fn; // Call when become active
void *active_arg; // Arguments for handler
zloop_fn *passive_fn; // Call when become passive
void *passive_arg; // Arguments for handler
};

// The finite-state machine is the same as in the proof-of-concept(概念验证) server.
// To understand this reactor(反应器) in detail, first read the CZMQ zloop class.

// We send state information every this often
// If peer(贵族) doesn't respond(回答) in two heartbeats(心跳), it is 'dead'
#define BSTAR_HEARTBEAT 1000 // In msecs

// Binary(二进制的) Star finite(有限的) state machine (applies event to state)
// Returns -1 if there was an exception(例外), 0 if event was valid.

static int
s_execute_fsm (bstar_t *self)
{
int rc = 0;
// Primary server is waiting for peer(贵族) to connect
// Accepts CLIENT_REQUEST events in this state
if (self->state == STATE_PRIMARY) {
if (self->event == PEER_BACKUP) {
zclock_log ("I: connected to backup (passive), ready as active");
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
}
else
if (self->event == PEER_ACTIVE) {
zclock_log ("I: connected to backup (active), ready as passive");
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}
else
if (self->event == CLIENT_REQUEST) {
// Allow client requests to turn us into the active if we've
// waited sufficiently(充分地) long to believe the backup is not
// currently acting as active (i.e., after a failover(失效备援))
assert (self->peer_expiry > 0);
if (zclock_time () >= self->peer_expiry) {
zclock_log ("I: request from client, ready as active");
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
} else
// Don't respond(回答) to clients yet - it's possible we're
// performing a failback and the backup is currently active
rc = -1;
}
}
else
// Backup server is waiting for peer(贵族) to connect
// Rejects CLIENT_REQUEST events in this state
if (self->state == STATE_BACKUP) {
if (self->event == PEER_ACTIVE) {
zclock_log ("I: connected to primary (active), ready as passive");
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}
else
if (self->event == CLIENT_REQUEST)
rc = -1;
}
else
// Server is active
// Accepts CLIENT_REQUEST events in this state
// The only way out of ACTIVE is death
if (self->state == STATE_ACTIVE) {
if (self->event == PEER_ACTIVE) {
// Two actives would mean split-brain
zclock_log ("E: fatal(致命的) error - dual(双的) actives, aborting(流产)");
rc = -1;
}
}
else
// Server is passive
// CLIENT_REQUEST events can trigger(引发) failover(失效备援) if peer(贵族) looks dead
if (self->state == STATE_PASSIVE) {
if (self->event == PEER_PRIMARY) {
// Peer(撒尿) is restarting(重新起动) - become active, peer will go passive
zclock_log ("I: primary (passive) is restarting, ready as active");
self->state = STATE_ACTIVE;
}
else
if (self->event == PEER_BACKUP) {
// Peer(撒尿) is restarting(重新起动) - become active, peer will go passive
zclock_log ("I: backup (passive) is restarting, ready as active");
self->state = STATE_ACTIVE;
}
else
if (self->event == PEER_PASSIVE) {
// Two passives would mean cluster(群) would be non-responsive
zclock_log ("E: fatal(致命的) error - dual(双的) passives, aborting(流产)");
rc = -1;
}
else
if (self->event == CLIENT_REQUEST) {
// Peer(撒尿) becomes active if timeout has passed
// It's the client request that triggers(修饰) the failover(失效备援)
assert (self->peer_expiry > 0);
if (zclock_time () >= self->peer_expiry) {
// If peer(贵族) is dead, switch(转换) to the active state
zclock_log ("I: failover successful, ready as active");
self->state = STATE_ACTIVE;
}
else
// If peer(贵族) is alive, reject connections
rc = -1;
}
// Call state change handler if necessary
if (self->state == STATE_ACTIVE && self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
}
return rc;
}

static void
s_update_peer_expiry (bstar_t *self)
{
self->peer_expiry = zclock_time () + 2 * BSTAR_HEARTBEAT;
}

// Reactor event handlers…

// Publish our state to peer(凝视)
int s_send_state (zloop_t *loop, int timer_id, void *arg)
{
bstar_t *self = (bstar_t *) arg;
zstr_sendf (self->statepub, "%d", self->state);
return 0;
}

// Receive state from peer(贵族), execute(实行) finite(有限的) state machine
int s_recv_state (zloop_t *loop, zmq_pollitem_t *poller, void *arg)
{
bstar_t *self = (bstar_t *) arg;
char *state = zstr_recv (poller->socket);
if (state) {
self->event = atoi (state);
s_update_peer(贵族)_expiry(满期) (self);
free (state);
}
return s_execute_fsm (self);
}

// Application wants to speak to us, see if it's possible
int s_voter_ready (zloop_t *loop, zmq_pollitem_t *poller, void *arg)
{
bstar_t *self = (bstar_t *) arg;
// If server can accept input(投入) now, call appl handler
self->event = CLIENT_REQUEST;
if (s_execute_fsm (self) == 0)
(self->voter_fn) (self->loop, poller, self->voter_arg);
else {
// Destroy waiting message, no-one to read it
zmsg_t *msg = zmsg_recv (poller->socket);
zmsg_destroy (&msg);
}
return 0;
}

// This is the constructor(构造函数) for our bstar class. We have to tell it
// whether we're primary or backup server, as well as our local and
// remote(遥远的) endpoints(端点) to bind(绑) and connect to:

bstar_t *
bstar_new (int primary, char *local, char *remote)
{
bstar_t
*self;

self = (bstar_t *) zmalloc (sizeof (bstar_t));

// Initialize(初始化) the Binary Star
self->ctx = zctx_new ();
self->loop = zloop_new ();
self->state = primary? STATE_PRIMARY: STATE_BACKUP;

// Create publisher for state going to peer(凝视)
self->statepub = zsocket_new (self->ctx, ZMQ_PUB);
zsocket_bind (self->statepub, local);

// Create subscriber(订户) for state coming from peer(贵族)
self->statesub = zsocket_new (self->ctx, ZMQ_SUB);
zsocket_set_subscribe (self->statesub, "");
zsocket_connect (self->statesub, remote);

// Set-up(计划) basic reactor(反应器) events
zloop_timer (self->loop, BSTAR_HEARTBEAT, 0, s_send_state, self);
zmq_pollitem_t poller = { self->statesub, 0, ZMQ_POLLIN };
zloop_poller (self->loop, &poller(无角的), s_recv_state, self);
return self;
}

// The destructor(破坏者) shuts down the bstar reactor(反应器):

void
bstar_destroy (bstar_t **self_p)
{
assert (self_p);
if (*self_p) {
bstar_t *self = *self_p;
zloop_destroy (&self->loop);
zctx_destroy (&self->ctx);
free (self);
*self_p = NULL;
}
}

// This method returns the underlying(潜在的) zloop reactor(反应器), so we can add
// additional(附加的) timers and readers:

zloop_t *
bstar_zloop (bstar_t *self)
{
return self->loop;
}

// This method registers a client voter socket(插座). Messages received
// on this socket provide the CLIENT_REQUEST events for the Binary Star
// FSM and are passed to the provided application handler. We require
// exactly one voter per bstar instance:

int
bstar_voter (bstar_t *self, char *endpoint, int type, zloop_fn handler,
void *arg)
{
// Hold actual handler+arg so we can call this later
void *socket = zsocket_new (self->ctx, type);
zsocket_bind(结合) (socket(插座), endpoint(端点));
assert (!self->voter_fn);
self->voter_fn = handler;
self->voter_arg = arg;
zmq_pollitem_t poller = { socket, 0, ZMQ_POLLIN };
return zloop_poller (self->loop, &poller(无角的), s_voter_ready, self);
}

// Register handlers to be called each time there's a state change:

void
bstar_new_active (bstar_t *self, zloop_fn handler, void *arg)
{
assert (!self->active_fn);
self->active_fn = handler;
self->active_arg = arg;
}

void
bstar_new_passive (bstar_t *self, zloop_fn handler, void *arg)
{
assert (!self->passive_fn);
self->passive_fn = handler;
self->passive_arg = arg;
}

// Enable/disable verbose(冗长的) tracing(追踪), for debugging(调试以排除故障):

void bstar_set_verbose (bstar_t *self, bool verbose)
{
zloop_set_verbose (self->loop, verbose);
}

// Finally, start the configured(配置) reactor(反应器). It will end if any handler
// returns -1 to the reactor, or if the process receives SIGINT or SIGTERM:

int
bstar_start (bstar_t *self)
{
assert (self->voter_fn);
s_update_peer(贵族)_expiry(满期) (self);
return zloop_start (self->loop);
}

And here is the class implementation(实现):

// bstar class - Binary Star reactor(反应器)

#include "bstar.h"

// States we can be in at any point in time
typedef enum {
STATE_PRIMARY = 1, // Primary, waiting for peer(贵族) to connect
STATE_BACKUP = 2, // Backup, waiting for peer to connect
STATE_ACTIVE = 3, // Active - accepting connections
STATE_PASSIVE = 4 // Passive - not accepting connections
} state_t;

// Events, which start with the states our peer(贵族) can be in
typedef enum {
PEER_PRIMARY = 1, // HA peer is pending(悬而未决) primary
PEER_BACKUP = 2, // HA peer is pending backup
PEER_ACTIVE = 3, // HA peer(贵族) is active
PEER_PASSIVE = 4, // HA peer is passive
CLIENT_REQUEST = 5 // Client makes request
} event_t;

// Structure(结构) of our class

struct _bstar_t {
zctx_t *ctx; // Our private context
zloop_t *loop; // Reactor loop
void *statepub; // State publisher
void *statesub; // State subscriber
state_t state; // Current state
event_t event; // Current event
int64_t peer_expiry; // When peer(贵族) is considered 'dead'
zloop_fn *voter_fn; // Voting socket handler
void *voter_arg; // Arguments for voting handler
zloop_fn *active_fn; // Call when become active
void *active_arg; // Arguments for handler
zloop_fn *passive_fn; // Call when become passive
void *passive_arg; // Arguments for handler
};

// The finite-state machine is the same as in the proof-of-concept(概念验证) server.
// To understand this reactor(反应器) in detail, first read the CZMQ zloop class.

// We send state information every this often
// If peer(贵族) doesn't respond(回答) in two heartbeats(心跳), it is 'dead'
#define BSTAR_HEARTBEAT 1000 // In msecs

// Binary(二进制的) Star finite(有限的) state machine (applies event to state)
// Returns -1 if there was an exception(例外), 0 if event was valid.

static int
s_execute_fsm (bstar_t *self)
{
int rc = 0;
// Primary server is waiting for peer(贵族) to connect
// Accepts CLIENT_REQUEST events in this state
if (self->state == STATE_PRIMARY) {
if (self->event == PEER_BACKUP) {
zclock_log ("I: connected to backup (passive), ready as active");
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
}
else
if (self->event == PEER_ACTIVE) {
zclock_log ("I: connected to backup (active), ready as passive");
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}
else
if (self->event == CLIENT_REQUEST) {
// Allow client requests to turn us into the active if we've
// waited sufficiently(充分地) long to believe the backup is not
// currently acting as active (i.e., after a failover(失效备援))
assert (self->peer_expiry > 0);
if (zclock_time () >= self->peer_expiry) {
zclock_log ("I: request from client, ready as active");
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
} else
// Don't respond(回答) to clients yet - it's possible we're
// performing a failback and the backup is currently active
rc = -1;
}
}
else
// Backup server is waiting for peer(贵族) to connect
// Rejects CLIENT_REQUEST events in this state
if (self->state == STATE_BACKUP) {
if (self->event == PEER_ACTIVE) {
zclock_log ("I: connected to primary (active), ready as passive");
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}