0MQ

ØMQ - The Guide

By Pieter Hintjens, CEO of iMatix

Please use the issue tracker for all comments and errata(勘误表). This version covers the latest stable(稳定的) release of ZeroMQ (3.2). If you are using older versions of ZeroMQ then some of the examples and explanations won't be accurate(精确的).

The Guide is originally in C, but also in PHP, Python, Lua, and Haxe. We've also translated most of the examples into C++, C#, CL, Delphi, Erlang, F#, Felix, Haskell, Java, Objective-C, Ruby, Ada, Basic, Clojure, Go, Haxe, Node.js, ooc, Perl, and Scala.

Preface

topprevnext

ZeroMQ in a Hundred Words

topprevnext

ZeroMQ (also known as ØMQ, 0MQ, or zmq) looks like an embeddable (可嵌入)networking library but acts like a concurrency (并发性)framework.(框架) It gives you sockets (插座)that carry atomic (原子的)messages across various transports like in-process, inter-process, TCP, and multicast.(多路广播) You can connect sockets N-to-N with patterns like fan-out, pub-sub, task distribution,(分布) and request-reply. It's fast enough to be the fabric (织物)for clustered (成群的)products. Its asynchronous (异步的)I/O model gives you scalable (可攀登的)multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems. ZeroMQ is from iMatix and is LGPLv3 open source.

How It Began

topprevnext

We took a normal TCP socket, injected(注入) it with a mix of radioactive isotopes(同位素) stolen from a secret Soviet atomic research project, bombarded(轰炸) it with 1950-era cosmic(宇宙的) rays, and put it into the hands of a drug-addled comic(喜剧的) book author with a badly-disguised fetish(恋物) for bulging(膨胀) muscles(肌肉) clad(穿衣) in spandex(斯潘德克斯弹性纤维). Yes, ZeroMQ sockets are the world-saving superheroes(超级英雄) of the networking world.

Figure 1 - A terrible accident…

fig1.png

The Zen of Zero

topprevnext

The Ø in ZeroMQ is all about tradeoffs. On the one hand this strange name lowers ZeroMQ's visibility (能见度)on Google and Twitter. On the other hand it annoys the heck (饲草架)out of some Danish folk who write us things like "ØMG røtfl", and "Ø is not a funny looking zero!" and "Rødgrød med fløde!", which is apparently(显然地) an insult(侮辱) that means "may your neighbours be the direct descendants(后裔) of Grendel!" Seems like a fair trade.

Originally the zero in ZeroMQ was meant as "zero broker" and (as close to) "zero latency(潜伏)" (as possible). Since then, it has come to encompass(包含) different goals: zero administration(管理), zero cost, zero waste. More generally, "zero" refers to the culture of minimalism(极简派艺术) that permeates(渗透) the project. We add power by removing complexity(复杂) rather than by exposing new functionality(功能).

Audience

topprevnext

This book is written for professional programmers who want to learn how to make the massively(大量地) distributed(分布式的) software that will dominate(控制) the future of computing. We assume(承担) you can read C code, because most of the examples here are in C even though ZeroMQ is used in many languages. We assume you care about scale(规模), because ZeroMQ solves that problem above all others. We assume you need the best possible results with the least possible cost, because otherwise you won't appreciate the trade-offs that ZeroMQ makes. Other than that basic background, we try to present all the concepts(观念) in networking and distributed computing you will need to use ZeroMQ.

Acknowledgements

topprevnext

Thanks to Andy Oram for making the O'Reilly book happen, and editing this text.

Thanks to Bill Desmarais, Brian Dorsey, Daniel Lin, Eric Desgranges, Gonzalo Diethelm, Guido Goldstein, Hunter Ford, Kamil Shakirov, Martin Sustrik, Mike Castleman, Naveen Chawla, Nicola Peduzzi, Oliver Smith, Olivier Chamoux, Peter Alexander, Pierre Rouleau, Randy Dryburgh, John Unwin, Alex Thomas, Mihail Minkov, Jeremy Avnet, Michael Compton, Kamil Kisiel, Mark Kharitonov, Guillaume Aubert, Ian Barber, Mike Sheridan, Faruk Akgul, Oleg Sidorov, Lev Givon, Allister MacLeod, Alexander D'Archangel, Andreas Hoelzlwimmer, Han Holl, Robert G. Jakabosky, Felipe Cruz, Marcus McCurdy, Mikhail Kulemin, Dr. Gergő Érdi, Pavel Zhukov, Alexander Else, Giovanni Ruggiero, Rick "Technoweenie", Daniel Lundin, Dave Hoover, Simon Jefford, Benjamin Peterson, Justin Case, Devon Weller, Richard Smith, Alexander Morland, Wadim Grasza, Michael Jakl, Uwe Dauernheim, Sebastian Nowicki, Simone Deponti, Aaron Raddon, Dan Colish, Markus Schirp, Benoit Larroque, Jonathan Palardy, Isaiah Peng, Arkadiusz Orzechowski, Umut Aydin, Matthew Horsfall, Jeremy W. Sherman, Eric Pugh, Tyler Sellon, John E. Vincent, Pavel Mitin, Min RK, Igor Wiedler, Olof Åkesson, Patrick Lucas, Heow Goodman, Senthil Palanisami, John Gallagher, Tomas Roos, Stephen McQuay, Erik Allik, Arnaud Cogoluègnes, Rob Gagnon, Dan Williams, Edward Smith, James Tucker, Kristian Kristensen, Vadim Shalts, Martin Trojer, Tom van Leeuwen, Hiten Pandya, Harm Aarts, Marc Harter, Iskren Ivov Chernev, Jay Han, Sonia Hamilton, Nathan Stocks, Naveen Palli, and Zed Shaw for their contributions to this work.


Chapter 1 - Basics

topprevnext

Fixing the World

topprevnext

How to explain ZeroMQ? Some of us start by saying all the wonderful things it does. It's sockets(插座) on steroids(类固醇). It's like mailboxes with routing(路由选择). It's fast! Others try to share their moment of enlightenment(启迪), that zap-pow-kaboom satori(心灵之顿悟) paradigm-shift moment when it all became obvious. Things just become simpler. Complexity(复杂) goes away. It opens the mind. Others try to explain by comparison(比较). It's smaller, simpler, but still looks familiar. Personally, I like to remember why we made ZeroMQ at all, because that's most likely where you, the reader, still are today.

Programming is science dressed up as art because most of us don't understand the physics of software and it's rarely, if ever, taught. The physics of software is not algorithms(算法), data structures(结构), languages and abstractions(抽象). These are just tools we make, use, throw away. The real physics of software is the physics of people—specifically, (特别地)our limitations w(限制)hen it comes to complexity, and our desire to work together to solve large problems in pieces. This is the science of programming: make building blocks that people can understand and use easily, and people will work together to solve the very largest problems.

We live in a connected world, and modern software has to navigate(驾驶) this world. So the building blocks for tomorrow's very largest solutions(解决方案) are connected and massively(大量地) parallel(平行的). It's not enough for code to be "strong and silent" any more. Code has to talk to code. Code has to be chatty(饶舌的), sociable(社交的), well-connected. Code has to run like the human brain, trillions(万亿) of individual(个人的) neurons(神经元) firing off messages to each other, a massively parallel network with no central control, no single point of failure, yet able to solve immensely(极大地) difficult problems. And it's no accident that the future of code looks like the human brain, because the endpoints(端点) of every network are, at some level, human brains.

If you've done any work with threads, protocols(协议), or networks, you'll realize this is pretty much impossible. It's a dream. Even connecting a few programs across a few sockets is plain nasty(肮脏的) when you start to handle real life situations. Trillions? The cost would be unimaginable(不可思议的). Connecting computers is so difficult that software and services to do this is a multi-billion dollar business.

So we live in a world where the wiring is years ahead of our ability to use it. We had a software crisis(危机) in the 1980s, when leading software engineers like Fred Brooks believed there was no "Silver Bullet" to "promise even one order of magnitude(大小) of improvement(改进) in productivity(生产力), reliability(可靠性), or simplicity(朴素)".

Brooks missed(感到思念的) free and open source software, which solved that crisis, enabling us to share knowledge efficiently(有效地). Today we face another software crisis, but it's one we don't talk about much. Only the largest, richest firms can afford to create connected applications. There is a cloud, but it's proprietary(所有权). Our data and our knowledge is disappearing from our personal computers into clouds that we cannot access and with which we cannot compete. Who owns our social networks? It is like the mainframe-PC revolution in reverse(相反).

We can leave the political philosophy(哲学) for another book. The point is that while the Internet offers the potential(潜能) of massively(大量地) connected code, the reality is that this is out of reach for most of us, and so large interesting problems (in health, education, economics(经济学), transport, and so on) remain unsolved because there is no way to connect the code, and thus no way to connect the brains that could work together to solve these problems.

There have been many attempts to solve the challenge of connected code. There are thousands of IETF specifications(规格), each solving part of the puzzle. For application developers, HTTP is perhaps the one solution(解决方案) to have been simple enough to work, but it arguably(可论证地) makes the problem worse by encouraging developers and architects(建筑师) to think in terms of big servers and thin, stupid clients.

So today people are still connecting applications using raw UDP and TCP, proprietary protocols(协议), HTTP, and Websockets. It remains painful, slow, hard to scale(衡量), and essentially centralized(集中的). Distributed(分配) P2P architectures(建筑学) are mostly for play, not work. How many applications use Skype or Bittorrent to exchange data?

Which brings us back to the science of programming. To fix the world, we needed to do two things. One, to solve the general problem of "how to connect any code to any code, anywhere". Two, to wrap(包) that up in the simplest possible building blocks that people could understand and use easily.

It sounds ridiculously(可笑地) simple. And maybe it is. That's kind of the whole point.

Starting Assumptions

topprevnext

We assume(承担) you are using at least version 3.2 of ZeroMQ. We assume you are using a Linux box or something similar. We assume you can read C code, more or less, as that's the default language for the examples. We assume that when we write constants like PUSH or SUBSCRIBE, you can imagine they are really called ZMQ_PUSH or ZMQ_SUBSCRIBE if the programming language needs it.

Getting the Examples

topprevnext

The examples live in a public GitHub repository. The simplest way to get all the examples is to clone(无性繁殖) this repository(贮藏室):

git clone --depth=1 https://github.com/imatix/zguide.git

Next, browse the examples subdirectory(子目录). You'll find examples by language. If there are examples missing in a language you use, you're encouraged to submit a translation. This is how this text became so useful, thanks to the work of many people. All examples are licensed under MIT/X11.

Ask and Ye Shall Receive

topprevnext

So let's start with some code. We start of course with a Hello World example. We'll make a client and a server. The client sends "Hello" to the server, which replies with "World". Here's the server in C, which opens a ZeroMQ socket(插座) on port 5555, reads requests on it, and replies with "World" to each request:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Q | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc

Figure 2 - Request-Reply

fig2.png

The REQ-REP socket(插座) pair is in lockstep(因循守旧). The client issues zmq_send() and then zmq_recv(), in a loop(环) (or once if that's all it needs). Doing any other sequence(序列) (e.g., sending two messages in a row) will result in a return code of -1 from the send or recv call. Similarly, the service issues zmq_recv() and then zmq_send() in that order, as often as it needs to.

ZeroMQ uses C as its reference(参考) language and this is the main language we'll use for examples. If you're reading this online, the link below the example takes you to translations into other programming languages. Let's compare the same server in C++:

//
// Hello World server in C++
// Binds(捆绑) REP socket to tcp://*:5555
// Expects "Hello" from client, replies with "World"
//

#include <zmq.hpp>
#include <string>
#include <iostream>
#ifndef _WIN32
#include <unistd.h>
#else
#include <windows.h>

#define sleep(n) Sleep(n)
#endif

int main () {
// Prepare our context(环境) and socket(插座)
zmq::context_t context (1);
zmq::socket_t socket(插座) (context(环境), ZMQ_REP);
socket.bind ("tcp://*:5555");

while (true) {
zmq::message_t request;

// Wait for next request from client
socket.recv (&request);
std::cout << "Received Hello" << std::endl;

// Do some 'work'
sleep(1);

// Send reply back to client
zmq::message_t reply (5);
memcpy (reply.data (), "World", 5);
socket.send (reply);
}
return 0;
}

hwserver.cpp: Hello World server

You can see that the ZeroMQ API is similar in C and C++. In a language like PHP or Java, we can hide even more and the code becomes even easier to read:

<?php
/*
* Hello World server
* Binds(捆绑) REP socket(插座) to tcp://*:5555
* Expects "Hello" from client, replies with "World"
* @author Ian Barber <ian(dot)barber(at)gmail(dot)com>
*/

$context = new ZMQContext(1);

// Socket(插座) to talk to clients
$responder = new ZMQSocket($context, ZMQ::SOCKET_REP);
$responder->bind("tcp://*:5555");

while (true) {
// Wait for next request from client
$request = $responder->recv();
printf ("Received request: [%s]\n", $request);

// Do some 'work'
sleep (1);

// Send reply back to client
$responder->send("World");
}

hwserver.php: Hello World server

//
// Hello World server in Java
// Binds(捆绑) REP socket(插座) to tcp://*:5555
// Expects "Hello" from client, replies with "World"
//

import org.zeromq.ZMQ;

public class hwserver {

public static void main(String[] args) throws Exception {
ZMQ.Context context = ZMQ.context(1);

// Socket(插座) to talk to clients
ZMQ.Socket responder = context.socket(ZMQ.REP);
responder.bind("tcp://*:5555");

while (!Thread.currentThread().isInterrupted()) {
// Wait for next request from the client
byte[] request = responder.recv(0);
System.out.println("Received Hello");

// Do some 'work'
Thread.sleep(1000);

// Send reply back to client
String reply = "World";
responder.send(reply.getBytes(), 0);
}
responder.close();
context.term();
}
}

hwserver.java: Hello World server

The server in other languages:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Q | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc

Here's the client code:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Q | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc

Now this looks too simple to be realistic(现实的), but ZeroMQ sockets(插座) have, as we already learned, superpowers(超级大国). You could throw thousands of clients at this server, all at once, and it would continue to work happily and quickly. For fun, try starting the client and then starting the server, see how it all still works, then think for a second what this means.

Let us explain briefly what these two programs are actually doing. They create a ZeroMQ context(环境) to work with, and a socket. Don't worry what the words mean. You'll pick it up. The server binds(捆绑) its REP (reply) socket to port 5555. The server waits for a request in a loop(环), and responds(应答) each time with a reply. The client sends a request and reads the reply back from the server.

If you kill the server (Ctrl-C) and restart(重新启动) it, the client won't recover properly. Recovering from crashing processes isn't quite that easy. Making a reliable(可靠的) request-reply flow is complex(复杂的) enough that we won't cover it until Chapter 4 - Reliable Request-Reply Patterns.

There is a lot happening behind the scenes but what matters to us programmers is how short and sweet the code is, and how often it doesn't crash, even under a heavy load. This is the request-reply pattern, probably the simplest way to use ZeroMQ. It maps to RPC and the classic(经典的) client/server model.

A Minor Note on Strings

topprevnext

ZeroMQ doesn't know anything about the data you send except its size in bytes. That means you are responsible(负责的) for formatting(格式化) it safely so that applications can read it back. Doing this for objects and complex data types is a job for specialized(专业的) libraries like Protocol Buffers. But even for strings, you need to take care.

In C and some other languages, strings are terminated(终止) with a null byte. We could send a string like "HELLO" with that extra null byte:

zmq_send (requester, "Hello", 6, 0);

However, if you send a string from another language, it probably will not include that null byte. For example, when we send that same string in Python, we do this:

socket.send ("Hello")

Then what goes onto the wire is a length (one byte for shorter strings) and the string contents as individual(个人的) characters.

Figure 3 - A ZeroMQ string

fig3.png

And if you read this from a C program, you will get something that looks like a string, and might by accident act like a string (if by luck the five bytes find themselves followed by an innocently(纯洁地) lurking(潜伏) null), but isn't a proper string. When your client and server don't agree on the string format, you will get weird(怪异的) results.

When you receive string data from ZeroMQ in C, you simply cannot trust that it's safely terminated(终止). Every single time you read a string, you should allocate(分配) a new buffer(缓冲区) with space for an extra byte, copy the string, and terminate it properly with a null.

So let's establish(建立) the rule that ZeroMQ strings are length-specified and are sent on the wire without a trailing null. In the simplest case (and we'll do this in our examples), a ZeroMQ string maps neatly to a ZeroMQ message frame(设计), which looks like the above figure—a length and some bytes.

Here is what we need to do, in C, to receive a ZeroMQ string and deliver it to the application as a valid C string:

// Receive ZeroMQ string from socket(插座) and convert(转变) into C string
// Chops(砍) string at 255 chars(炭), if it's longer

static char *
s_recv (void *socket) {
char buffer [256];
int size = zmq_recv (socket(插座), buffer(有软皮摩擦), 255, 0);
if (size == -1)
return NULL;
if (size > 255)
size = 255;
buffer [size] = 0;
return strdup (buffer);
}

This makes a handy helper function and in the spirit of making things we can reuse profitably(有利可图的), let's write a similar s_send function that sends strings in the correct ZeroMQ format, and package this into a header file we can reuse.

The result is zhelpers.h, which lets us write sweeter and shorter ZeroMQ applications in C. It is a fairly long source, and only fun for C developers, so read it at leisure(闲暇).

Version Reporting

topprevnext

ZeroMQ does come in several versions and quite often, if you hit a problem, it'll be something that's been fixed in a later version. So it's a useful trick to know exactly what version of ZeroMQ you're actually linking with.

Here is a tiny program that does that:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Q | Ruby | Scala | Tcl | Ada | Basic | Haxe | ooc | Racket

Getting the Message Out

topprevnext

The second classic(经典的) pattern is one-way data distribution(分布), in which a server pushes updates to a set of clients. Let's see an example that pushes out weather updates consisting of a zip code, temperature, and relative humidity(湿度). We'll generate(形成) random(随机的) values, just like the real weather stations do.

Here's the server. We'll use port 5556 for this application:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc | Q

There's no start and no end to this stream of updates, it's like a never ending broadcast.

Here is the client application, which listens to the stream of updates and grabs anything to do with a specified(规定的) zip code, by default New York City because that's a great place to start any adventure:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | ooc | Q

Figure 4 - Publish-Subscribe

fig4.png

Note that when you use a SUB socket(插座) you must set a subscription(捐献) using zmq_setsockopt() and SUBSCRIBE, as in this code. If you don't set any subscription, you won't get any messages. It's a common mistake for beginners. The subscriber(订户) can set many subscriptions, which are added together. That is, if an update matches ANY subscription, the subscriber receives it. The subscriber can also cancel specific(特殊的) subscriptions. A subscription is often, but not necessarily a printable(印得出的) string. See zmq_setsockopt() for how this works.

The PUB-SUB socket pair is asynchronous(异步的). The client does zmq_recv(), in a loop(环) (or once if that's all it needs). Trying to send a message to a SUB socket will cause an error. Similarly, the service does zmq_send() as often as it needs to, but must not do zmq_recv() on a PUB socket(插座).

In theory with ZeroMQ sockets, it does not matter which end connects and which end binds(捆绑). However, in practice there are undocumented differences that I'll come to later. For now, bind the PUB and connect the SUB, unless your network design makes that impossible.

There is one more important thing to know about PUB-SUB sockets: you do not know precisely(精确地) when a subscriber(订户) starts to get messages. Even if you start a subscriber, wait a while, and then start the publisher, the subscriber will always miss the first messages that the publisher sends. This is because as the subscriber connects to the publisher (something that takes a small but non-zero time), the publisher may already be sending messages out.

This "slow joiner" symptom(症状) hits enough people often enough that we're going to explain it in detail. Remember that ZeroMQ does asynchronous(异步的) I/O, i.e., in the background. Say you have two nodes doing this, in this order:

  • Subscriber connects to an endpoint(端点) and receives and counts messages.
  • Publisher binds to an endpoint and immediately sends 1,000 messages.

Then the subscriber will most likely not receive anything. You'll blink(眨眼), check that you set a correct filter and try again, and the subscriber will still not receive anything.

Making a TCP connection involves(包含) to and from handshaking(握手) that takes several milliseconds(毫秒) depending on your network and the number of hops(蜱酒花) between peers(撒尿). In that time, ZeroMQ can send many messages. For sake(目的) of argument assume(承担) it takes 5 msecs to establish(建立) a connection, and that same link can handle 1M messages per second. During the 5 msecs that the subscriber is connecting to the publisher, it takes the publisher only 1 msec to send out those 1K messages.

In Chapter 2 - Sockets and Patterns we'll explain how to synchronize(合拍) a publisher and subscribers so that you don't start to publish data until the subscribers really are connected and ready. There is a simple and stupid way to delay the publisher, which is to sleep. Don't do this in a real application, though, because it is extremely fragile as well as inelegant(不雅的) and slow. Use sleeps to prove to yourself what's happening, and then wait for Chapter 2 - Sockets and Patterns to see how to do this right.

The alternative(二中择一) to synchronization(同步) is to simply assume that the published data stream is infinite(无限的) and has no start and no end. One also assumes that the subscriber doesn't care what transpired(发生) before it started up. This is how we built our weather client example.

So the client subscribes to its chosen zip code and collects 100 updates for that zip code. That means about ten million updates from the server, if zip codes are randomly(随便地) distributed(分布式的). You can start the client, and then the server, and the client will keep working. You can stop and restart(重新启动) the server as often as you like, and the client will keep working. When the client has collected its hundred updates, it calculates(计算) the average, prints it, and exits.

Some points about the publish-subscribe (pub-sub) pattern:

  • A subscriber can connect to more than one publisher, using one connect call each time. Data will then arrive and be interleaved(交错) ("fair-queued") so that no single publisher drowns out the others.
  • If a publisher has no connected subscribers, then it will simply drop all messages.
  • If you're using TCP and a subscriber(订户) is slow, messages will queue up on the publisher. We'll look at how to protect publishers against this using the "high-water mark" later.
  • From ZeroMQ v3.x, filtering happens at the publisher side when using a connected protocol(协议) (tcp:// or ipc://). Using the epgm:// protocol, filtering happens at the subscriber side. In ZeroMQ v2.x, all filtering happened at the subscriber side.

This is how long it takes to receive and filter 10M messages on my laptop(膝上型轻便电脑), which is an 2011-era Intel i5, decent(正派的) but nothing special:

$ time wuclient
Collecting updates from weather server...
Average temperature for zipcode '10001 ' was 28F

real    0m4.470s
user    0m0.000s
sys     0m0.008s

Divide and Conquer

topprevnext

Figure 5 - Parallel(平行线) Pipeline

fig5.png

As a final example (you are surely getting tired of juicy code and want to delve(钻研) back into philological(文献学的) discussions about comparative(比较的) abstractive(摘要式的) norms(规范)), let's do a little supercomputing(超级计算). Then coffee. Our supercomputing application is a fairly typical(典型的) parallel processing model. We have:

  • A ventilator(通风设备) that produces tasks that can be done in parallel
  • A set of workers that process tasks
  • A sink that collects results back from the worker processes

In reality, workers run on superfast(超快速) boxes, perhaps using GPUs (graphic(图表的) processing units) to do the hard math. Here is the ventilator(通风设备). It generates(形成) 100 tasks, each a message telling the worker to sleep for some number of milliseconds(毫秒):


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | ooc | Q | Racket

Here is the worker application. It receives a message, sleeps for that number of seconds, and then signals that it's finished:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | ooc | Q | Racket

Here is the sink application. It collects the 100 tasks, then calculates(计算) how long the overall processing took, so we can confirm(确认) that the workers really were running in parallel(平行的) if there are more than one of them:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | ooc | Q | Racket

The average cost of a batch(一批) is 5 seconds. When we start 1, 2, or 4 workers we get results like this from the sink:

  • 1 worker: total elapsed(消逝) time: 5034 msecs.
  • 2 workers: total elapsed time: 2421 msecs.
  • 4 workers: total elapsed time: 1018 msecs.

Let's look at some aspects(方面) of this code in more detail:

  • The workers connect upstream(上游部门) to the ventilator(通风设备), and downstream(下游地) to the sink. This means you can add workers arbitrarily(武断地). If the workers bound to their endpoints(端点), you would need (a) more endpoints and (b) to modify(修改) the ventilator and/or the sink each time you added a worker. We say that the ventilator and sink are stable parts of our architecture(建筑学) and the workers are dynamic parts of it.
  • We have to synchronize(合拍) the start of the batch with all workers being up and running. This is a fairly common gotcha(明白了) in ZeroMQ and there is no easy solution(解决方案). The zmq_connect method takes a certain time. So when a set of workers connect to the ventilator(通风设备), the first one to successfully connect will get a whole load of messages in that short time while the others are also connecting. If you don't synchronize(合拍) the start of the batch(一批) somehow, the system won't run in parallel(平行线) at all. Try removing the wait in the ventilator, and see what happens.
  • The ventilator's PUSH socket(插座) distributes(分配) tasks to workers (assuming(承担) they are all connected before the batch starts going out) evenly. This is called load balancing and it's something we'll look at again in more detail.
  • The sink's PULL socket collects results from workers evenly. This is called fair-queuing.

Figure 6 - Fair Queuing

fig6.png

The pipeline(管道) pattern also exhibits(展览品) the "slow joiner" syndrome(综合征), leading to accusations(控告) that PUSH sockets don't load balance properly. If you are using PUSH and PULL, and one of your workers gets way more messages than the others, it's because that PULL socket has joined faster than the others, and grabs a lot of messages before the others manage to connect. If you want proper load balancing, you probably want to look at the load balancing pattern in Chapter 3 - Advanced Request-Reply Patterns.

Programming with ZeroMQ

topprevnext

Having seen some examples, you must be eager to start using ZeroMQ in some apps. Before you start that, take a deep breath, chillax, and reflect(反映) on some basic advice that will save you much stress(压力) and confusion(混淆).

  • Learn ZeroMQ step-by-step(按部就班的). It's just one simple API, but it hides a world of possibilities. Take the possibilities slowly and master each one.
  • Write nice code. Ugly code hides problems and makes it hard for others to help you. You might get used to meaningless(无意义的) variable(变量的) names, but people reading your code won't. Use names that are real words, that say something other than "I'm too careless to tell you what this variable is really for". Use consistent(始终如一的) indentation(压痕) and clean layout(布局). Write nice code and your world will be more comfortable.
  • Test what you make as you make it. When your program doesn't work, you should know what five lines are to blame. This is especially true when you do ZeroMQ magic, which just won't work the first few times you try it.
  • When you find that things don't work as expected, break your code into pieces, test each one, see which one is not working. ZeroMQ lets you make essentially modular(模块化的) code; use that to your advantage.
  • Make abstractions(抽象) (classes, methods, whatever) as you need them. If you copy/paste(张贴) a lot of code, you're going to copy/paste errors, too.

Getting the Context(环境) Right
topprevnext

ZeroMQ applications always start by creating a context, and then using that for creating sockets(插座). In C, it's the zmq_ctx_new() call. You should create and use exactly one context in your process. Technically, the context is the container for all sockets in a single process, and acts as the transport for inproc sockets, which are the fastest way to connect threads in one process. If at runtime a process has two contexts, these are like separate ZeroMQ instances(实例). If that's explicitly(明确的) what you want, OK, but otherwise remember:

Call zmq_ctx_new() once at the start of a process, and zmq_ctx_destroy() once at the end.

If you're using the fork() system call, do zmq_ctx_new() after the fork and at the beginning of the child process code. In general, you want to do interesting (ZeroMQ) stuff(东西) in the children, and boring process management in the parent.

Making a Clean Exit
topprevnext

Classy(优等的) programmers share the same motto as classy hit men: always clean-up when you finish the job. When you use ZeroMQ in a language like Python, stuff gets automatically(自动地) freed for you. But when using C, you have to carefully free objects when you're finished with them or else you get memory leaks, unstable(不稳定的) applications, and generally bad karma(因果报应).

Memory leaks are one thing, but ZeroMQ is quite finicky(过分讲究的) about how you exit an application. The reasons are technical and painful, but the upshot(结果) is that if you leave any sockets(插座) open, the zmq_ctx_destroy() function will hang forever. And even if you close all sockets, zmq_ctx_destroy() will by default wait forever if there are pending(未决定的) connects or sends unless you set the LINGER to zero on those sockets before closing them.

The ZeroMQ objects we need to worry about are messages, sockets, and contexts(环境). Luckily it's quite simple, at least in simple programs:

  • If you are opening and closing a lot of sockets(插座), that's probably a sign that you need to redesign(重新设计) your application. In some cases socket handles won't be freed until you destroy the context(环境).
  • When you exit the program, close your sockets and then call zmq_ctx_destroy(). This destroys the context.

This is at least the case for C development. In a language with automatic(自动的) object destruction(破坏), sockets and contexts will be destroyed as you leave the scope(范围). If you use exceptions(例外) you'll have to do the clean-up in something like a "final" block, the same as for any resource.

If you're doing multithreaded work, it gets rather more complex(复杂的) than this. We'll get to multithreading in the next chapter, but because some of you will, despite(尽管) warnings, try to run before you can safely walk, below is the quick and dirty guide to making a clean exit in a multithreaded ZeroMQ application.

First, do not try to use the same socket from multiple threads. Please don't explain why you think this would be excellent fun, just please don't do it. Next, you need to shut down each socket that has ongoing requests. The proper way is to set a low LINGER value (1 second), and then close the socket. If your language binding(结合) doesn't do this for you automatically(自动地) when you destroy a context, I'd suggest sending a patch(眼罩).

Finally, destroy the context. This will cause any blocking receives or polls(投票) or sends in attached(附加的) threads (i.e., which share the same context) to return with an error. Catch that error, and then set linger(徘徊) on, and close sockets in that thread, and exit. Do not destroy the same context(环境) twice. The zmq_ctx_destroy in the main thread will block until all sockets(插座) it knows about are safely closed.

Voila(瞧)! It's complex(复杂的) and painful enough that any language binding(有约束力的) author worth his or her salt will do this automatically(自动地) and make the socket closing dance unnecessary.

Why We Needed ZeroMQ

topprevnext

Now that you've seen ZeroMQ in action, let's go back to the "why".

Many applications these days consist of components(成分) that stretch(伸展) across some kind of network, either a LAN or the Internet. So many application developers end up doing some kind of messaging. Some developers use message queuing products, but most of the time they do it themselves, using TCP or UDP. These protocols(协议) are not hard to use, but there is a great difference between sending a few bytes from A to B, and doing messaging in any kind of reliable(可靠的) way.

Let's look at the typical(典型的) problems we face when we start to connect pieces using raw TCP. Any reusable(可重复使用的) messaging layer would need to solve all or most of these:

  • How do we handle I/O? Does our application block, or do we handle I/O in the background? This is a key design decision. Blocking I/O creates architectures(建筑学) that do not scale(衡量) well. But background I/O can be very hard to do right.
  • How do we handle dynamic(动态的) components, i.e., pieces that go away temporarily(临时地)? Do we formally(正式地) split components into "clients" and "servers" and mandate(授权) that servers cannot disappear? What then if we want to connect servers to servers? Do we try to reconnect(使再接合) every few seconds?
  • How do we represent a message on the wire? How do we frame(有木架的) data so it's easy to write and read, safe from buffer(缓冲区) overflows(充满), efficient(有效率的) for small messages, yet adequate(充足的) for the very largest videos of dancing cats wearing party hats?
  • How do we handle messages that we can't deliver immediately? Particularly, if we're waiting for a component to come back online? Do we discard(抛弃) messages, put them into a database, or into a memory queue?
  • Where do we store message queues? What happens if the component reading from a queue is very slow and causes our queues to build up? What's our strategy(战略) then?
  • How do we handle lost messages? Do we wait for fresh data, request a resend(再发), or do we build some kind of reliability(可靠性) layer that ensures(保证) messages cannot be lost? What if that layer itself crashes?
  • What if we need to use a different network transport. Say, multicast(多路广播) instead of TCP unicast(单一传播)? Or IPv6? Do we need to rewrite the applications, or is the transport abstracted(摘要) in some layer?
  • How do we route(路线) messages? Can we send the same message to multiple peers(撒尿)? Can we send replies back to an original requester?
  • How do we write an API for another language? Do we re-implement a wire-level protocol or do we repackage(重新包装) a library? If the former, how can we guarantee(保证) efficient and stable(稳定的) stacks(堆)? If the latter, how can we guarantee interoperability(互操作性)?
  • How do we represent data so that it can be read between different architectures(建筑学)? Do we enforce(实施) a particular encoding for data types? How far is this the job of the messaging system rather than a higher layer?
  • How do we handle network errors? Do we wait and retry(重操作), ignore(驳回诉讼) them silently, or abort(中止计划)?

Take a typical(典型的) open source project like Hadoop Zookeeper and read the C API code in src/c/src/zookeeper.c. When I read this code, in January 2013, it was 4,200 lines of mystery(神秘) and in there is an undocumented, client/server network communication protocol(协议). I see it's efficient(有效率的) because it uses poll instead of select. But really, Zookeeper should be using a generic(类的) messaging layer and an explicitly(明确地) documented wire level protocol. It is incredibly(难以置信地) wasteful(浪费的) for teams to be building this particular wheel over and over.

But how to make a reusable(可重复使用的) messaging layer? Why, when so many projects need this technology, are people still doing it the hard way by driving TCP sockets(插座) in their code, and solving the problems in that long list over and over?

It turns out that building reusable messaging systems is really difficult, which is why few FOSS projects ever tried, and why commercial(商业的) messaging products are complex(复杂的), expensive, inflexible(顽固的), and brittle(易碎的). In 2006, iMatix designed AMQP which started to give FOSS developers perhaps the first reusable recipe(食谱) for a messaging system. AMQP works better than many other designs, but remains relatively complex, expensive, and brittle. It takes weeks to learn to use, and months to create stable(稳定的) architectures that don't crash when things get hairy(多毛的).

Figure 7 - Messaging as it Starts

fig7.png

Most messaging projects, like AMQP, that try to solve this long list of problems in a reusable way do so by inventing a new concept(观念), the "broker", that does addressing, routing(路由选择), and queuing. This results in a client/server protocol or a set of APIs on top of some undocumented protocol that allows applications to speak to this broker. Brokers are an excellent thing in reducing the complexity(复杂) of large networks. But adding broker-based messaging to a product like Zookeeper would make it worse, not better. It would mean adding an additional(附加的) big box, and a new single point of failure. A broker rapidly becomes a bottleneck(瓶颈) and a new risk(风险) to manage. If the software supports it, we can add a second, third, and fourth broker and make some failover(失效备援) scheme(计划). People do this. It creates more moving pieces, more complexity, and more things to break.

And a broker-centric setup(设置) needs its own operations team. You literally(照字面地) need to watch the brokers day and night, and beat them with a stick when they start misbehaving(作弊). You need boxes, and you need backup boxes, and you need people to manage those boxes. It is only worth doing for large applications with many moving pieces, built by several teams of people over several years.

Figure 8 - Messaging as it Becomes

fig8.png

So small to medium(中间的) application developers are trapped. Either they avoid network programming and make monolithic(整体的) applications that do not scale(衡量). Or they jump into network programming and make brittle(易碎的), complex(复杂的) applications that are hard to maintain(维持). Or they bet on a messaging product, and end up with scalable(可攀登的) applications that depend on expensive, easily broken technology. There has been no really good choice, which is maybe why messaging is largely stuck in the last century and stirs(搅拌) strong emotions(情感): negative(负的) ones for users, gleeful(愉快的) joy for those selling support and licenses.

What we need is something that does the job of messaging, but does it in such a simple and cheap way that it can work in any application, with close to zero cost. It should be a library which you just link, without any other dependencies(依赖性). No additional(附加的) moving pieces, so no additional risk(风险). It should run on any OS and work with any programming language.

And this is ZeroMQ: an efficient(有效率的), embeddable(可嵌入) library that solves most of the problems an application needs to become nicely elastic(有弹性的) across a network, without much cost.

Specifically:

  • It handles I/O asynchronously(异步的), in background threads. These communicate with application threads using lock-free data structures(结构), so concurrent(并发的) ZeroMQ applications need no locks, semaphores(信号), or other wait states.
  • Components(成分) can come and go dynamically(动态地) and ZeroMQ will automatically(自动地) reconnect(使再接合). This means you can start components in any order. You can create "service-oriented(服务型的) architectures(建筑学)" (SOAs) where services can join and leave the network at any time.
  • It queues messages automatically when needed. It does this intelligently(聪明地), pushing messages as close as possible to the receiver before queuing them.
  • It has ways of dealing with over-full queues (called "high water mark"). When a queue is full, ZeroMQ automatically blocks senders, or throws away messages, depending on the kind of messaging you are doing (the so-called "pattern").
  • It lets your applications talk to each other over arbitrary(任意的) transports: TCP, multicast(多路广播), in-process, inter-process. You don't need to change your code to use a different transport.
  • It handles slow/blocked readers safely, using different strategies(战略) that depend on the messaging pattern.
  • It lets you route(路线) messages using a variety of patterns such as request-reply and pub-sub. These patterns are how you create the topology(拓扑学), the structure of your network.
  • It lets you create proxies(代理人) to queue, forward, or capture(俘获) messages with a single call. Proxies can reduce the interconnection(互连) complexity(复杂) of a network.
  • It delivers whole messages exactly as they were sent, using a simple framing(设计) on the wire. If you write a 10k message, you will receive a 10k message.
  • It does not impose(强加) any format on messages. They are blobs from zero to gigabytes(十亿字节) large. When you want to represent data you choose some other product on top, such as msgpack, Google's protocol(协议) buffers(有软皮摩擦), and others.
  • It handles network errors intelligently, by retrying(重试) automatically in cases where it makes sense.
  • It reduces your carbon footprint(足迹). Doing more with less CPU means your boxes use less power, and you can keep your old boxes in use for longer. Al Gore would love ZeroMQ.

Actually ZeroMQ does rather more than this. It has a subversive(破坏性的) effect on how you develop network-capable applications. Superficially(表面的), it's a socket-inspired API on which you do zmq_recv() and zmq_send(). But message processing rapidly becomes the central loop(环), and your application soon breaks down into a set of message processing tasks. It is elegant(高雅的) and natural. And it scales(天平): each of these tasks maps to a node, and the nodes talk to each other across arbitrary(任意的) transports. Two nodes in one process (node is a thread), two nodes on one box (node is a process), or two nodes on one network (node is a box)—it's all the same, with no application code changes.

Socket Scalability

topprevnext

Let's see ZeroMQ's scalability(可扩展性) in action. Here is a shell(壳) script that starts the weather server and then a bunch(群) of clients in parallel(平行线):

wuserver &
wuclient 12345 &
wuclient 23456 &
wuclient 34567 &
wuclient 45678 &
wuclient 56789 &

As the clients run, we take a look at the active processes using the top command', and we see something like (on a 4-core box):

PID  USER  PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME+  COMMAND
7136  ph   20   0 1040m 959m 1156 R  157 12.0 16:25.47 wuserver
7966  ph   20   0 98608 1804 1372 S   33  0.0  0:03.94 wuclient
7963  ph   20   0 33116 1748 1372 S   14  0.0  0:00.76 wuclient
7965  ph   20   0 33116 1784 1372 S    6  0.0  0:00.47 wuclient
7964  ph   20   0 33116 1788 1372 S    5  0.0  0:00.25 wuclient
7967  ph   20   0 33072 1740 1372 S    5  0.0  0:00.35 wuclient

Let's think for a second about what is happening here. The weather server has a single socket(插座), and yet here we have it sending data to five clients in parallel. We could have thousands of concurrent(并发的) clients. The server application doesn't see them, doesn't talk to them directly. So the ZeroMQ socket is acting like a little server, silently accepting client requests and shoving(挤) data out to them as fast as the network can handle it. And it's a multithreaded server, squeezing(挤) more juice out of your CPU.

Upgrading from ZeroMQ v2.2 to ZeroMQ v3.2

topprevnext

Compatible Changes
topprevnext

These changes don't impact(影响) existing application code directly:

  • Pub-sub filtering is now done at the publisher side instead of subscriber(订户) side. This improves performance significantly(意味深长地) in many pub-sub use cases. You can mix v3.2 and v2.1/v2.2 publishers and subscribers safely.

Incompatible Changes
topprevnext

These are the main areas of impact(影响) on applications and language bindings(结合):

  • Changed send/recv methods: zmq_send() and zmq_recv() have a different, simpler interface(界面), and the old functionality(功能) is now provided by zmq_msg_send() and zmq_msg_recv(). Symptom(症状): compile(编译) errors. Solution(解决方案): fix up your code.
  • These two methods return positive(积极的) values on success, and -1 on error. In v2.x they always returned zero on success. Symptom: apparent(显然的) errors when things actually work fine. Solution: test strictly for return code = -1, not non-zero.
  • zmq_poll() now waits for milliseconds(毫秒), not microseconds(微秒). Symptom: application stops responding(回答) (in fact responds 1000 times slower). Solution: use the ZMQ_POLL_MSEC macro(巨大的) defined(定义) below, in all zmq_poll calls.
  • ZMQ_NOBLOCK is now called ZMQ_DONTWAIT. Symptom: compile failures on the ZMQ_NOBLOCK macro.
  • The ZMQ_HWM socket(插座) option is now broken into ZMQ_SNDHWM and ZMQ_RCVHWM. Symptom(症状): compile(编译) failures on the ZMQ_HWM macro.
  • Most but not all zmq_getsockopt() options are now integer(整数) values. Symptom: runtime error returns on zmq_setsockopt and zmq_getsockopt.
  • The ZMQ_SWAP option has been removed. Symptom: compile failures on ZMQ_SWAP. Solution(解决方案): redesign(重新设计) any code that uses this functionality(功能).

Suggested Shim Macros
topprevnext

For applications that want to run on both v2.x and v3.2, such as language bindings(结合), our advice is to emulate(仿真) v3.2 as far as possible. Here are C macro(巨大的) definitions(定义) that help your C/C++ code to work across both versions (taken from CZMQ):

#ifndef ZMQ_DONTWAIT
# define ZMQ_DONTWAIT ZMQ_NOBLOCK
#endif
#if ZMQ_VERSION_MAJOR == 2
# define(定义) zmq_msg_send(msg,sock,opt) zmq_send (sock, msg, opt)
# define zmq_msg_recv(msg,sock,opt) zmq_recv (sock, msg, opt)
# define zmq_ctx_destroy(context(环境)) zmq_term(context)
# define ZMQ_POLL_MSEC 1000
// zmq_poll is usec
# define ZMQ_SNDHWM ZMQ_HWM
# define ZMQ_RCVHWM ZMQ_HWM
#elif ZMQ_VERSION_MAJOR == 3
# define ZMQ_POLL_MSEC 1
// zmq_poll is msec
#endif

Warning: Unstable(不稳定的) Paradigms!

topprevnext

Traditional network programming is built on the general assumption(假定) that one socket(插座) talks to one connection, one peer(贵族). There are multicast(多路广播) protocols(协议), but these are exotic(异国的). When we assume(承担) "one socket = one connection", we scale(衡量) our architectures(建筑学) in certain ways. We create threads of logic(逻辑) where each thread work with one socket, one peer. We place intelligence(智力) and state in these threads.

In the ZeroMQ universe, sockets are doorways(门口) to fast little background communications engines that manage a whole set of connections automagically(自动地) for you. You can't see, work with, open, close, or attach(依附) state to these connections. Whether you use blocking send or receive, or poll(投票), all you can talk to is the socket, not the connections it manages for you. The connections are private and invisible(无形的), and this is the key to ZeroMQ's scalability(可扩展性).

This is because your code, talking to a socket, can then handle any number of connections across whatever network protocols are around, without change. A messaging pattern sitting in ZeroMQ scales more cheaply than a messaging pattern sitting in your application code.

So the general assumption(假定) no longer applies. As you read the code examples, your brain will try to map them to what you know. You will read "socket(插座)" and think "ah, that represents a connection to another node". That is wrong. You will read "thread" and your brain will again think, "ah, a thread represents a connection to another node", and again your brain will be wrong.

If you're reading this Guide for the first time, realize that until you actually write ZeroMQ code for a day or two (and maybe three or four days), you may feel confused(混乱), especially by how simple ZeroMQ makes things for you, and you may try to impose(强加) that general assumption on ZeroMQ, and it won't work. And then you will experience your moment of enlightenment(启迪) and trust, that zap-pow-kaboom satori(心灵之顿悟) paradigm-shift moment when it all becomes clear.


Chapter 2 - Sockets and Patterns

topprevnext

In Chapter 1 - Basics we took ZeroMQ for a drive, with some basic examples of the main ZeroMQ patterns: request-reply, pub-sub, and pipeline(管道). In this chapter, we're going to get our hands dirty and start to learn how to use these tools in real programs.

We'll cover:

  • How to create and work with ZeroMQ sockets.
  • How to send and receive messages on sockets.
  • How to build your apps around ZeroMQ's asynchronous(异步的) I/O model.
  • How to handle multiple sockets in one thread.
  • How to handle fatal(致命的) and nonfatal(非致命的) errors properly.
  • How to handle interrupt signals like Ctrl-C.
  • How to shut down a ZeroMQ application cleanly.
  • How to check a ZeroMQ application for memory leaks.
  • How to send and receive multipart messages.
  • How to forward messages across networks.
  • How to build a simple message queuing broker.
  • How to write multithreaded applications with ZeroMQ.
  • How to use ZeroMQ to signal between threads.
  • How to use ZeroMQ to coordinate(调整) a network of nodes.
  • How to create and use message envelopes for pub-sub.
  • Using the HWM (high-water mark) to protect against memory overflows(充满).

The Socket API

topprevnext

To be perfectly honest, ZeroMQ does a kind of switch-and-bait on you, for which we don't apologize. It's for your own good and it hurts us more than it hurts you. ZeroMQ presents a familiar socket-based API, which requires great effort for us to hide a bunch(群) of message-processing engines. However, the result will slowly fix your world view about how to design and write distributed(分布式的) software.

Sockets(插座) are the de facto(事实上的) standard API for network programming, as well as being useful for stopping your eyes from falling onto your cheeks. One thing that makes ZeroMQ especially tasty to developers is that it uses sockets and messages instead of some other arbitrary(任意的) set of concepts(观念). Kudos(荣誉) to Martin Sustrik for pulling this off. It turns "Message Oriented Middleware", a phrase guaranteed(保证) to send the whole room off to Catatonia, into "Extra Spicy Sockets!", which leaves us with a strange craving(渴望) for pizza and a desire to know more.

Like a favorite dish, ZeroMQ sockets are easy to digest. Sockets have a life in four parts, just like BSD sockets:

  • Creating and destroying sockets, which go together to form a karmic circle of socket life (see zmq_socket(), zmq_close()).
  • Plugging sockets into the network topology(拓扑学) by creating ZeroMQ connections to and from them (see zmq_bind(), zmq_connect()).

Note that sockets are always void(空的) pointers, and messages (which we'll come to very soon) are structures(结构). So in C you pass sockets as-such, but you pass addresses of messages in all functions that work with messages, like zmq_msg_send() and zmq_msg_recv(). As a mnemonic(记忆的), realize that "in ZeroMQ, all your sockets are belong(属于) to us", but messages are things you actually own in your code.

Creating, destroying, and configuring(安装) sockets(插座) works as you'd expect for any object. But remember that ZeroMQ is an asynchronous(异步的), elastic(有弹性的) fabric(织物). This has some impact(影响) on how we plug sockets into the network topology(拓扑学) and how we use the sockets after that.

Plugging Sockets into the Topology
topprevnext

To create a connection between two nodes, you use zmq_bind() in one node and zmq_connect() in the other. As a general rule of thumb(拇指), the node that does zmq_bind() is a "server", sitting on a well-known(著名的) network address, and the node which does zmq_connect() is a "client", with unknown or arbitrary(任意的) network addresses. Thus we say that we "bind(绑) a socket to an endpoint(端点)" and "connect a socket to an endpoint", the endpoint being that well-known network address.

ZeroMQ connections are somewhat different from classic(经典的) TCP connections. The main notable(值得注意的) differences are:

  • One socket(插座) may have many outgoing(外出的) and many incoming connections.
  • There is no zmq_accept() method. When a socket is bound to an endpoint(端点) it automatically(自动地) starts accepting connections.
  • The network connection itself happens in the background, and ZeroMQ will automatically(自动地) reconnect(使再接合) if the network connection is broken (e.g., if the peer(贵族) disappears and then comes back).
  • Your application code cannot work with these connections directly; they are encapsulated(密封的) under the socket(插座).

Many architectures(建筑学) follow some kind of client/server model, where the server is the component(成分) that is most static, and the clients are the components that are most dynamic(动态的), i.e., they come and go the most. There are sometimes issues of addressing: servers will be visible(明显的) to clients, but not necessarily vice(副的) versa. So mostly it's obvious which node should be doing zmq_bind() (the server) and which should be doing zmq_connect() (the client). It also depends on the kind of sockets you're using, with some exceptions(例外) for unusual network architectures. We'll look at socket types later.

Now, imagine we start the client before we start the server. In traditional networking, we get a big red Fail flag. But ZeroMQ lets us start and stop pieces arbitrarily(武断地). As soon as the client node does zmq_connect(), the connection exists and that node can start to write messages to the socket. At some stage (hopefully before messages queue up so much that they start to get discarded(抛弃), or the client blocks), the server comes alive, does a zmq_bind(), and ZeroMQ starts to deliver messages.

A server node can bind(绑) to many endpoints(端点) (that is, a combination(结合) of protocol(协议) and address) and it can do this using a single socket. This means it will accept connections across different transports:

zmq_bind (socket, "tcp://*:5555");
zmq_bind (socket, "tcp://*:9999");
zmq_bind (socket, "inproc://somename");

With most transports, you cannot bind(结合) to the same endpoint(端点) twice, unlike for example in UDP. The ipc transport does, however, let one process bind to an endpoint already used by a first process. It's meant to allow a process to recover after a crash.

Although ZeroMQ tries to be neutral(中立的) about which side binds and which side connects, there are differences. We'll see these in more detail later. The upshot(结果) is that you should usually think in terms of "servers" as static parts of your topology(拓扑学) that bind to more or less fixed endpoints, and "clients" as dynamic(动态的) parts that come and go and connect to these endpoints. Then, design your application around this model. The chances that it will "just work" are much better like that.

Sockets(插座) have types. The socket type defines(定义) the semantics(语义学) of the socket, its policies(政策) for routing(路由选择) messages inwards and outwards, queuing, etc. You can connect certain types of socket together, e.g., a publisher socket and a subscriber(订户) socket. Sockets work together in "messaging patterns". We'll look at this in more detail later.

It's the ability to connect sockets in these different ways that gives ZeroMQ its basic power as a message queuing system. There are layers on top of this, such as proxies(代理人), which we'll get to later. But essentially, with ZeroMQ you define your network architecture(建筑学) by plugging pieces together like a child's construction toy.

Sending and Receiving Messages
topprevnext

To send and receive messages you use the zmq_msg_send() and zmq_msg_recv() methods. The names are conventional(符合习俗的), but ZeroMQ's I/O model is different enough from the classic(经典的) TCP model that you will need time to get your head around it.

Figure 9 - TCP sockets(插座) are 1 to 1

fig9.png

Let's look at the main differences between TCP sockets and ZeroMQ sockets when it comes to working with data:

  • ZeroMQ sockets carry messages, like UDP, rather than a stream of bytes as TCP does. A ZeroMQ message is length-specified binary(二进制的) data. We'll come to messages shortly; their design is optimized(最佳化的) for performance and so a little tricky(狡猾的).
  • ZeroMQ sockets do their I/O in a background thread. This means that messages arrive in local input(投入) queues and are sent from local output queues, no matter what your application is busy doing.
  • ZeroMQ sockets have one-to-N routing(路由选择) behavior(行为) built-in(嵌入的), according to the socket type.

The zmq_send() method does not actually send the message to the socket connection(s). It queues the message so that the I/O thread can send it asynchronously(异步的). It does not block except in some exception(例外) cases. So the message is not necessarily sent when zmq_send() returns to your application.

Unicast Transports
topprevnext

ZeroMQ provides a set of unicast(单一传播) transports (inproc, ipc, and tcp) and multicast(多路广播) transports (epgm, pgm). Multicast is an advanced technique that we'll come to later. Don't even start using it unless you know that your fan-out ratios(比率) will make 1-to-N unicast(单一传播) impossible.

For most common cases, use tcp, which is a disconnected TCP transport. It is elastic(有弹性的), portable, and fast enough for most cases. We call this disconnected(拆开) because ZeroMQ's tcp transport doesn't require that the endpoint(端点) exists before you connect to it. Clients and servers can connect and bind(捆绑) at any time, can go and come back, and it remains transparent(透明的) to applications.

The inter-process ipc transport is disconnected, like tcp. It has one limitation(限制): it does not yet work on Windows. By convention(大会) we use endpoint names with an ".ipc" extension(延长) to avoid potential(潜在的) conflict(冲突) with other file names. On UNIX systems, if you use ipc endpoints you need to create these with appropriate(适当的) permissions otherwise they may not be shareable(可分享的) between processes running under different user IDs. You must also make sure all processes can access the files, e.g., by running in the same working directory.

The inter-thread transport, inproc, is a connected signaling transport. It is much faster than tcp or ipc. This transport has a specific(特殊的) limitation(限制) compared to tcp and ipc: the server must issue a bind(捆绑) before any client issues a connect. This is something future versions of ZeroMQ may fix, but at present this defines(定义) how you use inproc sockets(插座). We create and bind one socket and start the child threads, which create and connect the other sockets.

ZeroMQ is Not a Neutral Carrier
topprevnext

A common question that newcomers to ZeroMQ ask (it's one I've asked myself) is, "how do I write an XYZ server in ZeroMQ?" For example, "how do I write an HTTP server in ZeroMQ?" The implication(含义) is that if we use normal sockets to carry HTTP requests and responses, we should be able to use ZeroMQ sockets to do the same, only much faster and better.

The answer used to be "this is not how it works". ZeroMQ is not a neutral(中立的) carrier: it imposes(利用) a framing(框架) on the transport protocols(协议) it uses. This framing is not compatible(兼容的) with existing protocols, which tend(照料) to use their own framing. For example, compare an HTTP request and a ZeroMQ request, both over TCP/IP.

Figure 10 - HTTP on the Wire

fig10.png

The HTTP request uses CR-LF as its simplest framing delimiter(划界), whereas(然而) ZeroMQ uses a length-specified frame. So you could write an HTTP-like protocol using ZeroMQ, using for example the request-reply socket(插座) pattern. But it would not be HTTP.

Figure 11 - ZeroMQ on the Wire

fig11.png

Since v3.3, however, ZeroMQ has a socket option called ZMQ_ROUTER_RAW that lets you read and write data without the ZeroMQ framing. You could use this to read and write proper HTTP requests and responses. Hardeep Singh contributed(贡献) this change so that he could connect to Telnet servers from his ZeroMQ application. At time of writing this is still somewhat experimental(实验的), but it shows how ZeroMQ keeps evolving(发展) to solve new problems. Maybe the next patch(眼罩) will be yours.

I/O Threads
topprevnext

We said that ZeroMQ does I/O in a background thread. One I/O thread (for all sockets) is sufficient(足够的) for all but the most extreme(极端的) applications. When you create a new context(环境), it starts with one I/O thread. The general rule of thumb(拇指) is to allow one I/O thread per gigabyte(十亿字节) of data in or out per second. To raise the number of I/O threads, use the zmq_ctx_set() call before creating any sockets:

int io_threads = 4;
void *context = zmq_ctx_new ();
zmq_ctx_set (context(环境), ZMQ_IO_THREADS, io_threads);
assert(维护) (zmq_ctx_get (context, ZMQ_IO_THREADS) == io_threads);

We've seen that one socket(插座) can handle dozens, even thousands of connections at once. This has a fundamental(基本的) impact(影响) on how you write applications. A traditional networked(网路的) application has one process or one thread per remote(遥远的) connection, and that process or thread handles one socket. ZeroMQ lets you collapse(倒塌) this entire structure(结构) into a single process and then break it up as necessary for scaling(缩放比例).

If you are using ZeroMQ for inter-thread communications only (i.e., a multithreaded application that does no external(外部的) socket I/O) you can set the I/O threads to zero. It's not a significant(重大的) optimization(最佳化) though, more of a curiosity(好奇).

Messaging Patterns

topprevnext

Underneath(在…的下面) the brown paper wrapping(包装纸) of ZeroMQ's socket API lies the world of messaging patterns. If you have a background in enterprise messaging, or know UDP well, these will be vaguely(含糊地) familiar. But to most ZeroMQ newcomers, they are a surprise. We're so used to the TCP paradigm(范例) where a socket maps one-to-one to another node.

Let's recap(翻新的轮胎) briefly what ZeroMQ does for you. It delivers blobs of data (messages) to nodes, quickly and efficiently(有效地). You can map nodes to threads, processes, or nodes. ZeroMQ gives your applications a single socket API to work with, no matter what the actual transport (like in-process, inter-process, TCP, or multicast(多路广播)). It automatically(自动的) reconnects(使再接合) to peers(撒尿) as they come and go. It queues messages at both sender and receiver, as needed. It limits these queues to guard processes against running out of memory. It handles socket errors. It does all I/O in background threads. It uses lock-free techniques for talking between nodes, so there are never locks, waits, semaphores(信号), or deadlocks(僵局).

But cutting through that, it routes(路由) and queues messages according to precise(精确的) recipes(食谱) called patterns. It is these patterns that provide ZeroMQ's intelligence(智力). They encapsulate(压缩) our hard-earned experience of the best ways to distribute(分配) data and work. ZeroMQ's patterns are hard-coded but future versions may allow user-definable patterns.

ZeroMQ patterns are implemented(实施) by pairs of sockets(插座) with matching types. In other words, to understand ZeroMQ patterns you need to understand socket types and how they work together. Mostly, this just takes study; there is little that is obvious at this level.

The built-in(嵌入的) core ZeroMQ patterns are:

  • Request-reply, which connects a set of clients to a set of services. This is a remote(遥远的) procedure(程序) call and task distribution(分布) pattern.
  • Pub-sub, which connects a set of publishers to a set of subscribers(订阅). This is a data distribution pattern.
  • Pipeline, which connects nodes in a fan-out/fan-in pattern that can have multiple steps and loops(环). This is a parallel(平行的) task distribution and collection pattern.
  • Exclusive pair, which connects two sockets exclusively(唯一地). This is a pattern for connecting two threads in a process, not to be confused(混乱) with "normal" pairs of sockets.

We looked at the first three of these in Chapter 1 - Basics, and we'll see the exclusive pair pattern later in this chapter. The zmq_socket() man page is fairly clear about the patterns — it's worth reading several times until it starts to make sense. These are the socket combinations t(结合)hat are valid for a connect-bind pair (either side can bind):(结合)

  • PUB and SUB
  • REQ and REP
  • REQ and ROUTER (take care, REQ inserts an extra null frame(框架))
  • DEALER and REP (take care, REP assumes(承担) a null frame)
  • DEALER and ROUTER
  • DEALER and DEALER
  • ROUTER and ROUTER
  • PUSH and PULL
  • PAIR and PAIR

You'll also see references(参考) to XPUB and XSUB sockets(插座), which we'll come to later (they're like raw versions of PUB and SUB). Any other combination(结合) will produce undocumented and unreliable(不可靠的) results, and future versions of ZeroMQ will probably return errors if you try them. You can and will, of course, bridge other socket types via code, i.e., read from one socket type and write to another.

High-Level Messaging Patterns
topprevnext

These four core patterns are cooked into ZeroMQ. They are part of the ZeroMQ API, implemented(实施) in the core C++ library, and are guaranteed(保证) to be available in all fine retail(零售的) stores.

On top of those, we add high-level messaging patterns. We build these high-level patterns on top of ZeroMQ and implement them in whatever language we're using for our application. They are not part of the core library, do not come with the ZeroMQ package, and exist in their own space as part of the ZeroMQ community. For example the Majordomo pattern, which we explore in Chapter 4 - Reliable(可靠的) Request-Reply Patterns, sits in the GitHub Majordomo project in the ZeroMQ organization.

One of the things we aim to provide you with in this book are a set of such high-level patterns, both small (how to handle messages sanely(稳健地)) and large (how to make a reliable pub-sub architecture(建筑学)).

Working with Messages
topprevnext

The libzmq core library has in fact two APIs to send and receive messages. The zmq_send() and zmq_recv() methods that we've already seen and used are simple one-liners(小笑话). We will use these often, but zmq_recv() is bad at dealing with arbitrary(任意的) message sizes: it truncates(把…截短) messages to whatever buffer(缓冲区) size you provide. So there's a second API that works with zmq_msg_t structures(结构), with a richer but more difficult API:

On the wire, ZeroMQ messages are blobs of any size from zero upwards that fit in memory. You do your own serialization(序列化) using protocol(协议) buffers(有软皮摩擦), msgpack, JSON, or whatever else your applications need to speak. It's wise to choose a data representation(代表) that is portable, but you can make your own decisions about trade-offs.

In memory, ZeroMQ messages are zmq_msg_t structures(结构) (or classes depending on your language). Here are the basic ground rules for using ZeroMQ messages in C:

  • You create and pass around zmq_msg_t objects, not blocks of data.
  • To write a message from new data, you use zmq_msg_init_size() to create a message and at the same time allocate(分配) a block of data of some size. You then fill that data using memcpy, and pass the message to zmq_msg_send().
  • To release (not destroy) a message, you call zmq_msg_close(). This drops a reference(参考), and eventually(最后的) ZeroMQ will destroy the message.
  • After you pass a message to zmq_msg_send(), ØMQ will clear the message, i.e., set the size to zero. You cannot send the same message twice, and you cannot access the message data after sending it.
  • These rules don't apply if you use zmq_send() and zmq_recv(), to which you pass byte arrays(数组), not message structures(结构).

If you want to send the same message more than once, and it's sizable(相当大的), create a second message, initialize(初始化) it using zmq_msg_init(), and then use zmq_msg_copy() to create a copy of the first message. This does not copy the data but copies a reference(参考). You can then send the message twice (or more, if you create more copies) and the message will only be finally destroyed when the last copy is sent or closed.

ZeroMQ also supports multipart messages, which let you send or receive a list of frames(框架) as a single on-the-wire message. This is widely used in real applications and we'll look at that later in this chapter and in Chapter 3 - Advanced Request-Reply Patterns.

Frames (also called "message parts" in the ZeroMQ reference manual(手工的) pages) are the basic wire format for ZeroMQ messages. A frame is a length-specified block of data. The length can be zero upwards. If you've done any TCP programming you'll appreciate why frames are a useful answer to the question "how much data am I supposed to read of this network socket(插座) now?"

There is a wire-level protocol called ZMTP that defines(定义) how ZeroMQ reads and writes frames on a TCP connection. If you're interested in how this works, the spec(投机) is quite short.

Originally, a ZeroMQ message was one frame, like UDP. We later extended(延伸) this with multipart messages, which are quite simply series of frames with a "more" bit set to one, followed by one with that bit set to zero. The ZeroMQ API then lets you write messages with a "more" flag and when you read messages, it lets you check if there's "more".

In the low-level ZeroMQ API and the reference manual, therefore, there's some fuzziness(绒毛的特性) about messages versus(对) frames. So here's a useful lexicon(词典):

  • A message can be one or more parts.
  • These parts are also called "frames(框架)".
  • Each part is a zmq_msg_t object.
  • You send and receive each part separately, in the low-level API.
  • Higher-level APIs provide wrappers(包) to send entire multipart messages.

Some other things that are worth knowing about messages:

  • You may send zero-length messages, e.g., for sending a signal from one thread to another.
  • ZeroMQ guarantees(保证) to deliver all the parts (one or more) for a message, or none of them.
  • ZeroMQ does not send the message (single or multipart) right away, but at some indeterminate(不确定的) later time. A multipart message must therefore fit in memory.
  • A message (single or multipart) must fit in memory. If you want to send files of arbitrary(任意的) sizes, you should break them into pieces and send each piece as separate single-part messages. Using multipart data will not reduce memory consumption(消费).
  • You must call zmq_msg_close() when finished with a received message, in languages that don't automatically(自动地) destroy objects when a scope(范围) closes. You don't call this method after sending a message.

And to be repetitive(重复的), do not use zmq_msg_init_data() yet. This is a zero-copy method and is guaranteed to create trouble for you. There are far more important things to learn about ZeroMQ before you start to worry about shaving off microseconds(微秒).

This rich API can be tiresome to work with. The methods are optimized(最佳化的) for performance, not simplicity(朴素). If you start using these you will almost definitely(清楚地) get them wrong until you've read the man pages with some care. So one of the main jobs of a good language binding(装订) is to wrap this API up in classes that are easier to use.

Handling Multiple Sockets
topprevnext

In all the examples so far, the main loop(环) of most examples has been:

  1. Wait for message on socket(插座).
  2. Process message.
  3. Repeat.

What if we want to read from multiple endpoints(端点) at the same time? The simplest way is to connect one socket to all the endpoints and get ZeroMQ to do the fan-in for us. This is legal(法律的) if the remote(遥远的) endpoints are in the same pattern, but it would be wrong to connect a PULL socket to a PUB endpoint.

To actually read from multiple sockets all at once, use zmq_poll(). An even better way might be to wrap(包) zmq_poll() in a framework(框架) that turns it into a nice event-driven reactor, but it's significantly(重大的) more work than we want to cover here.

Let's start with a dirty hack, partly for the fun of not doing it right, but mainly because it lets me show you how to do nonblocking socket reads. Here is a simple example of reading from two sockets using nonblocking reads. This rather confused(混乱) program acts both as a subscriber(订户) to weather updates, and a worker for parallel(平行的) tasks:

// Reading from multiple sockets(插座)
// This version uses a simple recv loop(打环)

#include "zhelpers.h"

int main (void)
{
// Connect to task ventilator(通风设备)
void *context = zmq_ctx_new ();
void *receiver = zmq_socket(插座) (context(环境), ZMQ_PULL);
zmq_connect (receiver, "tcp://localhost:5557");

// Connect to weather server
void *subscriber = zmq_socket (context, ZMQ_SUB);
zmq_connect (subscriber, "tcp://localhost:5556");
zmq_setsockopt (subscriber(订阅), ZMQ_SUBSCRIBE, "10001 ", 6);

// Process messages from both sockets(插座)
// We prioritize(把…区分优先次序) traffic from the task ventilator(通风设备)
while (1) {
char msg [256];
while (1) {
int size = zmq_recv (receiver, msg, 255, ZMQ_DONTWAIT);
if (size != -1) {
// Process task
}
else
break;
}
while (1) {
int size = zmq_recv (subscriber(订阅), msg, 255, ZMQ_DONTWAIT);
if (size != -1) {
// Process weather update
}
else
break;
}
// No activity, so sleep for 1 msec
s_sleep (1);
}
zmq_close (receiver);
zmq_close (subscriber);
zmq_ctx_destroy (context);
return 0;
}


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Java | Lua | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Haskell | Haxe | Node.js | ooc | Q | Racket

The cost of this approach(方法) is some additional(附加的) latency(潜伏) on the first message (the sleep at the end of the loop(环), when there are no waiting messages to process). This would be a problem in applications where submillisecond latency was vital(至关重要的). Also, you need to check the documentation(文件材料) for nanosleep() or whatever function you use to make sure it does not busy-loop.

You can treat the sockets(插座) fairly by reading first from one, then the second rather than prioritizing(把…区分优先次序) them as we did in this example.

Now let's see the same senseless(愚蠢的) little application done right, using zmq_poll():

// Reading from multiple sockets
// This version uses zmq_poll()

#include "zhelpers.h"

int main (void)
{
// Connect to task ventilator(通风设备)
void *context = zmq_ctx_new ();
void *receiver = zmq_socket(插座) (context(环境), ZMQ_PULL);
zmq_connect (receiver, "tcp://localhost:5557");

// Connect to weather server
void *subscriber = zmq_socket (context, ZMQ_SUB);
zmq_connect (subscriber, "tcp://localhost:5556");
zmq_setsockopt (subscriber(订阅), ZMQ_SUBSCRIBE, "10001 ", 6);

// Process messages from both sockets(插座)
while (1) {
char msg [256];
zmq_pollitem_t items [] = {
{ receiver, 0, ZMQ_POLLIN, 0 },
{ subscriber, 0, ZMQ_POLLIN, 0 }
};
zmq_poll (items, 2, -1);
if (items [0].revents & ZMQ_POLLIN) {
int size = zmq_recv (receiver, msg, 255, 0);
if (size != -1) {
// Process task
}
}
if (items [1].revents & ZMQ_POLLIN) {
int size = zmq_recv (subscriber(订阅), msg, 255, 0);
if (size != -1) {
// Process weather update
}
}
}
zmq_close (subscriber);
zmq_ctx_destroy (context);
return 0;
}


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Haxe | ooc | Q | Racket

The items structure(结构) has these four members:

typedef struct {
void *socket; // ZeroMQ socket(插座) to poll(投票) on
int fd; // OR, native file handle to poll on
short events; // Events to poll on
short revents; // Events returned after poll(投票)
} zmq_pollitem_t;

Multipart Messages
topprevnext

ZeroMQ lets us compose(构成) a message out of several frames(框架), giving us a "multipart message". Realistic applications use multipart messages heavily, both for wrapping(包装用的) messages with address information and for simple serialization(序列化). We'll look at reply envelopes later.

What we'll learn now is simply how to blindly and safely read and write multipart messages in any application (such as a proxy(代理人)) that needs to forward messages without inspecting them.

When you work with multipart messages, each part is a zmq_msg item. E.g., if you are sending a message with five parts, you must construct, send, and destroy five zmq_msg items. You can do this in advance (and store the zmq_msg items in an array(数组) or other structure(结构)), or as you send them, one-by-one.

Here is how we send the frames in a multipart message (we receive each frame into a message object):

zmq_msg_send (&message, socket(插座), ZMQ_SNDMORE);

zmq_msg_send (&message, socket(插座), ZMQ_SNDMORE);

zmq_msg_send (&message, socket, 0);

Here is how we receive and process all the parts in a message, be it single part or multipart:

while (1) {
zmq_msg_t message;
zmq_msg_init (&message);
zmq_msg_recv (&message, socket, 0);
// Process the message frame(框架)

zmq_msg_close (&message);
if (!zmq_msg_more (&message))
break; // Last message frame
}

Some things to know about multipart messages:

  • When you send a multipart message, the first part (and all following parts) are only actually sent on the wire when you send the final part.
  • If you are using zmq_poll(), when you receive the first part of a message, all the rest has also arrived.
  • You will receive all parts of a message, or none at all.
  • Each part of a message is a separate zmq_msg item.
  • You will receive all parts of a message whether or not you check the more property.
  • On sending, ZeroMQ queues message frames(框架) in memory until the last is received, then sends them all.
  • There is no way to cancel a partially(部分地) sent message, except by closing the socket(插座).

Intermediaries and Proxies
topprevnext

ZeroMQ aims for decentralized(分散的) intelligence(智力), but that doesn't mean your network is empty space in the middle. It's filled with message-aware infrastructure(基础设施) and quite often, we build that infrastructure with ZeroMQ. The ZeroMQ plumbing(垂直) can range from tiny pipes to full-blown(成熟的) service-oriented(服务型的) brokers. The messaging industry calls this intermediation, meaning that the stuff(东西) in the middle deals with either side. In ZeroMQ, we call these proxies(代理人), queues, forwarders, device(装置), or brokers, depending on the context(环境).

This pattern is extremely common in the real world and is why our societies and economies(经济) are filled with intermediaries(中间人) who have no other real function than to reduce the complexity(复杂) and scaling(衡量) costs of larger networks. Real-world intermediaries are typically(代表性地) called wholesalers(批发), distributors(经销商), managers, and so on.

The Dynamic(动态的) Discovery Problem
topprevnext

One of the problems you will hit as you design larger distributed(分布式的) architectures(建筑学) is discovery. That is, how do pieces know about each other? It's especially difficult if pieces come and go, so we call this the "dynamic discovery problem".

There are several solutions(解决方案) to dynamic discovery. The simplest is to entirely avoid it by hard-coding (or configuring(安装)) the network architecture so discovery is done by hand. That is, when you add a new piece, you reconfigure(重新配置) the network to know about it.

Figure 12 - Small-Scale Pub-Sub Network

fig12.png

In practice, this leads to increasingly fragile and unwieldy(笨拙的) architectures. Let's say you have one publisher and a hundred subscribers(订阅). You connect each subscriber to the publisher by configuring a publisher endpoint(端点) in each subscriber. That's easy. Subscribers are dynamic; the publisher is static. Now say you add more publishers. Suddenly, it's not so easy any more. If you continue to connect each subscriber to each publisher, the cost of avoiding dynamic discovery gets higher and higher.

Figure 13 - Pub-Sub Network with a Proxy(代理人)

fig13.png

There are quite a few answers to this, but the very simplest answer is to add an intermediary(中间的); that is, a static point in the network to which all other nodes connect. In classic(经典的) messaging, this is the job of the message broker. ZeroMQ doesn't come with a message broker as such, but it lets us build intermediaries quite easily.

You might wonder, if all networks eventually(最后) get large enough to need intermediaries, why don't we simply have a message broker in place for all applications? For beginners, it's a fair compromise(妥协). Just always use a star topology(拓扑学), forget about performance, and things will usually work. However, message brokers are greedy things; in their role as central intermediaries, they become too complex(复杂的), too stateful(状态性的), and eventually a problem.

It's better to think of intermediaries as simple stateless(没有国家的) message switches(开关). A good analogy(类比) is an HTTP proxy; it's there, but doesn't have any special role. Adding a pub-sub proxy solves the dynamic discovery problem in our example. We set the proxy in the "middle" of the network. The proxy opens an XSUB socket(插座), an XPUB socket, and binds(捆绑) each to well-known(著名的) IP addresses and ports. Then, all other processes connect to the proxy, instead of to each other. It becomes trivial(不重要的) to add more subscribers or publishers.

Figure 14 - Extended(延伸) Pub-Sub

fig14.png

We need XPUB and XSUB sockets because ZeroMQ does subscription(捐献) forwarding from subscribers to publishers. XSUB and XPUB are exactly like SUB and PUB except they expose subscriptions as special messages. The proxy has to forward these subscription messages from subscriber side to publisher side, by reading them from the XPUB socket and writing them to the XSUB socket. This is the main use case for XSUB and XPUB.

Shared Queue (DEALER and ROUTER sockets)
topprevnext

In the Hello World client/server application, we have one client that talks to one service. However, in real cases we usually need to allow multiple services as well as multiple clients. This lets us scale(规模) up the power of the service (many threads or processes or nodes rather than just one). The only constraint(约束) is that services must be stateless, all state being in the request or in some shared storage such as a database.

Figure 15 - Request Distribution

fig15.png

There are two ways to connect multiple clients to multiple servers. The brute(畜生) force way is to connect each client socket(插座) to multiple service endpoints(端点). One client socket can connect to multiple service sockets, and the REQ socket will then distribute(分配) requests among these services. Let's say you connect a client socket to three service endpoints; A, B, and C. The client makes requests R1, R2, R3, R4. R1 and R4 go to service A, R2 goes to B, and R3 goes to service C.

This design lets you add more clients cheaply. You can also add more services. Each client will distribute its requests to the services. But each client has to know the service topology(拓扑学). If you have 100 clients and then you decide to add three more services, you need to reconfigure(重新配置) and restart(重新启动) 100 clients in order for the clients to know about the three new services.

That's clearly not the kind of thing we want to be doing at 3 a.m. when our supercomputing(超级计算) cluster(群) has run out of resources and we desperately(拼命地) need to add a couple of hundred of new service nodes. Too many static pieces are like liquid concrete(具体物): knowledge is distributed and the more static pieces you have, the more effort it is to change the topology. What we want is something sitting in between clients and services that centralizes(集中) all knowledge of the topology. Ideally(理想的), we should be able to add and remove services or clients at any time without touching any other part of the topology.

So we'll write a little message queuing broker that gives us this flexibility(灵活性). The broker binds(捆绑) to two endpoints, a frontend(前端) for clients and a backend for services. It then uses zmq_poll() to monitor these two sockets for activity and when it has some, it shuttles messages between its two sockets. It doesn't actually manage any queues explicitly(明确的)—ZeroMQ does that automatically o(自动地)n each socket.

When you use REQ to talk to REP, you get a strictly synchronous(同步的) request-reply dialog. The client sends a request. The service reads the request and sends a reply. The client then reads the reply. If either the client or the service try to do anything else (e.g., sending two requests in a row without waiting for a response), they will get an error.

But our broker has to be nonblocking. Obviously, we can use zmq_poll() to wait for activity on either socket, but we can't use REP and REQ.

Figure 16 - Extended(延伸) Request-Reply

fig16.png

Luckily, there are two sockets called DEALER and ROUTER that let you do nonblocking request-response. You'll see in Chapter 3 - Advanced Request-Reply Patterns how DEALER and ROUTER sockets let you build all kinds of asynchronous(异步的) request-reply flows. For now, we're just going to see how DEALER and ROUTER let us extend REQ-REP across an intermediary(中间的), that is, our little broker.

In this simple extended request-reply pattern, REQ talks to ROUTER and DEALER talks to REP. In between the DEALER and ROUTER, we have to have code (like our broker) that pulls messages off the one socket and shoves(推) them onto the other.

The request-reply broker binds to two endpoints, one for clients to connect to (the frontend socket) and one for workers to connect to (the backend). To test this broker, you will want to change your workers so they connect to the backend socket. Here is a client that shows what I mean:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

Here is the worker:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

And here is the broker, which properly handles multipart messages:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

Figure 17 - Request-Reply Broker

fig17.png

Using a request-reply broker makes your client/server architectures(建筑学) easier to scale(衡量) because clients don't see workers, and workers don't see clients. The only static node is the broker in the middle.

ZeroMQ's Built-In Proxy(代理人) Function
topprevnext

It turns out that the core loop(环) in the previous section's rrbroker is very useful, and reusable(可重复使用的). It lets us build pub-sub forwarders and shared queues and other little intermediaries(中间人) with very little effort. ZeroMQ wraps(外套) this up in a single method, zmq_proxy():

zmq_proxy (frontend(前端), backend, capture(俘获));

The two (or three sockets(插座), if we want to capture data) must be properly connected, bound, and configured(配置). When we call the zmq_proxy method, it's exactly like starting the main loop of rrbroker. Let's rewrite the request-reply broker to call zmq_proxy, and re-badge this as an expensive-sounding "message queue" (people have charged houses for code that did less):


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Q | Ruby | Tcl | Ada | Basic | Felix | Objective-C | ooc | Racket | Scala

If you're like most ZeroMQ users, at this stage your mind is starting to think, "What kind of evil(邪恶的) stuff(东西) can I do if I plug random(随机的) socket(插座) types into the proxy(代理人)?" The short answer is: try it and work out what is happening. In practice, you would usually stick to ROUTER/DEALER, XSUB/XPUB, or PULL/PUSH.

Transport Bridging
topprevnext

A frequent request from ZeroMQ users is, "How do I connect my ZeroMQ network with technology X?" where X is some other networking or messaging technology.

Figure 18 - Pub-Sub Forwarder Proxy

fig18.png

The simple answer is to build a bridge. A bridge is a small application that speaks one protocol(协议) at one socket, and converts(皈依者) to/from a second protocol at another socket. A protocol interpreter(解释者), if you like. A common bridging problem in ZeroMQ is to bridge two transports or networks.

As an example, we're going to write a little proxy that sits in between a publisher and a set of subscribers(订阅), bridging two networks. The frontend(前端) socket (SUB) faces the internal(内部的) network where the weather server is sitting, and the backend (PUB) faces subscribers on the external(外部的) network. It subscribes to the weather service on the frontend socket, and republishes(再版) its data on the backend socket.


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

It looks very similar to the earlier proxy(代理人) example, but the key part is that the frontend(前端) and backend sockets(插座) are on two different networks. We can use this model for example to connect a multicast(多路广播) network (pgm transport) to a tcp publisher.

Handling Errors and ETERM

topprevnext

ZeroMQ's error handling philosophy(哲学) is a mix of fail-fast and resilience(恢复力). Processes, we believe, should be as vulnerable(易受攻击的) as possible to internal(内部的) errors, and as robust(强健的) as possible against external(外部的) attacks and errors. To give an analogy(类比), a living cell will self-destruct(自毁) if it detects(察觉) a single internal error, yet it will resist attack from the outside by all means possible.

Assertions(断言), which pepper the ZeroMQ code, are absolutely(绝对地) vital(至关重要的) to robust code; they just have to be on the right side of the cellular(细胞的) wall. And there should be such a wall. If it is unclear whether a fault is internal or external, that is a design flaw(瑕疵) to be fixed. In C/C++, assertions stop the application immediately with an error. In other languages, you may get exceptions(例外) or halts(停止).

When ZeroMQ detects(察觉) an external(外部的) fault it returns an error to the calling code. In some rare cases, it drops messages silently if there is no obvious strategy(战略) for recovering from the error.

In most of the C examples we've seen so far there's been no error handling. Real code should do error handling on every single ZeroMQ call. If you're using a language binding(结合) other than C, the binding may handle errors for you. In C, you do need to do this yourself. There are some simple rules, starting with POSIX conventions(大会):

  • Methods that create objects return NULL if they fail.
  • Methods that process data may return the number of bytes processed, or -1 on an error or failure.
  • Other methods return 0 on success and -1 on an error or failure.
  • The error code is provided in errno or zmq_errno().
  • A descriptive(描写的) error text for logging is provided by zmq_strerror().

For example:

void *context = zmq_ctx_new ();
assert (context);
void *socket = zmq_socket(插座) (context(环境), ZMQ_REP);
assert (socket);
int rc = zmq_bind (socket, "tcp://*:5555");
if (rc == -1) {
printf ("E: bind(绑) failed: %s\n", strerror (errno));
return -1;
}

There are two main exceptional(异常的) conditions that you should handle as nonfatal(非致命的):

  • When your code receives a message with the ZMQ_DONTWAIT option and there is no waiting data, ZeroMQ will return -1 and set errno to EAGAIN.
  • When one thread calls zmq_ctx_destroy(), and other threads are still doing blocking work, the zmq_ctx_destroy() call closes the context(环境) and all blocking calls exit with -1, and errno set to ETERM.

In C/C++, asserts(维护) can be removed entirely in optimized(最佳化的) code, so don't make the mistake of wrapping(包) the whole ZeroMQ call in an assert(). It looks neat; then the optimizer removes all the asserts and the calls you want to make, and your application breaks in impressive(感人的) ways.

Figure 19 - Parallel(平行线) Pipeline(管道) with Kill Signaling

fig19.png

Let's see how to shut down a process cleanly. We'll take the parallel pipeline example from the previous section. If we've started a whole lot of workers in the background, we now want to kill them when the batch(一批) is finished. Let's do this by sending a kill message to the workers. The best place to do this is the sink because it really knows when the batch is done.

How do we connect the sink to the workers? The PUSH/PULL sockets(插座) are one-way only. We could switch(转换) to another socket type, or we could mix multiple socket flows. Let's try the latter: using a pub-sub model to send kill messages to the workers:

  • The sink creates a PUB socket on a new endpoint(端点).
  • Workers bind(绑) their input(投入) socket to this endpoint.
  • When the sink detects(察觉) the end of the batch, it sends a kill to its PUB socket.
  • When a worker detects this kill message, it exits.

It doesn't take much new code in the sink:

void *controller = zmq_socket(插座) (context(环境), ZMQ_PUB);
zmq_bind (controller, "tcp://*:5559");

// Send kill signal to workers
s_send (controller, "KILL");

Here is the worker process, which manages two sockets (a PULL socket getting tasks, and a SUB socket getting control commands), using the zmq_poll() technique we saw earlier:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | ooc | Q | Racket

Here is the modified(改进的) sink application. When it's finished collecting results, it broadcasts a kill message to all workers:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | ooc | Q | Racket

Handling Interrupt Signals

topprevnext

Realistic(现实的) applications need to shut down cleanly when interrupted with Ctrl-C or another signal such as SIGTERM. By default, these simply kill the process, meaning messages won't be flushed(激动的), files won't be closed cleanly, and so on.

Here is how we handle a signal in various languages:


C++ | C# | Delphi | Erlang | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Ada | Basic | Clojure | CL | F# | Felix | Objective-C | ooc | Q | Racket | Tcl

The program provides s_catch_signals(), which traps Ctrl-C (SIGINT) and SIGTERM. When either of these signals arrive, the s_catch_signals() handler sets the global variable(变量的) s_interrupted. Thanks to your signal handler, your application will not die automatically(自动地). Instead, you have a chance to clean up and exit gracefully(优雅地). You have to now explicitly(明确地) check for an interrupt and handle it properly. Do this by calling s_catch_signals() (copy this from interrupt.c) at the start of your main code. This sets up the signal handling. The interrupt will affect ZeroMQ calls as follows:

  • If your code is blocking in a blocking call (sending a message, receiving a message, or polling(投票)), then when a signal arrives, the call will return with EINTR.
  • Wrappers like s_recv() return NULL if they are interrupted.

So check for an EINTR return code, a NULL return, and/or s_interrupted.

Here is a typical(典型的) code fragment(碎片):

s_catch_signals ();
client = zmq_socket (...);
while (!s_interrupted) {
    char *message = s_recv (client);
    if (!message)
        break;          //  Ctrl-C used
}
zmq_close (client);

If you call s_catch_signals() and don't test for interrupts, then your application will become immune(免疫者) to Ctrl-C and SIGTERM, which may be useful, but is usually not.

Detecting Memory Leaks

topprevnext

Any long-running application has to manage memory correctly, or eventually(最后的) it'll use up all available memory and crash. If you use a language that handles this automatically(自动地) for you, congratulations. If you program in C or C++ or any other language where you're responsible(负责的) for memory management, here's a short tutorial(辅导的) on using valgrind, which among other things will report on any leaks your programs have.

  • To install(安装) valgrind, e.g., on Ubuntu or Debian, issue this command:
sudo apt-get install valgrind
  • By default, ZeroMQ will cause valgrind to complain a lot. To remove these warnings, create a file called vg.supp that contains this:
{
   <socketcall_sendto>
   Memcheck:Param
   socketcall.sendto(msg)
   fun:send
   ...
}
{
   <socketcall_sendto>
   Memcheck:Param
   socketcall.send(msg)
   fun:send
   ...
}
  • Fix your applications to exit cleanly after Ctrl-C. For any application that exits by itself, that's not needed, but for long-running applications, this is essential, otherwise valgrind will complain about all currently allocated(分配) memory.
  • Build your application with -DDEBUG if it's not your default setting. That ensures(保证) valgrind can tell you exactly where memory is being leaked.
  • Finally, run valgrind thus:
valgrind --tool=memcheck --leak-check=full --suppressions=vg.supp someprog

And after fixing any errors it reported, you should get the pleasant message:

==30536== ERROR SUMMARY: 0 errors from 0 contexts...

Multithreading with ZeroMQ

topprevnext

ZeroMQ is perhaps the nicest way ever to write multithreaded (MT) applications. Whereas(然而) ZeroMQ sockets(插座) require some readjustment(重新调整) if you are used to traditional sockets, ZeroMQ multithreading will take everything you know about writing MT applications, throw it into a heap in the garden, pour gasoline(汽油) over it, and set it alight(下来). It's a rare book that deserves burning, but most books on concurrent(并发的) programming do.

To make utterly(完全地) perfect MT programs (and I mean that literally(文字的)), we don't need mutexes(互斥), locks, or any other form of inter-thread communication except messages sent across ZeroMQ sockets.

By "perfect MT programs", I mean code that's easy to write and understand, that works with the same design approach(方法) in any programming language, and on any operating system, and that scales(天平) across any number of CPUs with zero wait states and no point of diminishing(逐渐缩小的) returns.

If you've spent years learning tricks to make your MT code work at all, let alone rapidly, with locks and semaphores(信号) and critical(鉴定的) sections, you will be disgusted(厌恶的) when you realize it was all for nothing. If there's one lesson we've learned from 30+ years of concurrent programming, it is: just don't share state. It's like two drunkards(酒鬼) trying to share a beer. It doesn't matter if they're good buddies(伙伴). Sooner or later, they're going to get into a fight. And the more drunkards you add to the table, the more they fight each other over the beer. The tragic(悲剧的) majority of MT applications look like drunken(喝醉的) bar fights.

The list of weird(怪异的) problems that you need to fight as you write classic(经典的) shared-state MT code would be hilarious(欢闹的) if it didn't translate directly into stress(压力) and risk(风险), as code that seems to work suddenly fails under pressure. A large firm with world-beating experience in buggy(童车) code released its list of "11 Likely Problems In Your Multithreaded Code", which covers forgotten synchronization(同步), incorrect granularity(间隔尺寸), read and write tearing, lock-free reordering(再订购), lock convoys(护送), two-step dance, and priority(优先) inversion(倒置).

Yeah, we counted seven problems, not eleven. That's not the point though. The point is, do you really want that code running the power grid or stock market to start getting two-step lock convoys at 3 p.m. on a busy Thursday? Who cares what the terms actually mean? This is not what turned us on to programming, fighting ever more complex(复杂的) side effects with ever more complex hacks.

Some widely used models, despite(尽管) being the basis(基础) for entire industries, are fundamentally(根本地) broken, and shared state concurrency(并发性) is one of them. Code that wants to scale without limit does it like the Internet does, by sending messages and sharing nothing except a common contempt(轻视) for broken programming models.

You should follow some rules to write happy multithreaded code with ZeroMQ:

  • Isolate(隔离) data privately within its thread and never share data in multiple threads. The only exception(例外) to this are ZeroMQ contexts(环境), which are threadsafe.
  • Stay away from the classic concurrency mechanisms(机制) like as mutexes, critical sections, semaphores, etc. These are an anti-pattern in ZeroMQ applications.
  • Create one ZeroMQ context at the start of your process, and pass that to all threads that you want to connect via inproc sockets.
  • Use attached threads to create structure(结构) within your application, and connect these to their parent threads using PAIR sockets(插座) over inproc. The pattern is: bind(绑) parent socket, then create child thread which connects its socket.
  • Use detached threads to simulate(模仿的) independent tasks, with their own contexts(环境). Connect these over tcp. Later you can move these to stand-alone processes without changing the code significantly(意味深长地).
  • All interaction(相互作用) between threads happens as ZeroMQ messages, which you can define(定义) more or less formally(正式地).
  • Don't share ZeroMQ sockets between threads. ZeroMQ sockets are not threadsafe. Technically it's possible to migrate(移动) a socket from one thread to another but it demands skill. The only place where it's remotely(遥远的) sane(健全的) to share sockets between threads are in language bindings that need to do magic like garbage collection on sockets.

If you need to start more than one proxy(代理人) in an application, for example, you will want to run each in their own thread. It is easy to make the error of creating the proxy frontend(前端) and backend sockets in one thread, and then passing the sockets to the proxy in another thread. This may appear to work at first but will fail randomly(随便地) in real use. Remember: Do not use or close sockets except in the thread that created them.

If you follow these rules, you can quite easily build elegant(高雅的) multithreaded applications, and later split off threads into separate processes as you need to. Application logic(逻辑) can sit in threads, processes, or nodes: whatever your scale(规模) needs.

ZeroMQ uses native OS threads rather than virtual(虚拟的) "green" threads. The advantage is that you don't need to learn any new threading API, and that ZeroMQ threads map cleanly to your operating system. You can use standard tools like Intel's ThreadChecker to see what your application is doing. The disadvantages are that native threading APIs are not always portable, and that if you have a huge number of threads (in the thousands), some operating systems will get stressed(强调).

Let's see how this works in practice. We'll turn our old Hello World server into something more capable(能干的). The original server ran in a single thread. If the work per request is low, that's fine: one ØMQ thread can run at full speed on a CPU core, with no waits, doing an awful (可怕的)lot of work. But realistic (现实的)servers have to do nontrivial (非平凡的)work per request. A single core may not be enough when 10,000 clients hit the server all at once. So a realistic server will start multiple worker threads. It then accepts requests as fast as it can and distributes (分配)these to its worker threads. The worker threads grind (磨碎)through the work and eventually (最后)send their replies back.

You can, of course, do all this using a proxy broker and external(外部的) worker processes, but often it's easier to start one process that gobbles(火鸡叫声) up sixteen cores than sixteen processes, each gobbling up one core. Further, running workers as threads will cut out a network hop, latency(潜伏), and network traffic.

The MT version of the Hello World service basically(主要地) collapses(倒塌) the broker and workers into a single process:

// Multithreaded Hello World server

#include "zhelpers.h"
#include <pthread.h>
#include <unistd.h>

static void *
worker_routine (void *context) {
// Socket(插座) to talk to dispatcher(调度员)
void *receiver = zmq_socket (context(环境), ZMQ_REP);
zmq_connect (receiver, "inproc://workers");

while (1) {
char *string = s_recv (receiver);
printf ("Received request: [%s]\n", string);
free (string);
// Do some 'work'
sleep (1);
// Send reply back to client
s_send (receiver, "World");
}
zmq_close (receiver);
return NULL;
}

int main (void)
{
void *context = zmq_ctx_new ();

// Socket(插座) to talk to clients
void *clients = zmq_socket (context(环境), ZMQ_ROUTER);
zmq_bind (clients, "tcp://*:5555");

// Socket(插座) to talk to workers
void *workers = zmq_socket (context(环境), ZMQ_DEALER);
zmq_bind (workers, "inproc://workers");

// Launch(发射) pool of worker threads
int thread_nbr;
for (thread_nbr = 0; thread_nbr < 5; thread_nbr++) {
pthread_t worker;
pthread_create (&worker, NULL, worker_routine(日常的), context(环境));
}
// Connect work threads to client threads via a queue proxy(代理人)
zmq_proxy (clients, workers, NULL);

// We never get here, but clean up anyhow
zmq_close (clients);
zmq_close (workers);
zmq_ctx_destroy (context);
return 0;
}


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Q | Ruby | Scala | Ada | Basic | Felix | Node.js | Objective-C | ooc | Racket | Tcl

Figure 20 - Multithreaded Server

fig20.png

All the code should be recognizable(可辨认的) to you by now. How it works:

  • The server starts a set of worker threads. Each worker thread creates a REP socket(插座) and then processes requests on this socket. Worker threads are just like single-threaded servers. The only differences are the transport (inproc instead of tcp), and the bind-connect direction.
  • The server creates a ROUTER socket to talk to clients and binds(捆绑) this to its external(外部的) interface(界面) (over tcp).
  • The server creates a DEALER socket to talk to the workers and binds this to its internal(内部的) interface (over inproc).
  • The server starts a proxy(代理人) that connects the two sockets. The proxy pulls incoming requests fairly from all clients, and distributes(分配) those out to workers. It also routes(路由) replies back to their origin.

Note that creating threads is not portable in most programming languages. The POSIX library is pthreads, but on Windows you have to use a different API. In our example, the pthread_create call starts up a new thread running the worker_routine function we defined(定义). We'll see in Chapter 3 - Advanced Request-Reply Patterns how to wrap(包) this in a portable API.

Here the "work" is just a one-second pause. We could do anything in the workers, including talking to other nodes. This is what the MT server looks like in terms of ØMQ sockets (插座)and nodes. Note how the request-reply chain is REQ-ROUTER-queue-DEALER-REP.

Signaling Between Threads (PAIR Sockets)

topprevnext

When you start making multithreaded applications with ZeroMQ, you'll encounter(遭遇) the question of how to coordinate(调整) your threads. Though you might be tempted(诱惑) to insert "sleep" statements(声明), or use multithreading techniques such as semaphores(信号) or mutexes(互斥), the only mechanism(机制) that you should use are ZeroMQ messages. Remember the story of The Drunkards and The Beer Bottle.

Let's make three threads that signal each other when they are ready. In this example, we use PAIR sockets over the inproc transport:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Q | Ruby | Scala | Ada | Basic | Felix | Node.js | Objective-C | ooc | Racket | Tcl

Figure 21 - The Relay Race

fig21.png

This is a classic(经典的) pattern for multithreading with ZeroMQ:

  1. Two threads communicate over inproc, using a shared context(环境).
  2. The parent thread creates one socket(插座), binds(捆绑) it to an inproc:// endpoint, and then starts the child thread, passing the context(环境) to it.
  3. The child thread creates the second socket(插座), connects it to that inproc:// endpoint, and then signals to the parent thread that it's ready.

Note that multithreading code using this pattern is not scalable(可攀登的) out to processes. If you use inproc and socket pairs, you are building a tightly-bound application, i.e., one where your threads are structurally(在结构上) interdependent(相互依赖的). Do this when low latency(潜伏) is really vital(至关重要的). The other design pattern is a loosely bound application, where threads have their own context and communicate over ipc or tcp. You can easily break loosely bound threads into separate processes.

This is the first time we've shown an example using PAIR sockets. Why use PAIR? Other socket combinations(结合) might seem to work, but they all have side effects that could interfere(干涉) with signaling:

  • You can use PUSH for the sender and PULL for the receiver. This looks simple and will work, but remember that PUSH will distribute(分配) messages to all available receivers. If you by accident start two receivers (e.g., you already have one running and you start a second), you'll "lose" half of your signals. PAIR has the advantage of refusing more than one connection; the pair is exclusive.
  • You can use DEALER for the sender and ROUTER for the receiver. ROUTER, however, wraps(外套) your message in an "envelope", meaning your zero-size signal turns into a multipart message. If you don't care about the data and treat anything as a valid signal, and if you don't read more than once from the socket, that won't matter. If, however, you decide to send real data, you will suddenly find ROUTER providing you with "wrong" messages. DEALER also distributes outgoing(外出的) messages, giving the same risk(风险) as PUSH.
  • You can use PUB for the sender and SUB for the receiver. This will correctly deliver your messages exactly as you sent them and PUB does not distribute as PUSH or DEALER do. However, you need to configure(安装) the subscriber(订户) with an empty subscription(捐献), which is annoying.

For these reasons, PAIR makes the best choice for coordination(协调) between pairs of threads.

Node Coordination

topprevnext

When you want to coordinate(调整) a set of nodes on a network, PAIR sockets(插座) won't work well any more. This is one of the few areas where the strategies(战略) for threads and nodes are different. Principally(首要的), nodes come and go whereas(然而) threads are usually static. PAIR sockets do not automatically(自动地) reconnect(使再接合) if the remote(遥远的) node goes away and comes back.

Figure 22 - Pub-Sub Synchronization

fig22.png

The second significant(重大的) difference between threads and nodes is that you typically(代表性地) have a fixed number of threads but a more variable(变量的) number of nodes. Let's take one of our earlier scenarios(方案) (the weather server and clients) and use node coordination to ensure(保证) that subscribers(订阅) don't lose data when starting up.

This is how the application will work:

  • The publisher knows in advance how many subscribers it expects. This is just a magic number it gets from somewhere.
  • The publisher starts up and waits for all subscribers to connect. This is the node coordination part. Each subscriber subscribes and then tells the publisher it's ready via another socket.
  • When the publisher has all subscribers connected, it starts to publish data.

In this case, we'll use a REQ-REP socket flow to synchronize(合拍) subscribers and publisher. Here is the publisher:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

And here is the subscriber(订户):


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q

This Bash shell(壳) script will start ten subscribers(订阅) and then the publisher:

echo "Starting subscribers..."
for ((a=0; a<10; a++)); do
    syncsub &
done
echo "Starting publisher..."
syncpub

Which gives us this satisfying output(输出):

Starting subscribers...
Starting publisher...
Received 1000000 updates
Received 1000000 updates
...
Received 1000000 updates
Received 1000000 updates

We can't assume(承担) that the SUB connect will be finished by the time the REQ/REP dialog is complete. There are no guarantees(保证) that outbound(出站) connects will finish in any order whatsoever(无论什么), if you're using any transport except inproc. So, the example does a brute(畜生) force sleep of one second between subscribing, and sending the REQ/REP synchronization(同步).

A more robust(强健的) model could be:

  • Publisher opens PUB socket(插座) and starts sending "Hello" messages (not data).
  • Subscribers connect SUB socket and when they receive a Hello message they tell the publisher via a REQ/REP socket pair.
  • When the publisher has had all the necessary confirmations(确认), it starts to send real data.

Zero-Copy

topprevnext

ZeroMQ's message API lets you send and receive messages directly from and to application buffers(有软皮摩擦) without copying data. We call this zero-copy, and it can improve performance in some applications.

You should think about using zero-copy in the specific(特殊的) case where you are sending large blocks of memory (thousands of bytes), at a high frequency(频率). For short messages, or for lower message rates, using zero-copy will make your code messier and more complex(复杂的) with no measurable(可测量的) benefit(利益). Like all optimizations(最佳化), use this when you know it helps, and measure before and after.

To do zero-copy, you use zmq_msg_init_data() to create a message that refers to a block of data already allocated(分配) with malloc() or some other allocator(分配算符), and then you pass that to zmq_msg_send(). When you create the message, you also pass a function that ZeroMQ will call to free the block of data, when it has finished sending the message. This is the simplest example, assuming(承担) buffer is a block of 1,000 bytes allocated on the heap:

void my_free (void *data, void *hint) {
free (data);
}
// Send message from buffer(缓冲区), which we allocate(分配) and ZeroMQ will free for us
zmq_msg_t message;
zmq_msg_init_data (&message, buffer, 1000, my_free, NULL);
zmq_msg_send (&message, socket, 0);

Note that you don't call zmq_msg_close() after sending a message—libzmq will do this automatically(自动地) when it's actually done sending the message.

There is no way to do zero-copy on receive: ZeroMQ delivers you a buffer(缓冲区) that you can store as long as you wish, but it will not write data directly into application buffers.

On writing, ZeroMQ's multipart messages work nicely together with zero-copy. In traditional messaging, you need to marshal(元帅) different buffers together into one buffer that you can send. That means copying data. With ZeroMQ, you can send multiple buffers coming from different sources as individual(个人的) message frames(框架). Send each field as a length-delimited frame. To the application, it looks like a series of send and receive calls. But internally(内部地), the multiple parts get written to the network and read back with single system calls, so it's very efficient(有效率的).

Pub-Sub Message Envelopes

topprevnext

In the pub-sub pattern, we can split the key into a separate message frame that we call an envelope. If you want to use pub-sub envelopes, make them yourself. It's optional(可选择的), and in previous pub-sub examples we didn't do this. Using a pub-sub envelope is a little more work for simple cases, but it's cleaner especially for real cases, where the key and the data are naturally separate things.

Figure 23 - Pub-Sub Envelope with Separate Key

fig23.png

Subscriptions(捐献) do a prefix(前缀) match. That is, they look for "all messages starting with XYZ". The obvious question is: how to delimit(划界) keys from data so that the prefix match doesn't accidentally(意外地) match data. The best answer is to use an envelope because the match won't cross a frame boundary(边界). Here is a minimalist(极简抽象派艺术的) example of how pub-sub envelopes look in code. This publisher sends messages of two types, A and B.

The envelope holds the message type:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

The subscriber(订户) wants only messages of type B:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

When you run the two programs, the subscriber(订户) should show you this:

[B] We would like to see this
[B] We would like to see this
[B] We would like to see this
...

This example shows that the subscription(捐献) filter rejects or accepts the entire multipart message (key plus data). You won't get part of a multipart message, ever. If you subscribe to multiple publishers and you want to know their address so that you can send them data via another socket(插座) (and this is a typical(典型的) use case), create a three-part message.

Figure 24 - Pub-Sub Envelope with Sender Address

fig24.png

High-Water Marks

topprevnext

When you can send messages rapidly from process to process, you soon discover that memory is a precious resource, and one that can be trivially(琐细地) filled up. A few seconds of delay somewhere in a process can turn into a backlog(积压的工作) that blows up a server unless you understand the problem and take precautions(预防).

The problem is this: imagine you have process A sending messages at high frequency(频率) to process B, which is processing them. Suddenly B gets very busy (garbage collection, CPU overload, whatever), and can't process the messages for a short period. It could be a few seconds for some heavy garbage collection, or it could be much longer, if there's a more serious problem. What happens to the messages that process A is still trying to send frantically(疯狂似地)? Some will sit in B's network buffers(有软皮摩擦). Some will sit on the Ethernet wire itself. Some will sit in A's network buffers. And the rest will accumulate(积攒) in A's memory, as rapidly as the application behind A sends them. If you don't take some precaution, A can easily run out of memory and crash.

It is a consistent(始终如一的), classic(经典的) problem with message brokers. What makes it hurt more is that it's B's fault, superficially(表面地), and B is typically a user-written application which A has no control over.

What are the answers? One is to pass the problem upstream(上游部门). A is getting the messages from somewhere else. So tell that process, "Stop!" And so on. This is called flow control. It sounds plausible(貌似可信的), but what if you're sending out a Twitter feed? Do you tell the whole world to stop tweeting while B gets its act together?

Flow control works in some cases, but not in others. The transport layer can't tell the application layer to "stop" any more than a subway(地铁) system can tell a large business, "please keep your staff(职员) at work for another half an hour. I'm too busy". The answer for messaging is to set limits on the size of buffers(有软皮摩擦), and then when we reach those limits, to take some sensible(明智的) action. In some cases (not for a subway system, though), the answer is to throw away messages. In others, the best strategy(战略) is to wait.

ZeroMQ uses the concept(观念) of HWM (high-water mark) to define(定义) the capacity(能力) of its internal(内部的) pipes. Each connection out of a socket(插座) or into a socket has its own pipe, and HWM for sending, and/or receiving, depending on the socket type. Some sockets (PUB, PUSH) only have send buffers. Some (SUB, PULL, REQ, REP) only have receive buffers. Some (DEALER, ROUTER, PAIR) have both send and receive buffers.

In ZeroMQ v2.x, the HWM was infinite(无限的) by default. This was easy but also typically(代表性地) fatal(致命的) for high-volume(大容量) publishers. In ZeroMQ v3.x, it's set to 1,000 by default, which is more sensible. If you're still using ZeroMQ v2.x, you should always set a HWM on your sockets, be it 1,000 to match ZeroMQ v3.x or another figure that takes into account your message sizes and expected subscriber(订户) performance.

When your socket reaches its HWM, it will either block or drop data depending on the socket type. PUB and ROUTER sockets will drop data if they reach their HWM, while other socket types will block. Over the inproc transport, the sender and receiver share the same buffers, so the real HWM is the sum of the HWM set by both sides.

Lastly, the HWMs are not exact; while you may get up to 1,000 messages by default, the real buffer size may be much lower (as little as half), due to the way libzmq implements its queues.

Missing Message Problem Solver

topprevnext

As you build applications with ZeroMQ, you will come across this problem more than once: losing messages that you expect to receive. We have put together a diagram that walks through the most common causes for this.

Figure 25 - Missing Message Problem Solver

fig25.png

Here's a summary of what the graphic(图表的) says:

  • On SUB sockets, set a subscription(捐献) using zmq_setsockopt() with ZMQ_SUBSCRIBE, or you won't get messages. Because you subscribe(签署) to messages by prefix(前缀), if you subscribe to "" (an empty subscription(捐献)), you will get everything.
  • If you start the SUB socket(插座) (i.e., establish(建立) a connection to a PUB socket) after the PUB socket has started sending out data, you will lose whatever it published before the connection was made. If this is a problem, set up your architecture(建筑学) so the SUB socket starts first, then the PUB socket starts publishing.
  • Even if you synchronize(合拍) a SUB and PUB socket, you may still lose messages. It's due to the fact that internal(内部的) queues aren't created until a connection is actually created. If you can switch(转换) the bind(捆绑)/connect direction so the SUB socket binds, and the PUB socket connects, you may find it works more as you'd expect.
  • If you're using REP and REQ sockets, and you're not sticking to the synchronous(同步的) send/recv/send/recv order, ZeroMQ will report errors, which you might ignore(驳回诉讼). Then, it would look like you're losing messages. If you use REQ or REP, stick to the send/recv order, and always, in real code, check for errors on ZeroMQ calls.
  • If you're using PUSH sockets, you'll find that the first PULL socket to connect will grab an unfair share of messages. The accurate(精确的) rotation(旋转) of messages only happens when all PULL sockets are successfully connected, which can take some milliseconds(毫秒). As an alternative(二中择一) to PUSH/PULL, for lower data rates, consider using ROUTER/DEALER and the load balancing pattern.
  • If you're sharing sockets across threads, don't. It will lead to random(随机的) weirdness(命运), and crashes.
  • If you're using inproc, make sure both sockets are in the same context(环境). Otherwise the connecting side will in fact fail. Also, bind first, then connect. inproc is not a disconnected(分离的) transport like tcp.
  • If you're using ROUTER sockets, it's remarkably(卓越的) easy to lose messages by accident, by sending malformed(畸形的) identity(身份) frames(框架) (or forgetting to send an identity frame). In general setting the ZMQ_ROUTER_MANDATORY option on ROUTER sockets(插座) is a good idea, but do also check the return code on every send call.
  • Lastly, if you really can't figure out what's going wrong, make a minimal test case that reproduces(复制) the problem, and ask for help from the ZeroMQ community.


Chapter 3 - Advanced Request-Reply Patterns

topprevnext

In Chapter 2 - Sockets and Patterns we worked through the basics of using ZeroMQ by developing a series of small applications, each time exploring new aspects(方面) of ZeroMQ. We'll continue this approach(方法) in this chapter as we explore advanced patterns built on top of ZeroMQ's core request-reply pattern.

We'll cover:

  • How the request-reply mechanisms(机制) work
  • How to combine REQ, REP, DEALER, and ROUTER sockets
  • How ROUTER sockets work, in detail
  • The load balancing pattern
  • Building a simple load balancing message broker
  • Designing a high-level API for ZeroMQ
  • Building an asynchronous(异步的) request-reply server
  • A detailed inter-broker routing(路由选择) example

The Request-Reply Mechanisms

topprevnext

We already looked briefly at multipart messages. Let's now look at a major use case, which is reply message envelopes. An envelope is a way of safely packaging up data with an address, without touching the data itself. By separating reply addresses into an envelope we make it possible to write general purpose intermediaries(中间人) such as APIs and proxies(代理人) that create, read, and remove addresses no matter what the message payload(有效载荷) or structure(结构) is.

In the request-reply pattern, the envelope holds the return address for replies. It is how a ZeroMQ network with no state can create round-trip request-reply dialogs.

When you use REQ and REP sockets(插座) you don't even see envelopes; these sockets deal with them automatically(自动地). But for most of the interesting request-reply patterns, you'll want to understand envelopes and particularly ROUTER sockets. We'll work through this step-by-step(按部就班的).

The Simple Reply Envelope
topprevnext

A request-reply exchange consists of a request message, and an eventual(最后的) reply message. In the simple request-reply pattern, there's one reply for each request. In more advanced patterns, requests and replies can flow asynchronously(异步的). However, the reply envelope always works the same way.

The ZeroMQ reply envelope formally(正式地) consists of zero or more reply addresses, followed by an empty frame(框架) (the envelope delimiter(划界)), followed by the message body (zero or more frames). The envelope is created by multiple sockets working together in a chain. We'll break this down.

We'll start by sending "Hello" through a REQ socket. The REQ socket creates the simplest possible reply envelope, which has no addresses, just an empty delimiter frame and the message frame containing the "Hello" string. This is a two-frame message.

Figure 26 - Request with Minimal Envelope

fig26.png

The REP socket(插座) does the matching work: it strips(带) off the envelope, up to and including the delimiter(划界) frame(框架), saves the whole envelope, and passes the "Hello" string up the application. Thus our original Hello World example used request-reply envelopes internally(内部地), but the application never saw them.

If you spy on the network data flowing between hwclient and hwserver, this is what you'll see: every request and every reply is in fact two frames, an empty frame and then the body. It doesn't seem to make much sense for a simple REQ-REP dialog. However you'll see the reason when we explore how ROUTER and DEALER handle envelopes.

The Extended(延伸) Reply Envelope
topprevnext

Now let's extend the REQ-REP pair with a ROUTER-DEALER proxy(代理人) in the middle and see how this affects the reply envelope. This is the extended request-reply pattern we already saw in Chapter 2 - Sockets and Patterns. We can, in fact, insert any number of proxy steps. The mechanics(力学) are the same.

Figure 27 - Extended Request-Reply Pattern

fig27.png

The proxy does this, in pseudo-code:

prepare context, frontend and backend sockets
while true:
    poll on both sockets
    if frontend had input:
        read all frames from frontend
        send to backend
    if backend had input:
        read all frames from backend
        send to frontend

The ROUTER socket, unlike other sockets, tracks every connection it has, and tells the caller about these. The way it tells the caller is to stick the connection identity in front of each message received. An identity(身份), sometimes called an address, is just a binary(二进制的) string with no meaning except "this is a unique(独特的) handle to the connection". Then, when you send a message via a ROUTER socket(插座), you first send an identity frame(框架).

The zmq_socket() man page describes it thus:

When receiving messages a ZMQ_ROUTER socket shall prepend(预先考虑) a message part containing the identity of the originating(起源的) peer(贵族) to the message before passing it to the application. Messages received are fair-queued from among all connected peers. When sending messages a ZMQ_ROUTER socket shall remove the first part of the message and use it to determine the identity of the peer the message shall be routed(已选择路径) to.

As a historical(历史的) note, ZeroMQ v2.2 and earlier use UUIDs as identities. ZeroMQ v3.0 and later generate(形成) a 5 byte identity by default (0 + a random(随机的) 32bit integer(整数)). There's some impact(影响) on network performance, but only when you use multiple proxy(代理人) hops(蜱酒花), which is rare. Mostly the change was to simplify(简化) building libzmq by removing the dependency(属国) on a UUID library.

Identities are a difficult concept(观念) to understand, but it's essential if you want to become a ZeroMQ expert. The ROUTER socket invents a random identity for each connection with which it works. If there are three REQ sockets connected to a ROUTER socket, it will invent three random identities, one for each REQ socket.

So if we continue our worked example, let's say the REQ socket has a 3-byte identity ABC. Internally(内部的), this means the ROUTER socket keeps a hash table where it can search for ABC and find the TCP connection for the REQ socket.

When we receive the message off the ROUTER socket, we get three frames.

Figure 28 - Request with One Address

fig28.png

The core of the proxy(代理人) loop(环) is "read from one socket(插座), write to the other", so we literally(照字面地) send these three frames(框架) out on the DEALER socket. If you now sniffed(嗅) the network traffic, you would see these three frames flying from the DEALER socket to the REP socket. The REP socket does as before, strips(带) off the whole envelope including the new reply address, and once again delivers the "Hello" to the caller.

Incidentally(附带的) the REP socket can only deal with one request-reply exchange at a time, which is why if you try to read multiple requests or send multiple replies without sticking to a strict recv-send cycle, it gives an error.

You should now be able to visualize(形象) the return path. When hwserver sends "World" back, the REP socket wraps(外套) that with the envelope it saved, and sends a three-frame reply message across the wire to the DEALER socket.

Figure 29 - Reply with one Address

fig29.png

Now the DEALER reads these three frames, and sends all three out via the ROUTER socket. The ROUTER takes the first frame for the message, which is the ABC identity(身份), and looks up the connection for this. If it finds that, it then pumps the next two frames out onto the wire.

Figure 30 - Reply with Minimal Envelope

fig30.png

The REQ socket picks this message up, and checks that the first frame is the empty delimiter(划界), which it is. The REQ socket discards(抛弃) that frame and passes "World" to the calling application, which prints it out to the amazement(惊异) of the younger us looking at ZeroMQ for the first time.

What's This Good For?
topprevnext

To be honest, the use cases for strict request-reply or extended(延伸) request-reply are somewhat limited. For one thing, there's no easy way to recover from common failures like the server crashing due to buggy(童车) application code. We'll see more about this in Chapter 4 - Reliable(可靠的) Request-Reply Patterns. However once you grasp(抓住) the way these four sockets deal with envelopes, and how they talk to each other, you can do very useful things. We saw how ROUTER uses the reply envelope to decide which client REQ socket to route(按某路线发送) a reply back to. Now let's express this another way:

  • Each time ROUTER gives you a message, it tells you what peer(贵族) that came from, as an identity.
  • You can use this with a hash table (with the identity as key) to track new peers as they arrive.
  • ROUTER will route(路线) messages asynchronously(异步的) to any peer(贵族) connected to it, if you prefix(加前缀) the identity(身份) as the first frame(框架) of the message.

ROUTER sockets(插座) don't care about the whole envelope. They don't know anything about the empty delimiter(划界). All they care about is that one identity frame that lets them figure out which connection to send a message to.

Recap(翻新的轮胎) of Request-Reply Sockets
topprevnext

Let's recap this:

  • The REQ socket sends, to the network, an empty delimiter frame in front of the message data. REQ sockets are synchronous(同步的). REQ sockets always send one request and then wait for one reply. REQ sockets talk to one peer at a time. If you connect a REQ socket to multiple peers, requests are distributed(分布式的) to and replies expected from each peer one turn at a time.
  • The REP socket reads and saves all identity frames up to and including the empty delimiter, then passes the following frame or frames to the caller. REP sockets are synchronous and talk to one peer at a time. If you connect a REP socket to multiple peers, requests are read from peers in fair fashion, and replies are always sent to the same peer that made the last request.
  • The DEALER socket is oblivious(遗忘的) to the reply envelope and handles this like any multipart message. DEALER sockets are asynchronous and like PUSH and PULL combined. They distribute sent messages among all connections, and fair-queue received messages from all connections.
  • The ROUTER socket is oblivious to the reply envelope, like DEALER. It creates identities for its connections, and passes these identities to the caller as a first frame in any received message. Conversely(相反的), when the caller sends a message, it uses the first message frame as an identity to look up the connection to send to. ROUTERS are asynchronous.

Request-Reply Combinations

topprevnext

We have four request-reply sockets, each with a certain behavior(行为). We've seen how they connect in simple and extended(延伸) request-reply patterns. But these sockets are building blocks that you can use to solve many problems.

These are the legal(法律的) combinations(结合):

  • REQ to REP
  • DEALER to REP
  • REQ to ROUTER
  • DEALER to ROUTER
  • DEALER to DEALER
  • ROUTER to ROUTER

And these combinations(结合) are invalid(无效的) (and I'll explain why):

  • REQ to REQ
  • REQ to DEALER
  • REP to REP
  • REP to ROUTER

Here are some tips for remembering the semantics(语义学). DEALER is like an asynchronous(异步的) REQ socket(插座), and ROUTER is like an asynchronous REP socket. Where we use a REQ socket, we can use a DEALER; we just have to read and write the envelope ourselves. Where we use a REP socket, we can stick a ROUTER; we just need to manage the identities(身份) ourselves.

Think of REQ and DEALER sockets as "clients" and REP and ROUTER sockets as "servers". Mostly, you'll want to bind(绑) REP and ROUTER sockets, and connect REQ and DEALER sockets to them. It's not always going to be this simple, but it is a clean and memorable(显著的) place to start.

The REQ to REP Combination
topprevnext

We've already covered a REQ client talking to a REP server but let's take one aspect(方面): the REQ client must initiate(开始) the message flow. A REP server cannot talk to a REQ client that hasn't first sent it a request. Technically, it's not even possible, and the API also returns an EFSM error if you try it.

The DEALER to REP Combination
topprevnext

Now, let's replace the REQ client with a DEALER. This gives us an asynchronous(异步的) client that can talk to multiple REP servers. If we rewrote(重写) the "Hello World" client using DEALER, we'd be able to send off any number of "Hello" requests without waiting for replies.

When we use a DEALER to talk to a REP socket(插座), we must accurately(精确地) emulate(仿真) the envelope that the REQ socket would have sent, or the REP socket will discard(抛弃) the message as invalid(无效的). So, to send a message, we:

  • Send an empty message frame(框架) with the MORE flag set; then
  • Send the message body.

And when we receive a message, we:

  • Receive the first frame and if it's not empty, discard the whole message;
  • Receive the next frame and pass that to the application.

The REQ to ROUTER Combination
topprevnext

In the same way that we can replace REQ with DEALER, we can replace REP with ROUTER. This gives us an asynchronous server that can talk to multiple REQ clients at the same time. If we rewrote the "Hello World" server using ROUTER, we'd be able to process any number of "Hello" requests in parallel(平行线). We saw this in the Chapter 2 - Sockets and Patterns mtserver example.

We can use ROUTER in two distinct(明显的) ways:

  • As a proxy(代理人) that switches(开关) messages between frontend(前端) and backend sockets(插座).
  • As an application that reads the message and acts on it.

In the first case, the ROUTER simply reads all frames(框架), including the artificial(人造的) identity(身份) frame, and passes them on blindly. In the second case the ROUTER must know the format of the reply envelope it's being sent. As the other peer(贵族) is a REQ socket, the ROUTER gets the identity frame, an empty frame, and then the data frame.

The DEALER to ROUTER Combination(结合)
topprevnext

Now we can switch out both REQ and REP with DEALER and ROUTER to get the most powerful socket combination, which is DEALER talking to ROUTER. It gives us asynchronous(异步的) clients talking to asynchronous servers, where both sides have full control over the message formats.

Because both DEALER and ROUTER can work with arbitrary(任意的) message formats, if you hope to use these safely, you have to become a little bit of a protocol(协议) designer. At the very least you must decide whether you wish to emulate(仿真) the REQ/REP reply envelope. It depends on whether you actually need to send replies or not.

The DEALER to DEALER Combination
topprevnext

You can swap(与…交换) a REP with a ROUTER, but you can also swap a REP with a DEALER, if the DEALER is talking to one and only one peer.

When you replace a REP with a DEALER, your worker can suddenly go full asynchronous, sending any number of replies back. The cost is that you have to manage the reply envelopes yourself, and get them right, or nothing at all will work. We'll see a worked example later. Let's just say for now that DEALER to DEALER is one of the trickier(狡猾的) patterns to get right, and happily it's rare that we need it.

The ROUTER to ROUTER Combination
topprevnext

This sounds perfect for N-to-N connections, but it's the most difficult combination(结合) to use. You should avoid it until you are well advanced with ZeroMQ. We'll see one example it in the Freelance pattern in Chapter 4 - Reliable(可靠的) Request-Reply Patterns, and an alternative(供选择的) DEALER to ROUTER design for peer-to-peer(对等) work in Chapter 8 - A Framework for Distributed Computing.

Invalid Combinations
topprevnext

Mostly, trying to connect clients to clients, or servers to servers is a bad idea and won't work. However, rather than give general vague(模糊的) warnings, I'll explain in detail:

  • REQ to REQ: both sides want to start by sending messages to each other, and this could only work if you timed things so that both peers(撒尿) exchanged messages at the same time. It hurts my brain to even think about it.
  • REQ to DEALER: you could in theory do this, but it would break if you added a second REQ because DEALER has no way of sending a reply to the original peer. Thus the REQ socket(插座) would get confused(混乱), and/or return messages meant for another client.
  • REP to REP: both sides would wait for the other to send the first message.
  • REP to ROUTER: the ROUTER socket can in theory initiate(开始) the dialog and send a properly-formatted request, if it knows the REP socket has connected and it knows the identity(身份) of that connection. It's messy and adds nothing over DEALER to ROUTER.

The common thread in this valid versus(对) invalid(无效的) breakdown(故障) is that a ZeroMQ socket connection is always biased(有偏见的) towards one peer that binds(捆绑) to an endpoint(端点), and another that connects to that. Further, that which side binds and which side connects is not arbitrary(任意的), but follows natural patterns. The side which we expect to "be there" binds: it'll be a server, a broker, a publisher, a collector(收藏家). The side that "comes and goes" connects: it'll be clients and workers. Remembering this will help you design better ZeroMQ architectures(建筑学).

Exploring ROUTER Sockets

topprevnext

Let's look at ROUTER sockets(插座) a little closer. We've already seen how they work by routing(路由选择) individual(个人的) messages to specific(特殊的) connections. I'll explain in more detail how we identify(确定) those connections, and what a ROUTER socket does when it can't send a message.

Identities and Addresses
topprevnext

The identity concept(观念) in ZeroMQ refers specifically(特别地) to ROUTER sockets and how they identify the connections they have to other sockets. More broadly, identities(身份) are used as addresses in the reply envelope. In most cases, the identity is arbitrary(任意的) and local to the ROUTER socket: it's a lookup(查找) key in a hash table. Independently, a peer(贵族) can have an address that is physical (a network endpoint(端点) like "tcp://192.168.55.117:5670") or logical(合逻辑的) (a UUID or email address or other unique(独特的) key).

An application that uses a ROUTER socket to talk to specific peers can convert(转变) a logical address to an identity if it has built the necessary hash table. Because ROUTER sockets only announce the identity of a connection (to a specific peer) when that peer sends a message, you can only really reply to a message, not spontaneously(自发地) talk to a peer.

This is true even if you flip(弹) the rules and make the ROUTER connect to the peer rather than wait for the peer to connect to the ROUTER. However you can force the ROUTER socket to use a logical address in place of its identity. The zmq_setsockopt reference(参考) page calls this setting the socket identity. It works as follows:

  • The peer application sets the ZMQ_IDENTITY option of its peer socket (DEALER or REQ) before binding or connecting.
  • Usually the peer(贵族) then connects to the already-bound ROUTER socket(插座). But the ROUTER can also connect to the peer.
  • At connection time, the peer socket tells the router(路由器) socket, "please use this identity(身份) for this connection".
  • If the peer socket doesn't say that, the router generates(形成) its usual arbitrary(任意的) random(随机的) identity for the connection.
  • The ROUTER socket now provides this logical(合逻辑的) address to the application as a prefix(前缀) identity frame(框架) for any messages coming in from that peer.
  • The ROUTER also expects the logical address as the prefix identity frame for any outgoing(外出的) messages.

Here is a simple example of two peers that connect to a ROUTER socket, one that imposes(利用) a logical address "PEER2":


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Q | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Racket

Here is what the program prints:

----------------------------------------
[005] 006B8B4567
[000]
[039] ROUTER uses a generated 5 byte identity
----------------------------------------
[005] PEER2
[000]
[038] ROUTER uses REQ's socket identity

ROUTER Error Handling
topprevnext

ROUTER sockets(插座) do have a somewhat brutal(残忍的) way of dealing with messages they can't send anywhere: they drop them silently. It's an attitude that makes sense in working code, but it makes debugging(调试) hard. The "send identity(身份) as first frame(框架)" approach(方法) is tricky(狡猾的) enough that we often get this wrong when we're learning, and the ROUTER's stony(无情的) silence when we mess up isn't very constructive(建设性的).

Since ZeroMQ v3.2 there's a socket option you can set to catch this error: ZMQ_ROUTER_MANDATORY. Set that on the ROUTER socket and then when you provide an unroutable identity on a send call, the socket will signal an EHOSTUNREACH error.

The Load Balancing Pattern

topprevnext

Now let's look at some code. We'll see how to connect a ROUTER socket to a REQ socket, and then to a DEALER socket. These two examples follow the same logic(逻辑), which is a load balancing pattern. This pattern is our first exposure(暴露) to using the ROUTER socket for deliberate(故意的) routing(路由选择), rather than simply acting as a reply channel.

The load balancing pattern is very common and we'll see it several times in this book. It solves the main problem with simple round robin routing (as PUSH and DEALER offer) which is that round robin becomes inefficient(无效率的) if tasks do not all roughly take the same time.

It's the post office analogy(类比). If you have one queue per counter, and you have some people buying stamps (a fast, simple transaction(交易)), and some people opening new accounts (a very slow transaction), then you will find stamp buyers getting unfairly stuck in queues. Just as in a post office, if your messaging architecture(建筑学) is unfair, people will get annoyed.

The solution(解决方案) in the post office is to create a single queue so that even if one or two counters get stuck with slow work, other counters will continue to serve clients on a first-come, first-serve basis(基础).

One reason PUSH and DEALER use the simplistic(过分简单化的) approach(方法) is sheer(绝对的) performance. If you arrive in any major US airport, you'll find long queues of people waiting at immigration. The border patrol(巡逻) officials will send people in advance to queue up at each counter, rather than using a single queue. Having people walk fifty yards in advance saves a minute or two per passenger. And because every passport check takes roughly the same time, it's more or less fair. This is the strategy(战略) for PUSH and DEALER: send work loads ahead of time so that there is less travel distance.

This is a recurring(循环的) theme with ZeroMQ: the world's problems are diverse(不同的) and you can benefit(有益于) from solving different problems each in the right way. The airport isn't the post office and one size fits no one, really well.

Let's return to the scenario(方案) of a worker (DEALER or REQ) connected to a broker (ROUTER). The broker has to know when the worker is ready, and keep a list of workers so that it can take the least recently used worker each time.

The solution is really simple, in fact: workers send a "ready" message when they start, and after they finish each task. The broker reads these messages one-by-one. Each time it reads a message, it is from the last used worker. And because we're using a ROUTER socket(插座), we get an identity(身份) that we can then use to send a task back to the worker.

It's a twist(扭曲) on request-reply because the task is sent with the reply, and any response for the task is sent as a new request. The following code examples should make it clearer.

ROUTER Broker and REQ Workers
topprevnext

Here is an example of the load balancing pattern using a ROUTER broker talking to a set of REQ workers:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

The example runs for five seconds and then each worker prints how many tasks they handled. If the routing(路由选择) worked, we'd expect a fair distribution(分布) of work:

Completed: 20 tasks
Completed: 18 tasks
Completed: 21 tasks
Completed: 23 tasks
Completed: 19 tasks
Completed: 21 tasks
Completed: 17 tasks
Completed: 17 tasks
Completed: 25 tasks
Completed: 19 tasks

To talk to the workers in this example, we have to create a REQ-friendly envelope consisting of an identity(身份) plus an empty envelope delimiter(划界) frame(框架).

Figure 31 - Routing Envelope for REQ

fig31.png

ROUTER Broker and DEALER Workers
topprevnext

Anywhere you can use REQ, you can use DEALER. There are two specific(特殊的) differences:

  • The REQ socket(给…配插座) always sends an empty delimiter(划界) frame(框架) before any data frames; the DEALER does not.
  • The REQ socket will send only one message before it receives a reply; the DEALER is fully asynchronous(异步的).

The synchronous(同步的) versus(对) asynchronous behavior(行为) has no effect on our example because we're doing strict request-reply. It is more relevant(有关的) when we address recovering from failures, which we'll come to in Chapter 4 - Reliable(可靠的) Request-Reply Patterns.

Now let's look at exactly the same example but with the REQ socket replaced by a DEALER socket:


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

The code is almost identical(同一的) except that the worker uses a DEALER socket(插座), and reads and writes that empty frame(框架) before the data frame. This is the approach(方法) I use when I want to keep compatibility(兼容性) with REQ workers.

However, remember the reason for that empty delimiter(划界) frame: it's to allow multihop(多次反射) extended(延伸) requests that terminate(终止) in a REP socket, which uses that delimiter to split off the reply envelope so it can hand the data frames to its application.

If we never need to pass the message along to a REP socket, we can simply drop the empty delimiter frame at both sides, which makes things simpler. This is usually the design I use for pure DEALER to ROUTER protocols(协议).

A Load Balancing Message Broker
topprevnext

The previous example is half-complete. It can manage a set of workers with dummy(虚拟的) requests and replies, but it has no way to talk to clients. If we add a second frontend ROUTER socket(插座) that accepts client requests, and turn our example into a proxy(代理人) that can switch(转换) messages from frontend(前端) to backend, we get a useful and reusable(可重复使用的) tiny load balancing message broker.

Figure 32 - Load Balancing Broker

fig32.png

This broker does the following:

  • Accepts connections from a set of clients.
  • Accepts connections from a set of workers.
  • Accepts requests from clients and holds these in a single queue.
  • Sends these requests to workers using the load balancing pattern.
  • Receives replies back from workers.
  • Sends these replies back to the original requesting client.

The broker code is fairly long, but worth understanding:

// Load-balancing broker
// Clients and workers are shown here in-process

#include "zhelpers.h"
#include <pthread.h>
#define NBR_CLIENTS 10
#define NBR_WORKERS 3

// Dequeue(出列) operation for queue implemented(实施) as array(数组) of anything
#define(定义) DEQUEUE(q) memmove (&(q)[0], &(q)[1], sizeof (q) - sizeof (q [0]))

// Basic request-reply client using REQ socket(插座)
// Because s_send and s_recv can't handle 0MQ binary(二进制的) identities(身份), we
// set a printable(印得出的) text identity to allow routing(路由选择).
//

static void *
client_task(void *args)
{
void *context = zmq_ctx_new();
void *client = zmq_socket(插座)(context(环境), ZMQ_REQ);

#if (defined (WIN32))

s_set_id(client, (intptr_t)args);
zmq_connect(client, "tcp://localhost:5672"); // frontend
#else
s_set_id(client); // Set a printable(印得出的) identity(身份)
zmq_connect(client, "ipc://frontend.ipc");
#endif

// Send request, get reply
s_send(client, "HELLO");
char *reply = s_recv(client);
printf("Client: %s\n", reply);
free(reply);
zmq_close(client);
zmq_ctx_destroy(context);
return NULL;
}

// While this example runs in a single process, that is just to make
// it easier to start and stop the example. Each thread has its own
// context(环境) and conceptually(概念地) acts as a separate process.
// This is the worker task, using a REQ socket(插座) to do load-balancing.
// Because s_send and s_recv can't handle 0MQ binary(二进制的) identities(身份), we
// set a printable(印得出的) text identity to allow routing(路由选择).

static void *
worker_task(void *args)
{
void *context = zmq_ctx_new();
void *worker = zmq_socket(插座)(context(环境), ZMQ_REQ);

#if (defined (WIN32))

s_set_id(worker, (intptr_t)args);
zmq_connect(worker, "tcp://localhost:5673"); // backend
#else
s_set_id(worker);
zmq_connect(worker, "ipc://backend.ipc");
#endif

// Tell broker we're ready for work
s_send(worker, "READY");

while (1) {
// Read and save all frames(框架) until we get an empty frame
// In this example there is only 1, but there could be more
char *identity = s_recv(worker);
char *empty = s_recv(worker);
assert(*empty == 0);
free(empty);

// Get request, send reply
char *request = s_recv(worker);
printf("Worker: %s\n", request);
free(request);

s_sendmore(worker, identity(身份));
s_sendmore(worker, "");
s_send(worker, "OK");
free(identity);
}
zmq_close(worker);
zmq_ctx_destroy(context);
return NULL;
}

// This is the main task. It starts the clients and workers, and then
// routes(路由) requests between the two layers. Workers signal READY when
// they start; after that we treat them as ready when they reply with
// a response back to a client. The load-balancing data structure(结构) is
// just a queue of next available workers.

int main(void)
{
// Prepare our context(环境) and sockets(插座)
void *context = zmq_ctx_new();
void *frontend = zmq_socket(插座)(context(环境), ZMQ_ROUTER);
void *backend = zmq_socket(context, ZMQ_ROUTER);

#if (defined (WIN32))

zmq_bind(frontend, "tcp://*:5672"); // frontend
zmq_bind(backend, "tcp://*:5673"); // backend
#else
zmq_bind(frontend, "ipc://frontend.ipc");
zmq_bind(backend, "ipc://backend.ipc");
#endif

int client_nbr;
for (client_nbr = 0; client_nbr < NBR_CLIENTS; client_nbr++) {
pthread_t client;
pthread_create(&client, NULL, client_task, (void *)(intptr_t)client_nbr);
}
int worker_nbr;
for (worker_nbr = 0; worker_nbr < NBR_WORKERS; worker_nbr++) {
pthread_t worker;
pthread_create(&worker, NULL, worker_task, (void *)(intptr_t)worker_nbr);
}
// Here is the main loop(环) for the least-recently-used queue. It has two
// sockets(插座); a frontend(前端) for clients and a backend for workers. It polls(投票)
// the backend in all cases, and polls the frontend only when there are
// one or more workers ready. This is a neat way to use 0MQ's own queues
// to hold messages we're not ready to process yet. When we get a client
// reply, we pop the next available worker and send the request to it,
// including the originating(起源的) client identity(身份). When a worker replies, we
// requeue that worker and forward the reply to the original client
// using the reply envelope.

// Queue of available workers
int available_workers = 0;
char *worker_queue[10];

while (1) {
zmq_pollitem_t items[] = {
{ backend, 0, ZMQ_POLLIN, 0 },
{ frontend, 0, ZMQ_POLLIN, 0 }
};
// Poll(投票) frontend(前端) only if we have available workers
int rc = zmq_poll(items, available_workers ? 2 : 1, -1);
if (rc == -1)
break; // Interrupted

// Handle worker activity on backend
if (items[0].revents & ZMQ_POLLIN) {
// Queue worker identity(身份) for load-balancing
char *worker_id = s_recv(backend);
assert(available_workers < NBR_WORKERS);
worker_queue[available_workers++] = worker_id;

// Second frame(框架) is empty
char *empty = s_recv(backend);
assert(empty[0] == 0);
free(empty);

// Third frame(框架) is READY or else a client reply identity(身份)
char *client_id = s_recv(backend);

// If client reply, send rest back to frontend(前端)
if (strcmp(client_id, "READY") != 0) {
empty = s_recv(backend);
assert(empty[0] == 0);
free(empty);
char *reply = s_recv(backend);
s_sendmore(frontend(前端), client_id);
s_sendmore(frontend, "");
s_send(frontend, reply);
free(reply);
if (--client_nbr == 0)
break; // Exit after N messages
}
free(client_id);
}
// Here is how we handle a client request:

if (items[1].revents & ZMQ_POLLIN) {
// Now get next client request, route(路线) to last-used worker
// Client request is [identity][empty][request]
char *client_id = s_recv(frontend);
char *empty = s_recv(frontend);
assert(empty[0] == 0);
free(empty);
char *request = s_recv(frontend);

s_sendmore(backend, worker_queue[0]);
s_sendmore(backend, "");
s_sendmore(backend, client_id);
s_sendmore(backend, "");
s_send(backend, request);

free(client_id);
free(request);

// Dequeue(出列) and drop the next worker identity(身份)
free(worker_queue[0]);
DEQUEUE(worker_queue);
available_workers--;
}
}
zmq_close(frontend);
zmq_close(backend);
zmq_ctx_destroy(context);
return 0;
}


C++ | C# | Clojure | CL | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | Felix | Objective-C | ooc | Q | Racket

The difficult part of this program is (a) the envelopes that each socket(插座) reads and writes, and (b) the load balancing algorithm(算法). We'll take these in turn, starting with the message envelope formats.

Let's walk through a full request-reply chain from client to worker and back. In this code we set the identity(身份) of client and worker sockets to make it easier to trace(追踪) the message frames(框架). In reality, we'd allow the ROUTER sockets to invent identities for connections. Let's assume(承担) the client's identity is "CLIENT" and the worker's identity is "WORKER". The client application sends a single frame containing "Hello".

Figure 33 - Message that Client Sends

fig33.png

Because the REQ socket adds its empty delimiter(划界) frame and the ROUTER socket adds its connection identity, the proxy(代理人) reads off the frontend(前端) ROUTER socket the client address, empty delimiter frame, and the data part.

Figure 34 - Message Coming in on Frontend

fig34.png

The broker sends this to the worker, prefixed(有前缀的) by the address of the chosen worker, plus an additional(附加的) empty part to keep the REQ at the other end happy.

Figure 35 - Message Sent to Backend

fig35.png

This complex(复杂的) envelope stack(堆) gets chewed up first by the backend ROUTER socket, which removes the first frame. Then the REQ socket in the worker removes the empty part, and provides the rest to the worker application.

Figure 36 - Message Delivered to Worker

fig36.png

The worker has to save the envelope (which is all the parts up to and including the empty message frame) and then it can do what's needed with the data part. Note that a REP socket would do this automatically(自动地), but we're using the REQ-ROUTER pattern so that we can get proper load balancing.

On the return path, the messages are the same as when they come in, i.e., the backend socket(插座) gives the broker a message in five parts, and the broker sends the frontend(前端) socket a message in three parts, and the client gets a message in one part.

Now let's look at the load balancing algorithm(算法). It requires that both clients and workers use REQ sockets, and that workers correctly store and replay(重赛) the envelope on messages they get. The algorithm is:

  • Create a pollset that always polls(投票) the backend, and polls the frontend only if there are one or more workers available.
  • Poll for activity with infinite(无限的) timeout.
  • If there is activity on the backend, we either have a "ready" message or a reply for a client. In either case, we store the worker address (the first part) on our worker queue, and if the rest is a client reply, we send it back to that client via the frontend.
  • If there is activity on the frontend, we take the client request, pop the next worker (which is the last used), and send the request to the backend. This means sending the worker address, empty part, and then the three parts of the client request.

You should now see that you can reuse and extend(延伸) the load balancing algorithm with variations(变化) based on the information the worker provides in its initial "ready" message. For example, workers might start up and do a performance self test, then tell the broker how fast they are. The broker can then choose the fastest available worker rather than the oldest.

A High-Level API for ZeroMQ

topprevnext

We're going to push request-reply onto the stack(堆) and open a different area, which is the ZeroMQ API itself. There's a reason for this detour(绕道): as we write more complex(复杂的) examples, the low-level ZeroMQ API starts to look increasingly clumsy(笨拙的). Look at the core of the worker thread from our load balancing broker:

while (true) {
// Get one address frame(框架) and empty delimiter(划界)
char *address = s_recv (worker);
char *empty = s_recv (worker);
assert (*empty == 0);
free (empty);

// Get request, send reply
char *request = s_recv (worker);
printf ("Worker: %s\n", request);
free (request);

s_sendmore (worker, address);
s_sendmore (worker, "");
s_send (worker, "OK");
free (address);
}

That code isn't even reusable(可重复使用的) because it can only handle one reply address in the envelope, and it already does some wrapping(包装纸) around the ZeroMQ API. If we used the libzmq simple message API this is what we'd have to write:

while (true) {
// Get one address frame(框架) and empty delimiter(划界)
char address [255];
int address_size = zmq_recv (worker, address, 255, 0);
if (address_size == -1)
break;

char empty [1];
int empty_size = zmq_recv (worker, empty, 1, 0);
zmq_recv (worker, &empty, 0);
assert (empty_size <= 0);
if (empty_size == -1)
break;

// Get request, send reply
char request [256];
int request_size = zmq_recv (worker, request, 255, 0);
if (request_size == -1)
return NULL;
request [request_size] = 0;
printf ("Worker: %s\n", request);

zmq_send (worker, address, address_size, ZMQ_SNDMORE);
zmq_send (worker, empty, 0, ZMQ_SNDMORE);
zmq_send (worker, "OK", 2, 0);
}

And when code is too long to write quickly, it's also too long to understand. Up until now, I've stuck to the native API because, as ZeroMQ users, we need to know that intimately(熟悉地). But when it gets in our way, we have to treat it as a problem to solve.

We can't of course just change the ZeroMQ API, which is a documented public contract(合同) on which thousands of people agree and depend. Instead, we construct a higher-level API on top based on our experience so far, and most specifically(特别地), our experience from writing more complex(复杂的) request-reply patterns.

What we want is an API that lets us receive and send an entire message in one shot, including the reply envelope with any number of reply addresses. One that lets us do what we want with the absolute(绝对的) least lines of code.

Making a good message API is fairly difficult. We have a problem of terminology(术语): ZeroMQ uses "message" to describe both multipart messages, and individual(个人的) message frames(框架). We have a problem of expectations: sometimes it's natural to see message content as printable(印得出的) string data, sometimes as binary(二进制的) blobs. And we have technical challenges, especially if we want to avoid copying data around too much.

The challenge of making a good API affects all languages, though my specific(特殊的) use case is C. Whatever language you use, think about how you could contribute(贡献) to your language binding(结合) to make it as good (or better) than the C binding I'm going to describe.

Features(特色) of a Higher-Level API
topprevnext

My solution(解决方案) is to use three fairly natural and obvious concepts(观念): string (already the basis(基础) for our s_send and s_recv) helpers, frame (a message frame(框架)), and message (a list of one or more frames). Here is the worker code, rewritten(改写) onto an API using these concepts(观念):

while (true) {
zmsg_t *msg = zmsg_recv (worker);
zframe_reset(重置) (zmsg_last (msg), "OK", 2);
zmsg_send (&msg, worker);
}

Cutting the amount(数量) of code we need to read and write complex(复杂的) messages is great: the results are easy to read and understand. Let's continue this process for other aspects(方面) of working with ZeroMQ. Here's a wish list of things I'd like in a higher-level API, based on my experience with ZeroMQ so far:

  • Automatic(自动的) handling of sockets(插座). I find it cumbersome(笨重的) to have to close sockets manually(手动地), and to have to explicitly(明确地) define(定义) the linger(徘徊) timeout in some (but not all) cases. It'd be great to have a way to close sockets automatically(自动地) when I close the context(环境).
  • Portable thread management. Every nontrivial(非平凡的) ZeroMQ application uses threads, but POSIX threads aren't portable. So a decent(正派的) high-level API should hide this under a portable layer.
  • Piping from parent to child threads. It's a recurrent(复发的) problem: how to signal between parent and child threads. Our API should provide a ZeroMQ message pipe (using PAIR sockets and inproc automatically.
  • Portable clocks. Even getting the time to a millisecond(毫秒) resolution(分辨率), or sleeping for some milliseconds, is not portable. Realistic(现实的) ZeroMQ applications need portable clocks, so our API should provide them.
  • A reactor(反应器) to replace zmq_poll(). The poll(投票) loop(环) is simple, but clumsy(笨拙的). Writing a lot of these, we end up doing the same work over and over: calculating(计算的) timers, and calling code when sockets(插座) are ready. A simple reactor(反应器) with socket readers and timers would save a lot of repeated work.
  • Proper handling of Ctrl-C. We already saw how to catch an interrupt. It would be useful if this happened in all applications.

The CZMQ High-Level API
topprevnext

Turning this wish list into reality for the C language gives us CZMQ, a ZeroMQ language binding(结合) for C. This high-level binding, in fact, developed out of earlier versions of the examples. It combines nicer semantics(语义学) for working with ZeroMQ with some portability(可移植性) layers, and (importantly for C, but less for other languages) containers like hashes and lists. CZMQ also uses an elegant(高雅的) object model that leads to frankly(真诚地) lovely code.

Here is the load balancing broker rewritten(改写) to use a higher-level API (CZMQ for the C case):


C++ | Delphi | Haxe | Java | Lua | PHP | Python | Scala | Ada | Basic | C# | Clojure | CL | Erlang | F# | Felix | Go | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Tcl

One thing CZMQ provides is clean interrupt handling. This means that Ctrl-C will cause any blocking ZeroMQ call to exit with a return code -1 and errno set to EINTR. The high-level recv methods will return NULL in such cases. So, you can cleanly exit a loop(环) like this:

// Shows how to handle Ctrl-C

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <fcntl.h>

#include <zmq.h>

// Signal handling
//
// Create a self-pipe and call s_catch_signals(pipe's writefd) in your application
// at startup, and then exit your main loop(环) if your pipe contains any data.
// Works especially well with zmq_poll(投票).

#define(定义) S_NOTIFY_MSG " "
#define S_ERROR_MSG "Error while writing to self-pipe.\n"

static int s_fd;
static void s_signal_handler (int signal_value)
{
int rc = write (s_fd, S_NOTIFY_MSG, sizeof(S_NOTIFY_MSG));
if (rc != sizeof(S_NOTIFY_MSG)) {
write (STDOUT_FILENO, S_ERROR_MSG, sizeof(S_ERROR_MSG)-1);
exit(1);
}
}

static void s_catch_signals (int fd)
{
s_fd = fd;

struct sigaction action;
action.sa_handler = s_signal_handler;
// Doesn't matter if SA_RESTART set because self-pipe will wake up zmq_poll(投票)
// But setting to 0 will allow zmq_read to be interrupted.
action.sa_flags = 0;
sigemptyset (&action.sa_mask);
sigaction (SIGINT, &action, NULL);
sigaction (SIGTERM, &action, NULL);
}

int main (void)
{
int rc;

void *context = zmq_ctx_new ();
void *socket = zmq_socket(插座) (context(环境), ZMQ_REP);
zmq_bind (socket, "tcp://*:5555");

int pipefds[2];
rc = pipe(pipefds);
if (rc != 0) {
perror("Creating self-pipe");
exit(1);
}
for (int i = 0; i < 2; i++) {
int flags = fcntl(pipefds[0], F_GETFL, 0);
if (flags < 0) {
perror ("fcntl(F_GETFL)");
exit(1);
}
rc = fcntl (pipefds[0], F_SETFL, flags | O_NONBLOCK);
if (rc != 0) {
perror ("fcntl(F_SETFL)");
exit(1);
}
}

s_catch_signals (pipefds[1]);

zmq_pollitem_t items [] = {
{ 0, pipefds[0], ZMQ_POLLIN, 0 },
{ socket, 0, ZMQ_POLLIN, 0 }
};

while (1) {
rc = zmq_poll (items, 2, -1);
if (rc == 0) {
continue;
} else if (rc < 0) {
if (errno == EINTR) { continue; }
perror("zmq_poll");
exit(1);
}

// Signal pipe FD
if (items [0].revents & ZMQ_POLLIN) {
char buffer [1];
read (pipefds[0], buffer, 1); // clear notifying byte
printf ("W: interrupt received, killing server…\n");
break;
}

// Read socket
if (items [1].revents & ZMQ_POLLIN) {
char buffer [255];
// Use non-blocking so we can continue to check self-pipe via zmq_poll(投票)
rc = zmq_recv (socket(插座), buffer(有软皮摩擦), 255, ZMQ_NOBLOCK);
if (rc < 0) {
if (errno == EAGAIN) { continue; }
if (errno == EINTR) { continue; }
perror("recv");
exit(1);
}
printf ("W: recv\n");

// Now send message back.
//
}
}

printf ("W: cleaning up\n");
zmq_close (socket);
zmq_ctx_destroy (context);
return 0;
}

Or, if you're calling zmq_poll(), test on the return code:

if (zmq_poll (items, 2, 1000 * 1000) == -1)
break; // Interrupted

The previous example still uses zmq_poll(). So how about reactors(反应器)? The CZMQ zloop reactor is simple but functional(功能的). It lets you:

  • Set a reader on any socket(插座), i.e., code that is called whenever the socket has input(投入).
  • Cancel a reader on a socket(插座).
  • Set a timer that goes off once or multiple times at specific(特殊的) intervals.
  • Cancel a timer.

zloop of course uses zmq_poll() internally(内部地). It rebuilds its poll(投票) set each time you add or remove readers, and it calculates(计算) the poll timeout to match the next timer. Then, it calls the reader and timer handlers for each socket and timer that need attention.

When we use a reactor(反应器) pattern, our code turns inside out. The main logic(逻辑) looks like this:

zloop_t *reactor = zloop_new ();
zloop_reader (reactor, self->backend, s_handle_backend, self);
zloop_start (reactor);
zloop_destroy (&reactor);

The actual handling of messages sits inside dedicated(专用的) functions or methods. You may not like the style—it's a matter of taste. What it does help with is mixing timers and socket a(插座)ctivity. In the rest of this text, we'll use zmq_poll() in simpler cases, and zloop in more complex(复杂的) examples.

Here is the load balancing broker rewritten(改写) once again, this time to use zloop:


Haxe | Java | Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

Getting applications to properly shut down when you send them Ctrl-C can be tricky(狡猾的). If you use the zctx class it'll automatically(自动的) set up signal handling, but your code still has to cooperate(合作). You must break any loop(环) if zmq_poll returns -1 or if any of the zstr_recv, zframe_recv, or zmsg_recv methods return NULL. If you have nested loops(环), it can be useful to make the outer ones conditional(条件句) on !zctx_interrupted.

If you're using child threads, they won't receive the interrupt. To tell them to shutdown(关机), you can either:

  • Destroy the context(环境), if they are sharing the same context, in which case any blocking calls they are waiting on will end with ETERM.
  • Send them shutdown messages, if they are using their own contexts. For this you'll need some socket(插座) plumbing(铅工业).

The Asynchronous(异步的) Client/Server Pattern

topprevnext

In the ROUTER to DEALER example, we saw a 1-to-N use case where one server talks asynchronously to multiple workers. We can turn this upside(上部) down to get a very useful N-to-1 architecture(建筑学) where various clients talk to a single server, and do this asynchronously.

Figure 37 - Asynchronous Client/Server

fig37.png

Here's how it works:

  • Clients connect to the server and send requests.
  • For each request, the server sends 0 or more replies.
  • Clients can send multiple requests without waiting for a reply.
  • Servers can send multiple replies without waiting for new requests.

Here's code that shows how this works:

// Asynchronous client-to-server (DEALER to ROUTER)
//
// While this example runs in a single process, that is to make
// it easier to start and stop the example. Each task has its own
// context(环境) and conceptually(概念地) acts as a separate process.

#include "czmq.h"

// This is our client task
// It connects to the server, and then sends a request once per second
// It collects responses as they arrive, and it prints them out. We will
// run several client tasks in parallel(平行线), each with a different random(随机的) ID.

static void *
client_task (void *args)
{
zctx_t *ctx = zctx_new ();
void *client = zsocket_new (ctx, ZMQ_DEALER);

// Set random(随机的) identity(身份) to make tracing(追溯) easier
char identity [10];
sprintf (identity, "%04X-%04X", randof (0x10000), randof (0x10000));
zsocket_set_identity (client, identity);
zsocket_connect (client, "tcp://localhost:5570");

zmq_pollitem_t items [] = { { client, 0, ZMQ_POLLIN, 0 } };
int request_nbr = 0;
while (true) {
// Tick once per second, pulling in arriving messages
int centitick;
for (centitick = 0; centitick < 100; centitick++) {
zmq_poll (items, 1, 10 * ZMQ_POLL_MSEC);
if (items [0].revents & ZMQ_POLLIN) {
zmsg_t *msg = zmsg_recv (client);
zframe_print (zmsg_last (msg), identity(身份));
zmsg_destroy (&msg);
}
}
zstr_sendf (client, "request #%d", ++request_nbr);
}
zctx_destroy (&ctx);
return NULL;
}

// This is our server task.
// It uses the multithreaded server model to deal requests out to a pool
// of workers and route(路线) replies back to clients. One worker can handle
// one request at a time but one client can talk to multiple workers at
// once.

static void server_worker (void *args, zctx_t *ctx, void *pipe);

void *server_task (void *args)
{
// Frontend(前端) socket(插座) talks to clients over TCP
zctx_t *ctx = zctx_new ();
void *frontend = zsocket_new (ctx, ZMQ_ROUTER);
zsocket_bind (frontend, "tcp://*:5570");

// Backend socket(插座) talks to workers over inproc
void *backend = zsocket_new (ctx, ZMQ_DEALER);
zsocket_bind (backend, "inproc://backend");

// Launch(发射) pool of worker threads, precise(精确的) number is not critical(鉴定的)
int thread_nbr;
for (thread_nbr = 0; thread_nbr < 5; thread_nbr++)
zthread_fork (ctx, server_worker, NULL);

// Connect backend to frontend(前端) via a proxy(代理人)
zmq_proxy (frontend, backend, NULL);

zctx_destroy (&ctx);
return NULL;
}

// Each worker task works on one request at a time and sends a random(随机的) number
// of replies back, with random delays between replies:

static void
server_worker (void *args, zctx_t *ctx, void *pipe)
{
void *worker = zsocket_new (ctx, ZMQ_DEALER);
zsocket_connect (worker, "inproc://backend");

while (true) {
// The DEALER socket(插座) gives us the reply envelope and message
zmsg_t *msg = zmsg_recv (worker);
zframe_t *identity = zmsg_pop (msg);
zframe_t *content = zmsg_pop (msg);
assert (content);
zmsg_destroy (&msg);

// Send 0..4 replies back
int reply, replies = randof (5);
for (reply = 0; reply < replies; reply++) {
// Sleep for some fraction(分数) of a second
zclock_sleep (randof (1000) + 1);
zframe_send (&identity(身份), worker, ZFRAME_REUSE + ZFRAME_MORE);
zframe_send (&content, worker, ZFRAME_REUSE);
}
zframe_destroy (&identity);
zframe_destroy (&content);
}
}

// The main thread simply starts several clients and a server, and then
// waits for the server to finish.

int main (void)
{
zthread_new (client_task, NULL);
zthread_new (client_task, NULL);
zthread_new (client_task, NULL);
zthread_new (server_task, NULL);
zclock_sleep (5 * 1000); // Run for 5 seconds then quit(离开)
return 0;
}


C++ | C# | Clojure | Delphi | Erlang | F# | Go | Haskell | Haxe | Java | Lua | Node.js | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | CL | Felix | Objective-C | ooc | Perl | Q | Racket

The example runs in one process, with multiple threads simulating(模拟) a real multiprocess architecture(建筑学). When you run the example, you'll see three clients (each with a random(随机的) ID), printing out the replies they get from the server. Look carefully and you'll see each client task gets 0 or more replies per request.

Some comments on this code:

  • The clients send a request once per second, and get zero or more replies back. To make this work using zmq_poll(), we can't simply poll(投票) with a 1-second timeout, or we'd end up sending a new request only one second after we received the last reply. So we poll at a high frequency(频率) (100 times at 1/100th of a second per poll), which is approximately(大约) accurate(精确的).
  • The server uses a pool of worker threads, each processing one request synchronously(同步地). It connects these to its frontend(前端) socket(插座) using an internal(内部的) queue. It connects the frontend and backend sockets using a zmq_proxy() call.

Figure 38 - Detail of Asynchronous(异步的) Server

fig38.png

Note that we're doing DEALER to ROUTER dialog between client and server, but internally between the server main thread and workers, we're doing DEALER to DEALER. If the workers were strictly synchronous, we'd use REP. However, because we want to send multiple replies, we need an async socket. We do not want to route(按某路线发送) replies, they always go to the single server thread that sent us the request.

Let's think about the routing(路由选择) envelope. The client sends a message consisting of a single frame(框架). The server thread receives a two-frame message (original message prefixed(有前缀的) by client identity(身份)). We send these two frames on to the worker, which treats it as a normal reply envelope, returns that to us as a two frame message. We then use the first frame as an identity to route the second frame back to the client as a reply.

It looks something like this:

     client          server       frontend       worker
   [ DEALER ]<---->[ ROUTER <----> DEALER <----> DEALER ]
             1 part         2 parts       2 parts

Now for the sockets: we could use the load balancing ROUTER to DEALER pattern to talk to workers, but it's extra work. In this case, a DEALER to DEALER pattern is probably fine: the trade-off is lower latency(潜伏) for each request, but higher risk(风险) of unbalanced work distribution(分布). Simplicity(朴素) wins in this case.

When you build servers that maintain(维持) stateful(状态性的) conversations with clients, you will run into a classic(经典的) problem. If the server keeps some state per client, and clients keep coming and going, eventually(最后) it will run out of resources. Even if the same clients keep connecting, if you're using default identities, each connection will look like a new one.

We cheat in the above example by keeping state only for a very short time (the time it takes a worker to process a request) and then throwing away the state. But that's not practical for many cases. To properly manage client state in a stateful asynchronous server, you have to:

  • Do heartbeating from client to server. In our example, we send a request once per second, which can reliably(可靠地) be used as a heartbeat(心跳).
  • Store state using the client identity (whether generated(形成) or explicit(明确的)) as key.
  • Detect(察觉) a stopped heartbeat. If there's no request from a client within, say, two seconds, the server can detect this and destroy any state it's holding for that client.

Worked Example: Inter-Broker Routing

topprevnext

Let's take everything we've seen so far, and scale(规模) things up to a real application. We'll build this step-by-step(按部就班的) over several iterations(迭代). Our best client calls us urgently(迫切地) and asks for a design of a large cloud computing facility(设施). He has this vision(视力) of a cloud that spans(跨度) many data centers, each a cluster(群) of clients and workers, and that works together as a whole. Because we're smart enough to know that practice always beats theory, we propose(建议) to make a working simulation(仿真) using ZeroMQ. Our client, eager to lock down the budget(预算) before his own boss changes his mind, and having read great things about ZeroMQ on Twitter, agrees.

Establishing the Details
topprevnext

Several espressos later, we want to jump into writing code, but a little voice tells us to get more details before making a sensational(轰动的) solution(解决方案) to entirely the wrong problem. "What kind of work is the cloud doing?", we ask.

The client explains:

  • Workers run on various kinds of hardware(计算机硬件), but they are all able to handle any task. There are several hundred workers per cluster, and as many as a dozen clusters in total.
  • Clients create tasks for workers. Each task is an independent unit of work and all the client wants is to find an available worker, and send it the task, as soon as possible. There will be a lot of clients and they'll come and go arbitrarily(武断地).
  • The real difficulty is to be able to add and remove clusters at any time. A cluster can leave or join the cloud instantly, bringing all its workers and clients with it.
  • If there are no workers in their own cluster, clients' tasks will go off to other available workers in the cloud.
  • Clients send out one task at a time, waiting for a reply. If they don't get an answer within X seconds, they'll just send out the task again. This isn't our concern(涉及); the client API does it already.
  • Workers process one task at a time; they are very simple beasts. If they crash, they get restarted(重新启动) by whatever script started them.

So we double-check to make sure that we understood this correctly:

  • "There will be some kind of super-duper(了不起的) network interconnect(使互相连接) between clusters, right?", we ask. The client says, "Yes, of course, we're not idiots(笨蛋)."
  • "What kind of volumes(量) are we talking about?", we ask. The client replies, "Up to a thousand clients per cluster, each doing at most ten requests per second. Requests are small, and replies are also small, no more than 1K bytes each."

So we do a little calculation(计算) and see that this will work nicely over plain TCP. 2,500 clients x 10/second x 1,000 bytes x 2 directions = 50MB/sec or 400Mb/sec, not a problem for a 1Gb network.

It's a straightforward(简单的) problem that requires no exotic(异国的) hardware(计算机硬件) or protocols(协议), just some clever routing(路由选择) algorithms(算法) and careful design. We start by designing one cluster(群) (one data center) and then we figure out how to connect clusters together.

Architecture(建筑学) of a Single Cluster
topprevnext

Workers and clients are synchronous(同步的). We want to use the load balancing pattern to route(按某路线发送) tasks to workers. Workers are all identical(完全相同的事物); our facility(设施) has no notion(概念) of different services. Workers are anonymous(匿名的); clients never address them directly. We make no attempt here to provide guaranteed(有保证的) delivery(交付), retry(重操作), and so on.

For reasons we already examined, clients and workers won't speak to each other directly. It makes it impossible to add or remove nodes dynamically(动态地). So our basic model consists of the request-reply message broker we saw earlier.

Figure 39 - Cluster Architecture

fig39.png

Scaling(衡量) to Multiple Clusters
topprevnext

Now we scale this out to more than one cluster. Each cluster has a set of clients and workers, and a broker that joins these together.

Figure 40 - Multiple Clusters

fig40.png

The question is: how do we get the clients of each cluster talking to the workers of the other cluster? There are a few possibilities, each with pros and cons:

  • Clients could connect directly to both brokers. The advantage is that we don't need to modify(修改) brokers or workers. But clients get more complex(复杂的) and become aware(意识到的) of the overall topology(拓扑学). If we want to add a third or forth(向前) cluster, for example, all the clients are affected. In effect we have to move routing and failover(失效备援) logic(逻辑) into the clients and that's not nice.
  • Workers might connect directly to both brokers. But REQ workers can't do that, they can only reply to one broker. We might use REPs but REPs don't give us customizable(可定制的) broker-to-worker routing like load balancing does, only the built-in(嵌入的) load balancing. That's a fail; if we want to distribute(分配) work to idle(虚度) workers, we precisely(精确地) need load balancing. One solution(解决方案) would be to use ROUTER sockets(插座) for the worker nodes. Let's label(标注) this "Idea #1".
  • Brokers could connect to each other. This looks neatest because it creates the fewest additional(附加的) connections. We can't add clusters on the fly, but that is probably out of scope(范围). Now clients and workers remain ignorant(无知的) of the real network topology, and brokers tell each other when they have spare capacity(能力). Let's label this "Idea #2".

Let's explore Idea #1. In this model, we have workers connecting to both brokers and accepting jobs from either one.

Figure 41 - Idea 1: Cross-connected Workers

fig41.png

It looks feasible(可行的). However, it doesn't provide what we wanted, which was that clients get local workers if possible and remote(遥远的) workers only if it's better than waiting. Also workers will signal "ready" to both brokers and can get two jobs at once, while other workers remain idle(闲置的). It seems this design fails because again we're putting routing(路由选择) logic(逻辑) at the edges.

So, idea #2 then. We interconnect(使互相连接) the brokers and don't touch the clients or workers, which are REQs like we're used to.

Figure 42 - Idea 2: Brokers Talking to Each Other

fig42.png

This design is appealing(呼吁) because the problem is solved in one place, invisible(无形的) to the rest of the world. Basically(主要地), brokers open secret channels to each other and whisper, like camel traders, "Hey, I've got some spare capacity(能力). If you have too many clients, give me a shout and we'll deal".

In effect it is just a more sophisticated(复杂的) routing algorithm(算法): brokers become subcontractors(转包商) for each other. There are other things to like about this design, even before we play with real code:

  • It treats the common case (clients and workers on the same cluster(群)) as default and does extra work for the exceptional(异常的) case (shuffling(支吾的) jobs between clusters).
  • It lets us use different message flows for the different types of work. That means we can handle them differently, e.g., using different types of network connection.
  • It feels like it would scale(衡量) smoothly. Interconnecting three or more brokers doesn't get overly complex(复杂的). If we find this to be a problem, it's easy to solve by adding a super-broker.

We'll now make a worked example. We'll pack an entire cluster into one process. That is obviously not realistic(现实的), but it makes it simple to simulate(模仿的), and the simulation(仿真) can accurately(精确地) scale to real processes. This is the beauty of ZeroMQ—you can design at the micro-level a(微级)nd scale that up to the macro-level. Threads become processes, and then become boxes and the patterns and logic remain the same. Each of our "cluster" processes contains client threads, worker threads, and a broker thread.

We know the basic model well by now:

  • The REQ client (REQ) threads create workloads and pass them to the broker (ROUTER).
  • The REQ worker (REQ) threads process workloads and return the results to the broker (ROUTER).
  • The broker queues and distributes(分配) workloads using the load balancing pattern.

Federation Versus Peering
topprevnext

There are several possible ways to interconnect brokers. What we want is to be able to tell other brokers, "we have capacity", and then receive multiple tasks. We also need to be able to tell other brokers, "stop, we're full". It doesn't need to be perfect; sometimes we may accept jobs we can't process immediately, then we'll do them as soon as possible.

The simplest interconnect is federation, in which brokers simulate(模仿的) clients and workers for each other. We would do this by connecting our frontend(前端) to the other broker's backend socket(插座). Note that it is legal(法律的) to both bind(绑) a socket to an endpoint(端点) and connect it to other endpoints.

Figure 43 - Cross-connected Brokers in Federation(联合) Model

fig43.png

This would give us simple logic(逻辑) in both brokers and a reasonably good mechanism(机制): when there are no workers, tell the other broker "ready", and accept one job from it. The problem is also that it is too simple for this problem. A federated(使结成同盟) broker would be able to handle only one task at a time. If the broker emulates(仿真) a lock-step client and worker, it is by definition(定义) also going to be lock-step, and if it has lots of available workers they won't be used. Our brokers need to be connected in a fully asynchronous(异步的) fashion.

The federation model is perfect for other kinds of routing(路由选择), especially service-oriented(服务型的) architectures(建筑学) (SOAs), which route(路线) by service name and proximity(亲近) rather than load balancing or round robin. So don't dismiss it as useless, it's just not right for all use cases.

Instead of federation, let's look at a peering approach(方法) in which brokers are explicitly(明确地) aware(意识到的) of each other and talk over privileged(享有特权的) channels. Let's break this down, assuming(承担) we want to interconnect(使互相连接) N brokers. Each broker has (N - 1) peers(撒尿), and all brokers are using exactly the same code and logic. There are two distinct(明显的) flows of information between brokers:

  • Each broker needs to tell its peers how many workers it has available at any time. This can be fairly simple information—just a quantity that is updated regularly. The obvious (and correct) socket pattern for this is pub-sub. So every broker opens a PUB socket and publishes state information on that, and every broker also opens a SUB socket and connects that to the PUB socket of every other broker to get state information from its peers.
  • Each broker needs a way to delegate(委派…为代表) tasks to a peer and get replies back, asynchronously. We'll do this using ROUTER sockets; no other combination(结合) works. Each broker has two such sockets: one for tasks it receives and one for tasks it delegates. If we didn't use two sockets, it would be more work to know whether we were reading a request or a reply each time. That would mean adding more information to the message envelope.

And there is also the flow of information between a broker and its local clients and workers.

The Naming Ceremony
topprevnext

Three flows x two sockets for each flow = six sockets that we have to manage in the broker. Choosing good names is vital(至关重要的) to keeping a multisocket juggling(玩杂耍) act reasonably coherent(连贯的) in our minds. Sockets do something and what they do should form the basis(基础) for their names. It's about being able to read the code several weeks later on a cold Monday morning before coffee, and not feel any pain.

Let's do a shamanistic naming ceremony(典礼) for the sockets. The three flows are:

  • A local request-reply flow between the broker and its clients and workers.
  • A cloud request-reply flow between the broker and its peer(贵族) brokers.
  • A state flow between the broker and its peer brokers.

Finding meaningful(有意义的) names that are all the same length means our code will align(结盟) nicely. It's not a big thing, but attention to details helps. For each flow the broker has two sockets(插座) that we can orthogonally(正交的) call the frontend and backend. We've used these names quite often. A frontend(前端) receives information or tasks. A backend sends those out to other peers. The conceptual(概念上的) flow is from front to back (with replies going in the opposite direction from back to front).

So in all the code we write for this tutorial(个别指导), we will use these socket names:

  • localfe and localbe for the local flow.
  • cloudfe and cloudbe for the cloud flow.
  • statefe and statebe for the state flow.

For our transport and because we're simulating(模仿) the whole thing on one box, we'll use ipc for everything. This has the advantage of working like tcp in terms of connectivity(连通性) (i.e., it's a disconnected(分离的) transport, unlike inproc), yet we don't need IP addresses or DNS names, which would be a pain here. Instead, we will use ipc endpoints called something-local, something-cloud, and something-state, where something is the name of our simulated(模拟的) cluster(群).

You might be thinking that this is a lot of work for some names. Why not call them s1, s2, s3, s4, etc.? The answer is that if your brain is not a perfect machine, you need a lot of help when reading code, and we'll see that these names do help. It's easier to remember "three flows, two directions" than "six different sockets(插座)".

Figure 44 - Broker Socket Arrangement

fig44.png

Note that we connect the cloudbe in each broker to the cloudfe in every other broker, and likewise(同样地) we connect the statebe in each broker to the statefe in every other broker.

Prototyping(样机研究) the State Flow
topprevnext

Because each socket flow has its own little traps for the unwary(粗心的), we will test them in real code one-by-one, rather than try to throw the whole lot into code in one go. When we're happy with each flow, we can put them together into a full program. We'll start with the state flow.

Figure 45 - The State Flow

fig45.png

Here is how this works in code:


C# | Clojure | Delphi | F# | Go | Haskell | Haxe | Java | Lua | Node.js | PHP | Python | Racket | Ruby | Scala | Tcl | Ada | Basic | C++ | CL | Erlang | Felix | Objective-C | ooc | Perl | Q

Notes about this code:

  • Each broker has an identity(身份) that we use to construct ipc endpoint(端点) names. A real broker would need to work with TCP and a more sophisticated(复杂的) configuration(配置) scheme(计划). We'll look at such schemes later in this book, but for now, using generated(形成) ipc names lets us ignore(驳回诉讼) the problem of where to get TCP/IP addresses or names.
  • We use a zmq_poll() loop(环) as the core of the program. This processes incoming messages and sends out state messages. We send a state message only if we did not get any incoming messages and we waited for a second. If we send out a state message each time we get one in, we'll get message storms.
  • We use a two-part pub-sub message consisting of sender address and data. Note that we will need to know the address of the publisher in order to send it tasks, and the only way is to send this explicitly(明确地) as a part of the message.
  • We don't set identities(身份) on subscribers(订阅) because if we did then we'd get outdated state information when connecting to running brokers.
  • We don't set a HWM on the publisher, but if we were using ZeroMQ v2.x that would be a wise idea.

We can build this little program and run it three times to simulate(模仿的) three clusters(群). Let's call them DC1, DC2, and DC3 (the names are arbitrary(任意的)). We run these three commands, each in a separate window:

peering1 DC1 DC2 DC3  #  Start DC1 and connect to DC2 and DC3
peering1 DC2 DC1 DC3  #  Start DC2 and connect to DC1 and DC3
peering1 DC3 DC1 DC2  #  Start DC3 and connect to DC1 and DC2

You'll see each cluster report the state of its peers(撒尿), and after a few seconds they will all happily be printing random(随机的) numbers once per second. Try this and satisfy yourself that the three brokers all match up and synchronize(合拍) to per-second state updates.

In real life, we'd not send out state messages at regular intervals, but rather whenever we had a state change, i.e., whenever a worker becomes available or unavailable. That may seem like a lot of traffic, but state messages are small and we've established(建立) that the inter-cluster connections are super fast.

If we wanted to send state messages at precise(精确的) intervals, we'd create a child thread and open the statebe socket(插座) in that thread. We'd then send irregular(不规则的) state updates to that child thread from our main thread and allow the child thread to conflate(合并) them into regular outgoing(外出的) messages. This is more work than we need here.

Prototyping(样机研究) the Local and Cloud Flows
topprevnext

Let's now prototype(原型) the flow of tasks via the local and cloud sockets. This code pulls requests from clients and then distributes(分配) them to local workers and cloud peers on a random basis(基础).

Figure 46 - The Flow of Tasks

fig46.png

Before we jump into the code, which is getting a little complex(复杂的), let's sketch(素描) the core routing(路由选择) logic(逻辑) and break it down into a simple yet robust(强健的) design.

We need two queues, one for requests from local clients and one for requests from cloud clients. One option would be to pull messages off the local and cloud frontends(前端), and pump these onto their respective(分别的) queues. But this is kind of pointless(无意义的) because ZeroMQ sockets(插座) are queues already. So let's use the ZeroMQ socket buffers(有软皮摩擦) as queues.

This was the technique we used in the load balancing broker, and it worked nicely. We only read from the two frontends when there is somewhere to send the requests. We can always read from the backends, as they give us replies to route(按某路线发送) back. As long as the backends aren't talking to us, there's no point in even looking at the frontends.

So our main loop(环) becomes:

  • Poll(投票) the backends for activity. When we get a message, it may be "ready" from a worker or it may be a reply. If it's a reply, route back via the local or cloud frontend.
  • If a worker replied, it became available, so we queue it and count it.
  • While there are workers available, take a request, if any, from either frontend and route to a local worker, or randomly(随便地), to a cloud peer(贵族).

Randomly sending tasks to a peer broker rather than a worker simulates(模仿) work distribution(分布) across the cluster(群). It's dumb(哑的), but that is fine for this stage.

We use broker identities(身份) to route messages between brokers. Each broker has a name that we provide on the command line in this simple prototype(原型). As long as these names don't overlap(重叠) with the ZeroMQ-generated UUIDs used for client nodes, we can figure out whether to route a reply back to a client or to a broker.

Here is how this works in code. The interesting part starts around the comment "Interesting part".


C# | Delphi | F# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | C++ | Clojure | CL | Erlang | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket

Run this by, for instance(实例), starting two instances of the broker in two windows:

peering2 me you
peering2 you me

Some comments on this code:

  • In the C code at least, using the zmsg class makes life much easier, and our code much shorter. It's obviously an abstraction(抽象) that works. If you build ZeroMQ applications in C, you should use CZMQ.
  • Because we're not getting any state information from peers(撒尿), we naively(无邪地) assume(承担) they are running. The code prompts(提示) you to confirm(确认) when you've started all the brokers. In the real case, we'd not send anything to brokers who had not told us they exist.

You can satisfy yourself that the code works by watching it run forever. If there were any misrouted(错误指向) messages, clients would end up blocking, and the brokers would stop printing trace(追踪) information. You can prove that by killing either of the brokers. The other broker tries to send requests to the cloud, and one-by-one its clients block, waiting for an answer.

Putting it All Together
topprevnext

Let's put this together into a single package. As before, we'll run an entire cluster(群) as one process. We're going to take the two previous examples and merge(合并) them into one properly working design that lets you simulate(模仿的) any number of clusters.

This code is the size of both previous prototypes(原型) together, at 270 LoC. That's pretty good for a simulation(仿真) of a cluster that includes clients and workers and cloud workload distribution(分布). Here is the code:


Delphi | F# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Erlang | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

It's a nontrivial(非平凡的) program and took about a day to get working. These are the highlights(加亮区):

  • The client threads detect(察觉) and report a failed request. They do this by polling(投票) for a response and if none arrives after a while (10 seconds), printing an error message.
  • Client threads don't print directly, but instead send a message to a monitor socket(插座) (PUSH) that the main loop(环) collects (PULL) and prints off. This is the first case we've seen of using ZeroMQ sockets for monitoring and logging; this is a big use case that we'll come back to later.
  • Clients simulate(模仿的) varying(不同的) loads to get the cluster(群) 100% at random(随机的) moments, so that tasks are shifted(移动) over to the cloud. The number of clients and workers, and delays in the client and worker threads control this. Feel free to play with them to see if you can make a more realistic(现实的) simulation(仿真).
  • The main loop uses two pollsets. It could in fact use three: information, backends, and frontends(前端). As in the earlier prototype(原型), there is no point in taking a frontend message if there is no backend capacity(能力).

These are some of the problems that arose during development of this program:

  • Clients would freeze, due to requests or replies getting lost somewhere. Recall(召回) that the ROUTER socket(插座) drops messages it can't route(按某路线发送). The first tactic(策略) here was to modify(修改) the client thread to detect(察觉) and report such problems. Secondly, I put zmsg_dump() calls after every receive and before every send in the main loop(环), until the origin of the problems was clear.
  • The main loop was mistakenly reading from more than one ready socket. This caused the first message to be lost. I fixed that by reading only from the first ready socket.
  • The zmsg class was not properly encoding UUIDs as C strings. This caused UUIDs that contain 0 bytes to be corrupted(腐败的). I fixed that by modifying zmsg to encode UUIDs as printable(印得出的) hex strings.

This simulation(仿真) does not detect disappearance of a cloud peer(贵族). If you start several peers and stop one, and it was broadcasting capacity(能力) to the others, they will continue to send it work even if it's gone. You can try this, and you will get clients that complain of lost requests. The solution(解决方案) is twofold(双重的): first, only keep the capacity information for a short time so that if a peer does disappear, its capacity is quickly set to zero. Second, add reliability(可靠性) to the request-reply chain. We'll look at reliability in the next chapter.


Chapter 4 - Reliable(可靠的) Request-Reply Patterns

topprevnext

Chapter 3 - Advanced Request-Reply Patterns covered advanced uses of ZeroMQ's request-reply pattern with working examples. This chapter looks at the general question of reliability and builds a set of reliable messaging patterns on top of ZeroMQ's core request-reply pattern.

In this chapter, we focus heavily on user-space request-reply patterns, reusable(可重复使用的) models that help you design your own ZeroMQ architectures(建筑学):

  • The Lazy Pirate pattern: reliable(可靠的) request-reply from the client side
  • The Simple Pirate pattern: reliable request-reply using load balancing
  • The Paranoid Pirate pattern: reliable request-reply with heartbeating
  • The Majordomo pattern: service-oriented(服务型的) reliable queuing
  • The Titanic pattern: disk-based/disconnected(拆开) reliable queuing
  • The Binary Star pattern: primary-backup server failover(失效备援)
  • The Freelance pattern: brokerless reliable request-reply

What is "Reliability"?

topprevnext

Most people who speak of "reliability(可靠性)" don't really know what they mean. We can only define(定义) reliability in terms of failure. That is, if we can handle a certain set of well-defined(定义明确的) and understood failures, then we are reliable(可靠的) with respect to those failures. No more, no less. So let's look at the possible causes of failure in a distributed(分配) ZeroMQ application, in roughly descending(下降的) order of probability(可能性):

  • Application code is the worst offender(罪犯). It can crash and exit, freeze and stop responding(回答) to input(输入), run too slowly for its input, exhaust(排出) all memory, and so on.
  • System code—such as brokers we write using ZeroMQ—can die for the same reasons as application code. System code should be more reliable than application code, but it can still crash and burn, and especially run out of memory if it tries to queue messages for slow clients.
  • Message queues can overflow(溢出), typically(代表性地) in system code that has learned to deal brutally(残忍地) with slow clients. When a queue overflows, it starts to discard(抛弃) messages. So we get "lost" messages.
  • Networks can fail (e.g., WiFi gets switched(转换) off or goes out of range). ZeroMQ will automatically(自动地) reconnect(使再接合) in such cases, but in the meantime(其时), messages may get lost.
  • Hardware(计算机硬件) can fail and take with it all the processes running on that box.
  • Networks can fail in exotic(异国的) ways, e.g., some ports on a switch may die and those parts of the network become inaccessible(难达到的).
  • Entire data centers can be struck by lightning, earthquakes, fire, or more mundane(世俗的) power or cooling failures.

To make a software system fully reliable against all of these possible failures is an enormously(巨大地) difficult and expensive job and goes beyond the scope(范围) of this book.

Because the first five cases in the above list cover 99.9% of real world requirements outside large companies (according to a highly scientific study I just ran, which also told me that 78% of statistics(统计) are made up on the spot, and moreover never to trust a statistic that we didn't falsify(伪造) ourselves), that's what we'll examine. If you're a large company with money to spend on the last two cases, contact my company immediately! There's a large hole behind my beach house waiting to be converted(转变) into an executive(行政的) swimming pool.

Designing Reliability

topprevnext

So to make things brutally(残忍地) simple, reliability(可靠性) is "keeping things working properly when code freezes or crashes", a situation we'll shorten(缩短) to "dies". However, the things we want to keep working properly are more complex(复杂的) than just messages. We need to take each core ZeroMQ messaging pattern and see how to make it work (if we can) even when code dies.

Let's take them one-by-one:

  • Request-reply: if the server dies (while processing a request), the client can figure that out because it won't get an answer back. Then it can give up in a huff(把…吹胀), wait and try again later, find another server, and so on. As for the client dying, we can brush that off as "someone else's problem" for now.
  • Pub-sub: if the client dies (having gotten(得到) some data), the server doesn't know about it. Pub-sub doesn't send any information back from client to server. But the client can contact the server out-of-band, e.g., via request-reply, and ask, "please resend(再发) everything I missed(感到思念的)". As for the server dying, that's out of scope(范围) for here. Subscribers(订阅) can also self-verify that they're not running too slowly, and take action (e.g., warn the operator and die) if they are.
  • Pipeline(管道): if a worker dies (while working), the ventilator(通风设备) doesn't know about it. Pipelines, like the grinding(磨的) gears(齿轮) of time, only work in one direction. But the downstream(下游的) collector(收藏家) can detect(察觉) that one task didn't get done, and send a message back to the ventilator saying, "hey, resend task 324!" If the ventilator or collector dies, whatever upstream(上游部门) client originally sent the work batch(一批) can get tired of waiting and resend the whole lot. It's not elegant(高雅的), but system code should really not die often enough to matter.

In this chapter we'll focus just on request-reply, which is the low-hanging fruit of reliable(可靠的) messaging.

The basic request-reply pattern (a REQ client socket(插座) doing a blocking send/receive to a REP server socket) scores low on handling the most common types of failure. If the server crashes while processing the request, the client just hangs forever. If the network loses the request or the reply, the client hangs forever.

Request-reply is still much better than TCP, thanks to ZeroMQ's ability to reconnect(使再接合) peers(撒尿) silently, to load balance messages, and so on. But it's still not good enough for real work. The only case where you can really trust the basic request-reply pattern is between two threads in the same process where there's no network or separate server process to die.

However, with a little extra work, this humble pattern becomes a good basis(基础) for real work across a distributed(分布式的) network, and we get a set of reliable request-reply (RRR) patterns that I like to call the Pirate patterns (you'll eventually(最后的) get the joke, I hope).

There are, in my experience, roughly three ways to connect clients to servers. Each needs a specific(特殊的) approach(方法) to reliability:

  • Multiple clients talking directly to a single server. Use case: a single well-known(著名的) server to which clients need to talk. Types of failure we aim to handle: server crashes and restarts(重新开始), and network disconnects(拆开).
  • Multiple clients talking to a broker proxy(代理人) that distributes work to multiple workers. Use case: service-oriented(服务型的) transaction(交易) processing. Types of failure we aim to handle: worker crashes and restarts, worker busy looping(循环的), worker overload, queue crashes and restarts, and network disconnects.
  • Multiple clients talking to multiple servers with no intermediary(中间的) proxies. Use case: distributed services such as name resolution(分辨率). Types of failure we aim to handle: service crashes and restarts, service busy looping, service overload, and network disconnects.

Each of these approaches has its trade-offs and often you'll mix them. We'll look at all three in detail.

Client-Side Reliability (Lazy Pirate Pattern)

topprevnext

We can get very simple reliable(可靠的) request-reply with some changes to the client. We call this the Lazy Pirate(海盗) pattern. Rather than doing a blocking receive, we:

  • Poll(投票) the REQ socket(插座) and receive from it only when it's sure a reply has arrived.
  • Resend(再发) a request, if no reply has arrived within a timeout period.
  • Abandon(狂热) the transaction(交易) if there is still no reply after several requests.

If you try to use a REQ socket in anything other than a strict send/receive fashion, you'll get an error (technically, the REQ socket implements(工具) a small finite-state machine to enforce(实施) the send/receive ping-pong, and so the error code is called "EFSM"). This is slightly annoying when we want to use REQ in a pirate pattern, because we may send several requests before getting a reply.

The pretty good brute(畜生) force solution(解决方案) is to close and reopen(再开) the REQ socket after an error:


C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Ruby | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Q | Racket | Scala

Run this together with the matching server:


C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | Perl | PHP | Python | Ruby | Scala | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Q | Racket

Figure 47 - The Lazy Pirate Pattern

fig47.png

To run this test case, start the client and the server in two console(控制台) windows. The server will randomly(随便地) misbehave(作弊) after a few messages. You can check the client's response. Here is typical(典型的) output(输出) from the server:

I: normal request (1)
I: normal request (2)
I: normal request (3)
I: simulating CPU overload
I: normal request (4)
I: simulating a crash

And here is the client's response:

I: connecting to server...
I: server replied OK (1)
I: server replied OK (2)
I: server replied OK (3)
W: no response from server, retrying...
I: connecting to server...
W: no response from server, retrying...
I: connecting to server...
E: server seems to be offline, abandoning

The client sequences(序列) each message and checks that replies come back exactly in order: that no requests or replies are lost, and no replies come back more than once, or out of order. Run the test a few times until you're convinced(说服) that this mechanism(机制) actually works. You don't need sequence numbers in a production application; they just help us trust our design.

The client uses a REQ socket(插座), and does the brute(畜生) force close/reopen(再开) because REQ sockets impose(强加) that strict send/receive cycle. You might be tempted(诱惑) to use a DEALER instead, but it would not be a good decision. First, it would mean emulating(仿真) the secret sauce that REQ does with envelopes (if you've forgotten what that is, it's a good sign you don't want to have to do it). Second, it would mean potentially(可能地) getting back replies that you didn't expect.

Handling failures only at the client works when we have a set of clients talking to a single server. It can handle a server crash, but only if recovery(恢复) means restarting(重新起动) that same server. If there's a permanent(永久的) error, such as a dead power supply on the server hardware(计算机硬件), this approach(方法) won't work. Because the application code in servers is usually the biggest source of failures in any architecture(建筑学), depending on a single server is not a great idea.

So, pros and cons:

  • Pro: simple to understand and implement(工具).
  • Pro: works easily with existing client and server application code.
  • Pro: ZeroMQ automatically(自动的) retries(重操作) the actual reconnection(重新连接) until it works.
  • Con: doesn't failover(失效备援) to backup or alternate(交替的) servers.

Basic Reliable Queuing (Simple Pirate Pattern)

topprevnext

Our second approach extends(延伸) the Lazy Pirate pattern with a queue proxy(代理人) that lets us talk, transparently(显然地), to multiple servers, which we can more accurately(精确地) call "workers". We'll develop this in stages, starting with a minimal(最低的) working model, the Simple Pirate pattern.

In all these Pirate patterns, workers are stateless(没有国家的). If the application requires some shared state, such as a shared database, we don't know about it as we design our messaging framework(框架). Having a queue proxy means workers can come and go without clients knowing anything about it. If one worker dies, another takes over. This is a nice, simple topology(拓扑学) with only one real weakness, namely the central queue itself, which can become a problem to manage, and a single point of failure.

Figure 48 - The Simple Pirate Pattern

fig48.png

The basis(基础) for the queue proxy is the load balancing broker from Chapter 3 - Advanced Request-Reply Patterns. What is the very minimum we need to do to handle dead or blocked workers? Turns out, it's surprisingly little. We already have a retry(重操作) mechanism(机制) in the client. So using the load balancing pattern will work pretty well. This fits with ZeroMQ's philosophy(哲学) that we can extend(延伸) a peer-to-peer(对等) pattern like request-reply by plugging naive(天真的) proxies(代理人) in the middle.

We don't need a special client; we're still using the Lazy Pirate client. Here is the queue, which is identical(同一的) to the main task of the load balancing broker:


C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Here is the worker, which takes the Lazy Pirate server and adapts(适应) it for the load balancing pattern (using the REQ "ready" signaling):


C++ | C# | Clojure | Delphi | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | CL | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

To test this, start a handful of workers, a Lazy Pirate client, and the queue, in any order. You'll see that the workers eventually(最后) all crash and burn, and the client retries(重操作) and then gives up. The queue never stops, and you can restart(重新启动) workers and clients ad nauseam. This model works with any number of clients and workers.

Robust(强健的) Reliable Queuing (Paranoid Pirate Pattern)

topprevnext

Figure 49 - The Paranoid Pirate Pattern

fig49.png

The Simple Pirate Queue pattern works pretty well, especially because it's just a combination(结合) of two existing patterns. Still, it does have some weaknesses:

  • It's not robust in the face of a queue crash and restart. The client will recover, but the workers won't. While ZeroMQ will reconnect(使再接合) workers' sockets(插座) automatically(自动地), as far as the newly started queue is concerned(涉及), the workers haven't signaled ready, so don't exist. To fix this, we have to do heartbeating from queue to worker so that the worker can detect(察觉) when the queue has gone away.
  • The queue does not detect(察觉) worker failure, so if a worker dies while idle(虚度), the queue can't remove it from its worker queue until the queue sends it a request. The client waits and retries(重操作) for nothing. It's not a critical(鉴定的) problem, but it's not nice. To make this work properly, we do heartbeating from worker to queue, so that the queue can detect a lost worker at any stage.

We'll fix these in a properly pedantic(迂腐的) Paranoid Pirate Pattern.

We previously used a REQ socket(插座) for the worker. For the Paranoid Pirate worker, we'll switch(转换) to a DEALER socket. This has the advantage of letting us send and receive messages at any time, rather than the lock-step send/receive that REQ imposes(利用). The downside(下降趋势) of DEALER is that we have to do our own envelope management (re-read Chapter 3 - Advanced Request-Reply Patterns for background on this concept(观念)).

We're still using the Lazy Pirate client. Here is the Paranoid Pirate queue proxy(代理人):

// Paranoid Pirate queue

#include "czmq.h"
#define(定义) HEARTBEAT_LIVENESS 3 // 3-5 is reasonable
#define HEARTBEAT_INTERVAL 1000 // msecs

// Paranoid(类似妄想狂的) Pirate Protocol constants
#define PPP_READY "\001"
// Signals worker is ready
#define PPP_HEARTBEAT "\002" // Signals worker heartbeat

// Here we define(定义) the worker class; a structure(结构) and a set of functions that
// act as constructor(构造函数), destructor(破坏者), and methods on worker objects:

typedef struct {
zframe_t *identity; // Identity of worker
char *id_string; // Printable identity
int64_t expiry; // Expires(期满) at this time
} worker_t;

// Construct new worker
static worker_t *
s_worker_new (zframe_t *identity)
{
worker_t *self = (worker_t *) zmalloc (sizeof (worker_t));
self->identity = identity;
self->id_string = zframe_strhex (identity);
self->expiry = zclock_time ()
+ HEARTBEAT_INTERVAL * HEARTBEAT_LIVENESS;
return self;
}

// Destroy specified(规定的) worker object, including identity(身份) frame(框架).
static void
s_worker_destroy (worker_t **self_p)
{
assert (self_p);
if (*self_p) {
worker_t *self = *self_p;
zframe_destroy (&self->identity);
free (self->id_string);
free (self);
*self_p = NULL;
}
}

// The ready method puts a worker to the end of the ready list:

static void
s_worker_ready (worker_t *self, zlist_t *workers)
{
worker_t *worker = (worker_t *) zlist_first (workers);
while (worker) {
if (streq (self->id_string, worker->id_string)) {
zlist_remove (workers, worker);
s_worker_destroy (&worker);
break;
}
worker = (worker_t *) zlist_next (workers);
}
zlist_append(附加) (workers, self);
}

// The next method returns the next available worker identity(身份):

static zframe_t *
s_workers_next (zlist_t *workers)
{
worker_t *worker = zlist_pop (workers);
assert (worker);
zframe_t *frame = worker->identity;
worker->identity = NULL;
s_worker_destroy (&worker);
return frame;
}

// The purge(净化) method looks for and kills expired(期满) workers. We hold workers
// from oldest to most recent, so we stop at the first alive worker:

static void
s_workers_purge (zlist_t *workers)
{
worker_t *worker = (worker_t *) zlist_first (workers);
while (worker) {
if (zclock_time () < worker->expiry)
break; // Worker is alive, we're done here

zlist_remove (workers, worker);
s_worker_destroy (&worker);
worker = (worker_t *) zlist_first (workers);
}
}

// The main task is a load-balancer with heartbeating on workers so we
// can detect(察觉) crashed or blocked worker tasks:

int main (void)
{
zctx_t *ctx = zctx_new ();
void *frontend = zsocket_new (ctx, ZMQ_ROUTER);
void *backend = zsocket_new (ctx, ZMQ_ROUTER);
zsocket_bind (frontend, "tcp://*:5555"); // For clients
zsocket_bind (backend, "tcp://*:5556"); // For workers

// List of available workers
zlist_t *workers = zlist_new ();

// Send out heartbeats(心跳) at regular intervals
uint64_t heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;

while (true) {
zmq_pollitem_t items [] = {
{ backend, 0, ZMQ_POLLIN, 0 },
{ frontend, 0, ZMQ_POLLIN, 0 }
};
// Poll(投票) frontend(前端) only if we have available workers
int rc = zmq_poll (items, zlist_size (workers)? 2: 1,
HEARTBEAT_INTERVAL * ZMQ_POLL_MSEC);
if (rc == -1)
break; // Interrupted

// Handle worker activity on backend
if (items [0].revents & ZMQ_POLLIN) {
// Use worker identity(身份) for load-balancing
zmsg_t *msg = zmsg_recv (backend);
if (!msg)
break; // Interrupted

// Any sign of life from worker means it's ready
zframe_t *identity = zmsg_unwrap (msg);
worker_t *worker = s_worker_new (identity);
s_worker_ready (worker, workers);

// Validate(证实) control message, or return reply to client
if (zmsg_size (msg) == 1) {
zframe_t *frame = zmsg_first (msg);
if (memcmp (zframe_data (frame(框架)), PPP_READY, 1)
&& memcmp (zframe_data (frame), PPP_HEARTBEAT, 1)) {
printf ("E: invalid(无效的) message from worker");
zmsg_dump (msg);
}
zmsg_destroy (&msg);
}
else
zmsg_send (&msg, frontend);
}
if (items [1].revents & ZMQ_POLLIN) {
// Now get next client request, route(路线) to next worker
zmsg_t *msg = zmsg_recv (frontend);
if (!msg)
break; // Interrupted
zframe_t *identity = s_workers_next (workers);
zmsg_prepend (msg, &identity);
zmsg_send (&msg, backend);
}
// We handle heartbeating after any socket(插座) activity. First, we send
// heartbeats(心跳) to any idle(闲置的) workers if it's time. Then, we purge(净化) any
// dead workers:
if (zclock_time () >= heartbeat_at) {
worker_t *worker = (worker_t *) zlist_first (workers);
while (worker) {
zframe_send (&worker->identity, backend,
ZFRAME_REUSE + ZFRAME_MORE);
zframe_t *frame = zframe_new (PPP_HEARTBEAT, 1);
zframe_send (&frame, backend, 0);
worker = (worker_t *) zlist_next (workers);
}
heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;
}
s_workers_purge (workers);
}
// When we're done, clean up properly
while (zlist_size (workers)) {
worker_t *worker = (worker_t *) zlist_pop (workers);
s_worker_destroy (&worker);
}
zlist_destroy (&workers);
zctx_destroy (&ctx);
return 0;
}


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

The queue extends(延伸) the load balancing pattern with heartbeating of workers. Heartbeating is one of those "simple" things that can be difficult to get right. I'll explain more about that in a second.

Here is the Paranoid Pirate worker:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Some comments about this example:

  • The code includes simulation(仿真) of failures, as before. This makes it (a) very hard to debug(调试), and (b) dangerous to reuse. When you want to debug this, disable the failure simulation.
  • The worker uses a reconnect(使再接合) strategy(战略) similar to the one we designed for the Lazy Pirate client, with two major differences: (a) it does an exponential(指数的) back-off(退下), and (b) it retries(重操作) indefinitely(不确定的) (whereas(然而) the client retries a few times before reporting a failure).

Try the client, queue, and workers, such as by using a script like this:

ppqueue &
for i in 1 2 3 4; do
    ppworker &
    sleep 1
done
lpclient &

You should see the workers die one-by-one as they simulate(模仿的) a crash, and the client eventually(最后) give up. You can stop and restart(重新启动) the queue and both client and workers will reconnect and carry on. And no matter what you do to queues and workers, the client will never get an out-of-order(无序的) reply: the whole chain either works, or the client abandons(狂热).

Heartbeating

topprevnext

Heartbeating solves the problem of knowing whether a peer(贵族) is alive or dead. This is not an issue specific(特殊的) to ZeroMQ. TCP has a long timeout (30 minutes or so), that means that it can be impossible to know whether a peer has died, been disconnected(拆开), or gone on a weekend to Prague with a case of vodka(伏特加酒), a redhead(红色头发的人), and a large expense account.

It's is not easy to get heartbeating right. When writing the Paranoid Pirate examples, it took about five hours to get the heartbeating working properly. The rest of the request-reply chain took perhaps ten minutes. It is especially easy to create "false failures", i.e., when peers decide that they are disconnected because the heartbeats(心跳) aren't sent properly.

We'll look at the three main answers people use for heartbeating with ZeroMQ.

Shrugging It Off
topprevnext

The most common approach(方法) is to do no heartbeating at all and hope for the best. Many if not most ZeroMQ applications do this. ZeroMQ encourages this by hiding peers in many cases. What problems does this approach cause?

  • When we use a ROUTER socket(插座) in an application that tracks peers, as peers disconnect and reconnect(使再接合), the application will leak memory (resources that the application holds for each peer) and get slower and slower.
  • When we use SUB- or DEALER-based data recipients(容器), we can't tell the difference between good silence (there's no data) and bad silence (the other end died). When a recipient knows the other side died, it can for example switch(转换) over to a backup route(路线).
  • If we use a TCP connection that stays silent for a long while, it will, in some networks, just die. Sending something (technically, a "keep-alive" more than a heartbeat), will keep the network alive.

One-Way Heartbeats
topprevnext

A second option is to send a heartbeat message from each node to its peers every second or so. When one node hears nothing from another within some timeout (several seconds, typically(代表性地)), it will treat that peer as dead. Sounds good, right? Sadly, no. This works in some cases but has nasty(肮脏的) edge cases in others.

For pub-sub, this does work, and it's the only model you can use. SUB sockets cannot talk back to PUB sockets, but PUB sockets can happily send "I'm alive" messages to their subscribers(订阅).

As an optimization(最佳化), you can send heartbeats only when there is no real data to send. Furthermore(此外), you can send heartbeats progressively(渐进地) slower and slower, if network activity is an issue (e.g., on mobile networks where activity drains(排水) the battery). As long as the recipient can detect(察觉) a failure (sharp stop in activity), that's fine.

Here are the typical problems with this design:

  • It can be inaccurate(错误的) when we send large amounts(数量) of data, as heartbeats(心跳) will be delayed behind that data. If heartbeats are delayed, you can get false timeouts and disconnections(断开) due to network congestion(拥挤). Thus, always treat any incoming data as a heartbeat, whether or not the sender optimizes(最优化) out heartbeats.
  • While the pub-sub pattern will drop messages for disappeared recipients(容器), PUSH and DEALER sockets(插座) will queue them. So if you send heartbeats to a dead peer(贵族) and it comes back, it will get all the heartbeats you sent, which can be thousands. Whoa(惊叹声), whoa!
  • This design assumes(承担) that heartbeat timeouts are the same across the whole network. But that won't be accurate(精确的). Some peers will want very aggressive heartbeating in order to detect(察觉) faults rapidly. And some will want very relaxed heartbeating, in order to let sleeping networks lie and save power.

Ping-Pong Heartbeats
topprevnext

The third option is to use a ping-pong dialog. One peer sends a ping command to the other, which replies with a pong(乒乓球) command. Neither command has any payload(有效载荷). Pings and pongs are not correlated(有相互关系的). Because the roles of "client" and "server" are arbitrary(任意的) in some networks, we usually specify(指定) that either peer can in fact send a ping and expect a pong in response. However, because the timeouts depend on network topologies(拓扑学) known best to dynamic(动态的) clients, it is usually the client that pings the server.

This works for all ROUTER-based brokers. The same optimizations(最佳化) we used in the second model make this work even better: treat any incoming data as a pong, and only send a ping when not otherwise sending data.

Heartbeating for Paranoid Pirate
topprevnext

For Paranoid Pirate, we chose the second approach(方法). It might not have been the simplest option: if designing this today, I'd probably try a ping-pong approach instead. However the principles(原理) are similar. The heartbeat messages flow asynchronously(异步的) in both directions, and either peer can decide the other is "dead" and stop talking to it.

In the worker, this is how we handle heartbeats from the queue:

  • We calculate a liveness, which is how many heartbeats we can still miss before deciding the queue is dead. It starts at three and we decrement(渐减) it each time we miss a heartbeat.
  • We wait, in the zmq_poll loop(环), for one second each time, which is our heartbeat(心跳) interval.
  • If there's any message from the queue during that time, we reset(重置) our liveness(活性) to three.
  • If there's no message during that time, we count down our liveness.
  • If the liveness reaches zero, we consider the queue dead.
  • If the queue is dead, we destroy our socket(插座), create a new one, and reconnect(使再接合).
  • To avoid opening and closing too many sockets, we wait for a certain interval before reconnecting, and we double the interval each time until it reaches 32 seconds.

And this is how we handle heartbeats to the queue:

  • We calculate(计算) when to send the next heartbeat; this is a single variable(变量) because we're talking to one peer(封为贵族), the queue.
  • In the zmq_poll loop, whenever we pass this time, we send a heartbeat to the queue.

Here's the essential heartbeating code for the worker:

#define HEARTBEAT_LIVENESS 3 // 3-5 is reasonable
#define HEARTBEAT_INTERVAL 1000 // msecs
#define INTERVAL_INIT 1000 // Initial reconnect
#define INTERVAL_MAX 32000 // After exponential backoff


// If liveness(活性) hits zero, queue is considered disconnected(拆开)
size_t liveness = HEARTBEAT_LIVENESS;
size_t interval = INTERVAL_INIT;

// Send out heartbeats(心跳) at regular intervals
uint64_t heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;

while (true) {
zmq_pollitem_t items [] = { { worker, 0, ZMQ_POLLIN, 0 } };
int rc = zmq_poll (items, 1, HEARTBEAT_INTERVAL * ZMQ_POLL_MSEC);

if (items [0].revents & ZMQ_POLLIN) {
// Receive any message from queue
liveness = HEARTBEAT_LIVENESS;
interval = INTERVAL_INIT;
}
else
if (--liveness == 0) {
zclock_sleep (interval);
if (interval < INTERVAL_MAX)
interval *= 2;
zsocket_destroy (ctx, worker);

liveness = HEARTBEAT_LIVENESS;
}
// Send heartbeat(心跳) to queue if it's time
if (zclock_time () > heartbeat_at) {
heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;
// Send heartbeat(心跳) message to queue
}
}

The queue does the same, but manages an expiration(呼气) time for each worker.

Here are some tips for your own heartbeating implementation(实现):

  • Use zmq_poll or a reactor(反应器) as the core of your application's main task.
  • Start by building the heartbeating between peers(撒尿), test it by simulating(模拟) failures, and then build the rest of the message flow. Adding heartbeating afterwards is much trickier(狡猾的).
  • Use simple tracing(追踪), i.e., print to console(安慰), to get this working. To help you trace the flow of messages between peers, use a dump(垃圾场) method such as zmsg offers, and number your messages incrementally(递增地) so you can see if there are gaps(间隙).
  • In a real application, heartbeating must be configurable(可配置的) and usually negotiated(谈判) with the peer. Some peers will want aggressive heartbeating, as low as 10 msecs. Other peers will be far away and want heartbeating as high as 30 seconds.
  • If you have different heartbeat intervals for different peers, your poll(投票) timeout should be the lowest (shortest time) of these. Do not use an infinite(无限的) timeout.
  • Do heartbeating on the same socket(插座) you use for messages, so your heartbeats also act as a keep-alive to stop the network connection from going stale(尿) (some firewalls(防火墙) can be unkind(无情的) to silent connections).

Contracts and Protocols

topprevnext

If you're paying attention, you'll realize that Paranoid Pirate is not interoperable(彼此协作的) with Simple Pirate, because of the heartbeats(心跳). But how do we define(定义) "interoperable"? To guarantee(保证) interoperability(互操作性), we need a kind of contract(合同), an agreement that lets different teams in different times and places write code that is guaranteed to work together. We call this a "protocol(协议)".

It's fun to experiment without specifications(规格), but that's not a sensible(明智的) basis(基础) for real applications. What happens if we want to write a worker in another language? Do we have to read code to see how things work? What if we want to change the protocol for some reason? Even a simple protocol will, if it's successful, evolve(发展) and become more complex(复杂的).

Lack of contracts is a sure sign of a disposable(可任意处理的) application. So let's write a contract for this protocol. How do we do that?

There's a wiki at rfc.zeromq.org that we made especially as a home for public ZeroMQ contracts.
To create a new specification, register on the wiki if needed, and follow the instructions. It's fairly straightforward(简单的), though writing technical texts is not everyone's cup of tea.

It took me about fifteen minutes to draft(起草) the new Pirate Pattern Protocol. It's not a big specification, but it does capture(捕获) enough to act as the basis for arguments ("your queue isn't PPP compatible(兼容的); please fix it!").

Turning PPP into a real protocol would take more work:

  • There should be a protocol version number in the READY command so that it's possible to distinguish(区分) between different versions of PPP.
  • Right now, READY and HEARTBEAT are not entirely distinct(明显的) from requests and replies. To make them distinct, we would need a message structure(结构) that includes a "message type" part.

Service-Oriented Reliable Queuing (Majordomo Pattern)

topprevnext

Figure 50 - The Majordomo Pattern

fig50.png

The nice thing about progress is how fast it happens when lawyers and committees(委员会) aren't involved(包含). The one-page MDP specification turns PPP into something more solid. This is how we should design complex(复杂的) architectures(建筑学): start by writing down the contracts(合同), and only then write software to implement(实施) them.

The Majordomo Protocol(协议) (MDP) extends(延伸) and improves on PPP in one interesting way: it adds a "service name" to requests that the client sends, and asks workers to register for specific(特殊的) services. Adding service names turns our Paranoid Pirate queue into a service-oriented(服务型的) broker. The nice thing about MDP is that it came out of working code, a simpler ancestor protocol (PPP), and a precise(精确的) set of improvements(改进) that each solved a clear problem. This made it easy to draft(草稿).

To implement Majordomo, we need to write a framework(框架) for clients and workers. It's really not sane(健全的) to ask every application developer to read the spec(投机) and make it work, when they could be using a simpler API that does the work for them.

So while our first contract (MDP itself) defines(定义) how the pieces of our distributed(分布式的) architecture talk to each other, our second contract defines how user applications talk to the technical framework we're going to design.

Majordomo(总监) has two halves, a client side and a worker side. Because we'll write both client and worker applications, we will need two APIs. Here is a sketch(素描) for the client API, using a simple object-oriented(面向对象的) approach(方法):

// Majordomo Protocol client example
// Uses the mdcli API to hide all MDP aspects(方面)

// Lets us build this source without creating a library
#include "mdcliapi.c"

int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], "-v"));
mdcli_t *session = mdcli_new ("tcp://localhost:5555", verbose);

int count;
for (count = 0; count < 100000; count++) {
zmsg_t *request = zmsg_new ();
zmsg_pushstr (request, "Hello world");
zmsg_t *reply = mdcli_send (session, "echo", &request);
if (reply)
zmsg_destroy (&reply);
else
break; // Interrupt or failure
}
printf ("%d requests/replies processed\n", count);
mdcli_destroy (&session);
return 0;
}

That's it. We open a session(会议) to the broker, send a request message, get a reply message back, and eventually(最后) close the connection. Here's a sketch(素描) for the worker API:

// Majordomo(总监) Protocol worker example
// Uses the mdwrk API to hide all MDP aspects(方面)

// Lets us build this source without creating a library
#include "mdwrkapi.c"

int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], "-v"));
mdwrk_t *session = mdwrk_new (
"tcp://localhost:5555", "echo", verbose);

zmsg_t *reply = NULL;
while (true) {
zmsg_t *request = mdwrk_recv (session, &reply);
if (request == NULL)
break; // Worker was interrupted
reply = request; // Echo(反射) is complex(复杂的)… :-)
}
mdwrk_destroy (&session);
return 0;
}

It's more or less symmetrical(匀称的), but the worker dialog is a little different. The first time a worker does a recv(), it passes a null reply. Thereafter, it passes the current reply, and gets a new request.

The client and worker APIs were fairly simple to construct because they're heavily based on the Paranoid Pirate code we already developed. Here is the client API:


C# | Go | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Let's see how the client API looks in action, with an example test program that does 100K request-reply cycles:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

And here is the worker API:


C# | Go | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Let's see how the worker API looks in action, with an example test program that implements(工具) an echo(回音) service:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Here are some things to note about the worker API code:

  • The APIs are single-threaded. This means, for example, that the worker won't send heartbeats(心跳) in the background. Happily, this is exactly what we want: if the worker application gets stuck, heartbeats will stop and the broker will stop sending requests to the worker.
  • The worker API doesn't do an exponential(指数的) back-off(退下); it's not worth the extra complexity(复杂).
  • The APIs don't do any error reporting. If something isn't as expected, they raise an assertion(断言) (or exception(例外) depending on the language). This is ideal(理想的) for a reference(参考) implementation(实现), so any protocol(协议) errors show immediately. For real applications, the API should be robust(强健的) against invalid(无效的) messages.

You might wonder why the worker API is manually(手动地) closing its socket(插座) and opening a new one, when ZeroMQ will automatically(自动地) reconnect(使再接合) a socket if the peer(贵族) disappears and comes back. Look back at the Simple Pirate and Paranoid Pirate workers to understand. Although ZeroMQ will automatically reconnect workers if the broker dies and comes back up, this isn't sufficient(足够的) to re-register the workers with the broker. I know of at least two solutions(解决方案). The simplest, which we use here, is for the worker to monitor the connection using heartbeats(心跳), and if it decides the broker is dead, to close its socket and start afresh(重新) with a new socket. The alternative(二中择一) is for the broker to challenge unknown workers when it gets a heartbeat from the worker and ask them to re-register. That would require protocol support.

Now let's design the Majordomo broker. Its core structure(结构) is a set of queues, one per service. We will create these queues as workers appear (we could delete them as workers disappear, but forget that for now because it gets complex(复合体)). Additionally, we keep a queue of workers per service.

And here is the broker:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

This is by far the most complex(复杂的) example we've seen. It's almost 500 lines of code. To write this and make it somewhat robust(强健的) took two days. However, this is still a short piece of code for a full service-oriented(服务型的) broker.

Here are some things to note about the broker code:

  • The Majordomo Protocol lets us handle both clients and workers on a single socket(插座). This is nicer for those deploying(配置) and managing the broker: it just sits on one ZeroMQ endpoint(端点) rather than the two that most proxies(代理人) need.
  • The broker implements(工具) all of MDP/0.1 properly (as far as I know), including disconnection(断开) if the broker sends invalid(无效的) commands, heartbeating, and the rest.
  • It can be extended(延伸) to run multiple threads, each managing one socket(插座) and one set of clients and workers. This could be interesting for segmenting(分段) large architectures(建筑学). The C code is already organized around a broker class to make this trivial(不重要的).
  • A primary/failover(失效备援) or live/live broker reliability(可靠性) model is easy, as the broker essentially has no state except service presence(存在). It's up to clients and workers to choose another broker if their first choice isn't up and running.
  • The examples use five-second heartbeats(心跳), mainly to reduce the amount(数量) of output(输出) when you enable tracing(追踪). Realistic(现实的) values would be lower for most LAN applications. However, any retry(重操作) has to be slow enough to allow for a service to restart(重新启动), say 10 seconds at least.

We later improved and extended the protocol(协议) and the Majordomo implementation(实现), which now sits in its own Github project. If you want a properly usable(可用的) Majordomo stack(堆), use the GitHub project.

Asynchronous(异步的) Majordomo Pattern

topprevnext

The Majordomo implementation in the previous section is simple and stupid. The client is just the original Simple Pirate, wrapped(包) up in a sexy API. When I fire up a client, broker, and worker on a test box, it can process 100,000 requests in about 14 seconds. That is partially(部分地) due to the code, which cheerfully copies message frames(框架) around as if CPU cycles were free. But the real problem is that we're doing network round-trips. ZeroMQ disables Nagle's algorithm, but round-tripping(借贷套利) is still slow.

Theory is great in theory, but in practice, practice is better. Let's measure the actual cost of round-tripping with a simple test program. This sends a bunch(群) of messages, first waiting for a reply to each message, and second as a batch(一批), reading all the replies back as a batch. Both approaches(方法) do the same work, but they give very different results. We mock(模拟的) up a client, broker, and worker:


C++ | C# | Go | Haskell | Haxe | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

On my development box, this program says:

Setting up test...
Synchronous round-trip test...
 9057 calls/second
Asynchronous round-trip test...
 173010 calls/second

Note that the client thread does a small pause before starting. This is to get around one of the "features(特色)" of the router(路由器) socket(插座): if you send a message with the address of a peer(撒尿) that's not yet connected, the message gets discarded(抛弃). In this example we don't use the load balancing mechanism(机制), so without the sleep, if the worker thread is too slow to connect, it will lose messages, making a mess of our test.

As we see, round-tripping(借贷套利) in the simplest case is 20 times slower than the asynchronous(异步的), "shove(挤) it down the pipe as fast as it'll go" approach(方法). Let's see if we can apply this to Majordomo to make it faster.

First, we modify(修改) the client API to send and receive in two separate methods:

mdcli_t *mdcli_new (char *broker);
void mdcli_destroy (mdcli_t **self_p);
int mdcli_send (mdcli_t *self, char *service, zmsg_t **request_p);
zmsg_t *mdcli_recv (mdcli_t *self);

It's literally(文字的) a few minutes' work to refactor(重构) the synchronous(同步的) client API to become asynchronous(异步的):


C# | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

The differences are:

  • We use a DEALER socket(给…配插座) instead of REQ, so we emulate(仿真) REQ with an empty delimiter(划界) frame(框架) before each request and each response.
  • We don't retry(重操作) requests; if the application needs to retry, it can do this itself.
  • We break the synchronous(同步的) send method into separate send and recv methods.
  • The send method is asynchronous(异步的) and returns immediately after sending. The caller can thus send a number of messages before getting a response.
  • The recv method waits for (with a timeout) one response and returns that to the caller.

And here's the corresponding client test program, which sends 100,000 messages and then receives 100,000 back:


C++ | C# | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

The broker and worker are unchanged because we've not modified(修改) the protocol(协议) at all. We see an immediate improvement(改进) in performance. Here's the synchronous(同步的) client chugging(发出轧轧声) through 100K request-reply cycles:

$ time mdclient
100000 requests/replies processed

real    0m14.088s
user    0m1.310s
sys     0m2.670s

And here's the asynchronous(异步的) client, with a single worker:

$ time mdclient2
100000 replies received

real    0m8.730s
user    0m0.920s
sys     0m1.550s

Twice as fast. Not bad, but let's fire up 10 workers and see how it handles the traffic

$ time mdclient2
100000 replies received

real    0m3.863s
user    0m0.730s
sys     0m0.470s

It isn't fully asynchronous because workers get their messages on a strict last-used basis(基础). But it will scale(规模) better with more workers. On my PC, after eight or so workers, it doesn't get any faster. Four cores only stretches(伸展) so far. But we got a 4x improvement in throughput(生产量) with just a few minutes' work. The broker is still unoptimized. It spends most of its time copying message frames(框架) around, instead of doing zero-copy, which it could. But we're getting 25K reliable(可靠的人) request/reply calls a second, with pretty low effort.

However, the asynchronous Majordomo pattern isn't all roses. It has a fundamental(基本的) weakness, namely that it cannot survive(幸存) a broker crash without more work. If you look at the mdcliapi2 code you'll see it does not attempt to reconnect(使再接合) after a failure. A proper reconnect would require the following:

  • A number on every request and a matching number on every reply, which would ideally(理想地) require a change to the protocol to enforce(实施).
  • Tracking and holding onto all outstanding(杰出的) requests in the client API, i.e., those for which no reply has yet been received.
  • In case of failover(失效备援), for the client API to resend all outstanding requests to the broker.

It's not a deal breaker, but it does show that performance often means complexity(复杂). Is this worth doing for Majordomo? It depends on your use case. For a name lookup(查找) service you call once per session(会议), no. For a web frontend(前端) serving thousands of clients, probably yes.

Service Discovery

topprevnext

So, we have a nice service-oriented(服务型的) broker, but we have no way of knowing whether a particular service is available or not. We know whether a request failed, but we don't know why. It is useful to be able to ask the broker, "is the echo(回音) service running?" The most obvious way would be to modify our MDP/Client protocol to add commands to ask this. But MDP/Client has the great charm(魅力) of being simple. Adding service discovery to it would make it as complex(复杂的) as the MDP/Worker protocol.

Another option is to do what email does, and ask that undeliverable(无法投递的) requests be returned. This can work well in an asynchronous world, but it also adds complexity. We need ways to distinguish(区分) returned requests from replies and to handle these properly.

Let's try to use what we've already built, building on top of MDP instead of modifying it. Service discovery is, itself, a service. It might indeed be one of several management services, such as "disable service X", "provide statistics(统计)", and so on. What we want is a general, extensible(可延长的) solution(解决方案) that doesn't affect the protocol or existing applications.

So here's a small RFC that layers this on top of MDP: the Majordomo Management Interface (MMI). We already implemented(实施) it in the broker, though unless you read the whole thing you probably missed(感到思念的) that. I'll explain how it works in the broker:

  • When a client requests a service that starts with mmi., instead of routing(路由选择) this to a worker, we handle it internally(内部地).
  • We handle just one service in this broker, which is mmi.service, the service discovery service.
  • The payload(有效载荷) for the request is the name of an external(外部的) service (a real one, provided by a worker).
  • The broker returns "200" (OK) or "404" (Not found), depending on whether there are workers registered for that service or not.

Here's how we use the service discovery in an application:

// MMI echo(反射) query example

// Lets us build this source without creating a library
#include "mdcliapi.c"

int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], "-v"));
mdcli_t *session = mdcli_new ("tcp://localhost:5555", verbose);

// This is the service we want to look up
zmsg_t *request = zmsg_new ();
zmsg_addstr (request, "echo");

// This is the service we send our request to
zmsg_t *reply = mdcli_send (session, "mmi.service", &request);

if (reply) {
char *reply_code = zframe_strdup (zmsg_first (reply));
printf ("Lookup(查找) echo(反射) service: %s\n", reply_code);
free (reply_code);
zmsg_destroy (&reply);
}
else
printf ("E: no response from broker, make sure it's running\n");

mdcli_destroy (&session);
return 0;
}


C# | Go | Haxe | Java | Lua | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Haskell | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Try this with and without a worker running, and you should see the little program report "200" or "404" accordingly. The implementation(实现) of MMI in our example broker is flimsy(脆弱的). For example, if a worker disappears, services remain "present". In practice, a broker should remove services that have no workers after some configurable(可配置的) timeout.

Idempotent Services

topprevnext

Idempotency(幂等性) is not something you take a pill for. What it means is that it's safe to repeat an operation. Checking the clock is idempotent(幂等的). Lending ones credit card to ones children is not. While many client-to-server use cases are idempotent, some are not. Examples of idempotent use cases include:

  • Stateless(没有国家的) task distribution(分布), i.e., a pipeline(管道) where the servers are stateless workers that compute a reply based purely on the state provided by a request. In such a case, it's safe (though inefficient(无效率的)) to execute(实行) the same request many times.
  • A name service that translates logical(合逻辑的) addresses into endpoints(端点) to bind(绑) or connect to. In such a case, it's safe to make the same lookup(查找) request many times.

And here are examples of a non-idempotent use cases:

  • A logging service. One does not want the same log information recorded more than once.
  • Any service that has impact(影响) on downstream(下游的) nodes, e.g., sends on information to other nodes. If that service gets the same request more than once, downstream nodes will get duplicate(复制的) information.
  • Any service that modifies(修改) shared data in some non-idempotent way; e.g., a service that debits(借方) a bank account is not idempotent without extra work.

When our server applications are not idempotent, we have to think more carefully about when exactly they might crash. If an application dies when it's idle(闲置的), or while it's processing a request, that's usually fine. We can use database transactions(交易) to make sure a debit and a credit are always done together, if at all. If the server dies while sending its reply, that's a problem, because as far as it's concerned(涉及), it has done its work.

If the network dies just as the reply is making its way back to the client, the same problem arises. The client will think the server died and will resend(再发) the request, and the server will do the same work twice, which is not what we want.

To handle non-idempotent operations, use the fairly standard solution(解决方案) of detecting(检测) and rejecting duplicate requests. This means:

  • The client must stamp every request with a unique(独特的) client identifier(标识符) and a unique message number.
  • The server, before sending back a reply, stores it using the combination(结合) of client ID and message number as a key.
  • The server, when getting a request from a given client, first checks whether it has a reply for that client ID and message number. If so, it does not process the request, but just resends(再发) the reply.

Disconnected(拆开) Reliability(可靠性) (Titanic Pattern)

topprevnext

Once you realize that Majordomo is a "reliable(可靠的)" message broker, you might be tempted(诱惑) to add some spinning rust(锈) (that is, ferrous-based hard disk platters(大浅盘)). After all, this works for all the enterprise messaging systems. It's such a tempting idea that it's a little sad to have to be negative(负的) toward it. But brutal(残忍的) cynicism(玩世不恭) is one of my specialties(专业). So, some reasons you don't want rust-based brokers sitting in the center of your architecture(建筑学) are:

  • As you've seen, the Lazy Pirate client performs surprisingly well. It works across a whole range of architectures, from direct client-to-server to distributed(分布式的) queue proxies(代理人). It does tend(照料) to assume(承担) that workers are stateless(没有国家的) and idempotent(幂等的). But we can work around that limitation(限制) without resorting(求助) to rust.
  • Rust brings a whole set of problems, from slow performance to additional(附加的) pieces that you have to manage, repair, and handle 6 a.m. panics(恐慌) from, as they inevitably(不可避免地) break at the start of daily operations. The beauty of the Pirate patterns in general is their simplicity(朴素). They won't crash. And if you're still worried about the hardware(计算机硬件), you can move to a peer-to-peer(对等) pattern that has no broker at all. I'll explain later in this chapter.

Having said this, however, there is one sane(健全的) use case for rust-based reliability, which is an asynchronous(异步的) disconnected network. It solves a major problem with Pirate, namely that a client has to wait for an answer in real time. If clients and workers are only sporadically(零星地) connected (think of email as an analogy(类比)), we can't use a stateless network between clients and workers. We have to put state in the middle.

So, here's the Titanic pattern, in which we write messages to disk to ensure(保证) they never get lost, no matter how sporadically clients and workers are connected. As we did for service discovery, we're going to layer Titanic on top of MDP rather than extend(延伸) it. It's wonderfully lazy because it means we can implement(实施) our fire-and-forget(发射后自寻的) reliability in a specialized(专业的) worker, rather than in the broker. This is excellent for several reasons:

  • It is much easier because we divide and conquer(战胜): the broker handles message routing(路由选择) and the worker handles reliability.
  • It lets us mix brokers written in one language with workers written in another.
  • It lets us evolve(发展) the fire-and-forget technology independently.

The only downside(下降趋势) is that there's an extra network hop between broker and hard disk. The benefits(利益) are easily worth it.

There are many ways to make a persistent(固执的) request-reply architecture. We'll aim for one that is simple and painless(无痛的). The simplest design I could come up with, after playing with this for a few hours, is a "proxy service". That is, Titanic doesn't affect workers at all. If a client wants a reply immediately, it talks directly to a service and hopes the service is available. If a client is happy to wait a while, it talks to Titanic instead and asks, "hey, buddy(伙伴), would you take care of this for me while I go buy my groceries(杂货)?"

Figure 51 - The Titanic Pattern

fig51.png

Titanic is thus both a worker and a client. The dialog between client and Titanic goes along these lines:

  • Client: Please accept this request for me. Titanic: OK, done.
  • Client: Do you have a reply for me? Titanic: Yes, here it is. Or, no, not yet.
  • Client: OK, you can wipe that request now, I'm happy. Titanic: OK, done.

Whereas(然而) the dialog between Titanic and broker and worker goes like this:

  • Titanic: Hey, Broker, is there an coffee service? Broker: Uhm, Yeah, seems like.
  • Titanic: Hey, coffee service, please handle this for me.
  • Coffee: Sure, here you are.
  • Titanic: Sweeeeet!

You can work through this and the possible failure scenarios(方案). If a worker crashes while processing a request, Titanic retries(重操作) indefinitely(不确定的). If a reply gets lost somewhere, Titanic will retry. If the request gets processed but the client doesn't get the reply, it will ask again. If Titanic crashes while processing a request or a reply, the client will try again. As long as requests are fully committed(犯罪) to safe storage, work can't get lost.

The handshaking(握手) is pedantic(迂腐的), but can be pipelined, i.e., clients can use the asynchronous(异步的) Majordomo pattern to do a lot of work and then get the responses later.

We need some way for a client to request its replies. We'll have many clients asking for the same services, and clients disappear and reappear(再出现) with different identities(身份). Here is a simple, reasonably secure(保护) solution(解决方案):

  • Every request generates(形成) a universally(普遍地) unique(独特的) ID (UUID), which Titanic returns to the client after it has queued the request.
  • When a client asks for a reply, it must specify(指定) the UUID for the original request.

In a realistic(现实的) case, the client would want to store its request UUIDs safely, e.g., in a local database.

Before we jump off and write yet another formal(正式的) specification(规格) (fun, fun!), let's consider how the client talks to Titanic. One way is to use a single service and send it three different request types. Another way, which seems simpler, is to use three services:

  • titanic.request: store a request message, and return a UUID for the request.
  • titanic.reply: fetch a reply, if available, for a given request UUID.
  • titanic.close: confirm(确认) that a reply has been stored and processed.

We'll just make a multithreaded worker, which as we've seen from our multithreading experience with ZeroMQ, is trivial(不重要的). However, let's first sketch(画素描或速写) what Titanic would look like in terms of ZeroMQ messages and frames(框架). This gives us the Titanic Service Protocol (TSP).

Using TSP is clearly more work for client applications than accessing a service directly via MDP. Here's the shortest robust(强健的) "echo(回音)" client example:


C# | Haxe | Java | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

Of course this can be, and should be, wrapped(包) up in some kind of framework(框架) or API. It's not healthy to ask average application developers to learn the full details of messaging: it hurts their brains, costs time, and offers too many ways to make buggy(童车) complexity(复杂). Additionally(附加的), it makes it hard to add intelligence(智力).

For example, this client blocks on each request whereas(然而) in a real application, we'd want to be doing useful work while tasks are executed(实行). This requires some nontrivial(非平凡的) plumbing(垂直) to build a background thread and talk to that cleanly. It's the kind of thing you want to wrap in a nice simple API that the average developer cannot misuse. It's the same approach(方法) that we used for Majordomo.

Here's the Titanic implementation(实现). This server handles the three services using three threads, as proposed(建议). It does full persistence(持续) to disk using the most brutal(残忍的) approach possible: one file per message. It's so simple, it's scary(提心吊胆的). The only complex(复杂的) part is that it keeps a separate queue of all requests, to avoid reading the directory over and over:


C# | Haxe | Java | PHP | Python | Ruby | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Scala

To test this, start mdbroker and titanic, and then run ticlient. Now start mdworker arbitrarily(武断地), and you should see the client getting a response and exiting happily.

Some notes about this code:

  • Note that some loops(环) start by sending, others by receiving messages. This is because Titanic acts both as a client and a worker in different roles.
  • The Titanic broker uses the MMI service discovery protocol(协议) to send requests only to services that appear to be running. Since the MMI implementation(实现) in our little Majordomo broker is quite poor, this won't work all the time.
  • We use an inproc connection to send new request data from the titanic.request service through to the main dispatcher(调度员). This saves the dispatcher from having to scan(扫描) the disk directory, load all request files, and sort them by date/time.

The important thing about this example is not performance (which, although I haven't tested it, is surely terrible), but how well it implements(工具) the reliability(可靠性) contract(合同). To try it, start the mdbroker and titanic programs. Then start the ticlient, and then start the mdworker echo(反射) service. You can run all four of these using the -v option to do verbose(冗长的) activity tracing(追踪). You can stop and restart(重新启动) any piece except the client and nothing will get lost.

If you want to use Titanic in real cases, you'll rapidly be asking "how do we make this faster?"

Here's what I'd do, starting with the example implementation:

  • Use a single disk file for all data, rather than multiple files. Operating systems are usually better at handling a few large files than many smaller ones.
  • Organize that disk file as a circular(循环的) buffer(缓冲区) so that new requests can be written contiguously(连续的) (with very occasional(偶然的) wraparound(概括的)). One thread, writing full speed to a disk file, can work rapidly.
  • Keep the index in memory and rebuild the index at startup time, from the disk buffer. This saves the extra disk head flutter(摆动) needed to keep the index fully safe on disk. You would want an fsync after every message, or every N milliseconds(毫秒) if you were prepared to lose the last M messages in case of a system failure.
  • Use a solid-state(固态的) drive rather than spinning iron oxide(氧化物) platters(大浅盘).
  • Pre-allocate the entire file, or allocate(分配) it in large chunks(大块), which allows the circular buffer to grow and shrink(收缩) as needed. This avoids fragmentation(破碎) and ensures(保证) that most reads and writes are contiguous.

And so on. What I'd not recommend is storing messages in a database, not even a "fast" key/value store, unless you really like a specific(特殊的) database and don't have performance worries. You will pay a steep price for the abstraction(抽象), ten to a thousand times over a raw disk file.

If you want to make Titanic even more reliable, duplicate(复制) the requests to a second server, which you'd place in a second location just far away enough to survive(幸存) a nuclear attack on your primary location, yet not so far that you get too much latency(潜伏).

If you want to make Titanic much faster and less reliable(可靠的), store requests and replies purely in memory. This will give you the functionality(功能) of a disconnected(分离的) network, but requests won't survive a crash of the Titanic server itself.

High-Availability Pair (Binary Star Pattern)

topprevnext

Figure 52 - High-Availability Pair, Normal Operation

fig52.png

The Binary Star pattern puts two servers in a primary-backup high-availability pair. At any given time, one of these (the active) accepts connections from client applications. The other (the passive) does nothing, but the two servers monitor each other. If the active disappears from the network, after a certain time the passive takes over as active.

We developed the Binary Star pattern at iMatix for our OpenAMQ server. We designed it:

  • To provide a straightforward(简单的) high-availability solution(解决方案).
  • To be simple enough to actually understand and use.
  • To fail over reliably when needed, and only when needed.

Assuming(承担) we have a Binary Star pair running, here are the different scenarios(方案) that will result in a failover(失效备援):

  • The hardware(计算机硬件) running the primary server has a fatal(致命的) problem (power supply explodes, machine catches fire, or someone simply unplugs it by mistake), and disappears. Applications see this, and reconnect(使再接合) to the backup server.
  • The network segment(段) on which the primary server sits crashes—perhaps a router g(路由器)ets hit by a power spike—a(长钉)nd applications start to reconnect to the backup server.
  • The primary server crashes or is killed by the operator and does not restart(重新启动) automatically(自动地).

Figure 53 - High-availability Pair During Failover(失效备援)

fig53.png

Recovery(恢复) from failover works as follows:

  • The operators restart the primary server and fix whatever problems were causing it to disappear from the network.
  • The operators stop the backup server at a moment when it will cause minimal(最低的) disruption(破坏) to applications.
  • When applications have reconnected to the primary server, the operators restart the backup server.

Recovery (to using the primary server as active) is a manual(手工的) operation. Painful experience teaches us that automatic(自动的) recovery is undesirable(不良的). There are several reasons:

  • Failover creates an interruption(中断) of service to applications, possibly lasting 10-30 seconds. If there is a real emergency, this is much better than total outage(储运损耗). But if recovery creates a further 10-30 second outage, it is better that this happens off-peak(非尖峰的), when users have gone off the network.
  • When there is an emergency, the absolute(绝对的) first priority(优先) is certainty(必然) for those trying to fix things. Automatic recovery creates uncertainty(不确定) for system administrators(管理人), who can no longer be sure which server is in charge without double-checking.
  • Automatic recovery can create situations where networks fail over and then recover, placing operators in the difficult position of analyzing(分解) what happened. There was an interruption of service, but the cause isn't clear.

Having said this, the Binary Star pattern will fail back to the primary server if this is running (again) and the backup server fails. In fact, this is how we provoke(驱使) recovery.

The shutdown(关机) process for a Binary Star pair is to either:

  1. Stop the passive server and then stop the active server at any later time, or
  2. Stop both servers in any order but within a few seconds of each other.

Stopping the active and then the passive server with any delay longer than the failover timeout will cause applications to disconnect(拆开), then reconnect, and then disconnect again, which may disturb users.

Detailed Requirements
topprevnext

Binary(二进制的) Star is as simple as it can be, while still working accurately(精确地). In fact, the current design is the third complete redesign(重新设计). Each of the previous designs we found to be too complex(复杂的), trying to do too much, and we stripped(剥夺) out functionality(功能) until we came to a design that was understandable(可以理解的), easy to use, and reliable(可靠的) enough to be worth using.

These are our requirements for a high-availability architecture(建筑学):

  • The failover(失效备援) is meant to provide insurance against catastrophic(灾难的) system failures, such as hardware(计算机硬件) breakdown(故障), fire, accident, and so on. There are simpler ways to recover from ordinary server crashes and we already covered these.
  • Failover time should be under 60 seconds and preferably(较好) under 10 seconds.
  • Failover has to happen automatically(自动地), whereas(然而) recovery(恢复) must happen manually(手动地). We want applications to switch(转换) over to the backup server automatically, but we do not want them to switch back to the primary server except when the operators have fixed whatever problem there was and decided that it is a good time to interrupt applications again.
  • The semantics(语义学) for client applications should be simple and easy for developers to understand. Ideally(理想的), they should be hidden in the client API.
  • There should be clear instructions for network architects(建筑师) on how to avoid designs that could lead to split brain syndrome, in which both servers in a Binary Star pair think they are the active server.
  • There should be no dependencies(依赖性) on the order in which the two servers are started.
  • It must be possible to make planned stops and restarts(重新开始) of either server without stopping client applications (though they may be forced to reconnect(使再接合)).
  • Operators must be able to monitor both servers at all times.
  • It must be possible to connect the two servers using a high-speed dedicated(专用的) network connection. That is, failover synchronization(同步) must be able to use a specific(特殊的) IP route(路线).

We make the following assumptions(假定):

  • A single backup server provides enough insurance; we don't need multiple levels of backup.
  • The primary and backup servers are equally capable(能干的) of carrying the application load. We do not attempt to balance load across the servers.
  • There is sufficient(足够的) budget(预算) to cover a fully redundant(多余的) backup server that does nothing almost all the time.

We don't attempt to cover the following:

  • The use of an active backup server or load balancing. In a Binary Star pair, the backup server is inactive(不活跃的) and does no useful work until the primary server goes offline.
  • The handling of persistent(固执的) messages or transactions(交易) in any way. We assume(承担) the existence of a network of unreliable(不可靠的) (and probably untrusted) servers or Binary Star pairs.
  • Any automatic(自动的) exploration(探测) of the network. The Binary Star pair is manually(手动地) and explicitly(明确地) defined(定义) in the network and is known to applications (at least in their configuration(配置) data).
  • Replication(复制) of state or messages between servers. All server-side state must be recreated(娱乐) by applications when they fail over.

Here is the key terminology(术语) that we use in Binary Star:

  • Primary: the server that is normally or initially active.
  • Backup: the server that is normally passive. It will become active if and when the primary server disappears from the network, and when client applications ask the backup server to connect.
  • Active: the server that accepts client connections. There is at most one active server.
  • Passive: the server that takes over if the active disappears. Note that when a Binary Star pair is running normally, the primary server is active, and the backup is passive. When a failover(失效备援) has happened, the roles are switched(转换).

To configure(安装) a Binary Star pair, you need to:

  1. Tell the primary server where the backup server is located(处于).
  2. Tell the backup server where the primary server is located.
  3. Optionally(可选择的), tune(曲调) the failover response times, which must be the same for both servers.

The main tuning concern(关系) is how frequently you want the servers to check their peering(凝视) status, and how quickly you want to activate(刺激) failover. In our example, the failover timeout value defaults to 2,000 msec. If you reduce this, the backup server will take over as active more rapidly but may take over in cases where the primary server could recover. For example, you may have wrapped(包) the primary server in a shell(壳) script that restarts(重新开始) it if it crashes. In that case, the timeout should be higher than the time needed to restart the primary server.

For client applications to work properly with a Binary Star pair, they must:

  1. Know both server addresses.
  2. Try to connect to the primary server, and if that fails, to the backup server.
  3. Detect(察觉) a failed connection, typically(代表性地) using heartbeating.
  4. Try to reconnect(使再接合) to the primary, and then backup (in that order), with a delay between retries(重操作) that is at least as high as the server failover(失效备援) timeout.
  5. Recreate(娱乐) all of the state they require on a server.
  6. Retransmit(转播) messages lost during a failover, if messages need to be reliable(可靠的).

It's not trivial(不重要的) work, and we'd usually wrap(包) this in an API that hides it from real end-user applications.

These are the main limitations(限制) of the Binary Star pattern:

  • A server process cannot be part of more than one Binary Star pair.
  • A primary server can have a single backup server, and no more.
  • The passive server does no useful work, and is thus wasted.
  • The backup server must be capable(能干的) of handling full application loads.
  • Failover configuration(配置) cannot be modified(修改) at runtime.
  • Client applications must do some work to benefit(有益于) from failover.

Preventing Split-Brain Syndrome
topprevnext

Split-brain syndrome occurs when different parts of a cluster(群) think they are active at the same time. It causes applications to stop seeing each other. Binary(二进制的) Star has an algorithm(算法) for detecting and eliminating(消除) split brain, which is based on a three-way decision mechanism(机制) (a server will not decide to become active until it gets application connection requests and it cannot see its peer(贵族) server).

However, it is still possible to (mis)design a network to fool this algorithm. A typical scenario(方案) would be a Binary Star pair, that is distributed(分布式的) between two buildings, where each building also had a set of applications and where there was a single network link between both buildings. Breaking this link would create two sets of client applications, each with half of the Binary Star pair, and each failover server would become active.

To prevent split-brain situations, we must connect a Binary Star pair using a dedicated(专用的) network link, which can be as simple as plugging them both into the same switch(开关) or, better, using a crossover(交叉) cable(电缆) directly between two machines.

We must not split a Binary Star architecture(建筑学) into two islands, each with a set of applications. While this may be a common type of network architecture, you should use federation(联合), not high-availability failover(失效备援), in such cases.

A suitably paranoid(类似妄想狂的) network configuration(配置) would use two private cluster(群) interconnects(使互相连接), rather than a single one. Further, the network cards used for the cluster would be different from those used for message traffic, and possibly even on different paths on the server hardware(计算机硬件). The goal is to separate possible failures in the network from possible failures in the cluster. Network ports can have a relatively high failure rate.

Binary Star Implementation
topprevnext

Without further ado, here is a proof-of-concept(概念验证) implementation(实现) of the Binary Star server. The primary and backup servers run the same code, you choose their roles when you run the code:

// Binary Star server proof-of-concept implementation. This server does no
// real work; it just demonstrates(证明) the Binary Star failover model.

#include "czmq.h"

// States we can be in at any point in time
typedef enum {
STATE_PRIMARY = 1, // Primary, waiting for peer(贵族) to connect
STATE_BACKUP = 2, // Backup, waiting for peer to connect
STATE_ACTIVE = 3, // Active - accepting connections
STATE_PASSIVE = 4 // Passive - not accepting connections
} state_t;

// Events, which start with the states our peer(贵族) can be in
typedef enum {
PEER_PRIMARY = 1, // HA peer is pending(悬而未决) primary
PEER_BACKUP = 2, // HA peer is pending backup
PEER_ACTIVE = 3, // HA peer(贵族) is active
PEER_PASSIVE = 4, // HA peer is passive
CLIENT_REQUEST = 5 // Client makes request
} event_t;

// Our finite(有限的) state machine
typedef struct {
state_t state; // Current state
event_t event; // Current event
int64_t peer_expiry; // When peer(贵族) is considered 'dead'
} bstar_t;

// We send state information this often
// If peer doesn't respond(回答) in two heartbeats(心跳), it is 'dead'
#define HEARTBEAT 1000
// In msecs

// The heart of the Binary Star design is its finite-state machine (FSM).
// The FSM runs one event at a time. We apply an event to the current state,
// which checks if the event is accepted, and if so, sets a new state:

static bool
s_state_machine (bstar_t *fsm)
{
bool exception = false;

// These are the PRIMARY and BACKUP states; we're waiting to become
// ACTIVE or PASSIVE depending on events we get from our peer(贵族):
if (fsm->state == STATE_PRIMARY) {
if (fsm->event == PEER_BACKUP) {
printf ("I: connected to backup (passive), ready active\n");
fsm->state = STATE_ACTIVE;
}
else
if (fsm->event == PEER_ACTIVE) {
printf ("I: connected to backup (active), ready passive\n");
fsm->state = STATE_PASSIVE;
}
// Accept client connections
}
else
if (fsm->state == STATE_BACKUP) {
if (fsm->event == PEER_ACTIVE) {
printf ("I: connected to primary (active), ready passive\n");
fsm->state = STATE_PASSIVE;
}
else
// Reject client connections when acting as backup
if (fsm->event == CLIENT_REQUEST)
exception = true;
}
else
// These are the ACTIVE and PASSIVE states:

if (fsm->state == STATE_ACTIVE) {
if (fsm->event == PEER_ACTIVE) {
// Two actives would mean split-brain
printf ("E: fatal(致命的) error - dual(双的) actives, aborting(流产)\n");
exception = true;
}
}
else
// Server is passive
// CLIENT_REQUEST events can trigger(引发) failover(失效备援) if peer(贵族) looks dead
if (fsm->state == STATE_PASSIVE) {
if (fsm->event == PEER_PRIMARY) {
// Peer(撒尿) is restarting(重新起动) - become active, peer will go passive
printf ("I: primary (passive) is restarting, ready active\n");
fsm->state = STATE_ACTIVE;
}
else
if (fsm->event == PEER_BACKUP) {
// Peer(撒尿) is restarting(重新起动) - become active, peer will go passive
printf ("I: backup (passive) is restarting, ready active\n");
fsm->state = STATE_ACTIVE;
}
else
if (fsm->event == PEER_PASSIVE) {
// Two passives would mean cluster(群) would be non-responsive
printf ("E: fatal(致命的) error - dual(双的) passives, aborting(流产)\n");
exception = true;
}
else
if (fsm->event == CLIENT_REQUEST) {
// Peer(撒尿) becomes active if timeout has passed
// It's the client request that triggers(修饰) the failover(失效备援)
assert (fsm->peer_expiry > 0);
if (zclock_time () >= fsm->peer_expiry) {
// If peer(贵族) is dead, switch(转换) to the active state
printf ("I: failover(失效备援) successful, ready active\n");
fsm->state = STATE_ACTIVE;
}
else
// If peer is alive, reject connections
exception = true;
}
}
return exception;
}

// This is our main task. First we bind(捆绑)/connect our sockets(插座) with our
// peer(贵族) and make sure we will get state messages correctly. We use
// three sockets; one to publish state, one to subscribe(订阅) to state, and
// one for client requests/replies:

int main (int argc, char *argv [])
{
// Arguments can be either of:
// -p primary server, at tcp://localhost:5001
// -b backup server, at tcp://localhost:5002
zctx_t *ctx = zctx_new ();
void *statepub = zsocket_new (ctx, ZMQ_PUB);
void *statesub = zsocket_new (ctx, ZMQ_SUB);
zsocket_set_subscribe(订阅) (statesub, "");
void *frontend = zsocket_new (ctx, ZMQ_ROUTER);
bstar_t fsm = { 0 };

if (argc == 2 && streq (argv [1], "-p")) {
printf ("I: Primary active, waiting for backup (passive)\n");
zsocket_bind (frontend, "tcp://*:5001");
zsocket_bind (statepub, "tcp://*:5003");
zsocket_connect (statesub, "tcp://localhost:5004");
fsm.state = STATE_PRIMARY;
}
else
if (argc == 2 && streq (argv [1], "-b")) {
printf ("I: Backup passive, waiting for primary (active)\n");
zsocket_bind (frontend, "tcp://*:5002");
zsocket_bind (statepub, "tcp://*:5004");
zsocket_connect (statesub, "tcp://localhost:5003");
fsm.state = STATE_BACKUP;
}
else {
printf ("Usage(使用): bstarsrv { -p | -b }\n");
zctx_destroy (&ctx);
exit (0);
}
// We now process events on our two input(投入) sockets(插座), and process these
// events one at a time via our finite-state machine. Our "work" for
// a client request is simply to echo(反射) it back:

// Set timer for next outgoing(外出的) state message
int64_t send_state_at = zclock_time () + HEARTBEAT;
while (!zctx_interrupted) {
zmq_pollitem_t items [] = {
{ frontend, 0, ZMQ_POLLIN, 0 },
{ statesub, 0, ZMQ_POLLIN, 0 }
};
int time_left = (int) ((send_state_at - zclock_time ()));
if (time_left < 0)
time_left = 0;
int rc = zmq_poll (items, 2, time_left * ZMQ_POLL_MSEC);
if (rc == -1)
break; // Context(环境) has been shut down

if (items [0].revents & ZMQ_POLLIN) {
// Have a client request
zmsg_t *msg = zmsg_recv (frontend);
fsm.event = CLIENT_REQUEST;
if (s_state_machine (&fsm) == false)
// Answer client by echoing(反射) request back
zmsg_send (&msg, frontend);
else
zmsg_destroy (&msg);
}
if (items [1].revents & ZMQ_POLLIN) {
// Have state from our peer(贵族), execute(实行) as event
char *message = zstr_recv (statesub);
fsm.event = atoi (message);
free (message);
if (s_state_machine (&fsm))
break; // Error, so exit
fsm.peer_expiry = zclock_time () + 2 * HEARTBEAT;
}
// If we timed out, send state to peer(凝视)
if (zclock_time () >= send_state_at) {
char message [2];
sprintf (message, "%d", fsm.state);
zstr_send (statepub, message);
send_state_at = zclock_time () + HEARTBEAT;
}
}
if (zctx_interrupted)
printf ("W: interrupted\n");

// Shutdown(关机) sockets(插座) and context(环境)
zctx_destroy (&ctx);
return 0;
}


Haxe | Java | Python | Ruby | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Scala

And here is the client:

// Binary Star client proof-of-concept(概念验证) implementation(实现). This client does no
// real work; it just demonstrates(证明) the Binary Star failover(失效备援) model.

#include "czmq.h"
#define REQUEST_TIMEOUT 1000 // msecs
#define SETTLE_DELAY 2000 // Before failing over

int main (void)
{
zctx_t *ctx = zctx_new ();

char *server [] = { "tcp://localhost:5001", "tcp://localhost:5002" };
uint server_nbr = 0;

printf ("I: connecting to server at %s…\n", server [server_nbr]);
void *client = zsocket_new (ctx, ZMQ_REQ);
zsocket_connect (client, server [server_nbr]);

int sequence = 0;
while (!zctx_interrupted) {
// We send a request, then we work to get a reply
char request [10];
sprintf (request, "%d", ++sequence);
zstr_send (client, request);

int expect_reply = 1;
while (expect_reply) {
// Poll(投票) socket(插座) for a reply, with timeout
zmq_pollitem_t items [] = { { client, 0, ZMQ_POLLIN, 0 } };
int rc = zmq_poll (items, 1, REQUEST_TIMEOUT * ZMQ_POLL_MSEC);
if (rc == -1)
break; // Interrupted

// We use a Lazy Pirate strategy(战略) in the client. If there's no
// reply within our timeout, we close the socket(插座) and try again.
// In Binary Star, it's the client vote that decides which
// server is primary; the client must therefore try to connect
// to each server in turn:

if (items [0].revents & ZMQ_POLLIN) {
// We got a reply from the server, must match sequence(序列)
char *reply = zstr_recv (client);
if (atoi (reply) == sequence) {
printf ("I: server replied OK (%s)\n", reply);
expect_reply = 0;
sleep (1); // One request per second
}
else
printf ("E: bad reply from server: %s\n", reply);
free (reply);
}
else {
printf ("W: no response from server, failing over\n");

// Old socket(插座) is confused(混乱); close it and open a new one
zsocket_destroy (ctx, client);
server_nbr = (server_nbr + 1) % 2;
zclock_sleep (SETTLE_DELAY);
printf ("I: connecting to server at %s…\n",
server [server_nbr]);
client = zsocket_new (ctx, ZMQ_REQ);
zsocket_connect (client, server [server_nbr]);

// Send request again, on new socket(插座)
zstr_send (client, request);
}
}
}
zctx_destroy (&ctx);
return 0;
}


Haxe | Java | Python | Ruby | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Scala

To test Binary Star, start the servers and client in any order:

bstarsrv -p     # Start primary
bstarsrv -b     # Start backup
bstarcli

You can then provoke(驱使) failover(失效备援) by killing the primary server, and recovery(恢复) by restarting(重新起动) the primary and killing the backup. Note how it's the client vote that triggers(修饰) failover, and recovery.

Binary(二进制的) star is driven by a finite(有限的) state machine. Events are the peer(贵族) state, so "Peer Active" means the other server has told us it's active. "Client Request" means we've received a client request. "Client Vote" means we've received a client request AND our peer is inactive(不活跃的) for two heartbeats(心跳).

Note that the servers use PUB-SUB sockets(插座) for state exchange. No other socket combination(结合) will work here. PUSH and DEALER block if there is no peer ready to receive a message. PAIR does not reconnect(使再接合) if the peer disappears and comes back. ROUTER needs the address of the peer before it can send it a message.

Figure 54 - Binary Star Finite State Machine

fig54.png

Binary Star Reactor
topprevnext

Binary Star is useful and generic(类的) enough to package up as a reusable(可重复使用的) reactor(反应器) class. The reactor then runs and calls our code whenever it has a message to process. This is much nicer than copying/pasting(裱糊) the Binary Star code into each server where we want that capability(才能).

In C, we wrap(包) the CZMQ zloop class that we saw before. zloop lets you register handlers to react(反应) on socket and timer events. In the Binary Star reactor, we provide handlers for voters and for state changes (active to passive, and vice(副的) versa). Here is the bstar API:

// bstar class - Binary Star reactor

#include "bstar.h"

// States we can be in at any point in time
typedef enum {
STATE_PRIMARY = 1, // Primary, waiting for peer(贵族) to connect
STATE_BACKUP = 2, // Backup, waiting for peer to connect
STATE_ACTIVE = 3, // Active - accepting connections
STATE_PASSIVE = 4 // Passive - not accepting connections
} state_t;

// Events, which start with the states our peer(贵族) can be in
typedef enum {
PEER_PRIMARY = 1, // HA peer is pending(悬而未决) primary
PEER_BACKUP = 2, // HA peer(贵族) is pending(悬而未决) backup
PEER_ACTIVE = 3, // HA peer is active
PEER_PASSIVE = 4, // HA peer is passive
CLIENT_REQUEST = 5 // Client makes request
} event_t;

// Structure(结构) of our class

struct _bstar_t {
zctx_t *ctx; // Our private context
zloop_t *loop; // Reactor loop
void *statepub; // State publisher
void *statesub; // State subscriber
state_t state; // Current state
event_t event; // Current event
int64_t peer_expiry; // When peer(贵族) is considered 'dead'
zloop_fn *voter_fn; // Voting socket handler
void *voter_arg; // Arguments for voting handler
zloop_fn *active_fn; // Call when become active
void *active_arg; // Arguments for handler
zloop_fn *passive_fn; // Call when become passive
void *passive_arg; // Arguments for handler
};

// The finite-state machine is the same as in the proof-of-concept(概念验证) server.
// To understand this reactor(反应器) in detail, first read the CZMQ zloop class.

// We send state information every this often
// If peer(贵族) doesn't respond(回答) in two heartbeats(心跳), it is 'dead'
#define BSTAR_HEARTBEAT 1000 // In msecs

// Binary(二进制的) Star finite(有限的) state machine (applies event to state)
// Returns -1 if there was an exception(例外), 0 if event was valid.

static int
s_execute_fsm (bstar_t *self)
{
int rc = 0;
// Primary server is waiting for peer(贵族) to connect
// Accepts CLIENT_REQUEST events in this state
if (self->state == STATE_PRIMARY) {
if (self->event == PEER_BACKUP) {
zclock_log ("I: connected to backup (passive), ready as active");
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
}
else
if (self->event == PEER_ACTIVE) {
zclock_log ("I: connected to backup (active), ready as passive");
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}
else
if (self->event == CLIENT_REQUEST) {
// Allow client requests to turn us into the active if we've
// waited sufficiently(充分地) long to believe the backup is not
// currently acting as active (i.e., after a failover(失效备援))
assert (self->peer_expiry > 0);
if (zclock_time () >= self->peer_expiry) {
zclock_log ("I: request from client, ready as active");
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
} else
// Don't respond(回答) to clients yet - it's possible we're
// performing a failback and the backup is currently active
rc = -1;
}
}
else
// Backup server is waiting for peer(贵族) to connect
// Rejects CLIENT_REQUEST events in this state
if (self->state == STATE_BACKUP) {
if (self->event == PEER_ACTIVE) {
zclock_log ("I: connected to primary (active), ready as passive");
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}
else
if (self->event == CLIENT_REQUEST)
rc = -1;
}
else
// Server is active
// Accepts CLIENT_REQUEST events in this state
// The only way out of ACTIVE is death
if (self->state == STATE_ACTIVE) {
if (self->event == PEER_ACTIVE) {
// Two actives would mean split-brain
zclock_log ("E: fatal(致命的) error - dual(双的) actives, aborting(流产)");
rc = -1;
}
}
else
// Server is passive
// CLIENT_REQUEST events can trigger(引发) failover(失效备援) if peer(贵族) looks dead
if (self->state == STATE_PASSIVE) {
if (self->event == PEER_PRIMARY) {
// Peer(撒尿) is restarting(重新起动) - become active, peer will go passive
zclock_log ("I: primary (passive) is restarting, ready as active");
self->state = STATE_ACTIVE;
}
else
if (self->event == PEER_BACKUP) {
// Peer(撒尿) is restarting(重新起动) - become active, peer will go passive
zclock_log ("I: backup (passive) is restarting, ready as active");
self->state = STATE_ACTIVE;
}
else
if (self->event == PEER_PASSIVE) {
// Two passives would mean cluster(群) would be non-responsive
zclock_log ("E: fatal(致命的) error - dual(双的) passives, aborting(流产)");
rc = -1;
}
else
if (self->event == CLIENT_REQUEST) {
// Peer(撒尿) becomes active if timeout has passed
// It's the client request that triggers(修饰) the failover(失效备援)
assert (self->peer_expiry > 0);
if (zclock_time () >= self->peer_expiry) {
// If peer(贵族) is dead, switch(转换) to the active state
zclock_log ("I: failover successful, ready as active");
self->state = STATE_ACTIVE;
}
else
// If peer(贵族) is alive, reject connections
rc = -1;
}
// Call state change handler if necessary
if (self->state == STATE_ACTIVE && self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
}
return rc;
}

static void
s_update_peer_expiry (bstar_t *self)
{
self->peer_expiry = zclock_time () + 2 * BSTAR_HEARTBEAT;
}

// Reactor event handlers…

// Publish our state to peer(凝视)
int s_send_state (zloop_t *loop, int timer_id, void *arg)
{
bstar_t *self = (bstar_t *) arg;
zstr_sendf (self->statepub, "%d", self->state);
return 0;
}

// Receive state from peer(贵族), execute(实行) finite(有限的) state machine
int s_recv_state (zloop_t *loop, zmq_pollitem_t *poller, void *arg)
{
bstar_t *self = (bstar_t *) arg;
char *state = zstr_recv (poller->socket);
if (state) {
self->event = atoi (state);
s_update_peer(贵族)_expiry(满期) (self);
free (state);
}
return s_execute_fsm (self);
}

// Application wants to speak to us, see if it's possible
int s_voter_ready (zloop_t *loop, zmq_pollitem_t *poller, void *arg)
{
bstar_t *self = (bstar_t *) arg;
// If server can accept input(投入) now, call appl handler
self->event = CLIENT_REQUEST;
if (s_execute_fsm (self) == 0)
(self->voter_fn) (self->loop, poller, self->voter_arg);
else {
// Destroy waiting message, no-one to read it
zmsg_t *msg = zmsg_recv (poller->socket);
zmsg_destroy (&msg);
}
return 0;
}

// This is the constructor(构造函数) for our bstar class. We have to tell it
// whether we're primary or backup server, as well as our local and
// remote(遥远的) endpoints(端点) to bind(绑) and connect to:

bstar_t *
bstar_new (int primary, char *local, char *remote)
{
bstar_t
*self;

self = (bstar_t *) zmalloc (sizeof (bstar_t));

// Initialize(初始化) the Binary Star
self->ctx = zctx_new ();
self->loop = zloop_new ();
self->state = primary? STATE_PRIMARY: STATE_BACKUP;

// Create publisher for state going to peer(凝视)
self->statepub = zsocket_new (self->ctx, ZMQ_PUB);
zsocket_bind (self->statepub, local);

// Create subscriber(订户) for state coming from peer(贵族)
self->statesub = zsocket_new (self->ctx, ZMQ_SUB);
zsocket_set_subscribe (self->statesub, "");
zsocket_connect (self->statesub, remote);

// Set-up(计划) basic reactor(反应器) events
zloop_timer (self->loop, BSTAR_HEARTBEAT, 0, s_send_state, self);
zmq_pollitem_t poller = { self->statesub, 0, ZMQ_POLLIN };
zloop_poller (self->loop, &poller(无角的), s_recv_state, self);
return self;
}

// The destructor(破坏者) shuts down the bstar reactor(反应器):

void
bstar_destroy (bstar_t **self_p)
{
assert (self_p);
if (*self_p) {
bstar_t *self = *self_p;
zloop_destroy (&self->loop);
zctx_destroy (&self->ctx);
free (self);
*self_p = NULL;
}
}

// This method returns the underlying(潜在的) zloop reactor(反应器), so we can add
// additional(附加的) timers and readers:

zloop_t *
bstar_zloop (bstar_t *self)
{
return self->loop;
}

// This method registers a client voter socket(插座). Messages received
// on this socket provide the CLIENT_REQUEST events for the Binary Star
// FSM and are passed to the provided application handler. We require
// exactly one voter per bstar instance:

int
bstar_voter (bstar_t *self, char *endpoint, int type, zloop_fn handler,
void *arg)
{
// Hold actual handler+arg so we can call this later
void *socket = zsocket_new (self->ctx, type);
zsocket_bind(结合) (socket(插座), endpoint(端点));
assert (!self->voter_fn);
self->voter_fn = handler;
self->voter_arg = arg;
zmq_pollitem_t poller = { socket, 0, ZMQ_POLLIN };
return zloop_poller (self->loop, &poller(无角的), s_voter_ready, self);
}

// Register handlers to be called each time there's a state change:

void
bstar_new_active (bstar_t *self, zloop_fn handler, void *arg)
{
assert (!self->active_fn);
self->active_fn = handler;
self->active_arg = arg;
}

void
bstar_new_passive (bstar_t *self, zloop_fn handler, void *arg)
{
assert (!self->passive_fn);
self->passive_fn = handler;
self->passive_arg = arg;
}

// Enable/disable verbose(冗长的) tracing(追踪), for debugging(调试以排除故障):

void bstar_set_verbose (bstar_t *self, bool verbose)
{
zloop_set_verbose (self->loop, verbose);
}

// Finally, start the configured(配置) reactor(反应器). It will end if any handler
// returns -1 to the reactor, or if the process receives SIGINT or SIGTERM:

int
bstar_start (bstar_t *self)
{
assert (self->voter_fn);
s_update_peer(贵族)_expiry(满期) (self);
return zloop_start (self->loop);
}

And here is the class implementation(实现):

// bstar class - Binary Star reactor(反应器)

#include "bstar.h"

// States we can be in at any point in time
typedef enum {
STATE_PRIMARY = 1, // Primary, waiting for peer(贵族) to connect
STATE_BACKUP = 2, // Backup, waiting for peer to connect
STATE_ACTIVE = 3, // Active - accepting connections
STATE_PASSIVE = 4 // Passive - not accepting connections
} state_t;

// Events, which start with the states our peer(贵族) can be in
typedef enum {
PEER_PRIMARY = 1, // HA peer is pending(悬而未决) primary
PEER_BACKUP = 2, // HA peer is pending backup
PEER_ACTIVE = 3, // HA peer(贵族) is active
PEER_PASSIVE = 4, // HA peer is passive
CLIENT_REQUEST = 5 // Client makes request
} event_t;

// Structure(结构) of our class

struct _bstar_t {
zctx_t *ctx; // Our private context
zloop_t *loop; // Reactor loop
void *statepub; // State publisher
void *statesub; // State subscriber
state_t state; // Current state
event_t event; // Current event
int64_t peer_expiry; // When peer(贵族) is considered 'dead'
zloop_fn *voter_fn; // Voting socket handler
void *voter_arg; // Arguments for voting handler
zloop_fn *active_fn; // Call when become active
void *active_arg; // Arguments for handler
zloop_fn *passive_fn; // Call when become passive
void *passive_arg; // Arguments for handler
};

// The finite-state machine is the same as in the proof-of-concept(概念验证) server.
// To understand this reactor(反应器) in detail, first read the CZMQ zloop class.

// We send state information every this often
// If peer(贵族) doesn't respond(回答) in two heartbeats(心跳), it is 'dead'
#define BSTAR_HEARTBEAT 1000 // In msecs

// Binary(二进制的) Star finite(有限的) state machine (applies event to state)
// Returns -1 if there was an exception(例外), 0 if event was valid.

static int
s_execute_fsm (bstar_t *self)
{
int rc = 0;
// Primary server is waiting for peer(贵族) to connect
// Accepts CLIENT_REQUEST events in this state
if (self->state == STATE_PRIMARY) {
if (self->event == PEER_BACKUP) {
zclock_log ("I: connected to backup (passive), ready as active");
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
}
else
if (self->event == PEER_ACTIVE) {
zclock_log ("I: connected to backup (active), ready as passive");
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}
else
if (self->event == CLIENT_REQUEST) {
// Allow client requests to turn us into the active if we've
// waited sufficiently(充分地) long to believe the backup is not
// currently acting as active (i.e., after a failover(失效备援))
assert (self->peer_expiry > 0);
if (zclock_time () >= self->peer_expiry) {
zclock_log ("I: request from client, ready as active");
self->state = STATE_ACTIVE;
if (self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
} else
// Don't respond(回答) to clients yet - it's possible we're
// performing a failback and the backup is currently active
rc = -1;
}
}
else
// Backup server is waiting for peer(贵族) to connect
// Rejects CLIENT_REQUEST events in this state
if (self->state == STATE_BACKUP) {
if (self->event == PEER_ACTIVE) {
zclock_log ("I: connected to primary (active), ready as passive");
self->state = STATE_PASSIVE;
if (self->passive_fn)
(self->passive_fn) (self->loop, NULL, self->passive_arg);
}
else
if (self->event == CLIENT_REQUEST)
rc = -1;
}
else
// Server is active
// Accepts CLIENT_REQUEST events in this state
// The only way out of ACTIVE is death
if (self->state == STATE_ACTIVE) {
if (self->event == PEER_ACTIVE) {
// Two actives would mean split-brain
zclock_log ("E: fatal(致命的) error - dual(双的) actives, aborting(流产)");
rc = -1;
}
}
else
// Server is passive
// CLIENT_REQUEST events can trigger(引发) failover(失效备援) if peer(贵族) looks dead
if (self->state == STATE_PASSIVE) {
if (self->event == PEER_PRIMARY) {
// Peer is restarting(重新起动) - become active, peer will go passive
zclock_log ("I: primary (passive) is restarting, ready as active");
self->state = STATE_ACTIVE;
}
else
if (self->event == PEER_BACKUP) {
// Peer(撒尿) is restarting(重新起动) - become active, peer will go passive
zclock_log ("I: backup (passive) is restarting, ready as active");
self->state = STATE_ACTIVE;
}
else
if (self->event == PEER_PASSIVE) {
// Two passives would mean cluster(群) would be non-responsive
zclock_log ("E: fatal(致命的) error - dual(双的) passives, aborting(流产)");
rc = -1;
}
else
if (self->event == CLIENT_REQUEST) {
// Peer(撒尿) becomes active if timeout has passed
// It's the client request that triggers(修饰) the failover(失效备援)
assert (self->peer_expiry > 0);
if (zclock_time () >= self->peer_expiry) {
// If peer(贵族) is dead, switch(转换) to the active state
zclock_log ("I: failover(失效备援) successful, ready as active");
self->state = STATE_ACTIVE;
}
else
// If peer is alive, reject connections
rc = -1;
}
// Call state change handler if necessary
if (self->state == STATE_ACTIVE && self->active_fn)
(self->active_fn) (self->loop, NULL, self->active_arg);
}
return rc;
}

static void
s_update_peer_expiry (bstar_t *self)
{
self->peer_expiry = zclock_time () + 2 * BSTAR_HEARTBEAT;
}

// Reactor event handlers…

// Publish our state to peer(凝视)
int s_send_state (zloop_t *loop, int timer_id, void *arg)
{
bstar_t *self = (bstar_t *) arg;
zstr_sendf (self->statepub, "%d", self->state);
return 0;
}

// Receive state from peer(贵族), execute(实行) finite(有限的) state machine
int s_recv_state (zloop_t *loop, zmq_pollitem_t *poller, void *arg)
{
bstar_t *self = (bstar_t *) arg;
char *state = zstr_recv (poller->socket);
if (state) {
self->event = atoi (state);
s_update_peer(贵族)_expiry(满期) (self);
free (state);
}
return s_execute_fsm (self);
}

// Application wants to speak to us, see if it's possible
int s_voter_ready (zloop_t *loop, zmq_pollitem_t *poller, void *arg)
{
bstar_t *self = (bstar_t *) arg;
// If server can accept input(投入) now, call appl handler
self->event = CLIENT_REQUEST;
if (s_execute_fsm (self) == 0)
(self->voter_fn) (self->loop, poller, self->voter_arg);
else {
// Destroy waiting message, no-one to read it
zmsg_t *msg = zmsg_recv (poller->socket);
zmsg_destroy (&msg);
}
return 0;
}

// This is the constructor(构造函数) for our bstar class. We have to tell it
// whether we're primary or backup server, as well as our local and
// remote(遥远的) endpoints(端点) to bind(绑) and connect to:

bstar_t *
bstar_new (int primary, char *local, char *remote)
{
bstar_t
*self;

self = (bstar_t *) zmalloc (sizeof (bstar_t));

// Initialize(初始化) the Binary Star
self->ctx = zctx_new ();
self->loop = zloop_new ();
self->state = primary? STATE_PRIMARY: STATE_BACKUP;

// Create publisher for state going to peer(凝视)
self->statepub = zsocket_new (self->ctx, ZMQ_PUB);
zsocket_bind (self->statepub, local);

// Create subscriber(订户) for state coming from peer(贵族)
self->statesub = zsocket_new (self->ctx, ZMQ_SUB);
zsocket_set_subscribe (self->statesub, "");
zsocket_connect (self->statesub, remote);

// Set-up(计划) basic reactor(反应器) events
zloop_timer (self->loop, BSTAR_HEARTBEAT, 0, s_send_state, self);
zmq_pollitem_t poller = { self->statesub, 0, ZMQ_POLLIN };
zloop_poller (self->loop, &poller(无角的), s_recv_state, self);
return self;
}

// The destructor(破坏者) shuts down the bstar reactor(反应器):

void
bstar_destroy (bstar_t **self_p)
{
assert (self_p);
if (*self_p) {
bstar_t *self = *self_p;
zloop_destroy (&self->loop);
zctx_destroy (&self->ctx);
free (self);
*self_p = NULL;
}
}

// This method returns the underlying(潜在的) zloop reactor(反应器), so we can add
// additional(附加的) timers and readers:

zloop_t *
bstar_zloop (bstar_t *self)
{
return self->loop;
}

// This method registers a client voter socket(插座). Messages received
// on this socket provide the CLIENT_REQUEST events for the Binary Star
// FSM and are passed to the provided application handler. We require
// exactly one voter per bstar instance:

int
bstar_voter (bstar_t *self, char *endpoint, int type, zloop_fn handler,
void *arg)
{
// Hold actual handler+arg so we can call this later
void *socket = zsocket_new (self->ctx, type);
zsocket_bind(结合) (socket(插座), endpoint(端点));
assert (!self->voter_fn);
self->voter_fn = handler;
self->voter_arg = arg;
zmq_pollitem_t poller = { socket, 0, ZMQ_POLLIN };
return zloop_poller (self->loop, &poller(无角的), s_voter_ready, self);
}

// Register handlers to be called each time there's a state change:

void
bstar_new_active (bstar_t *self, zloop_fn handler, void *arg)
{
assert (!self->active_fn);
self->active_fn = handler;
self->active_arg = arg;
}

void
bstar_new_passive (bstar_t *self, zloop_fn handler, void *arg)
{
assert (!self->passive_fn);
self->passive_fn = handler;
self->passive_arg = arg;
}

// Enable/disable verbose(冗长的) tracing(追踪), for debugging(调试以排除故障):

void bstar_set_verbose (bstar_t *self, bool verbose)
{
zloop_set_verbose (self->loop, verbose);
}

// Finally, start the configured(配置) reactor(反应器). It will end if any handler
// returns -1 to the reactor, or if the process receives SIGINT or SIGTERM:

int
bstar_start (bstar_t *self)
{
assert (self->voter_fn);
s_update_peer(贵族)_expiry(满期) (self);
return zloop_start (self->loop);
}


Haxe | Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

This gives us the following short main program for the server:


Haxe | Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

Brokerless Reliability(可靠性) (Freelance Pattern)

topprevnext

It might seem ironic(讽刺的) to focus so much on broker-based reliability, when we often explain ZeroMQ as "brokerless messaging". However, in messaging, as in real life, the middleman is both a burden(负担) and a benefit(利益). In practice, most messaging architectures(建筑学) benefit from a mix of distributed(分布式的) and brokered messaging. You get the best results when you can decide freely what trade-offs you want to make. This is why I can drive twenty minutes to a wholesaler(批发商) to buy five cases of wine for a party, but I can also walk ten minutes to a corner store to buy one bottle for a dinner. Our highly context-sensitive(上下文相关的) relative valuations(评价) of time, energy, and cost are essential to the real world economy(经济). And they are essential to an optimal(最佳的) message-based architecture.

This is why ZeroMQ does not impose a broker-centric architecture, though it does give you the tools to build brokers, aka proxies, and we've built a dozen or so different ones so far, just for practice.

So we'll end this chapter by deconstructing(解构) the broker-based reliability(可靠性) we've built so far, and turning it back into a distributed(分配) peer-to-peer(对等) architecture(建筑学) I call the Freelance pattern. Our use case will be a name resolution(分辨率) service. This is a common problem with ZeroMQ architectures: how do we know the endpoint(端点) to connect to? Hard-coding TCP/IP addresses in code is insanely(疯狂地) fragile. Using configuration(配置) files creates an administration(管理) nightmare(恶梦). Imagine if you had to hand-configure your web browser, on every PC or mobile phone you used, to realize that "google.com" was "74.125.230.82".

A ZeroMQ name service (and we'll make a simple implementation(实现)) must do the following:

  • Resolve(决定) a logical(合逻辑的) name into at least a bind(结合) endpoint, and a connect endpoint. A realistic(现实的) name service would provide multiple bind endpoints, and possibly multiple connect endpoints as well.
  • Allow us to manage multiple parallel(平行的) environments, e.g., "test" versus(对) "production", without modifying(修改) code.
  • Be reliable(可靠的), because if it is unavailable, applications won't be able to connect to the network.

Putting a name service behind a service-oriented(服务型的) Majordomo broker is clever from some points of view. However, it's simpler and much less surprising to just expose the name service as a server to which clients can connect directly. If we do this right, the name service becomes the only global network endpoint we need to hard-code in our code or configuration files.

Figure 55 - The Freelance Pattern

fig55.png

The types of failure we aim to handle are server crashes and restarts(重新开始), server busy looping(循环的), server overload, and network issues. To get reliability, we'll create a pool of name servers so if one crashes or goes away, clients can connect to another, and so on. In practice, two would be enough. But for the example, we'll assume(承担) the pool can be any size.

In this architecture, a large set of clients connect to a small set of servers directly. The servers bind to their respective(分别的) addresses. It's fundamentally(基本的) different from a broker-based approach(方法) like Majordomo, where workers connect to the broker. Clients have a couple of options:

  • Use REQ sockets(插座) and the Lazy Pirate pattern. Easy, but would need some additional(附加的) intelligence(智力) so clients don't stupidly try to reconnect(使再接合) to dead servers over and over.
  • Use DEALER sockets and blast(爆炸) out requests (which will be load balanced to all connected servers) until they get a reply. Effective(有效的), but not elegant(高雅的).
  • Use ROUTER sockets so clients can address specific(特殊的) servers. But how does the client know the identity(身份) of the server sockets? Either the server has to ping the client first (complex(复杂的)), or the server has to use a hard-coded, fixed identity known to the client (nasty(肮脏的)).

We'll develop each of these in the following subsections(分段).

Model One: Simple Retry and Failover
topprevnext

So our menu appears to offer: simple, brutal(残忍的), complex(复合体), or nasty(肮脏的). Let's start with simple and then work out the kinks(扭结). We take Lazy Pirate and rewrite it to work with multiple server endpoints(端点).

Start one or several servers first, specifying(指定) a bind(结合) endpoint as the argument:

// Freelance(自由作家) server - Model 1
// Trivial echo service

#include "czmq.h"

int main (int argc, char *argv [])
{
if (argc < 2) {
printf ("I: syntax(语法): %s <endpoint(端点)>\n", argv [0]);
return 0;
}
zctx_t *ctx = zctx_new ();
void *server = zsocket_new (ctx, ZMQ_REP);
zsocket_bind(结合) (server, argv [1]);

printf ("I: echo(反射) service is ready at %s\n", argv [1]);
while (true) {
zmsg_t *msg = zmsg_recv (server);
if (!msg)
break; // Interrupted
zmsg_send (&msg, server);
}
if (zctx_interrupted)
printf ("W: interrupted\n");

zctx_destroy (&ctx);
return 0;
}


C# | Java | Lua | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Then start the client, specifying(指定) one or more connect endpoints(端点) as arguments:

// Freelance(自由作家) client - Model 1
// Uses REQ socket(插座) to query one or more services

#include "czmq.h"
#define REQUEST_TIMEOUT 1000
#define MAX_RETRIES 3 // Before we abandon

static zmsg_t *
s_try_request (zctx_t *ctx, char *endpoint, zmsg_t *request)
{
printf ("I: trying echo(回音) service at %s…\n", endpoint);
void *client = zsocket_new (ctx, ZMQ_REQ);
zsocket_connect (client, endpoint(端点));

// Send request, wait safely for reply
zmsg_t *msg = zmsg_dup (request);
zmsg_send (&msg, client);
zmq_pollitem_t items [] = { { client, 0, ZMQ_POLLIN, 0 } };
zmq_poll (items, 1, REQUEST_TIMEOUT * ZMQ_POLL_MSEC);
zmsg_t *reply = NULL;
if (items [0].revents & ZMQ_POLLIN)
reply = zmsg_recv (client);

// Close socket(插座) in any case, we're done with it now
zsocket_destroy (ctx, client);
return reply;
}

// The client uses a Lazy Pirate strategy(战略) if it only has one server to talk
// to. If it has two or more servers to talk to, it will try each server just
// once:

int main (int argc, char *argv [])
{
zctx_t *ctx = zctx_new ();
zmsg_t *request = zmsg_new ();
zmsg_addstr (request, "Hello world");
zmsg_t *reply = NULL;

int endpoints = argc - 1;
if (endpoints == 0)
printf ("I: syntax(语法): %s <endpoint(端点)> …\n", argv [0]);
else
if (endpoints == 1) {
// For one endpoint, we retry(重操作) N times
int retries;
for (retries = 0; retries < MAX_RETRIES; retries++) {
char *endpoint = argv [1];
reply = s_try_request (ctx, endpoint(端点), request);
if (reply)
break; // Successful
printf ("W: no response from %s, retrying(重试)\n", endpoint);
}
}
else {
// For multiple endpoints, try each at most once
int endpoint_nbr;
for (endpoint_nbr = 0; endpoint_nbr < endpoints; endpoint_nbr++) {
char *endpoint = argv [endpoint_nbr + 1];
reply = s_try_request (ctx, endpoint(端点), request);
if (reply)
break; // Successful
printf ("W: no response from %s\n", endpoint);
}
}
if (reply)
printf ("Service is running OK\n");

zmsg_destroy (&request);
zmsg_destroy (&reply);
zctx_destroy (&ctx);
return 0;
}


C# | Java | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

A sample(试样的) run is:

flserver1 tcp://*:5555 &
flserver1 tcp://*:5556 &
flclient1 tcp://localhost:5555 tcp://localhost:5556

Although the basic approach(方法) is Lazy Pirate, the client aims to just get one successful reply. It has two techniques, depending on whether you are running a single server or multiple servers:

  • With a single server, the client will retry(重操作) several times, exactly as for Lazy Pirate.
  • With multiple servers, the client will try each server at most once until it's received a reply or has tried all servers.

This solves the main weakness of Lazy Pirate, namely that it could not fail over to backup or alternate(交替的) servers.

However, this design won't work well in a real application. If we're connecting many sockets(插座) and our primary name server is down, we're going to experience this painful timeout each time.

Model Two: Brutal(残忍的) Shotgun Massacre
topprevnext

Let's switch(转换) our client to using a DEALER socket. Our goal here is to make sure we get a reply back within the shortest possible time, no matter whether a particular server is up or down. Our client takes this approach:

  • We set things up, connecting to all servers.
  • When we have a request, we blast(猛攻) it out as many times as we have servers.
  • We wait for the first reply, and take that.
  • We ignore(驳回诉讼) any other replies.

What will happen in practice is that when all servers are running, ZeroMQ will distribute(分配) the requests so that each server gets one request and sends one reply. When any server is offline and disconnected(拆开), ZeroMQ will distribute the requests to the remaining servers. So a server may in some cases get the same request more than once.

What's more annoying for the client is that we'll get multiple replies back, but there's no guarantee(保证) we'll get a precise(精确的) number of replies. Requests and replies can get lost (e.g., if the server crashes while processing a request).

So we have to number requests and ignore any replies that don't match the request number. Our Model One server will work because it's an echo(回音) server, but coincidence(巧合) is not a great basis(基础) for understanding. So we'll make a Model Two server that chews up the message and returns a correctly numbered reply with the content "OK". We'll use messages consisting of two parts: a sequence(序列) number and a body.

Start one or more servers, specifying(指定) a bind(结合) endpoint(端点) each time:

// Freelance(自由作家) server - Model 2
// Does some work, replies OK, with message sequencing

#include "czmq.h"

int main (int argc, char *argv [])
{
if (argc < 2) {
printf ("I: syntax(语法): %s <endpoint(端点)>\n", argv [0]);
return 0;
}
zctx_t *ctx = zctx_new ();
void *server = zsocket_new (ctx, ZMQ_REP);
zsocket_bind(结合) (server, argv [1]);

printf ("I: service is ready at %s\n", argv [1]);
while (true) {
zmsg_t *request = zmsg_recv (server);
if (!request)
break; // Interrupted
// Fail nastily(肮脏的) if run against wrong client
assert(维护) (zmsg_size (request) == 2);

zframe_t *identity = zmsg_pop (request);
zmsg_destroy (&request);

zmsg_t *reply = zmsg_new ();
zmsg_add (reply, identity(身份));
zmsg_addstr (reply, "OK");
zmsg_send (&reply, server);
}
if (zctx_interrupted)
printf ("W: interrupted\n");

zctx_destroy (&ctx);
return 0;
}


C# | Java | Lua | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Then start the client, specifying(指定) the connect endpoints(端点) as arguments:

// Freelance(自由作家) client - Model 2
// Uses DEALER socket(插座) to blast(爆炸) one or more services

#include "czmq.h"

// We design our client API as a class, using the CZMQ style
#ifdef __cplusplus
extern "C" {
#endif

typedef struct _flclient_t flclient_t;
flclient_t *flclient_new (void);
void flclient_destroy (flclient_t **self_p);
void flclient_connect (flclient_t *self, char *endpoint);
zmsg_t *flclient_request (flclient_t *self, zmsg_t **request_p);

#ifdef __cplusplus

}
#endif

// If not a single service replies within this time, give up
#define(定义) GLOBAL_TIMEOUT 2500

int main (int argc, char *argv [])
{
if (argc == 1) {
printf ("I: syntax(语法): %s <endpoint(端点)> …\n", argv [0]);
return 0;
}
// Create new freelance(自由投稿的) client object
flclient_t *client = flclient_new ();

// Connect to each endpoint(端点)
int argn;
for (argn = 1; argn < argc; argn++)
flclient_connect (client, argv [argn]);

// Send a bunch(群) of name resolution(分辨率) 'requests', measure time
int requests = 10000;
uint64_t start = zclock_time ();
while (requests--) {
zmsg_t *request = zmsg_new ();
zmsg_addstr (request, "random name");
zmsg_t *reply = flclient_request (client, &request);
if (!reply) {
printf ("E: name service not available, aborting(异常终止)\n");
break;
}
zmsg_destroy (&reply);
}
printf ("Average round trip cost: %d usec\n",
(int) (zclock_time () - start) / 10);

flclient_destroy (&client);
return 0;
}

// Here is the flclient class implementation(实现). Each instance(举…为例) has a
// context(环境), a DEALER socket(插座) it uses to talk to the servers, a counter
// of how many servers it's connected to, and a request sequence(序列) number:

struct _flclient_t {
zctx_t *ctx; // Our context wrapper
void *socket; // DEALER socket(插座) talking to servers
size_t servers; // How many servers we have connected to
uint sequence; // Number of requests ever sent
};

// Constructor

flclient_t *
flclient_new (void)
{
flclient_t
*self;

self = (flclient_t *) zmalloc (sizeof (flclient_t));
self->ctx = zctx_new ();
self->socket = zsocket_new (self->ctx, ZMQ_DEALER);
return self;
}

// Destructor

void
flclient_destroy (flclient_t **self_p)
{
assert (self_p);
if (*self_p) {
flclient_t *self = *self_p;
zctx_destroy (&self->ctx);
free (self);
*self_p = NULL;
}
}

// Connect to new server endpoint(端点)

void
flclient_connect (flclient_t *self, char *endpoint)
{
assert (self);
zsocket_connect (self->socket, endpoint);
self->servers++;
}

// This method does the hard work. It sends a request to all
// connected servers in parallel(平行线) (for this to work, all connections
// must be successful and completed by this time). It then waits
// for a single successful reply, and returns that to the caller.
// Any other replies are just dropped:

zmsg_t *
flclient_request (flclient_t *self, zmsg_t **request_p)
{
assert (self);
assert (*request_p);
zmsg_t *request = *request_p;

// Prefix(前缀) request with sequence(序列) number and empty envelope
char sequence_text [10];
sprintf (sequence_text, "%u", ++self->sequence);
zmsg_pushstr (request, sequence(序列)_text);
zmsg_pushstr (request, "");

// Blast(爆炸) the request to all connected servers
int server;
for (server = 0; server < self->servers; server++) {
zmsg_t *msg = zmsg_dup (request);
zmsg_send (&msg, self->socket);
}
// Wait for a matching reply to arrive from anywhere
// Since we can poll(投票) several times, calculate(计算) each one
zmsg_t *reply = NULL;
uint64_t endtime = zclock_time () + GLOBAL_TIMEOUT;
while (zclock_time () < endtime) {
zmq_pollitem_t items [] = { { self->socket, 0, ZMQ_POLLIN, 0 } };
zmq_poll (items, 1, (endtime - zclock_time ()) * ZMQ_POLL_MSEC);
if (items [0].revents & ZMQ_POLLIN) {
// Reply is [empty][sequence][OK]
reply = zmsg_recv (self->socket);
assert (zmsg_size (reply) == 3);
free (zmsg_popstr (reply));
char *sequence = zmsg_popstr (reply);
int sequence_nbr = atoi (sequence);
free (sequence);
if (sequence_nbr == self->sequence)
break;
zmsg_destroy (&reply);
}
}
zmsg_destroy (request_p);
return reply;
}


C# | Java | PHP | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Here are some things to note about the client implementation(实现):

  • The client is structured(有结构的) as a nice little class-based API that hides the dirty work of creating ZeroMQ contexts(环境) and sockets(插座) and talking to the server. That is, if a shotgun(猎枪的) blast(爆炸) to the midriff(膈) can be called "talking".
  • The client will abandon(遗弃) the chase(追逐) if it can't find any responsive(响应的) server within a few seconds.
  • The client has to create a valid REP envelope, i.e., add an empty message frame(框架) to the front of the message.

The client performs 10,000 name resolution(分辨率) requests (fake(伪造的) ones, as our server does essentially nothing) and measures the average cost. On my test box, talking to one server, this requires about 60 microseconds(微秒). Talking to three servers, it takes about 80 microseconds.

The pros and cons of our shotgun approach(方法) are:

  • Pro: it is simple, easy to make and easy to understand.
  • Pro: it does the job of failover(失效备援), and works rapidly, so long as there is at least one server running.
  • Con: it creates redundant(多余的) network traffic.
  • Con: we can't prioritize(把…区分优先次序) our servers, i.e., Primary, then Secondary.
  • Con: the server can do at most one request at a time, period.

Model Three: Complex(复杂的) and Nasty(肮脏的)
topprevnext

The shotgun approach seems too good to be true. Let's be scientific and work through all the alternatives(二中择一). We're going to explore the complex/nasty option, even if it's only to finally realize that we preferred brutal(残忍的). Ah, the story of my life.

We can solve the main problems of the client by switching(转换) to a ROUTER socket. That lets us send requests to specific(特殊的) servers, avoid servers we know are dead, and in general be as smart as we want to be. We can also solve the main problem of the server (single-threadedness) by switching to a ROUTER socket.

But doing ROUTER to ROUTER between two anonymous(匿名的) sockets (which haven't set an identity(身份)) is not possible. Both sides generate(形成) an identity (for the other peer(贵族)) only when they receive a first message, and thus neither can talk to the other until it has first received a message. The only way out of this conundrum(难题) is to cheat, and use hard-coded identities in one direction. The proper way to cheat, in a client/server case, is to let the client "know" the identity of the server. Doing it the other way around would be insane(疯狂的), on top of complex and nasty, because any number of clients should be able to arise independently. Insane, complex, and nasty are great attributes(属性) for a genocidal(集团屠杀的) dictator(独裁者), but terrible ones for software.

Rather than invent yet another concept(观念) to manage, we'll use the connection endpoint(端点) as identity. This is a unique(独特的) string on which both sides can agree without more prior(优先的) knowledge than they already have for the shotgun model. It's a sneaky(鬼鬼祟祟的) and effective(有效的) way to connect two ROUTER sockets.

Remember how ZeroMQ identities(身份) work. The server ROUTER socket(插座) sets an identity before it binds(捆绑) its socket. When a client connects, they do a little handshake(握手) to exchange identities, before either side sends a real message. The client ROUTER socket, having not set an identity, sends a null identity to the server. The server generates(形成) a random(随机的) UUID to designate(指定) the client for its own use. The server sends its identity (which we've agreed is going to be an endpoint(端点) string) to the client.

This means that our client can route(按某路线发送) a message to the server (i.e., send on its ROUTER socket, specifying(指定) the server endpoint as identity) as soon as the connection is established(确定的). That's not immediately after doing a zmq_connect(), but some random time thereafter. Herein lies one problem: we don't know when the server will actually be available and complete its connection handshake. If the server is online, it could be after a few milliseconds(毫秒). If the server is down and the sysadmin is out to lunch, it could be an hour from now.

There's a small paradox(悖论) here. We need to know when servers become connected and available for work. In the Freelance pattern, unlike the broker-based patterns we saw earlier in this chapter, servers are silent until spoken to. Thus we can't talk to a server until it's told us it's online, which it can't do until we've asked it.

My solution(解决方案) is to mix in a little of the shotgun(猎枪的) approach(方法) from model 2, meaning we'll fire (harmless) shots at anything we can, and if anything moves, we know it's alive. We're not going to fire real requests, but rather a kind of ping-pong heartbeat(心跳).

This brings us to the realm(领域) of protocols(协议) again, so here's a short spec(投机) that defines(定义) how a Freelance client and server exchange ping-pong commands and request-reply commands.

It is short and sweet to implement(实施) as a server. Here's our echo(反射) server, Model Three, now speaking FLP:


C# | Java | Lua | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

The Freelance client, however, has gotten(得到) large. For clarity(清楚), it's split into an example application and a class that does the hard work. Here's the top-level application:


C# | Java | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

And here, almost as complex(复杂的) and large as the Majordomo broker, is the client API class:


C# | Java | Python | Tcl | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

This API implementation(实现) is fairly sophisticated(复杂的) and uses a couple of techniques that we've not seen before.

  • Multithreaded API: the client API consists of two parts, a synchronous(同步的) flcliapi class that runs in the application thread, and an asynchronous(异步的) agent class that runs as a background thread. Remember how ZeroMQ makes it easy to create multithreaded apps. The flcliapi and agent classes talk to each other with messages over an inproc socket(插座). All ZeroMQ aspects(方面) (such as creating and destroying a context(环境)) are hidden in the API. The agent in effect acts like a mini-broker, talking to servers in the background, so that when we make a request, it can make a best effort to reach a server it believes is available.
  • Tickless poll timer: in previous poll(投票) loops(环) we always used a fixed tick interval, e.g., 1 second, which is simple enough but not excellent on power-sensitive clients (such as notebooks or mobile phones), where waking the CPU costs power. For fun, and to help save the planet, the agent uses a tickless timer, which calculates(计算) the poll delay based on the next timeout we're expecting. A proper implementation(实现) would keep an ordered list of timeouts. We just check all timeouts and calculate the poll delay until the next one.

Conclusion

topprevnext

In this chapter, we've seen a variety of reliable(可靠的) request-reply mechanisms(机制), each with certain costs and benefits(利益). The example code is largely ready for real use, though it is not optimized(最佳化的). Of all the different patterns, the two that stand out for production use are the Majordomo pattern, for broker-based reliability(可靠性), and the Freelance pattern, for brokerless reliability.


Chapter 5 - Advanced Pub-Sub Patterns

topprevnext

In Chapter 3 - Advanced Request-Reply Patterns and Chapter 4 - Reliable(可靠的) Request-Reply Patterns we looked at advanced use of ZeroMQ's request-reply pattern. If you managed to digest all that, congratulations. In this chapter we'll focus on publish-subscribe and extend(延伸) ZeroMQ's core pub-sub pattern with higher-level patterns for performance, reliability(可靠性), state distribution(分布), and monitoring.

We'll cover:

  • When to use publish-subscribe
  • How to handle too-slow subscribers(订阅) (the Suicidal Snail pattern)
  • How to design high-speed subscribers (the Black Box pattern)
  • How to monitor a pub-sub network (the Espresso pattern)
  • How to build a shared key-value store (the Clone pattern)
  • How to use reactors(反应器) to simplify(简化) complex(复杂的) servers
  • How to use the Binary Star pattern to add failover(失效备援) to a server

Pros and Cons of Pub-Sub

topprevnext

ZeroMQ's low-level patterns have their different characters. Pub-sub addresses an old messaging problem, which is multicast or group messaging. It has that unique(独特的) mix of meticulous(一丝不苟的) simplicity(朴素) and brutal(残忍的) indifference(漠不关心) that characterizes(描绘…的特性) ZeroMQ. It's worth understanding the trade-offs that pub-sub makes, how these benefit(有益于) us, and how we can work around them if needed.

First, PUB sends each message to "all of many", whereas(然而) PUSH and DEALER rotate(旋转) messages to "one of many". You cannot simply replace PUSH with PUB or vice(副的) versa and hope that things will work. This bears repeating because people seem to quite often suggest doing this.

More profoundly(深厚的), pub-sub is aimed at scalability(可扩展性). This means large volumes(量) of data, sent rapidly to many recipients(容器). If you need millions of messages per second sent to thousands of points, you'll appreciate pub-sub a lot more than if you need a few messages a second sent to a handful of recipients.

To get scalability, pub-sub uses the same trick as push-pull(推挽式的连接), which is to get rid of back-chatter. This means that recipients don't talk back to senders. There are some exceptions(例外), e.g., SUB sockets(插座) will send subscriptions(捐献) to PUB sockets, but it's anonymous(匿名的) and infrequent(罕见的).

Killing back-chatter is essential to real scalability. With pub-sub, it's how the pattern can map cleanly to the PGM multicast(多路广播) protocol(协议), which is handled by the network switch(开关). In other words, subscribers(订阅) don't connect to the publisher at all, they connect to a multicast group on the switch, to which the publisher sends its messages.

When we remove back-chatter, our overall message flow becomes much simpler, which lets us make simpler APIs, simpler protocols, and in general reach many more people. But we also remove any possibility to coordinate(调整) senders and receivers. What this means is:

  • Publishers can't tell when subscribers are successfully connected, both on initial connections, and on reconnections(重新连接) after network failures.
  • Subscribers can't tell publishers anything that would allow publishers to control the rate of messages they send. Publishers only have one setting, which is full-speed, and subscribers must either keep up or lose messages.
  • Publishers can't tell when subscribers(订阅) have disappeared due to processes crashing, networks breaking, and so on.

The downside(下降趋势) is that we actually need all of these if we want to do reliable(可靠的) multicast(多路广播). The ZeroMQ pub-sub pattern will lose messages arbitrarily(武断地) when a subscriber is connecting, when a network failure occurs, or just if the subscriber or network can't keep up with the publisher.

The upside(上部) is that there are many use cases where almost reliable multicast is just fine. When we need this back-chatter, we can either switch(转换) to using ROUTER-DEALER (which I tend(照料) to do for most normal volume(量) cases), or we can add a separate channel for synchronization(同步) (we'll see an example of this later in this chapter).

Pub-sub is like a radio broadcast; you miss everything before you join, and then how much information you get depends on the quality of your reception. Surprisingly, this model is useful and widespread because it maps perfectly to real world distribution(分布) of information. Think of Facebook and Twitter, the BBC World Service, and the sports results.

As we did for request-reply, let's define(定义) reliability in terms of what can go wrong. Here are the classic(经典的) failure cases for pub-sub:

  • Subscribers join late, so they miss messages the server already sent.
  • Subscribers can fetch messages too slowly, so queues build up and then overflow(充满).
  • Subscribers can drop off and lose messages while they are away.
  • Subscribers can crash and restart(重新启动), and lose whatever data they already received.
  • Networks can become overloaded and drop data (specifically(特别地), for PGM).
  • Networks can become too slow, so publisher-side queues overflow and publishers crash.

A lot more can go wrong but these are the typical(典型的) failures we see in a realistic(现实的) system. Since v3.x, ZeroMQ forces default limits on its internal(内部的) buffers(有软皮摩擦) (the so-called high-water mark or HWM), so publisher crashes are rarer unless you deliberately(故意地) set the HWM to infinite(无限).

All of these failure cases have answers, though not always simple ones. Reliability(可靠性) requires complexity(复杂) that most of us don't need, most of the time, which is why ZeroMQ doesn't attempt to provide it out of the box (even if there was one global design for reliability, which there isn't).

Pub-Sub Tracing (Espresso Pattern)

topprevnext

Let's start this chapter by looking at a way to trace(追溯) pub-sub networks. In Chapter 2 - Sockets(插座) and Patterns we saw a simple proxy(代理人) that used these to do transport bridging. The zmq_proxy() method has three arguments: a frontend and backend socket that it bridges together, and a capture socket to which it will send all messages.

The code is deceptively(迷惑地) simple:


C# | Java | Python | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

Espresso(浓咖啡) works by creating a listener thread that reads a PAIR socket(插座) and prints anything it gets. That PAIR socket is one end of a pipe; the other end (another PAIR) is the socket we pass to zmq_proxy(). In practice, you'd filter interesting messages to get the essence(本质) of what you want to track (hence(因此) the name of the pattern).

The subscriber(订户) thread subscribes to "A" and "B", receives five messages, and then destroys its socket. When you run the example, the listener prints two subscription(捐献) messages, five data messages, two unsubscribe(取消订阅) messages, and then silence:

[002] 0141
[002] 0142
[007] B-91164
[007] B-12979
[007] A-52599
[007] A-06417
[007] A-45770
[002] 0041
[002] 0042

This shows neatly how the publisher socket stops sending data when there are no subscribers for it. The publisher thread is still sending messages. The socket just drops them silently.

Last Value Caching

topprevnext

If you've used commercial(商业的) pub-sub systems, you may be used to some features(特色) that are missing in the fast and cheerful ZeroMQ pub-sub model. One of these is last value caching (LVC). This solves the problem of how a new subscriber catches up when it joins the network. The theory is that publishers get notified(通告) when a new subscriber joins and subscribes to some specific(特殊的) topics. The publisher can then rebroadcast(重播) the last message for those topics.

I've already explained why publishers don't get notified when there are new subscribers, because in large pub-sub systems, the volumes(量) of data make it pretty much impossible. To make really large-scale(大规模的) pub-sub networks, you need a protocol(协议) like PGM that exploits an upscale(质优价高的) Ethernet switch's ability to multicast(多路广播) data to thousands of subscribers. Trying to do a TCP unicast(单一传播) from the publisher to each of thousands of subscribers just doesn't scale(衡量). You get weird(怪异的) spikes(长钉), unfair distribution(分布) (some subscribers getting the message before others), network congestion(拥挤), and general unhappiness(苦恼).

PGM is a one-way protocol: the publisher sends a message to a multicast address at the switch(开关), which then rebroadcasts that to all interested subscribers. The publisher never sees when subscribers join or leave: this all happens in the switch, which we don't really want to start reprogramming(改编程序).

However, in a lower-volume network with a few dozen subscribers and a limited number of topics, we can use TCP and then the XSUB and XPUB sockets do talk to each other as we just saw in the Espresso pattern.

Can we make an LVC using ZeroMQ? The answer is yes, if we make a proxy(代理人) that sits between the publisher and subscribers(订阅); an analog(模拟) for the PGM switch(转换), but one we can program ourselves.

I'll start by making a publisher and subscriber that highlight(突出) the worst case scenario(方案). This publisher is pathological(病理学的). It starts by immediately sending messages to each of a thousand topics, and then it sends one update a second to a random(随机的) topic. A subscriber connects, and subscribes to a topic. Without LVC, a subscriber would have to wait an average of 500 seconds to get any data. To add some drama(戏剧), let's pretend there's an escaped convict(罪犯) called Gregor threatening(威胁) to rip the head off Roger the toy bunny(兔子) if we can't fix that 8.3 minutes' delay.

Here's the publisher code. Note that it has the command line option to connect to some address, but otherwise binds(捆绑) to an endpoint(端点). We'll use this later to connect to our last value cache:


C# | Java | Python | Ruby | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Scala | Tcl

And here's the subscriber(订户):


C# | Java | Python | Ruby | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Scala | Tcl

Try building and running these: first the subscriber(订户), then the publisher. You'll see the subscriber reports getting "Save Roger" as you'd expect:

./pathosub &
./pathopub

It's when you run a second subscriber that you understand Roger's predicament(窘况). You have to leave it an awful(可怕的) long time before it reports getting any data. So, here's our last value cache. As I promised, it's a proxy(代理人) that binds(捆绑) to two sockets(插座) and then handles messages on both:


C# | Java | Python | Ruby | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Scala | Tcl

Now, run the proxy(代理人), and then the publisher:

./lvcache &
./pathopub tcp://localhost:5557

And now run as many instances(实例) of the subscriber(订户) as you want to try, each time connecting to the proxy(代理人) on port 5558:

./pathosub tcp://localhost:5558

Each subscriber happily reports "Save Roger", and Gregor the Escaped Convict slinks(早产的动物) back to his seat for dinner and a nice cup of hot milk, which is all he really wanted in the first place.

One note: by default, the XPUB socket(插座) does not report duplicate(复制的) subscriptions(捐献), which is what you want when you're naively(天真的) connecting an XPUB to an XSUB. Our example sneakily(鬼鬼祟祟的) gets around this by using random(胡乱地) topics so the chance of it not working is one in a million. In a real LVC proxy, you'll want to use the ZMQ_XPUB_VERBOSE option that we implement(实施) in Chapter 6 - The ZeroMQ Community as an exercise.

Slow Subscriber Detection (Suicidal Snail Pattern)

topprevnext

A common problem you will hit when using the pub-sub pattern in real life is the slow subscriber. In an ideal(理想的) world, we stream data at full speed from publishers to subscribers. In reality, subscriber applications are often written in interpreted(说明) languages, or just do a lot of work, or are just badly written, to the extent(程度) that they can't keep up with publishers.

How do we handle a slow subscriber? The ideal fix is to make the subscriber faster, but that might take work and time. Some of the classic(经典的) strategies(战略) for handling a slow subscriber are:

  • Queue messages on the publisher. This is what Gmail does when I don't read my email for a couple of hours. But in high-volume(大容量) messaging, pushing queues upstream(上游部门) has the thrilling(毛骨悚然的) but unprofitable(无益的) result of making publishers run out of memory and crash—especially if there are lots of subscribers and it's not possible to flush t(齐平)o disk for performance reasons.
  • Queue messages on the subscriber. This is much better, and it's what ZeroMQ does by default if the network can keep up with things. If anyone's going to run out of memory and crash, it'll be the subscriber rather than the publisher, which is fair. This is perfect for "peaky(多峰的)" streams where a subscriber can't keep up for a while, but can catch up when the stream slows down. However, it's no answer to a subscriber that's simply too slow in general.
  • Stop queuing new messages after a while. This is what Gmail does when my mailbox overflows(充满) its precious gigabytes(十亿字节) of space. New messages just get rejected or dropped. This is a great strategy from the perspective(观点) of the publisher, and it's what ZeroMQ does when the publisher sets a HWM. However, it still doesn't help us fix the slow subscriber. Now we just get gaps(间隙) in our message stream.
  • Punish slow subscribers with disconnect(拆开). This is what Hotmail (remember that?) did when I didn't log in for two weeks, which is why I was on my fifteenth Hotmail account when it hit me that there was perhaps a better way. It's a nice brutal(残忍的) strategy(战略) that forces subscribers(订阅) to sit up and pay attention and would be ideal(理想的), but ZeroMQ doesn't do this, and there's no way to layer it on top because subscribers are invisible(无形的) to publisher applications.

None of these classic(经典的) strategies fit, so we need to get creative(创造性的). Rather than disconnect(拆开) the publisher, let's convince(说服) the subscriber to kill itself. This is the Suicidal Snail pattern. When a subscriber detects(察觉) that it's running too slowly (where "too slowly" is presumably(大概) a configured(配置) option that really means "so slowly that if you ever get here, shout really loudly because I need to know, so I can fix this!"), it croaks(呱呱叫声) and dies.

How can a subscriber detect this? One way would be to sequence(按顺序排好) messages (number them in order) and use a HWM at the publisher. Now, if the subscriber detects a gap (i.e., the numbering isn't consecutive(连贯的)), it knows something is wrong. We then tune(调整) the HWM to the "croak and die if you hit this" level.

There are two problems with this solution(解决方案). One, if we have many publishers, how do we sequence messages? The solution is to give each publisher a unique(独特的) ID and add that to the sequencing. Second, if subscribers use ZMQ_SUBSCRIBE filters, they will get gaps(间隙) by definition(定义). Our precious sequencing will be for nothing.

Some use cases won't use filters, and sequencing will work for them. But a more general solution is that the publisher timestamps(时间戳) each message. When a subscriber gets a message, it checks the time, and if the difference is more than, say, one second, it does the "croak and die" thing, possibly firing off a squawk(抗议) to some operator console(安慰) first.

The Suicide Snail pattern works especially when subscribers have their own clients and service-level agreements and need to guarantee(保证) certain maximum latencies(潜伏). Aborting(流产) a subscriber may not seem like a constructive(建设性的) way to guarantee a maximum latency, but it's the assertion(断言) model. Abort today, and the problem will be fixed. Allow late data to flow downstream(下游地), and the problem may cause wider damage and take longer to appear on the radar(雷达).

Here is a minimal(最低的) example of a Suicidal Snail:


C++ | C# | Java | Lua | PHP | Python | Tcl | Ada | Basic | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Node.js | Objective-C | ooc | Perl | Q | Racket | Ruby | Scala

Here are some things to note about the Suicidal Snail example:

  • The message here consists simply of the current system clock as a number of milliseconds(毫秒). In a realistic(现实的) application, you'd have at least a message header with the timestamp(时间戳) and a message body with data.
  • The example has subscriber(订户) and publisher in a single process as two threads. In reality, they would be separate processes. Using threads is just convenient for the demonstration(示范).

High-Speed Subscribers (Black Box Pattern)

topprevnext

Now lets look at one way to make our subscribers(订阅) faster. A common use case for pub-sub is distributing(分配) large data streams like market data coming from stock exchanges. A typical(典型的) setup(设置) would have a publisher connected to a stock exchange, taking price quotes, and sending them out to a number of subscribers. If there are a handful of subscribers, we could use TCP. If we have a larger number of subscribers, we'd probably use reliable(可靠的) multicast(多路广播), i.e., PGM.

Figure 56 - The Simple Black Box Pattern

fig56.png

Let's imagine our feed has an average of 100,000 100-byte messages a second. That's a typical rate, after filtering market data we don't need to send on to subscribers. Now we decide to record a day's data (maybe 250 GB in 8 hours), and then replay i(重赛)t to a simulation n(仿真)etwork, i.e., a small group of subscribers. While 100K messages a second is easy for a ZeroMQ application, we want to replay it much faster.

So we set up our architecture(建筑学) with a bunch(群) of boxes—one for the publisher and one for each subscriber. These are well-specified boxes—eight cores, twelve for the publisher.

And as we pump data into our subscribers, we notice two things:

  1. When we do even the slightest amount(数量) of work with a message, it slows down our subscriber to the point where it can't catch up with the publisher again.
  1. We're hitting a ceiling, at both publisher and subscriber, to around 6M messages a second, even after careful optimization(最佳化) and TCP tuning(调整).

The first thing we have to do is break our subscriber into a multithreaded design so that we can do work with messages in one set of threads, while reading messages in another. Typically, we don't want to process every message the same way. Rather, the subscriber will filter some messages, perhaps by prefix(前缀) key. When a message matches some criteria(标准), the subscriber will call a worker to deal with it. In ZeroMQ terms, this means sending the message to a worker thread.

So the subscriber looks something like a queue device(装置). We could use various sockets(插座) to connect the subscriber and workers. If we assume(承担) one-way traffic and workers that are all identical(完全相同的事物), we can use PUSH and PULL and delegate(委派…为代表) all the routing(路由选择) work to ZeroMQ. This is the simplest and fastest approach(方法).

The subscriber talks to the publisher over TCP or PGM. The subscriber talks to its workers, which are all in the same process, over inproc://.

Figure 57 - Mad Black Box Pattern

fig57.png

Now to break that ceiling. The subscriber thread hits 100% of CPU and because it is one thread, it cannot use more than one core. A single thread will always hit a ceiling, be it at 2M, 6M, or more messages per second. We want to split the work across multiple threads that can run in parallel(平行线).

The approach used by many high-performance products, which works here, is sharding. Using sharding(分片), we split the work into parallel(平行线) and independent streams, such as half of the topic keys in one stream, and half in another. We could use many streams, but performance won't scale(规模) unless we have free cores. So let's see how to shard(鞘翅) into two streams.

With two streams, working at full speed, we would configure(安装) ZeroMQ as follows:

  • Two I/O threads, rather than one.
  • Two network interfaces(界面) (NIC), one per subscriber(订户).
  • Each I/O thread bound to a specific(特殊的) NIC.
  • Two subscriber threads, bound to specific cores.
  • Two SUB sockets(插座), one per subscriber thread.
  • The remaining cores assigned(分配) to worker threads.
  • Worker threads connected to both subscriber PUSH sockets.

Ideally(理想的), we want to match the number of fully-loaded threads in our architecture(建筑学) with the number of cores. When threads start to fight for cores and CPU cycles, the cost of adding more threads outweighs(比…重) the benefits(利益). There would be no benefit, for example, in creating more I/O threads.

Reliable(可靠的) Pub-Sub (Clone Pattern)

topprevnext

As a larger worked example, we'll take the problem of making a reliable pub-sub architecture. We'll develop this in stages. The goal is to allow a set of applications to share some common state. Here are our technical challenges:

  • We have a large set of client applications, say thousands or tens of thousands.
  • They will join and leave the network arbitrarily(武断地).
  • These applications must share a single eventually-consistent state.
  • Any application can update the state at any point in time.

Let's say that updates are reasonably low-volume. We don't have real time goals. The whole state can fit into memory. Some plausible(貌似可信的) use cases are:

  • A configuration(配置) that is shared by a group of cloud servers.
  • Some game state shared by a group of players.
  • Exchange rate data that is updated in real time and available to applications.

Centralized(集中) Versus(对) Decentralized(分散)
topprevnext

A first decision we have to make is whether we work with a central server or not. It makes a big difference in the resulting design. The trade-offs are these:

  • Conceptually(概念上的), a central server is simpler to understand because networks are not naturally symmetrical(匀称的). With a central server, we avoid all questions of discovery, bind(捆绑) versus connect, and so on.
  • Generally, a fully-distributed architecture(建筑学) is technically more challenging but ends up with simpler protocols(协议). That is, each node must act as server and client in the right way, which is delicate(微妙的). When done right, the results are simpler than using a central server. We saw this in the Freelance pattern in Chapter 4 - Reliable(可靠的) Request-Reply Patterns.
  • A central server will become a bottleneck(瓶颈) in high-volume(大容量) use cases. If handling scale(规模) in the order of millions of messages a second is required, we should aim for decentralization(分散) right away.
  • Ironically(讽刺的), a centralized architecture will scale to more nodes more easily than a decentralized one. That is, it's easier to connect 10,000 nodes to one server than to each other.

So, for the Clone pattern we'll work with a server that publishes state updates and a set of clients that represent applications.

Representing State as Key-Value Pairs
topprevnext

We'll develop Clone in stages, solving one problem at a time. First, let's look at how to update a shared state across a set of clients. We need to decide how to represent our state, as well as the updates. The simplest plausible(貌似可信的) format is a key-value store, where one key-value pair represents an atomic(原子的) unit of change in the shared state.

We have a simple pub-sub example in Chapter 1 - Basics, the weather server and client. Let's change the server to send key-value pairs, and the client to store these in a hash table. This lets us send updates from one server to a set of clients using the classic(经典的) pub-sub model.

An update is either a new key-value pair, a modified(改进的) value for an existing key, or a deleted key. We can assume(承担) for now that the whole store fits in memory and that applications access it by key, such as by using a hash table or dictionary. For larger stores and some kind of persistence(持续) we'd probably store the state in a database, but that's not relevant(有关的) here.

This is the server:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

And here is the client:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

Figure 58 - Publishing State Updates

fig58.png

Here are some things to note about this first model:

  • All the hard work is done in a kvmsg class. This class works with key-value message objects, which are multipart ZeroMQ messages structured(有结构的) as three frames(框架): a key (a ZeroMQ string), a sequence(序列) number (64-bit value, in network byte order), and a binary(二进制的) body (holds everything else).
  • The server generates(形成) messages with a randomized(随机化) 4-digit key, which lets us simulate(模仿的) a large but not enormous(庞大的) hash table (10K entries).
  • We don't implement(实施) deletions(删除) in this version: all messages are inserts or updates.
  • The server does a 200 millisecond(毫秒) pause after binding(结合) its socket(插座). This is to prevent slow joiner syndrome, where the subscriber(订户) loses messages as it connects to the server's socket. We'll remove that in later versions of the Clone code.
  • We'll use the terms publisher and subscriber in the code to refer to sockets(插座). This will help later when we have multiple sockets doing different things.

Here is the kvmsg class, in the simplest form that works for now:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

Later, we'll make a more sophisticated(弄复杂) kvmsg class that will work in real applications.

Both the server and client maintain(维持) hash tables, but this first model only works properly if we start all clients before the server and the clients never crash. That's very artificial(人造的).

Getting an Out-of-Band Snapshot(快照)
topprevnext

So now we have our second problem: how to deal with late-joining clients or clients that crash and then restart(重新启动).

In order to allow a late (or recovering) client to catch up with a server, it has to get a snapshot of the server's state. Just as we've reduced "message" to mean "a sequenced(按顺序排好) key-value pair", we can reduce "state" to mean "a hash table". To get the server state, a client opens a DEALER socket(插座) and asks for it explicitly(明确地).

To make this work, we have to solve a problem of timing. Getting a state snapshot will take a certain time, possibly fairly long if the snapshot is large. We need to correctly apply updates to the snapshot. But the server won't know when to start sending us updates. One way would be to start subscribing(订阅), get a first update, and then ask for "state for update N". This would require the server storing one snapshot for each update, which isn't practical.

Figure 59 - State Replication

fig59.png

So we will do the synchronization(同步) in the client, as follows:

  • The client first subscribes to updates and then makes a state request. This guarantees(保证) that the state is going to be newer than the oldest update it has.
  • The client waits for the server to reply with state, and meanwhile queues all updates. It does this simply by not reading them: ZeroMQ keeps them queued on the socket queue.
  • When the client receives its state update, it begins once again to read updates. However, it discards(抛弃) any updates that are older than the state update. So if the state update includes updates up to 200, the client will discard updates up to 201.
  • The client then applies updates to its own state snapshot.

It's a simple model that exploits ZeroMQ's own internal(内部的) queues. Here's the server:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

And here is the client:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

Here are some things to note about these two programs:

  • The server uses two tasks. One thread produces the updates (randomly(随便地)) and sends these to the main PUB socket(插座), while the other thread handles state requests on the ROUTER socket. The two communicate across PAIR sockets over an inproc:// connection.
  • The client is really simple. In C, it consists of about fifty lines of code. A lot of the heavy lifting is done in the kvmsg class. Even so, the basic Clone pattern is easier to implement(实施) than it seemed at first.
  • We don't use anything fancy for serializing(连载) the state. The hash table holds a set of kvmsg objects, and the server sends these, as a batch(一批) of messages, to the client requesting state. If multiple clients request state at once, each will get a different snapshot(快照).
  • We assume(承担) that the client has exactly one server to talk to. The server must be running; we do not try to solve the question of what happens if the server crashes.

Right now, these two programs don't do anything real, but they correctly synchronize(合拍) state. It's a neat example of how to mix different patterns: PAIR-PAIR, PUB-SUB, and ROUTER-DEALER.

Republishing(再版) Updates from Clients
topprevnext

In our second model, changes to the key-value store came from the server itself. This is a centralized(集中的) model that is useful, for example if we have a central configuration(配置) file we want to distribute(分配), with local caching on each node. A more interesting model takes updates from clients, not the server. The server thus becomes a stateless(没有国家的) broker. This gives us some benefits(利益):

  • We're less worried about the reliability(可靠性) of the server. If it crashes, we can start a new instance(实例) and feed it new values.
  • We can use the key-value store to share knowledge between active peers(撒尿).

To send updates from clients back to the server, we could use a variety of socket(插座) patterns. The simplest plausible(貌似可信的) solution(解决方案) is a PUSH-PULL combination(结合).

Why don't we allow clients to publish updates directly to each other? While this would reduce latency(潜伏), it would remove the guarantee(保证) of consistency(一致性). You can't get consistent(始终如一的) shared state if you allow the order of updates to change depending on who receives them. Say we have two clients, changing different keys. This will work fine. But if the two clients try to change the same key at roughly the same time, they'll end up with different notions(概念) of its value.

There are a few strategies(战略) for obtaining consistency when changes happen in multiple places at once. We'll use the approach(方法) of centralizing all change. No matter the precise(精确的) timing of the changes that clients make, they are all pushed through the server, which enforces(实施) a single sequence(序列) according to the order in which it gets updates.

Figure 60 - Republishing Updates

fig60.png

By mediating(调解) all changes, the server can also add a unique(独特的) sequence number to all updates. With unique sequencing, clients can detect(察觉) the nastier(肮脏的) failures, including network congestion(拥挤) and queue overflow(充满). If a client discovers that its incoming message stream has a hole, it can take action. It seems sensible(明智的) that the client contact the server and ask for the missing messages, but in practice that isn't useful. If there are holes, they're caused by network stress(压力), and adding more stress to the network will make things worse. All the client can do is warn its users that it is "unable to continue", stop, and not restart(重新启动) until someone has manually(手动地) checked the cause of the problem.

We'll now generate(形成) state updates in the client. Here's the server:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

And here is the client:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

Here are some things to note about this third design:

  • The server has collapsed(倒塌) to a single task. It manages a PULL socket(插座) for incoming updates, a ROUTER socket for state requests, and a PUB socket for outgoing(外出的) updates.
  • The client uses a simple tickless timer to send a random(随机的) update to the server once a second. In a real implementation(实现), we would drive updates from application code.

Working with Subtrees
topprevnext

As we grow the number of clients, the size of our shared store will also grow. It stops being reasonable to send everything to every client. This is the classic(经典的) story with pub-sub: when you have a very small number of clients, you can send every message to all clients. As you grow the architecture(建筑学), this becomes inefficient(无效率的). Clients specialize(专门从事) in different areas.

So even when working with a shared store, some clients will want to work only with a part of that store, which we call a subtree. The client has to request the subtree(子树) when it makes a state request, and it must specify(指定) the same subtree when it subscribes(订阅) to updates.

There are a couple of common syntaxes(语法) for trees. One is the path hierarchy, and another is the topic tree. These look like this:

  • Path hierarchy: /some/list/of/paths
  • Topic tree: some.list.of.topics

We'll use the path hierarchy(层级), and extend(延伸) our client and server so that a client can work with a single subtree(子树). Once you see how to work with a single subtree you'll be able to extend this yourself to handle multiple subtrees, if your use case demands it.

Here's the server implementing(实施) subtrees, a small variation(变化) on Model Three:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

And here is the corresponding client:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

Ephemeral Values
topprevnext

An ephemeral(短暂的) value is one that expires(期满) automatically(自动地) unless regularly refreshed(恢复精神的). If you think of Clone being used for a registration(登记) service, then ephemeral values would let you do dynamic(动态的) values. A node joins the network, publishes its address, and refreshes this regularly. If the node dies, its address eventually(最后) gets removed.

The usual abstraction(抽象) for ephemeral values is to attach(依附) them to a session, and delete them when the session(会议) ends. In Clone, sessions would be defined(定义) by clients, and would end if the client died. A simpler alternative(二中择一) is to attach a time to live (TTL) to ephemeral values, which the server uses to expire values that haven't been refreshed in time.

A good design principle(原理) that I use whenever possible is to not invent concepts(观念) that are not absolutely(绝对地) essential. If we have very large numbers of ephemeral values, sessions will offer better performance. If we use a handful of ephemeral values, it's fine to set a TTL on each one. If we use masses of ephemeral values, it's more efficient(有效率的) to attach them to sessions and expire them in bulk(体积). This isn't a problem we face at this stage, and may never face, so sessions go out the window.

Now we will implement(实施) ephemeral values. First, we need a way to encode the TTL in the key-value message. We could add a frame(框架). The problem with using ZeroMQ frames for properties is that each time we want to add a new property, we have to change the message structure(结构). It breaks compatibility(兼容性). So let's add a properties frame to the message, and write the code to let us get and put property values.

Next, we need a way to say, "delete this value". Up until now, servers and clients have always blindly inserted or updated new values into their hash table. We'll say that if the value is empty, that means "delete this key".

Here's a more complete version of the kvmsg class, which implements the properties frame (and adds a UUID frame, which we'll need later on). It also handles empty values by deleting the key from the hash, if necessary:


Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

The Model Five client is almost identical(同一的) to Model Four. It uses the full kvmsg class now, and sets a randomized(随机化) ttl property (measured in seconds) on each message:

kvmsg_set_prop (kvmsg, "ttl", "%d", randof (30));

Using a Reactor
topprevnext

Until now, we have used a poll(无角的) loop(环) in the server. In this next model of the server, we switch(转换) to using a reactor(反应器). In C, we use CZMQ's zloop class. Using a reactor makes the code more verbose(冗长的), but easier to understand and build out because each piece of the server is handled by a separate reactor handler.

We use a single thread and pass a server object around to the reactor handlers. We could have organized the server as multiple threads, each handling one socket(插座) or timer, but that works better when threads don't have to share data. In this case all work is centered around the server's hashmap, so one thread is simpler.

There are three reactor handlers:

  • One to handle snapshot(快照) requests coming on the ROUTER socket;
  • One to handle incoming updates from clients, coming on the PULL socket;
  • One to expire(期满) ephemeral(短暂的) values that have passed their TTL.

Java | Python | Tcl | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala

Adding the Binary Star Pattern for Reliability(可靠性)
topprevnext

The Clone models we've explored up to now have been relatively simple. Now we're going to get into unpleasantly complex(复杂的) territory(领土), which has me getting up for another espresso(浓咖啡). You should appreciate that making "reliable(可靠的)" messaging is complex enough that you always need to ask, "Do we actually need this?" before jumping into it. If you can get away with unreliable(不可靠的) or with "good enough" reliability, you can make a huge win in terms of cost and complexity(复杂). Sure, you may lose some data now and then. It is often a good trade-off. Having said, that, and… sips… (抿)because the espresso is really good, let's jump in.

As you play with the last model, you'll stop and restart(重新启动) the server. It might look like it recovers, but of course it's applying updates to an empty state instead of the proper current state. Any new client joining the network will only get the latest updates instead of the full historical(历史的) record.

What we want is a way for the server to recover from being killed, or crashing. We also need to provide backup in case the server is out of commission(委员会) for any length of time. When someone asks for "reliability", ask them to list the failures they want to handle. In our case, these are:

  • The server process crashes and is automatically(不经思索的) or manually(手动地) restarted. The process loses its state and has to get it back from somewhere.
  • The server machine dies and is offline for a significant(重大的) time. Clients have to switch(转换) to an alternate(交替的) server somewhere.
  • The server process or machine gets disconnected(拆开) from the network, e.g., a switch dies or a datacenter gets knocked out. It may come back at some point, but in the meantime(其时) clients need an alternate server.

Our first step is to add a second server. We can use the Binary Star pattern from Chapter 4 - Reliable Request-Reply Patterns to organize these into primary and backup. Binary(二进制的) Star is a reactor(反应器), so it's useful that we already refactored(重构) the last server model into a reactor style.

We need to ensure(保证) that updates are not lost if the primary server crashes. The simplest technique is to send them to both servers. The backup server can then act as a client, and keep its state synchronized(合拍) by receiving updates as all clients do. It'll also get new updates from clients. It can't yet store these in its hash table, but it can hold onto them for a while.

So, Model Six introduces the following changes over Model Five:

  • We use a pub-sub flow instead of a push-pull(推挽式的连接) flow for client updates sent to the servers. This takes care of fanning out the updates to both servers. Otherwise we'd have to use two DEALER sockets(插座).
  • We add heartbeats(心跳) to server updates (to clients), so that a client can detect(察觉) when the primary server has died. It can then switch(转换) over to the backup server.
  • We connect the two servers using the Binary Star bstar reactor(反应器) class. Binary(二进制的) Star relies(依靠) on the clients to vote by making an explicit(明确的) request to the server they consider active. We'll use snapshot(快照) requests as the voting mechanism(机制).
  • We make all update messages uniquely(独特地) identifiable(可辨认的) by adding a UUID field. The client generates(形成) this, and the server propagates(传播) it back on republished(再版) updates.
  • The passive server keeps a "pending(未决定的) list" of updates that it has received from clients, but not yet from the active server; or updates it's received from the active server, but not yet from the clients. The list is ordered from oldest to newest, so that it is easy to remove updates off the head.

Figure 61 - Clone(克隆) Client Finite(有限的) State Machine

fig61.png

It's useful to design the client logic(逻辑) as a finite state machine. The client cycles through three states:

  • The client opens and connects its sockets(插座), and then requests a snapshot from the first server. To avoid request storms, it will ask any given server only twice. One request might get lost, which would be bad luck. Two getting lost would be carelessness(粗心大意).
  • The client waits for a reply (snapshot data) from the current server, and if it gets it, it stores it. If there is no reply within some timeout, it fails over to the next server.
  • When the client has gotten(得到) its snapshot, it waits for and processes updates. Again, if it doesn't hear anything from the server within some timeout, it fails over to the next server.

The client loops(环) forever. It's quite likely during startup or failover(失效备援) that some clients may be trying to talk to the primary server while others are trying to talk to the backup server. The Binary Star state machine handles this, hopefully accurately(精确地). It's hard to prove software correct; instead we hammer it until we can't prove it wrong.

Failover happens as follows:

  • The client detects that primary server is no longer sending heartbeats, and concludes that it has died. The client connects to the backup server and requests a new state snapshot.
  • The backup server starts to receive snapshot requests from clients, and detects that primary server has gone, so it takes over as primary.
  • The backup server applies its pending list to its own hash table, and then starts to process state snapshot requests.

When the primary server comes back online, it will:

  • Start up as passive server, and connect to the backup server as a Clone client.
  • Start to receive updates from clients, via its SUB socket.

We make a few assumptions(假定):

  • At least one server will keep running. If both servers crash, we lose all server state and there's no way to recover it.
  • Multiple clients do not update the same hash keys at the same time. Client updates will arrive at the two servers in a different order. Therefore, the backup server may apply updates from its pending(未决定的) list in a different order than the primary server would or did. Updates from one client will always arrive in the same order on both servers, so that is safe.

Thus the architecture(建筑学) for our high-availability server pair using the Binary Star pattern has two servers and a set of clients that talk to both servers.

Figure 62 - High-availability Clone Server Pair

fig62.png

Here is the sixth and last model of the Clone server:


Java | Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

This model is only a few hundred lines of code, but it took quite a while to get working. To be accurate(精确的), building Model Six took about a full week of "Sweet god, this is just too complex(复杂的) for an example" hacking. We've assembled(集合) pretty much everything and the kitchen sink into this small application. We have failover(失效备援), ephemeral(短暂的) values, subtrees(子树), and so on. What surprised me was that the up-front(预先的) design was pretty accurate. Still the details of writing and debugging(调试) so many socket(插座) flows is quite challenging.

The reactor-based design removes a lot of the grunt(咕哝) work from the code, and what remains is simpler and easier to understand. We reuse the bstar reactor(反应器) from Chapter 4 - Reliable(可靠的) Request-Reply Patterns. The whole server runs as one thread, so there's no inter-thread weirdness(命运) going on—just a structure p(结构)ointer (self) passed around to all handlers, which can do their thing happily. One nice side effect of using reactors is that the code, being less tightly integrated(综合的) into a poll(无角的) loop(环), is much easier to reuse. Large chunks(大块) of Model Six are taken from Model Five.

I built it piece by piece, and got each piece working properly before going onto the next one. Because there are four or five main socket flows, that meant quite a lot of debugging and testing. I debugged just by dumping(倾销) messages to the console(控制台). Don't use classic(经典的) debuggers to step through ZeroMQ applications; you need to see the message flows to make any sense of what is going on.

For testing, I always try to use Valgrind, which catches memory leaks and invalid(无效的) memory accesses. In C, this is a major concern(关系), as you can't delegate(委派…为代表) to a garbage collector(收藏家). Using proper and consistent(始终如一的) abstractions(抽象) like kvmsg and CZMQ helps enormously(巨大地).

The Clustered Hashmap Protocol(协议)
topprevnext

While the server is pretty much a mashup of the previous model plus the Binary Star pattern, the client is quite a lot more complex. But before we get to that, let's look at the final protocol. I've written this up as a specification(规格) on the ZeroMQ RFC website as the Clustered Hashmap Protocol.

Roughly, there are two ways to design a complex protocol such as this one. One way is to separate each flow into its own set of sockets. This is the approach(方法) we used here. The advantage is that each flow is simple and clean. The disadvantage is that managing multiple socket flows at once can be quite complex. Using a reactor makes it simpler, but still, it makes a lot of moving pieces that have to fit together correctly.

The second way to make such a protocol is to use a single socket pair for everything. In this case, I'd have used ROUTER for the server and DEALER for the clients, and then done everything over that connection. It makes for a more complex protocol but at least the complexity(复杂) is all in one place. In Chapter 7 - Advanced Architecture using ZeroMQ we'll look at an example of a protocol(协议) done over a ROUTER-DEALER combination(结合).

Let's take a look at the CHP specification(规格). Note that "SHOULD", "MUST" and "MAY" are key words we use in protocol specifications to indicate(表明) requirement levels.

Goals

CHP is meant to provide a basis(基础) for reliable(可靠的) pub-sub across a cluster(群) of clients connected over a ZeroMQ network. It defines(定义) a "hashmap" abstraction(抽象) consisting of key-value pairs. Any client can modify(修改) any key-value pair at any time, and changes are propagated(传播) to all clients. A client can join the network at any time.

Architecture

CHP connects a set of client applications and a set of servers. Clients connect to the server. Clients do not see each other. Clients can come and go arbitrarily(武断地).

Ports and Connections

The server MUST open three ports as follows:

  • A SNAPSHOT port (ZeroMQ ROUTER socket(插座)) at port number P.
  • A PUBLISHER port (ZeroMQ PUB socket) at port number P + 1.
  • A COLLECTOR port (ZeroMQ SUB socket) at port number P + 2.

The client SHOULD open at least two connections:

  • A SNAPSHOT connection (ZeroMQ DEALER socket) to port number P.
  • A SUBSCRIBER connection (ZeroMQ SUB socket) to port number P + 1.

The client MAY open a third connection, if it wants to update the hashmap:

  • A PUBLISHER connection (ZeroMQ PUB socket) to port number P + 2.

This extra frame(框架) is not shown in the commands explained below.

State Synchronization

The client MUST start by sending a ICANHAZ command to its snapshot(快照) connection. This command consists of two frames as follows:

ICANHAZ command
-----------------------------------
Frame 0: "ICANHAZ?"
Frame 1: subtree specification

Both frames(框架) are ZeroMQ strings. The subtree(子树) specification(规格) MAY be empty. If not empty, it consists of a slash(削减) followed by one or more path segments(段), ending in a slash.

The server MUST respond(回答) to a ICANHAZ command by sending zero or more KVSYNC commands to its snapshot(快照) port, followed with a KTHXBAI command. The server MUST prefix(加前缀) each command with the identity(身份) of the client, as provided by ZeroMQ with the ICANHAZ command. The KVSYNC command specifies(指定) a single key-value pair as follows:

KVSYNC command
-----------------------------------
Frame 0: key, as ZeroMQ string
Frame 1: sequence number, 8 bytes in network order
Frame 2: <empty>
Frame 3: <empty>
Frame 4: value, as blob

The sequence(序列) number has no significance(意义) and may be zero.

The KTHXBAI command takes this form:

KTHXBAI command
-----------------------------------
Frame 0: "KTHXBAI"
Frame 1: sequence number, 8 bytes in network order
Frame 2: <empty>
Frame 3: <empty>
Frame 4: subtree specification

The sequence number MUST be the highest sequence number of the KVSYNC commands previously sent.

When the client has received a KTHXBAI command, it SHOULD start to receive messages from its subscriber(订户) connection and apply them.

Server-to-Client Updates

When the server has an update for its hashmap it MUST broadcast this on its publisher socket(插座) as a KVPUB command. The KVPUB command has this form:

KVPUB command
-----------------------------------
Frame 0: key, as ZeroMQ string
Frame 1: sequence number, 8 bytes in network order
Frame 2: UUID, 16 bytes
Frame 3: properties, as ZeroMQ string
Frame 4: value, as blob

The sequence number MUST be strictly incremental(增加的). The client MUST discard(抛弃) any KVPUB commands whose sequence numbers are not strictly greater than the last KTHXBAI or KVPUB command received.

The UUID is optional(可选择的) and frame 2 MAY be empty (size zero). The properties field is formatted(格式化) as zero or more instances(实例) of "name=value" followed by a newline character. If the key-value pair has no properties, the properties field is empty.

If the value is empty, the client SHOULD delete its key-value entry with the specified key.

In the absence of other updates the server SHOULD send a HUGZ command at regular intervals, e.g., once per second. The HUGZ command has this format:

HUGZ command
-----------------------------------
Frame 0: "HUGZ"
Frame 1: 00000000
Frame 2: <empty>
Frame 3: <empty>
Frame 4: <empty>

The client MAY treat the absence of HUGZ as an indicator(指示器) that the server has crashed (see Reliability below).

Client-to-Server Updates

When the client has an update for its hashmap, it MAY send this to the server via its publisher connection as a KVSET command. The KVSET command has this form:

KVSET command
-----------------------------------
Frame 0: key, as ZeroMQ string
Frame 1: sequence number, 8 bytes in network order
Frame 2: UUID, 16 bytes
Frame 3: properties, as ZeroMQ string
Frame 4: value, as blob

The sequence number has no significance and may be zero. The UUID SHOULD be a universally(普遍地) unique(独特的) identifier(标识符), if a reliable(可靠的) server architecture(建筑学) is used.

If the value is empty, the server MUST delete its key-value entry with the specified key.

The server SHOULD accept the following properties:

  • ttl: specifies a time-to-live in seconds. If the KVSET command has a ttl property, the server SHOULD delete the key-value pair and broadcast a KVPUB with an empty value in order to delete this from all clients when the TTL has expired(期满).

Reliability

CHP may be used in a dual-server configuration(配置) where a backup server takes over if the primary server fails. CHP does not specify(指定) the mechanisms(机制) used for this failover(失效备援) but the Binary Star pattern may be helpful.

To assist(参加) server reliability(可靠性), the client MAY:

  • Set a UUID in every KVSET command.
  • Detect(察觉) the lack of HUGZ over a time period and use this as an indicator(指示器) that the current server has failed.
  • Connect to a backup server and re-request a state synchronization(同步).

Scalability(可扩展性) and Performance

CHP is designed to be scalable(可攀登的) to large numbers (thousands) of clients, limited only by system resources on the broker. Because all updates pass through a single server, the overall throughput(生产量) will be limited to some millions of updates per second at peak(山峰), and probably less.

Security

CHP does not implement(实施) any authentication(证明), access control, or encryption(加密) mechanisms and should not be used in any deployment(调度) where these are required.

Building a Multithreaded Stack(堆) and API
topprevnext

The client stack we've used so far isn't smart enough to handle this protocol(协议) properly. As soon as we start doing heartbeats(心跳), we need a client stack that can run in a background thread. In the Freelance pattern at the end of Chapter 4 - Reliable(可靠的) Request-Reply Patterns we used a multithreaded API but didn't explain it in detail. It turns out that multithreaded APIs are quite useful when you start to make more complex(复杂的) ZeroMQ protocols like CHP.

Figure 63 - Multithreaded API

fig63.png

If you make a nontrivial(非平凡的) protocol(协议) and you expect applications to implement(实施) it properly, most developers will get it wrong most of the time. You're going to be left with a lot of unhappy people complaining that your protocol is too complex(复杂的), too fragile, and too hard to use. Whereas(然而) if you give them a simple API to call, you have some chance of them buying in.

Our multithreaded API consists of a frontend(前端) object and a background agent, connected by two PAIR sockets(插座). Connecting two PAIR sockets like this is so useful that your high-level binding(装订) should probably do what CZMQ does, which is package a "create new thread with a pipe that I can use to send messages to it" method.

The multithreaded APIs that we see in this book all take the same form:

  • The constructor(构造函数) for the object (clone_new) creates a context(环境) and starts a background thread connected with a pipe. It holds onto one end of the pipe so it can send commands to the background thread.
  • The background thread starts an agent that is essentially a zmq_poll loop(环) reading from the pipe socket and any other sockets (here, the DEALER and SUB sockets).
  • The main application thread and the background thread now communicate only via ZeroMQ messages. By convention(大会), the frontend sends string commands so that each method on the class turns into a message sent to the backend agent, like this:

void
clone_connect (clone_t *self, char *address, char *service)
{
assert (self);
zmsg_t *msg = zmsg_new ();
zmsg_addstr (msg, "CONNECT");
zmsg_addstr (msg, address);
zmsg_addstr (msg, service);
zmsg_send (&msg, self->pipe);
}

  • If the method needs a return code, it can wait for a reply message from the agent.
  • If the agent needs to send asynchronous(异步的) events back to the frontend(前端), we add a recv method to the class, which waits for messages on the frontend pipe.
  • We may want to expose the frontend pipe socket(插座) handle to allow the class to be integrated(综合的) into further poll(投票) loops(环). Otherwise any recv method would block the application.

The clone(克隆) class has the same structure(结构) as the flcliapi class from Chapter 4 - Reliable(可靠的) Request-Reply Patterns and adds the logic(逻辑) from the last model of the Clone client. Without ZeroMQ, this kind of multithreaded API design would be weeks of really hard work. With ZeroMQ, it was a day or two of work.

The actual API methods for the clone class are quite simple:

// Create a new clone class instance(实例)
clone_t *
clone_new (void);

// Destroy a clone(克隆) class instance(实例)
void
clone_destroy (clone_t **self_p);

// Define(定义) the subtree(子树), if any, for this clone class
void
clone_subtree (clone_t *self, char *subtree);

// Connect the clone class to one server
void
clone_connect (clone_t *self, char *address, char *service);

// Set a value in the shared hashmap
void
clone_set (clone_t *self, char *key, char *value, int ttl);

// Get a value from the shared hashmap
char *
clone_get (clone_t *self, char *key);

So here is Model Six of the clone(克隆) client, which has now become just a thin shell(壳) using the clone class:


Java | Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

Note the connect method, which specifies(指定) one server endpoint(端点). Under the hood(头巾), we're in fact talking to three ports. However, as the CHP protocol(协议) says, the three ports are on consecutive(连贯的) port numbers:

  • The server state router(路由器) (ROUTER) is at port P.
  • The server updates publisher (PUB) is at port P + 1.
  • The server updates subscriber(订户) (SUB) is at port P + 2.

So we can fold the three connections into one logical(合逻辑的) operation (which we implement(实施) as three separate ZeroMQ connect calls).

Let's end with the source code for the clone(克隆) stack(堆). This is a complex(复杂的) piece of code, but easier to understand when you break it into the frontend(前端) object class and the backend agent. The frontend sends string commands ("SUBTREE", "CONNECT", "SET", "GET") to the agent, which handles these commands as well as talking to the server(s). Here is the agent's logic(逻辑):

  1. Start up by getting a snapshot(快照) from the first server
  2. When we get a snapshot switch(转换) to reading from the subscriber socket(插座).
  3. If we don't get a snapshot then fail over to the second server.
  4. Poll(投票) on the pipe and the subscriber socket.
  5. If we got input(投入) on the pipe, handle the control message from the frontend object.
  6. If we got input on the subscriber, store or apply the update.
  7. If we didn't get anything from the server within a certain time, fail over.
  8. Repeat until the process is interrupted by Ctrl-C.

And here is the actual clone class implementation(实现):


Java | Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl


Chapter 6 - The ZeroMQ Community

topprevnext

People sometimes ask me what's so special about ZeroMQ. My standard answer is that ZeroMQ is arguably(可论证地) the best answer we have to the vexing(令人烦恼的) question of "How do we make the distributed(分布式的) software that the 21st century demands?" But more than that, ZeroMQ is special because of its community. This is ultimately(最后) what separates the wolves from the sheep.

There are three main open source patterns. The first is the large firm dumping(倾销) code to break the market for others. This is the Apache Foundation model. The second is tiny teams or small firms building their dream. This is the most common open source model, which can be very successful commercially(商业上). The last is aggressive and diverse(不同的) communities that swarm(挤满) over a problem landscape(风景). This is the Linux model, and the one to which we aspire(渴望) with ZeroMQ.

It's hard to overemphasize(过分强调) the power and persistence(持续) of a working open source community. There really does not seem to be a better way of making software for the long term. Not only does the community choose the best problems to solve, it solves them minimally(最低限度地), carefully, and it then looks after these answers for years, decades(十年), until they're no longer relevant(有关的), and then it quietly puts them away.

To really benefit(有益于) from ZeroMQ, you need to understand the community. At some point down the road you'll want to submit(服从) a patch(眼罩), an issue, or an add-on. You might want to ask someone for help. You will probably want to bet a part of your business on ZeroMQ, and when I tell you that the community is much, much more important than the company that backs the product, even though I'm CEO of that company, this should be significant(重大的).

In this chapter I'm going to look at our community from several angles and conclude by explaining in detail our contract(合同) for collaboration(合作), which we call "C4". You should find the discussion useful for your own work. We've also adapted(适应) the ZeroMQ C4 process for closed source projects with good success.

We'll cover:

  • The rough structure(结构) of ZeroMQ as a set of projects
  • What "software architecture(建筑学)" is really about
  • Why we use the LGPL and not the BSD license
  • How we designed and grew the ZeroMQ community
  • The business that backs ZeroMQ
  • Who owns the ZeroMQ source code
  • How to make and submit(服从) a patch(修补) to ZeroMQ
  • Who controls what patches actually go into ZeroMQ
  • How we guarantee(保证) compatibility(兼容性) with old code
  • Why we don't use public git branches
  • Who decides on the ZeroMQ road map
  • A worked example of a change to libzmq

Architecture of the ZeroMQ Community

topprevnext

You know that ZeroMQ is an LGPL-licensed project. In fact it's a collection of projects, built around the core library, libzmq. I'll visualize(形象) these projects as an expanding(扩大的) galaxy(银河):

  • At the core, libzmq is the ZeroMQ core library. It's written in C++, with a low-level C API. The code is nasty(肮脏的), mainly because it's highly optimized(最佳化的) but also because it's written in C++, a language that lends itself to subtle(微妙的) and deep nastiness(不洁). Martin Sustrik wrote the bulk(体积) of this code. Today it has dozens of people who maintain(维持) different parts of it.
  • Around libzmq, there are about 50 bindings. These are individual(个人的) projects that create higher-level APIs for ZeroMQ, or at least map the low-level API into other languages. The bindings(结合) vary(变化) in quality from experimental(实验的) to utterly(完全地) awesome(可怕的). Probably the most impressive(感人的) binding is PyZMQ, which was one of the first community projects on top of ZeroMQ. If you are a binding author, you should really study PyZMQ and aspire(渴望) to making your code and community as great.
  • A lot of languages have multiple bindings (Erlang, Ruby, C#, at least) written by different people over time, or taking varying approaches(方法). We don't regulate(调节) these in any way. There are no "official" bindings. You vote by using one or the other, contributing(贡献) to it, or ignoring(驳回诉讼) it.
  • There are a series of reimplementations of libzmq, starting with JeroMQ, a full Java translation of the library, which is now the basis(基础) for NetMQ, a C# stack(堆). These native stacks offer similar or identical(同一的) APIs, and speak the same protocol(协议) (ZMTP) as libzmq.
  • On top of the bindings are a lot of projects that use ZeroMQ or build on it. See the "Labs" page on the wiki for a long list of projects and proto-projects that use ZeroMQ in some way. There are frameworks(框架), web servers like Mongrel2, brokers like Majordomo, and enterprise open source tools like Storm.

Libzmq, most of the bindings, and some of the outer projects sit in the ZeroMQ community "organization" on GitHub. This organization is "run" by a group consisting of the most senior binding(有约束力的) authors. There's very little to run as it's almost all self-managing and there's zero conflict(冲突) these days.

iMatix, my firm, plays a specific(特殊的) role in the community. We own the trademarks(商标) and enforce(实施) them discretely(离散地) in order to make sure that if you download a package calling itself "ZeroMQ", you can trust what you are getting. People have on rare occasion(时机) tried to hijack(抢劫) the name, maybe believing that "free software" means there is no property at stake(桩) and no one willing to defend it. One thing you'll understand from this chapter is how seriously we take the process behind our software (and I mean "us" as a community, not a company). iMatix backs the community by enforcing that process on anything calling itself "ZeroMQ" or "ZeroMQ". We also put money and time into the software and packaging for reasons I'll explain later.

It is not a charity(慈善) exercise. ZeroMQ is a for-profit project, and a very profitable(有利可图的) one. The profits(利润) are widely distributed(分布式的) among all those who invest(投资) in it. It's really that simple: take the time to become an expert in ZeroMQ, or build something useful on top of ZeroMQ, and you'll find your value as an individual(个人的), or team, or company increasing. iMatix enjoys the same benefits(利益) as everyone else in the community. It's win-win to everyone except our competitors, who find themselves facing a threat(威胁) they can't beat and can't really escape. ZeroMQ dominates(控制) the future world of massively(大量地) distributed software.

My firm doesn't just have the community's back—we also built the community. This was deliberate w(故意的)ork; in the original ZeroMQ white paper from 2007, there were two projects. One was technical, how to make a better messaging system. The second was how to build a community that could take the software to dominant s(显性的)uccess. Software dies, but community survives.(幸存)

How to Make Really Large Architectures

topprevnext

There are, it has been said (at least by people reading this sentence out loud), two ways to make really large-scale(大规模的) software. Option One is to throw massive amounts(数量) of money and problems at empires of smart people, and hope that what emerges(浮现) is not yet another career(事业) killer. If you're very lucky and are building on lots of experience, have kept your teams solid, and are not aiming for technical brilliance(光辉), and are furthermore(此外) incredibly(难以置信地) lucky, it works.

But gambling(赌博) with hundreds of millions of others' money isn't for everyone. For the rest of us who want to build large-scale software, there's Option Two, which is open source, and more specifically(特殊的), free software. If you're asking how the choice of software license is relevant(有关的) to the scale(规模) of the software you build, that's the right question.

The brilliant(灿烂的) and visionary(梦想的) Eben Moglen once said, roughly, that a free software license is the contract(合同) on which a community builds. When I heard this, about ten years ago, the idea came to me—Can we deliberately grow free software communities?

Ten years later, the answer is "yes", and there is almost a science to it. I say "almost" because we don't yet have enough evidence(证据) of people doing this deliberately with a documented, reproducible(可再生的) process. It is what I'm trying to do with Social Architecture. ZeroMQ came after Wikidot, after the Digital Standards Organization (Digistan) and after the Foundation(基础) for a Free Information Infrastructure (aka the FFII, an NGO that fights against software patents(专利权)). This all came after a lot of less successful community projects like Xitami and Libero. My main takeaway(外卖食品) from a long career(事业) of projects of every conceivable(可能的) format is: if you want to build truly large-scale(大规模的) and long-lasting software, aim to build a free software community.

Psychology(心理学) of Software Architecture(建筑学)
topprevnext

Dirkjan Ochtman pointed me to Wikipedia's definition(定义) of Software Architecture as "the set of structures(结构) needed to reason about the system, which comprise(包含) software elements(基础), relations among them, and properties of both". For me this vapid(无趣味的) and circular(循环的) jargon(行话) is a good example of how miserably(贫困地) little we understand what actually makes a successful large scale(规模) software architecture.

Architecture is the art and science of making large artificial(人造的) structures for human use. If there is one thing I've learned and applied successfully in 30 years of making larger and larger software systems, it is this: software is about people. Large structures in themselves are meaningless(无意义的). It's how they function for human use that matters. And in software, human use starts with the programmers who make the software itself.

The core problems in software architecture are driven by human psychology, not technology. There are many ways our psychology affects our work. I could point to the way teams seem to get stupider as they get larger or when they have to work across larger distances. Does that mean the smaller the team, the more effective(有效的)? How then does a large global community like ZeroMQ manage to work successfully?

The ZeroMQ community wasn't accidental(意外的). It was a deliberate(故意的) design, my contribution to the early days when the code came out of a cellar in Bratislava. The design was based on my pet science of "Social Architecture", which Wikipedia defines as "the conscious(意识到的) design of an environment that encourages a desired range of social behaviors(行为) leading towards some goal or set of goals." I define(定义) this as more specifically(特别地) as "the process, and the product, of planning, designing, and growing an online community."

One of the tenets(原则) of Social Architecture is that how we organize is more significant(重大的) than who we are. The same group, organized differently, can produce wholly(完全地) different results. We are like peers(撒尿) in a ZeroMQ network, and our communication patterns have a dramatic(戏剧的) impact(影响) on our performance. Ordinary people, well connected, can far outperform(胜过) a team of experts using poor patterns. If you're the architect(建筑师) of a larger ZeroMQ application, you're going to have to help others find the right patterns for working together. Do this right, and your project can succeed. Do it wrong, and your project will fail.

The two most important psychological(心理的) elements(基础) are that we're really bad at understanding complexity(复杂) and that we are so good at working together to divide and conquer(战胜) large problems. We're highly social apes(猿), and kind of smart, but only in the right kind of crowd.

So here is my short list of the Psychological Elements of Software Architecture(建筑学):

  • Stupidity: our mental bandwidth(带宽) is limited, so we're all stupid at some point. The architecture has to be simple to understand. This is the number one rule: simplicity(朴素) beats functionality(功能), every single time. If you can't understand an architecture on a cold gray Monday morning before coffee, it is too complex(复杂的).
  • Selfishness: we act only out of self-interest, so the architecture must create space and opportunity(时机) for selfish(自私的) acts that benefit(有益于) the whole. Selfishness(自私自利) is often indirect and subtle(微妙的). For example, I'll spend hours helping someone else understand something because that could be worth days to me later.
  • Laziness: we make lots of assumptions(假定), many of which are wrong. We are happiest when we can spend the least effort to get a result or to test an assumption quickly, so the architecture has to make this possible. Specifically(特殊的), that means it must be simple.
  • Jealousy: we're jealous(妒忌的) of others, which means we'll overcome(克服) our stupidity(愚蠢) and laziness(怠惰) to prove others wrong and beat them in competition. The architecture thus has to create space for public competition based on fair rules that anyone can understand.
  • Fear: we're unwilling to take risks(风险), especially if it makes us look stupid. Fear of failure is a major reason people conform(符合) and follow the group in mass stupidity. The architecture should make silent experimentation(实验) easy and cheap, giving people opportunity for success without punishing failure.
  • Reciprocity: we'll pay extra in terms of hard work, even money, to punish cheats and enforce(实施) fair rules. The architecture should be heavily rule-based, telling people how to work together, but not what to work on.
  • Conformity: we're happiest to conform, out of fear and laziness, which means if the patterns are good, clearly explained and documented, and fairly enforced, we'll naturally choose the right path every time.
  • Pride: we're intensely(强烈的) aware(意识到的) of our social status, and we'll work hard to avoid looking stupid or incompetent(无能力的) in public. The architecture(建筑学) has to make sure every piece we make has our name on it, so we'll have sleepless(失眠的) nights stressing(强调) about what others will say about our work.
  • Greed: we're ultimately(最终的) economic(经济的) animals (see selfishness(自私自利)), so the architecture has to give us economic incentive(动机) to invest(投资) in making it happen. Maybe it's polishing(擦亮) our reputation(名声) as experts, maybe it's literally(文字的) making money from some skill or component(成分). It doesn't matter what it is, but there must be economic incentive. Think of architecture as a market place, not an engineering design.

These strategies(战略) work on a large scale(规模) but also on a small scale, within an organization or team.

The Importance of Contracts(合同)
topprevnext

Let me discuss a contentious(诉讼的) but important area, which is what license to choose. I'll say "BSD" to cover MIT, X11, BSD, Apache, and similar licenses, and "GPL" to cover GPLv3, LGPLv3, and AGPLv3. The significant(重大的) difference is the obligation(义务) to share back any forked versions, which prevents any entity(实体) from capturing(俘获) the software, and thus keeps it "free".

A software license isn't technically a contract since you don't sign anything. But broadly, calling it a contract is useful since it takes the obligations of each party, and makes them legally(合法地) enforceable(可实施的) in court, under copyright law.

You might ask, why do we need contracts at all to make open source? Surely it's all about decency(正派), goodwill, people working together for selfless(无私的) motives(主题). Surely the principle(原理) of "less is more" applies here of all places? Don't more rules mean less freedom? Do we really need lawyers to tell us how to work together? It seems cynical(愤世嫉俗的) and even counter-productive(产生相反效果的) to force a restrictive(限制的) set of rules on the happy communes(公社) of free and open source software.

But the truth about human nature is not that pretty. We're not really angels, nor devils(魔鬼), just self-interested winners descended(出身于…的) from a billion-year unbroken line of winners. In business, marriage, and collective(集体的) works, sooner or later, we either stop caring(有同情心的), or we fight and we argue.

Put this another way: a collective work has two extreme(极端的) outcomes. Either it's a failure, irrelevant(不相干的), and worthless, in which case every sane(健全的) person walks away, without a fight. Or, it's a success, relevant(有关的), and valuable, in which case we start jockeying(驾驶) for power, control, and often, money.

What a well-written contract does is to protect those valuable relationships from conflict(冲突). A marriage where the terms of divorce(离婚) are clearly agreed up-front(预先的) is much less likely to end in divorce. A business deal where both parties agree how to resolve(决定) various classic(经典的) conflicts—such as one party stealing the others' clients or staff—i(职员)s much less likely to end in conflict.

Similarly, a software project that has a well-written contract that defines(定义) the terms of breakup(解体) clearly is much less likely to end in breakup. The alternative(供选择的) seems to be to immerse(沉浸) the project into a larger organization that can assert(维护) pressure on teams to work together (or lose the backing and branding of the organization). This is for example how the Apache Foundation works. In my experience organization building has its own costs, and ends up favoring wealthier participants(参与者) (who can afford those sometimes huge costs).

In an open source or free software project, breakup usually takes the form of a fork, where the community splits into two or more groups, each with different visions(视力) of the future. During the honeymoon(蜜月) period of a project, which can last years, there's no question of a breakup. It is as a project begins to be worth money, or as the main authors start to burn out, that the goodwill and generosity tends(趋向) to dry up.

So when discussing software licenses, for the code you write or the code you use, a little cynicism(玩世不恭) helps. Ask yourself, not "which license will attract more contributors(贡献者)?" because the answer to that lies in the mission(使命) statement(声明) and contribution process. Ask yourself, "if this project had a big fight, and split three ways, which license would save us?" Or, "if the whole team was bought by a hostile(敌对的) firm that wanted to turn this code into a proprietary(所有的) product, which license would save us?"

Long-term survival(幸存) means enduring(忍耐) the bad times, as well as enjoying the good ones.

When BSD projects fork, they cannot easily merge(合并) again. Indeed, one-way forking of BSD projects is quite systematic(系统的): every time BSD code ends up in a commercial(商业的) project, this is what's happened. When GPL projects fork, however, re-merging is trivial(不重要的).

The GPL's story is relevant(有关的) here. Though communities of programmers sharing their code openly were already significant(重大的) by the 1980's, they tended(趋向) to use minimal(最低的) licenses that worked as long as no real money got involved(包含). There was an important language stack(堆) called Emacs, originally built in Lisp by Richard Stallman. Another programmer, James Gosling (who later gave us Java), rewrote(重写) Emacs in C with the help of many contributors(贡献者), on the assumption(假定) that it would be open. Stallman(摊贩) got that code and used it as the basis(基础) for his own C version. Gosling(小鹅) then sold the code to a firm which turned around and blocked anyone distributing(分配) a competing product. Stallman found this sale of the common work hugely unethical(不道德的), and began developing a reusable(可重复使用的) license that would protect communities from this.

What eventually(最后) emerged(浮现) was the GNU General Public License, which used traditional copyright to force remixability. It was a neat hack that spread to other domains, for instance(举…为例) the Creative Commons for photography(摄影) and music. In 2007, we saw version 3 of the license, which was a response to belated(迟来的) attacks from Microsoft and others on the concept(观念). It has become a long and complex(复杂的) document but corporate(法人的) copyright lawyers have become familiar with it and in my experience, few companies mind using GPL software and libraries, so long as the boundaries(边界) are clearly defined(定义).

Thus, a good contract(合同)—and I consider the modern GPL to be the best for software—lets programmers work together without upfront agr(预付的)eements, organizations, or assumptions of decency and(正派) goodwill. It makes it cheaper to collaborate, an(合作)d turns conflict int(冲突)o healthy competition. GPL doesn't just define what happens with a fork, it actively encourages forks as a tool for experimentation and(实验) learning. Whereas a f(然而)ork can kill a project with a "more liberal" li(自由主义者)cense, GPL projects thrive on (繁荣)forks since successful experiments can, by contract, be remixed bac(使再混合)k into the mainstream.(主流)

Yes, there are many thriving BSD projects and many dead GPL ones. It's always wrong to generalize(概括). A project will thrive or die for many reasons. However, in a competitive(竞争的) sport, one needs every advantage.

The other important part of the BSD vs. GPL story is what I call "leakage(泄漏)", which is the effect of pouring water into a pot with a small but real hole in the bottom.

Eat Me
topprevnext

Here is a story. It happened to the eldest(最年长的) brother-in-law of the cousin of a friend of mine's colleague at work. His name was, and still is, Patrick.

Patrick was a computer scientist with a PhD in advanced network topologies(拓扑学). He spent two years and his savings building a new product, and choose the BSD license because he believed that would get him more adoption(采用). He worked in his attic(阁楼), at great personal cost, and proudly published his work. People applauded(赞同), for it was truly fantastic, and his mailing lists were soon abuzz(嗡嗡的) with activity and patches(眼罩) and happy chatter. Many companies told him how they were saving millions using his work. Some of them even paid him for consultancy(咨询公司) and training. He was invited to speak at conferences and started collecting badges(徽章) with his name on them. He started a small business, hired a friend to work with him, and dreamed of making it big.

Then one day, someone pointed him to a new project, GPL licensed, which had forked his work and was improving on it. He was irritated(刺激) and upset, and asked how people—fellow open sourcers, (采购开发)no less!—would so shamelessly ste(不知羞耻地)al his code. There were long arguments on the list about whether it was even legal to (法律的)relicense their BSD code as GPL code. Turned out, it was. He tried to ignore the(驳回诉讼) new project, but then he soon realized that new patches coming from that project couldn't even be merged(合并) back into his work!

Worse, the GPL project got popular and some of his core contributors made first small, and then larger patches to it. Again, he couldn't use those changes, and he felt abandoned(遗弃). Patrick went into a depression(沮丧), his girlfriend left him for an international currency dealer called, weirdly(怪异的), Patrice, and he stopped all work on the project. He felt betrayed(背叛), and utterly(完全地) miserable(悲惨的). He fired his friend, who took it rather badly and told everyone that Patrick was a closet(壁橱) banjo(班卓琴) player. Finally, Patrick took a job as a project manager for a cloud company, and by the age of forty, he had stopped programming even for fun.

Poor Patrick. I almost felt sorry for him. Then I asked him, "Why didn't you choose the GPL?" "Because it's a restrictive(限制的) viral(滤过性毒菌的) license", he replied. I told him, "You may have a PhD, and you may be the eldest brother-in-law of the cousin of a friend of my colleague, but you are an idiot(笨蛋) and Monique was smart to leave you. You published your work inviting people to please steal your code as long as they kept this 'please steal my code' statement(声明) in the resulting work", and when people did exactly that, you got upset. Worse, you were a hypocrite(伪君子) because when they did it in secret, you were happy, but when they did it openly, you felt betrayed."

Seeing your hard work captured(俘获) by a smarter team and then used against you is enormously(巨大地) painful, so why even make that possible? Every proprietary(所有的) project that uses BSD code is capturing it. A public GPL fork is perhaps more humiliating(丢脸的), but it's fully self-inflicted(自己造成的).

BSD is like food. It literally(照字面地) (and I mean that metaphorically(比喻性的)) whispers "eat me" in the little voice one imagines a cube of cheese might use when it's sitting next to an empty bottle of the best beer in the world, which is of course Orval, brewed(酿造) by an ancient and almost extinct(灭绝的) order of silent Belgian monks(僧侣) called Les Gars Labas Qui Fabrique l'Orval. The BSD license, like its near clone(无性繁殖) MIT/X11, was designed specifically(特别地) by a university (Berkeley) with no profit(利润) motive(动机) to leak work and effort. It is a way to push subsidized(补贴的) technology at below its cost price, a dumping(倾销) of under-priced code in the hope that it will break the market for others. BSD is an excellent strategic(战略上的) tool, but only if you're a large well-funded institution that can afford to use Option One. The Apache license is BSD in a suit.

For us small businesses who aim our investments(投资) like precious bullets(子弹), leaking work and effort is unacceptable(不能接受的). Breaking the market is great, but we cannot afford to subsidize(资助) our competitors. The BSD networking stack(堆叠) ended up putting Windows on the Internet. We cannot afford battles with those we should naturally be allies with. We cannot afford to make fundamental(基本的) business errors because in the end, that means we have to fire people.

It comes down to behavioral(行为的) economics(经济学) and game theory. The license we choose modifies(修改) the economics of those who use our work. In the software industry, there are friends, foes(敌人), and food. BSD makes most people see us as lunch. Closed source makes most people see us as enemies (do you like paying people for software?) GPL, however, makes most people, with the exception(例外) of the Patricks of the world, our allies. Any fork of ZeroMQ is license compatible(兼容的) with ZeroMQ, to the point where we encourage forks as a valuable tool for experimentation(实验). Yes, it can be weird(怪异的) to see someone try to run off with the ball but here's the secret, I can get it back any time I want.

The Process
topprevnext

If you've accepted my thesis(论文) up to now, great! Now, I'll explain the rough process by which we actually build an open source community. This was how we built or grew or gently steered(控制) the ZeroMQ community into existence.

Your goal as leader of a community is to motivate(刺激) people to get out there and explore; to ensure(保证) they can do so safely and without disturbing others; to reward them when they make successful discoveries; and to ensure they share their knowledge with everyone else (and not because we ask them, not because they feel generous(慷慨的), but because it's The Law).

It is an iterative(迭代的) process. You make a small product, at your own cost, but in public view. You then build a small community around that product. If you have a small but real hit, the community then helps design and build the next version, and grows larger. And then that community builds the next version, and so on. It's evident(明显的) that you remain part of the community, maybe even a majority contributor(贡献者), but the more control you try to assert(维护) over the material results, the less people will want to participate(参与). Plan your own retirement(退休) well before someone decides you are their next problem.

Crazy, Beautiful, and Easy
topprevnext

You need a goal that's crazy and simple enough to get people out of bed in the morning. Your community has to attract the very best people and that demands something special. With ZeroMQ, we said we were going to make "the Fastest. Messaging. Ever.", which qualifies(限制) as a good motivator(动力). If we'd said, we're going to make "a smart transport layer that'll connect your moving pieces cheaply and flexibly(灵活地) across your enterprise", we'd have failed.

Then your work must be beautiful, immediately useful, and attractive. Your contributors(贡献者) are users who want to explore just a little beyond where they are now. Make it simple, elegant(高雅的), and brutally(残忍地) clean. The experience when people run or use your work should be an emotional(情绪的) one. They should feel something, and if you accurately(精确地) solved even just one big problem that until then they didn't quite realize they faced, you'll have a small part of their soul.

It must be easy to understand, use, and join. Too many projects have barriers(障碍物) to access: put yourself in the other person's mind and see all the reasons they come to your site, thinking "Um, interesting project, but…" and then leave. You want them to stay and try it, just once. Use GitHub and put the issue tracker right there.

If you do these things well, your community will be smart but more importantly, it will be intellectually(智力上) and geographically(在地理上) diverse(不同的). This is really important. A group of like-minded experts cannot explore the problem landscape(风景) well. They tend(照料) to make big mistakes. Diversity(多样性) beats education any time.

Stranger, Meet Stranger
topprevnext

How much up-front(预先的) agreement do two people need to work together on something? In most organizations, a lot. But you can bring this cost down to near-zero, and then people can collaborate(合作) without having ever met, done a phone conference, meeting, or business trip to discuss Roles and Responsibilities over way too many bottles of cheap Korean rice wine.

You need well-written rules that are designed by cynical(愤世嫉俗的) people like me to force strangers into mutually(互相地) beneficial(有益的) collaboration(合作) instead of conflict(冲突). The GPL is a good start. GitHub and its fork/merge(合并) strategy(战略) is a good follow-up(随访). And then you want something like our C4 rulebook to control how work actually happens.

C4 (which I now use for every new open source project) has detailed and tested answers to a lot of common mistakes people make, such as the sin of working offline in a corner with others "because it's faster". Transparency is essential to get trust, which is essential to get scale(规模). By forcing every single change through a single transparent(透明的) process, you build real trust in the results.

Another cardinal(主要的) sin that many open source developers make is to place themselves above others. "I founded this project thus my intellect(智力) is superior(上级的) to that of others". It's not just immodest(不谦虚的) and rude, and usually inaccurate(错误的), it's also poor business. The rules must apply equally to everyone, without distinction(区别). You are part of the community. Your job, as founder of a project, is not to impose(强加) your vision(视力) of the product over others, but to make sure the rules are good, honest, and enforced.

Infinite Property
topprevnext

One of the saddest myths(神话) of the knowledge business is that ideas are a sensible(明智的) form of property. It's medieval(中世纪的) nonsense(胡说) that should have been junked(报废的) along with slavery, but sadly it's still making too many powerful people too much money.

Ideas are cheap. What does work sensibly as property is the hard work we do in building a market. "You eat what you kill" is the right model for encouraging people to work hard. Whether it's moral authority(权威) over a project, money from consulting(咨询的), or the sale of a trademark(商标) to some large, rich firm: if you make it, you own it. But what you really own is "footfall(脚步)", participants(参与者) in your project, which ultimately(最后) defines(定义) your power.

To do this requires infinite(无限的) free space. Thankfully, GitHub solved this problem for us, for which I will die a grateful person (there are many reasons to be grateful in life, which I won't list here because we only have a hundred or so pages left, but this is one of them).

You cannot scale(规模) a single project with many owners like you can scale a collection of many small projects, each with fewer owners. When we embrace(拥抱) forks, a person can become an "owner" with a single click. Now they just have to convince(说服) others to join by demonstrating(证明) their unique(独特的) value.

So in ZeroMQ, we aimed to make it easy to write bindings(结合) on top of the core library, and we stopped trying to make those bindings ourselves. This created space for others to make those, become their owners, and get that credit.

Care and Feeding
topprevnext

I wish a community could be 100% self-steering(自操纵的), and perhaps one day this will work, but today it's not the case. We're very close with ZeroMQ, but from my experience a community needs four types of care and feeding:

  • First, simply because most people are too nice, we need some kind of symbolic(象征的) leadership(领导能力) or owners who provide ultimate authority in case of conflict(冲突). Usually it's the founders of the community. I've seen it work with self-elected groups of "elders", but old men like to talk a lot. I've seen communities split over the question "who is in charge?", and setting up legal(法律的) entities(实体) with boards and such seems to make arguments over control worse, not better. Maybe because there seems to be more to fight over. One of the real benefits(利益) of free software is that it's always remixable, so instead of fighting over a pie, one simply forks the pie.
  • Second, communities need living rules, and thus they need a lawyer able to formulate(规划) and write these down. Rules are critical(鉴定的); when done right, they remove friction(摩擦). When done wrong, or neglected(疏忽), we see real friction and argument that can drive away the nice majority, leaving the argumentative(好辩的) core in charge of the burning house. One thing I've tried to do with the ZeroMQ and previous communities is create reusable(可重复使用的) rules, which perhaps means we don't need lawyers as much.
  • Thirdly, communities need some kind of financial(金融的) backing. This is the jagged(使成锯齿状) rock that breaks most ships. If you starve a community, it becomes more creative(创造性的) but the core contributors(贡献者) burn out. If you pour too much money into it, you attract the professionals, who never say "no", and the community loses its diversity(多样性) and creativity(创造力). If you create a fund for people to share, they will fight (bitterly) over it. With ZeroMQ, we (iMatix) spend our time and money on marketing and packaging (like this book), and the basic care, like bug fixes, releases, and websites.
  • Lastly, sales and commercial(商业的) mediation(调解) are important. There is a natural market between expert contributors and customers, but both are somewhat incompetent(无能力的) at talking to each other. Customers assume(承担) that support is free or very cheap because the software is free. Contributors are shy at asking a fair rate for their work. It makes for a difficult market. A growing part of my work and my firm's profits(利润) is simply connecting ZeroMQ users who want help with experts from the community able to provide it, and ensuring(保证) both sides are happy with the results.

I've seen communities of brilliant(灿烂的) people with noble goals dying because the founders got some or all of these four things wrong. The core problem is that you can't expect consistently(一贯地) great leadership from any one company, person, or group. What works today often won't work tomorrow, yet structures(结构) become more solid, not more flexible(灵活的), over time.

The best answer I can find is a mix of two things. One, the GPL and its guarantee(保证) of remixability. No matter how bad the authority, no matter how much they try to privatize(私有化) and capture(俘获) the community's work, if it's GPL licensed, that work can walk away and find a better authority. Before you say, "all open source offers this," think it through. I can kill a BSD-licensed project by hiring the core contributors and not releasing any new patches(眼罩). But even with a billion of dollars, I cannot kill a GPL-licensed project. Two, the philosophical(哲学的) anarchist(无政府主义者) model of authority(权威), which is that we choose it, it does not own us.

The ZeroMQ Process: C4

topprevnext

When we say ZeroMQ we sometimes mean libzmq, the core library. In early 2012, we synthesized(合成的) the libzmq process into a formal(正式的) protocol(协议) for collaboration(合作) that we called the Collective(集体的) Code Construction Contract, or C4. You can see this as a layer above the GPL. These are our rules, and I'll explain the reasoning behind each one.

C4 is an evolution(演变) of the GitHub Fork + Pull Model. You may get the feeling I'm a fan of git and GitHub. This would be accurate(精确的): these two tools have made such a positive(积极的) impact(影响) on our work over the last years, especially when it comes to building community.

Language
topprevnext

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted(说明) as described in RFC 2119.

By starting with the RFC 2119 language, the C4 text makes very clear its intention(意图) to act as a protocol(协议) rather than a randomly(随便地) written set of recommendations(推荐). A protocol is a contract(合同) between parties that defines(定义) the rights and obligations(义务) of each party. These can be peers(撒尿) in a network or they can be strangers working in the same project.

I think C4 is the first time anyone has attempted to codify(编纂) a community's rulebook(规则手册) as a formal(正式的) and reusable(可重复使用的) protocol spec(投机). Previously, our rules were spread out over several wiki pages, and were quite specific(特殊的) to libzmq in many ways. But experience teaches us that the more formal, accurate(精确的), and reusable the rules, the easier it is for strangers to collaborate(合作) up-front(预先的). And less friction(摩擦) means a more scalable(可攀登的) community. At the time of C4, we also had some disagreement in the libzmq project over precisely(精确地) what process we were using. Not everyone felt bound by the same rules. Let's just say some people felt they had a special status, which created friction with the rest of the community. So codification(编纂) made things clear.

It's easy to use C4: just host your project on GitHub, get one other person to join, and open the floor to pull requests. In your README, put a link to C4 and that's it. We've done this in quite a few projects and it does seem to work. I've been pleasantly surprised a few times just applying these rules to my own work, like CZMQ. None of us are so amazing that we can work without others.

Goals
topprevnext

C4 is meant to provide a reusable optimal(最佳的) collaboration(合作) model for open source software projects.

The short term reason for writing C4 was to end arguments over the libzmq contribution process. The dissenters(不同意) went off elsewhere(在别处). The ZeroMQ community blossomed(开花) smoothly and easily, as I'd predicted(预报). Most people were surprised, but gratified(称心的). There's been no real criticisms(批评) of C4 except its branching policy(政策), which I'll come to later as it deserves its own discussion.

There's a reason I'm reviewing history here: as founder of a community, you are asking people to invest(投资) in your property, trademark(商标), and branding. In return, and this is what we do with ZeroMQ, you can use that branding to set a bar for quality. When you download a product labeled(标注) "ZeroMQ", you know that it's been produced to certain standards. It's a basic rule of quality: write down your process; otherwise you cannot improve it. Our processes aren't perfect, nor can they ever be. But any flaw(瑕疵) in them can be fixed, and tested.

Making C4 reusable is therefore really important. To learn more about the best possible process, we need to get results from the widest range of projects.

It has these specific(特殊的) goals:
To maximize(取…最大值) the scale(规模) of the community around a project, by reducing the friction(摩擦) for new Contributors and creating a scaled participation(参与) model with strong positive(积极的) feedbacks(反馈);

The number one goal is size and health of the community—not technical quality, not profits, (利润)not performance, not market share. The goal is simply the number of people who contribute t(贡献)o the project. The science here is simple: the larger the community, the more accurate t(精确的)he results.

To relieve(解除) dependencies(依赖性) on key individuals(个人) by separating different skill sets so that there is a larger pool of competence(能力) in any required domain;

Perhaps the worst problem we faced in libzmq was dependence(依赖) on people who could understand the code, manage GitHub branches, and make clean releases—all at the same time. It's like looking for athletes who can run marathons and sprint, (冲刺)swim, and also lift weights. We humans are really good at specialization. (专门化)Asking us to be really good at two contradictory t(矛盾的)hings reduces the number of candidates s(候选人)harply, which is a Bad Thing for any project. We had this problem severely i(严重地)n libzmq in 2009 or so, and fixed it by splitting the role of maintainer(维持) into two: one person makes patches(眼罩) and another makes releases.

To allow the project to develop faster and more accurately, by increasing the diversity(多样性) of the decision making process;

This is theory—not fully proven, (证明)but not falsified. (伪造)The diversity of the community and the number of people who can weigh in on discussions, without fear of being criticized o(批评)r dismissed, the faster and more accurately the software develops. Speed is quite subjective h(主观的)ere. Going very fast in the wrong direction is not just useless, it's actively damaging (and we suffered a lot of that in libzmq before we switched(转换) to C4).

To support the natural life cycle of project versions from experimental(实验的) through to stable(被关在马厩), by allowing safe experimentation(实验), rapid failure, and isolation(隔离) of stable code;

To be honest, this goal seems to be fading into irrelevance(离题). It's quite an interesting effect of the process: the git master is almost always perfectly stable. This has to do with the size of changes and their latency, i.e., the time between someone writing the code and someone actually using it fully. However, people still expect "stable" releases, so we'll keep this goal there for a while.

To reduce the internal(内部的) complexity(复杂) of project repositories(贮藏室), thus making it easier for Contributors to participate(参与) and reducing the scope(范围) for error;

Curious observation(观察): people who thrive(繁荣) in complex(复杂的) situations like to create complexity(复杂) because it keeps their value high. It's the Cobra Effect (Google it). Git made branches easy and left us with the all too common syndrome(综合征) of "git is easy once you understand that a git branch is just a folded five-dimensional lepton(轻粒子) space that has a detached(分离的) history with no intervening(干涉) cache". Developers should not be made to feel stupid by their tools. I've seen too many top-class developers confused(混乱) by repository(贮藏室) structures(结构) to accept conventional(符合习俗的) wisdom on git branches. We'll come back to dispose(处理) of git branches shortly, dear reader.

To enforce(实施) collective(集体的) ownership of the project, which increases economic(经济的) incentive(动机) to Contributors and reduces the risk(风险) of hijack(劫持) by hostile(敌对的) entities(实体).

Ultimately(最终的), we're economic creatures(动物), and the sense that "we own this, and our work can never be used against us" makes it much easier for people to invest(投资) in an open source project like ZeroMQ. And it can't be just a feeling, it has to be real. There are a number of aspects(方面) to making collective ownership work, we'll see these one-by-one as we go through C4.

Preliminaries
topprevnext

The project SHALL use the git distributed(分配) revision control system.

Git has its faults. Its command-line API is horribly inconsistent(不一致的), and it has a complex, messy internal(内部的) model that it shoves(推) in your face at the slightest provocation(挑衅). But despite(尽管) doing its best to make its users feel stupid, git does its job really, really well. More pragmatically(实用主义的), I've found that if you stay away from certain areas (branches!), people learn git rapidly and don't make many mistakes. That works for me.

The project SHALL be hosted on github.com or equivalent(等价的), herein called the "Platform".

I'm sure one day some large firm will buy GitHub and break it, and another platform will rise in its place. Until then, Github serves up a near-perfect set of minimal(最低的), fast, simple tools. I've thrown hundreds of people at it, and they all stick like flies stuck in a dish of honey.

The project SHALL use the Platform issue tracker.

We made the mistake in libzmq of switching(开关) to Jira because we hadn't learned yet how to properly use the GitHub issue tracker. Jira is a great example of how to turn something useful into a complex mess because the business depends on selling more "features(特色)". But even without criticizing(批评) Jira, keeping the issue tracker on the same platform means one less UI to learn, one less login, and smooth integration(集成) between issues and patches(眼罩).

The project SHOULD have clearly documented guidelines(指导方针) for code style.

This is a protocol(协议) plug-in(插件程序): insert code style guidelines here. If you don't document the code style you use, you have no basis(基础) except prejudice(损害) to reject patches.

A "Contributor" is a person who wishes to provide a patch, being a set of commits(犯罪) that solve some clearly identified(确定) problem.
A "Maintainer" is a person who merge(合并) patches to the project. Maintainers(维持) are not developers; their job is to enforce process.

Now we move on to definitions(定义) of the parties, and the splitting of roles that saved us from the sin of structural(结构的) dependency(属国) on rare individuals(个人). This worked well in libzmq, but as you will see it depends on the rest of the process. C4 isn't a buffet(自助餐); you will need the whole process (or something very like it), or it won't hold together.

Contributors(贡献者) SHALL NOT have commit(犯罪) access to the repository(贮藏室) unless they are also Maintainers(维持).
Maintainers SHALL have commit access to the repository.

What we wanted to avoid was people pushing their changes directly to master. This was the biggest source of trouble in libzmq historically(历史上地): large masses of raw code that took months or years to fully stabilize(稳固). We eventually(最后) followed other ZeroMQ projects like PyZMQ in using pull requests. We went further, and stipulated(规定) that all changes had to follow the same path. No exceptions(例外) for "special people".

Everyone, without distinction(区别) or discrimination, SHALL have an equal right to become a Contributor under the terms of this contract(合同).

We had to state this explicitly(明确地). It used to be that the libzmq maintainers would reject patches(眼罩) simply because they didn't like them. Now, that may sound reasonable to the author of a library (though libzmq was not written by any one person), but let's remember our goal of creating a work that is owned by as many people as possible. Saying "I don't like your patch so I'm going to reject it" is equivalent(等价的) to saying, "I claim(要求) to own this and I think I'm better than you, and I don't trust you". Those are toxic(有毒的) messages to give to others who are thinking of becoming your co-investors.

I think this fight between individual(个人的) expertise(专门知识) and collective(集体的) intelligence(智力) plays out in other areas. It defined(定义) Wikipedia, and still does, a decade(十年) after that work surpassed(超越) anything built by small groups of experts. For me, we make software by slowly synthesizing(合成) the most accurate(精确的) knowledge, much as we make Wikipedia articles.

Licensing and Ownership
topprevnext

The project SHALL use the GPLv3 or a variant(变体) thereof (LGPL, AGPL).

I've already explained how full remixability creates better scale(规模) and why the GPL and its variants seems the optimal(最佳的) contract(合同) for remixable software. If you're a large business aiming to dump(倾倒) code on the market, you won't want C4, but then you won't really care about community either.

All contributions to the project source code ("patches(眼罩)") SHALL use the same license as the project.

This removes the need for any specific(特殊的) license or contribution agreement for patches. You fork the GPL code, you publish your remixed(使再混合) version on GitHub, and you or anyone else can then submit(服从) that as a patch to the original code. BSD doesn't allow this. Any work that contains BSD code may also contain unlicensed proprietary(所有的) code so you need explicit(明确的) action from the author of the code before you can remix it.

All patches are owned by their authors. There SHALL NOT be any copyright assignment(分配) process.

Here we come to the key reason people trust their investments(投资) in ZeroMQ: it's logistically(后勤方面的) impossible to buy the copyrights to create a closed source competitor to ZeroMQ. iMatix can't do this either. And the more people that send patches, the harder it becomes. ZeroMQ isn't just free and open today—this specific rule means it will remain so forever. Note that it's not the case in all GPL projects, many of which still ask for copyright transfer b(转让)ack to the maintainers.(维持)

The project SHALL be owned collectively(共同地) by all its Contributors(贡献者).

This is perhaps redundant(多余的), but worth saying: if everyone owns their patches, then the resulting whole is also owned by every contributor. There's no legal(法律的) concept(观念) of owning lines of code: the "work" is at least a source file.

Each Contributor SHALL be responsible(负责的) for identifying(确定) themselves in the project Contributor list.

In other words, the maintainers are not karma(因果报应) accountants(会计师). Anyone who wants credit has to claim(要求) it themselves.

Patch Requirements
topprevnext

In this section, we define(定义) the obligations(义务) of the contributor: specifically(特别地), what constitutes(组成) a "valid" patch, so that maintainers have rules they can use to accept or reject patches.

Maintainers and Contributors MUST have a Platform account and SHOULD use their real names or a well-known(著名的) alias(别名).

In the worst case scenario(方案), where someone has submitted toxic(有毒的) code (patented(专利的), or owned by someone else), we need to be able to trace(追踪) who and when, so we can remove the code. Asking for real names or a well-known alias is a theoretical strategy(战略) for reducing the risk(风险) of bogus(假的) patches. We don't know if this actually works because we haven't had the problem yet.

A patch SHOULD be a minimal(最低的) and accurate(精确的) answer to exactly one identified and agreed problem.

This implements(工具) the Simplicity Oriented Design process that I'll come to later in this chapter. One clear problem, one minimal solution(解决方案), apply, test, repeat.

A patch MUST adhere(坚持) to the code style guidelines(指导方针) of the project if these are defined.

This is just sanity(明智). I've spent time cleaning up other peoples' patches(眼罩) because they insisted on putting the else beside the if instead of just below as Nature intended. Consistent(始终如一的) code is healthier.

A patch MUST adhere(坚持) to the "Evolution of Public Contracts(合同)" guidelines(指导方针) defined(定义) below.

Ah, the pain, the pain. I'm not speaking of the time at age eight when I stepped on a plank(厚木板) with a 4-inch nail protruding(突出) from it. That was relatively OK. I'm speaking of 2010-2011 when we had multiple parallel(平行的) releases of ZeroMQ, each with different incompatible APIs or wire protocols(协议). It was an exercise in bad rules, pointlessly(不相干地) enforced(实施), that still hurts us today. The rule was, "If you change the API or protocol, you SHALL create a new major version". Give me the nail through the foot; that hurt less.

One of the big changes we made with C4 was simply to ban, outright, this kind of sanctioned(制裁) sabotage(破坏). Amazingly, it's not even hard. We just don't allow the breaking of existing public contracts, period, unless everyone agrees, in which case no period. As Linus Torvalds famously put it on 23 December 2012, "WE DO NOT BREAK USERSPACE!"

A patch SHALL NOT include nontrivial(非平凡的) code from other projects unless the Contributor(贡献者) is the original author of that code.

This rule has two effects. The first is that it forces people to make minimal(最低的) solutions(解决方案) because they cannot simply import swathes(带子) of existing code. In the cases where I've seen this happen to projects, it's always bad unless the imported code is very cleanly separated. The second is that it avoids license arguments. You write the patch, you are allowed to publish it as LGPL, and we can merge(合并) it back in. But you find a 200-line code fragment(碎片) on the web, and try to paste(张贴) that, we'll refuse.

A patch MUST compile(编译) cleanly and pass project self-tests on at least the principle(原理) target platform.

For cross-platform projects, it is fair to ask that the patch works on the development box used by the contributor.

A patch commit(犯罪) message SHOULD consist of a single short (less than 50 character) line summarizing(总结) the change, optionally(可选择的) followed by a blank line and then a more thorough description.

This is a good format for commit messages that fits into email (the first line becomes the subject, and the rest becomes the email body).

A "Correct Patch" is one that satisfies the above requirements.

Just in case it wasn't clear, we're back to legalese(法律术语) and definitions(定义).

Development Process
topprevnext

In this section, we aim to describe the actual development process, step-by-step(按部就班的).

Change on the project SHALL be governed by the pattern of accurately(精确地) identifying(确定) problems and applying minimal(最低的), accurate solutions(解决方案) to these problems.

This is a unapologetic ramming(打结炉底) through of thirty years' software design experience. It's a profoundly(深刻地) simple approach(方法) to design: make minimal, accurate solutions to real problems, nothing more or less. In ZeroMQ, we don't have feature(特写) requests. Treating new features the same as bugs confuses(混乱) some newcomers. But this process works, and not just in open source. Enunciating(发音) the problem we're trying to solve, with every single change, is key to deciding whether the change is worth making or not.

To initiate(开始) changes, a user SHALL log an issue on the project Platform issue tracker.

This is meant to stop us from going offline and working in a ghetto(犹太人区), either by ourselves or with others. Although we tend(照料) to accept pull requests that have clear argumentation(论证), this rule lets us say "stop" to confused or too-large patches(眼罩).

The user SHOULD write the issue by describing the problem they face or observe.

"Problem: we need feature X. Solution: make it" is not a good issue. "Problem: user cannot do common tasks A or B except by using a complex(复杂的) workaround(工作区). Solution: make feature X" is a decent(正派的) explanation. Because everyone I've ever worked with has needed to learn this, it seems worth restating(重申): document the real problem first, solution second.

The user SHOULD seek consensus(一致) on the accuracy(精确度) of their observation(观察), and the value of solving the problem.

And because many apparent(显然的) problems are illusionary(错觉的), by stating the problem explicitly(明确地) we give others a chance to correct our logic(逻辑). "You're only using A and B a lot because function C is unreliable(不可靠的). Solution: make function C work properly."

Users SHALL NOT log feature requests, ideas, suggestions, or any solutions to problems that are not explicitly documented and provable(可证明的).

There are several reasons for not logging ideas, suggestions, or feature requests. In our experience, these just accumulate(累积) in the issue tracker until someone deletes them. But more profoundly, when we treat all change as problem solutions, we can prioritize(把…区分优先次序) trivially(琐细地). Either the problem is real and someone wants to solve it now, or it's not on the table. Thus, wish lists are off the table.

Thus, the release history of the project SHALL be a list of meaningful(有意义的) issues logged and solved.

I'd love the GitHub issue tracker to simply list all the issues we solved in each release. Today we still have to write that by hand. If one puts the issue number in each commit(犯罪), and if one uses the GitHub issue tracker, which we sadly don't yet do for ZeroMQ, this release history is easier to produce mechanically(机械地).

To work on an issue, a Contributor SHALL fork the project repository(贮藏室) and then work on their forked repository.

Here we explain the GitHub fork + pull request model so that newcomers only have to learn one process (C4) in order to contribute(贡献).

To submit(服从) a patch, a Contributor SHALL create a Platform pull request back to the project.

GitHub has made this so simple that we don't need to learn git commands to do it, for which I'm deeply grateful. Sometimes, I'll tell people who I don't particularly like that command-line git is awesome(可怕的) and all they need to do is learn git's internal(内部的) model in detail before trying to use it on real work. When I see them several months later they look… changed.

A Contributor SHALL NOT commit changes directly to the project.

Anyone who submits(服从) a patch(眼罩) is a contributor(贡献者), and all contributors follow the same rules. No special privileges(特权) to the original authors, because otherwise we're not building a community, only boosting(促进) our egos.

To discuss a patch, people MAY comment on the Platform pull request, on the commit(犯罪), or elsewhere(在别处).

Randomly(随机的) distributed(分布式的) discussions may be confusing(混乱) if you're walking up for the first time, but GitHub solves this for all current participants(参与者) by sending emails to those who need to follow what's going on. We had the same experience and the same solution(解决方案) in Wikidot, and it works. There's no evidence(证据) that discussing in different places has any negative(负的) effect.

To accept or reject a patch, a Maintainer(维持) SHALL use the Platform interface(界面).

Working via the GitHub web user interface means pull requests are logged as issues, with workflow(工作流) and discussion. I'm sure there are more complex(复杂的) ways to work. Complexity(复杂) is easy; it's simplicity(朴素) that's incredibly(难以置信的) hard.

Maintainers SHALL NOT accept their own patches.

There was a rule we defined(定义) in the FFII years ago to stop people burning out: no less than two people on any project. One-person projects tend(照料) to end in tears, or at least bitter silence. We have quite a lot of data on burnout(烧坏), why it happens, and how to prevent it (even cure it). I'll explore this later in the chapter, because if you work with or on open source you need to be aware(意识到的) of the risks(风险). The "no merging(合并) your own patch" rule has two goals. First, if you want your project to be C4-certified, you have to get at least one other person to help. If no one wants to help you, perhaps you need to rethink(重新考虑) your project. Second, having a control for every patch makes it much more satisfying, keeps us more focused, and stops us breaking the rules because we're in a hurry, or just feeling lazy.

Maintainers SHALL NOT make value judgments on correct patches.

We already said this but it's worth repeating: the role of Maintainer is not to judge a patch's substance(物质), only its technical quality. The substantive(有实质的) worth of a patch only emerges(浮现) over time: people use it, and like it, or they do not. And if no one is using a patch, eventually(最后的) it'll annoy someone else who will remove it, and no one will complain.

Maintainers SHALL merge correct patches rapidly.

There is a criteria(标准) I call change latency, which is the round-trip time from identifying(确定) a problem to testing a solution. The faster the better. If maintainers cannot respond(回答) to pull requests as rapidly as people expect, they're not doing their job (or they need more hands).

The Contributor MAY tag an issue as "Ready" after making a pull request for the issue.

By default, GitHub offers the usual variety of issues, but with C4 we don't use them. Instead, we need just two labels(标签), "Urgent" and "Ready". A contributor who wants another user to test an issue can then label it as "Ready".

The user who created an issue SHOULD close the issue after checking the patch is successful.

When one person opens an issue, and another works on it, it's best to allow the original person to close the issue. That acts as a double-check that the issue was properly resolved(决定).

Maintainers SHOULD ask for improvements(改进) to incorrect patches and SHOULD reject incorrect patches if the Contributor does not respond constructively(建设性地).

Initially, I felt it was worth merging all patches, no matter how poor. There's an element(元素) of trolling(轮唱) involved(包含). Accepting even obviously bogus(假的) patches could, I felt, pull in more contributors. But people were uncomfortable(不舒服的) with this so we defined the "correct patch" rules, and the Maintainer's role in checking for quality. On the negative side, I think we didn't take some interesting risks, which could have paid off with more participants. On the positive(积极的) side, this has led to libzmq master (and in all projects that use C4) being practically production quality, practically all the time.

Any Contributor(贡献者) who has value judgments on a correct patch(眼罩) SHOULD express these via their own patches.

In essence(本质), the goal here is to allow users to try patches rather than to spend time arguing pros and cons. As easy as it is to make a patch, it's as easy to revert(使回复原状) it with another patch. You might think this would lead to "patch wars", but that hasn't happened. We've had a handful of cases in libzmq where patches by one contributor were killed by another person who felt the experimentation(实验) wasn't going in the right direction. It is easier than seeking up-front(预先的) consensus(一致).

Maintainers(维持) MAY commit(犯罪) changes to non-source documentation(文件材料) directly to the project.

This exit allows maintainers who are making release notes to push those without having to create an issue which would then affect the release notes, leading to stress(强调) on the space time fabric(织物) and possibly involuntary(无意识的) rerouting(重编路由) backwards in the fourth dimension(维) to before the invention of cold beer. Shudder(发抖). It is simpler to agree that release notes aren't technically software.

Creating Stable Releases
topprevnext

We want some guarantee(保证) of stability(稳定性) for a production system. In the past, this meant taking unstable(不稳定的) code and then over months hammering out the bugs and faults until it was safe to trust. iMatix's job, for years, has been to do this to libzmq, turning raw code into packages by allowing only bug fixes and no new code into a "stabilization(稳定) branch". It's surprisingly not as thankless(不感谢的) as it sounds.

Now, since we went full speed with C4, we've found that git master of libzmq is mostly perfect, most of the time. This frees our time to do more interesting things, such as building new open source layers on top of libzmq. However, people still want that guarantee: many users will simply not install(安装) except from an "official" release. So a stable(稳定的) release today means two things. First, a snapshot(快照) of the master taken at a time when there were no new changes for a while, and no dramatic(戏剧的) open bugs. Second, a way to fine tune(曲调) that snapshot to fix the critical(鉴定的) issues remaining in it.

This is the process we explain in this section.

The project SHALL have one branch ("master") that always holds the latest in-progress version and SHOULD always build.

This is redundant(多余的) because every patch(眼罩) always builds but it's worth restating(重申). If the master doesn't build (and pass its tests), someone needs waking up.

The project SHALL NOT use topic branches for any reason. Personal forks MAY use topic branches.

I'll come to branches soon. In short (or "tl;dr", as they say on the webs), branches make the repository(贮藏室) too complex(复杂的) and fragile, and require up-front(预先的) agreement, all of which are expensive and avoidable(可避免的).

To make a stable(稳定的) release someone SHALL fork the repository by copying it and thus become maintainer(维持) of this repository.
Forking a project for stabilization(稳定) MAY be done unilaterally(单方面地) and without agreement of project maintainers.

It's free software. No one has a monopoly(垄断) on it. If you think the maintainers aren't producing stable releases right, fork the repository and do it yourself. Forking isn't a failure, it's an essential tool for competition. You can't do this with branches, which means a branch-based release policy(政策) gives the project maintainers a monopoly. And that's bad because they'll become lazier and more arrogant(自大的) than if real competition is chasing(追逐) their heels.

A stabilization project SHOULD be maintained by the same process as the main project.

Stabilization projects have maintainers and contributors(贡献者) like any project. In practice we usually cherry(樱桃) pick patches from the main project to the stabilization project, but that's just a convenience.

A patch to a repository declared "stable" SHALL be accompanied(陪伴) by a reproducible(可再生的) test case.

Beware(当心) of a one-size-fits-all process. New code does not need the same paranoia(偏执狂) as code that people are trusting for production use. In the normal development process, we did not mention test cases. There's a reason for this. While I love testable(可试验的) patches, many changes aren't easily or at all testable. However, to stabilize(稳固) a code base you want to fix only serious bugs, and you want to be 100% sure every change is accurate(精确的). This means before and after tests for every change.

Evolution(演变) of Public Contracts(合同)
topprevnext

By "public contracts", I mean APIs and protocols(协议). Up until the end of 2011, libzmq's naturally happy state was marred by broken promises and broken contracts. We stopped making promises (aka "road maps") for libzmq completely, and our dominant(显性的) theory of change is now that it emerges(浮现) carefully and accurately over time. At a 2012 Chicago meetup, Garrett Smith and Chuck Remes called this the "drunken(喝醉的) stumble(绊倒) to greatness(伟大)", which is how I think of it now.

We stopped breaking public contracts(合同) simply by banning the practice. Before then it had been "OK" (as in we did it and everyone complained bitterly, and we ignored(驳回诉讼) them) to break the API or protocol(协议) so long as we changed the major version number. Sounds fine, until you get ZeroMQ v2.0, v3.0, and v4.0 all in development at the same time, and not speaking to each other.

All Public Contracts (APIs or protocols) SHOULD be documented.

You'd think this was a given for professional software engineers but no, it's not. So, it's a rule. You want C4 certification(证明) for your project, you make sure your public contracts are documented. No "It's specified(指定) in the code" excuses. Code is not a contract. (Yes, I intend at some point to create a C4 certification process to act as a quality indicator(指示器) for open source projects.)

All Public Contracts SHALL use Semantic Versioning.

This rule is mainly here because people asked for it. I've no real love for it, as Semantic Versioning is what led to the so-called "Why does ZeroMQ not speak to itself?!" debacle(崩溃). I've never seen the problem that this solved. Something about runtime validation(确认) of library versions, or some-such.

All Public Contracts SHOULD have space for extensibility(展开性) and experimentation(实验).

Now, the real thing is that public contracts do change. It's not about not changing them. It's about changing them safely. This means educating (especially protocol) designers to create that space up-front(预先的).

A patch(眼罩) that modifies(修改) a stable(稳定的) Public Contract SHOULD not break existing applications unless there is overriding(高于一切的) consensus(一致) on the value of doing this.

Sometimes the patch is fixing a bad API that no one is using. It's a freedom we need, but it should be based on consensus, not one person's dogma(教条). However, making random(随机的) changes "just because" is not good. In ZeroMQ v3.x, did we benefit(有益于) from renaming(重新命名) ZMQ_NOBLOCK to ZMQ_DONTWAIT? Sure, it's closer to the POSIX socket(插座) recv() call, but is that worth breaking thousands of applications? No one ever reported it as an issue. To misquote Stallman: "your freedom to create an ideal(理想的) world stops one inch from my application."

A patch that introduces new features(特色) to a Public Contract SHOULD do so using new names.

We had the experience in ZeroMQ once or twice of new features using old names (or worse, using names that were still in use elsewhere(在别处)). ZeroMQ v3.0 had a newly introduced "ROUTER" socket(插座) that was totally different from the existing ROUTER socket in 2.x. Dear lord(主), you should be face-palming, why? The reason: apparently(显然地), even smart people sometimes need regulation to stop them doing silly things.

Old names SHOULD be deprecated(反对) in a systematic(系统的) fashion by marking new names as "experimental(实验的)" until they are stable(稳定的), then marking the old names as "deprecated".

This life cycle notation(符号) has the great benefit(利益) of actually telling users what is going on with a consistent(始终如一的) direction. "Experimental" means "we have introduced this and intend to make it stable if it works". It does not mean, "we have introduced this and will remove it at any time if we feel like it". One assumes(承担) that code that survives(幸存) more than one patch(眼罩) cycle is meant to be there. "Deprecated" means "we have replaced this and intend to remove it".

When sufficient(足够的) time has passed, old deprecated names SHOULD be marked "legacy(遗赠)" and eventually(最后) removed.

In theory this gives applications time to move onto stable new contracts(合同) without risk(风险). You can upgrade first, make sure things work, and then, over time, fix things up to remove dependencies(依赖性) on deprecated and legacy APIs and protocols(协议).

Old names SHALL NOT be reused by new features(特色).

Ah, yes, the joy when ZeroMQ v3.x renamed(重新命名) the top-used API functions (zmq_send() and zmq_recv()) and then recycled the old names for new methods that were utterly(完全地) incompatible(不相容的) (and which I suspect few people actually use). You should be slapping(拍击) yourself in confusion(混淆) again, but really, this is what happened and I was as guilty as anyone. After all, we did change the version number! The only benefit of that experience was to get this rule.

When old names are removed, their implementations(实现) MUST provoke(驱使) an exception(例外) (assertion(断言)) if used by applications.

I've not tested this rule to be certain it makes sense. Perhaps what it means is "if you can't provoke a compile(编译) error because the API is dynamic(动态的), provoke an assertion".

Project Administration
topprevnext

The project founders SHALL act as Administrators to manage the set of project Maintainers.

Someone needs to administer(管理) the project, and it makes sense that the original founders start this ball rolling.

The Administrators SHALL ensure(保证) their own succession(连续) over time by promoting(促进) the most effective(有效的) Maintainers.

At the same time, as founder of a project you really want to get out of the way before you become over-attached to it. Promoting(促进) the most active and consistent(始终如一的) maintainers(维持) is good for everyone.

A new Contributor(贡献者) who makes a correct patch(眼罩) SHALL be invited to become a Maintainer.

I met Felix Geisendörfer in Lyons in 2012 at the Mix-IT conference where I presented Social Architecture and one thing that came out of this was Felix's now famous Pull Request Hack. It fits elegantly(优美地) into C4 and solves the problem of maintainers dropping out over time.

Administrators(管理人) MAY remove Maintainers who are inactive(不活跃的) for an extended(延伸的) period of time, or who repeatedly fail to apply this process accurately(精确地).

This was Ian Barber's suggestion: we need a way to crop inactive maintainers. Originally maintainers were self-elected but that makes it hard to drop troublemakers(捣乱者) (who are rare, but not unknown).

C4 is not perfect. Few things are. The process for changing it (Digistan's COSS) is a little outdated now: it relies(依靠) on a single-editor workflow(工作流) with the ability to fork, but not merge(合并). This seems to work but it could be better to use C4 for protocols(协议) like C4.

A Real-Life Example

topprevnext

In this email thread, Dan Goes asks how to make a publisher that knows when a new client subscribes(订阅), and sends out previous matching messages. It's a standard pub-sub technique called "last value caching". Now over a 1-way transport like pgm (where subscribers literally(照字面地) send no packets back to publishers), this can't be done. But over TCP, it can, if we use an XPUB socket(插座) and if that socket didn't cleverly filter out duplicate(复制的) subscriptions(捐献) to reduce upstream(向上游的) traffic.

Though I'm not an expert contributor to libzmq, this seems like a fun problem to solve. How hard could it be? I start by forking the libzmq repository(贮藏室) to my own GitHub account and then clone(无性繁殖) it to my laptop(膝上型轻便电脑), where I build it:

git clone git@github.com:hintjens/libzmq.git
cd libzmq
./autogen.sh
./configure
make

Because the libzmq code is neat and well-organized, it was quite easy to find the main files to change (xpub.cpp and xpub.hpp). Each socket(插座) type has its own source file and class. They inherit(继承) from socket_base.cpp, which has this hook for socket-specific options:

//  First, check whether specific socket type overloads the option.
int rc = xsetsockopt (option_, optval_, optvallen_);
if (rc == 0 || errno != EINVAL)
    return rc;

//  If the socket type doesn't support the option, pass it to
//  the generic option parser.
return options.setsockopt (option_, optval_, optvallen_);

Then I check where the XPUB socket filters out duplicate(复制的) subscriptions(捐献), in its xread_activated method:

bool unique;
if (*data == 0)
    unique = subscriptions.rm (data + 1, size - 1, pipe_);
else
    unique = subscriptions.add (data + 1, size - 1, pipe_);

//  If the subscription is not a duplicate store it so that it can be
//  passed to used on next recv call.
if (unique && options.type != ZMQ_PUB)
    pending.push_back (blob_t (data, size));

At this stage, I'm not too concerned(涉及) with the details of how subscriptions.rm and subscriptions.add work. The code seems obvious except that "subscription" also includes unsubscription, which confused(混乱) me for a few seconds. If there's anything else weird(怪异的) in the rm and add methods, that's a separate issue to fix later. Time to make an issue for this change. I head over to the zeromq.jira.com site, log in, and create a new entry.

Jira kindly offers me the traditional choice between "bug" and "new feature(特色)" and I spend thirty seconds wondering where this counterproductive(反生产的) historical(历史的) distinction(区别) came from. Presumably(可能有的), the "we'll fix bugs for free, but you pay for new features" commercial(商业的) proposal(提议), which stems(干) from the "you tell us what you want and we'll make it for $X" model of software development, and which generally leads to "we spent three times $X and we got what?!" email Fists of Fury.

Putting such thoughts aside, I create an issue #443 and described the problem and plausible(貌似可信的) solution(解决方案):

Problem: XPUB socket(插座) filters out duplicate(复制的) subscriptions(捐献) (deliberate(故意的) design). However this makes it impossible to do subscription-based intelligence(智力). See http://lists.zeromq.org/pipermail/zeromq-dev/2012-October/018838.html for a use case.
Solution: make this behavior(行为) configurable(可配置的) with a socket option.

It's naming time. The API sits in include/zmq.h, so this is where I added the option name. When you invent a concept(观念) in an API or anywhere, please take a moment to choose a name that is explicit(明确的) and short and obvious. Don't fall back on generic(类的) names that need additional(附加的) context(环境) to understand. You have one chance to tell the reader what your concept is and does. A name like ZMQ_SUBSCRIPTION_FORWARDING_FLAG is terrible. It technically kind of aims in the right direction, but is miserably(贫困地) long and obscure(模糊的). I chose ZMQ_XPUB_VERBOSE: short and explicit and clearly an on/off switch(转换) with "off" being the default setting.

So, it's time to add a private property to the xpub class definition in xpub.hpp:

// If true, send all subscription messages upstream, not just
// unique ones
bool verbose;

And then lift some code from router.cpp to implement the xsetsockopt method. Finally, change the xread_activated method to use this new option, and while at it, make that test on socket(插座) type more explicit(明确的) too:

//  If the subscription is not a duplicate store it so that it can be
//  passed to used on next recv call.
if (options.type == ZMQ_XPUB && (unique || verbose))
    pending.push_back (blob_t (data, size));

The thing builds nicely the first time. This makes me a little suspicious(可疑的), but being lazy and jet-lagged(飞机晚点的) I don't immediately make a test case to actually try out the change. The process doesn't demand that, even if usually I'd do it just to catch that inevitable(必然的) 10% of mistakes we all make. I do however document this new option on the doc/zmq_setsockopt.txt man page. In the worst case, I added a patch(眼罩) that wasn't really useful. But I certainly didn't break anything.

I don't implement(实施) a matching zmq_getsockopt because "minimal(最低的)" means what it says. There's no obvious use case for getting the value of an option that you presumably(大概) just set, in code. Symmetry(对称) isn't a valid reason to double the size of a patch. I did have to document the new option because the process says, "All Public Contracts SHOULD be documented."

Committing(犯罪) the code, I push the patch to my forked repository(贮藏室) (the "origin"):

git commit -a -m "Fixed issue #443"
git push origin master

Switching(转换) to the GitHub web interface(界面), I go to my libzmq fork, and press the big "Pull Request" button at the top. GitHub asks me for a title, so I enter "Added ZMQ_XPUB_VERBOSE option". I'm not sure why it asks this as I made a neat commit message but hey, let's go with the flow here.

This makes a nice little pull request with two commits(犯罪); the one I'd made a month ago on the release notes to prepare for the v3.2.1 release (a month passes so quickly when you spend most of it in airports), and my fix for issue #443 (37 new lines of code). GitHub lets you continue to make commits after you've kicked off a pull request. They get queued up and merged(合并) in one go. That is easy, but the maintainer(维持) may refuse the whole bundle(束) based on one patch(修补) that doesn't look valid.

Because Dan is waiting (at least in my highly optimistic(乐观的) imagination(想象力)) for this fix, I go back to the zeromq-dev list and tell him I've made the patch, with a link to the commit. The faster I get feedback(反馈), the better. It's 1 a.m. in South Korea as I make this patch, so early evening in Europe, and morning in the States. You learn to count timezones(时区) when you work with people across the world. Ian is in a conference, Mikko is getting on a plane, and Chuck is probably in the office, but three hours later, Ian merges the pull request.

After Ian merges the pull request, I resynchronize(重新同步化) my fork with the upstream(逆流地) libzmq repository(贮藏室). First, I add a remote that tells git where this repository sits (I do this just once in the directory where I'm working):

git remote add upstream git://github.com/zeromq/libzmq.git

And then I pull changes back from the upstream master and check the git log to double-check:

git pull --rebase upstream master
git log

And that is pretty much it, in terms of how much git one needs to learn and use to contribute(贡献) patches to libzmq. Six git commands and some clicking on web pages. Most importantly to me as a naturally lazy, stupid, and easily confused(混乱) developer, I don't have to learn git's internal(内部的) models, and never have to do anything involving(包含) those infernal(地狱的) engines of structural(结构的) complexity(复杂) we call "git branches". Next up, the attempted assassination(暗杀) of git branches. Let's live dangerously!

Git Branches Considered Harmful

topprevnext

One of git's most popular features(特色) is its branches. Almost all projects that use git use branches, and the selection(选择) of the "best" branching strategy(战略) is like a rite(仪式) of passage for an open source project. Vincent Driessen's git-flow may be the best known. It has base branches (master, develop), feature branches, release branches, hotfix branches, and support branches. Many teams have adopted(采取) git-flow, which even has git extensions(延长) to support it. I'm a great believer in popular wisdom, but sometimes you have to recognize mass delusion(迷惑) for what it is.

Here is a section of C4 that might have shocked you when you first read it:

The project SHALL NOT use topic branches for any reason. Personal forks MAY use topic branches.

To be clear, it's public branches in shared repositories(贮藏室) that I'm talking about. Using branches for private work, e.g., to work on different issues, appears to work well enough, though it's more complexity(复杂) than I personally enjoy. To channel Stallman again: "your freedom to create complexity ends one inch from our shared workspace."

Like the rest of C4, the rules on branches are not accidental(意外的). They came from our experience making ZeroMQ, starting when Martin Sustrik and I rethought how to make stable(稳定的) releases. We both love and appreciate simplicity(朴素) (some people seem to have a remarkable(卓越的) tolerance(公差) for complexity). We chatted for a while… I asked him, "I'm going to start making a stable release. Would it be OK for me to make a branch in the git you're working in?" Martin didn't like the idea. "OK, if I fork the repository, I can move patches f(眼罩)rom your repo t(购回债券)o that one". That felt much better to both of us.

The response from many in the ZeroMQ community was shock and horror(惊骇). People felt we were being lazy and making contributors(贡献者) work harder to find the "right" repository. Still, this seemed simple, and indeed it worked smoothly. The best part was that we each worked as we wanted to. Whereas(然而) before, the ZeroMQ repository had felt horribly complex(复杂的) (and it wasn't even anything like git-flow), this felt simple. And it worked. The only downside(下降趋势) was that we lost a single unified(统一的) history. Now, perhaps historians(历史学家) will feel robbed, but I honestly can't see that the historical(历史的) minutiae(微小) of who changed what, when, including every branch and experiment, are worth any significant(重大的) pain or friction(摩擦).

People have gotten(得到) used to the "multiple repositories" approach(方法) in ZeroMQ and we've started using that in other projects quite successfully. My own opinion is that history will judge git branches and patterns like git-flow as a complex solution(解决方案) to imaginary(虚构的) problems inherited(继承) from the days of Subversion and monolithic(整体的) repositories.

More profoundly(深刻地), and perhaps this is why the majority seems to be "wrong": I think the branches versus(对) forks argument is really a deeper design versus evolve(发展) argument about how to make software optimally(最佳). I'll address that deeper argument in the next section. For now, I'll try to be scientific about my irrational(不合理的) hatred(憎恨) of branches, by looking at a number of criteria(标准), and comparing branches and forks in each one.

Simplicity Versus Complexity
topprevnext

The simpler, the better.

There is no inherent(内在的) reason why branches are more complex(复杂的) than forks. However, git-flow uses five types of branch, whereas(然而) C4 uses two types of fork (development, and stable(稳定的)) and one branch (master). Circumstantial evidence(证据) is thus that branches lead to more complexity(复杂) than forks. For new users, it is definitely(清楚地), and we've measured this in practice, easier to learn to work with many repositories(贮藏室) and no branches except master.

Change Latency
topprevnext

The smaller and more rapid the delivery(交付), the better.

Development branches seem to correlate(关联) strongly with large, slow, risky(危险的) deliveries. "Sorry, I have to merge(合并) this branch before we can test the new version" signals a breakdown(故障) in process. It's certainly not how C4 works, which is by focusing tightly on individual(个人的) problems and their minimal(最低的) solutions(解决方案). Allowing branches in development raises change latency(潜伏). Forks have a different outcome: it's up to the forker to ensure(保证) that his changes merge cleanly, and to keep them simple so they won't be rejected.

Learning Curve
topprevnext

The smoother the learning curve(弯), the better.

Evidence definitely shows that learning to use git branches is complex. For some people, this is OK. For most developers, every cycle spent learning git is a cycle lost on more productive(能生产的) things. I've been told several times, by different people that I do not like branches because I "never properly learned git". That is fair, but it is a criticism(批评) of the tool, not the human.

Cost of Failure
topprevnext

The lower the cost of failure, the better.

Branches demand more perfection(完善) from developers because mistakes potentially(可能地) affect others. This raises the cost of failure. Forks make failure extremely cheap because literally(照字面地) nothing that happens in a fork can affect others not using that fork.

Up-front Coordination
topprevnext

The less need for up-front(预先的) coordination(协调), the better.

You can do a hostile(敌对的) fork. You cannot do a hostile branch. Branches depend on up-front coordination, which is expensive and fragile. One person can veto(否决) the desires of a whole group. For example in the ZeroMQ community we were unable to agree on a git branching model for a year. We solved that by using forking instead. The problem went away.

Scalability
topprevnext

The more you can scale(测量) a project, the better.

The strong assumption(假定) in all branch strategies(战略) is that the repository(贮藏室) is the project. But there is a limit to how many people you can get to agree to work together in one repository. As I explained, the cost of up-front coordination can become fatal(致命的). A more realistic(现实的) project scales by allowing anyone to start their own repositories, and ensuring(保证) these can work together. A project like ZeroMQ has dozens of repositories. Forking looks more scalable(可攀登的) than branching.

Surprise and Expectations
topprevnext

The less surprising, the better.

People expect branches and find forks to be uncommon and thus confusing(混乱的). This is the one aspect(方面) where branches win. If you use branches, a single patch(眼罩) will have the same commit(犯罪) hash tag, whereas(然而) across forks the patch will have different hash tags. That makes it harder to track patches as they cross forks, true. But seriously, having to track hexadecimal(十六进制的) hash tags is not a feature(特色). It's a bug. Sometimes better ways of working are surprising at first.

Economics of Participation
topprevnext

The more tangible(有形的) the rewards, the better.

People like to own their work and get credit for it. This is much easier with forks than with branches. Forks create more competition in a healthy way, while branches suppress(抑制) competition and force people to collaborate(合作) and share credit. This sounds positive(积极的) but in my experience it demotivates(使失去动力) people. A branch isn't a product you can "own", whereas a fork can be.

Robustness in Conflict
topprevnext

The more a model can survive(幸存) conflict(冲突), the better.

Like it or not, people fight over ego, status, beliefs, and theories of the world. Challenge is a necessary part of science. If your organizational(组织的) model depends on agreement, you won't survive the first real fight. Branches do not survive real arguments and fights, whereas forks can be hostile(敌对的), and still benefit(有益于) all parties. And this is indeed how free software works.

Guarantees of Isolation
topprevnext

The stronger the isolation(隔离) between production code and experiment, the better.

People make mistakes. I've seen experimental(实验的) code pushed to mainline production by error. I've seen people make bad panic(恐慌) changes under stress(压力). But the real fault is in allowing two entirely separate generations of product to exist in the same protected space. If you can push to random-branch-x, you can push to master. Branches do not guarantee(保证) isolation of production critical(鉴定的) code. Forks do.

Visibility
topprevnext

The more visible(可见物) our work, the better.

Forks have watchers, issues, a README, and a wiki. Branches have none of these. People try forks, build them, break them, patch(修补) them. Branches sit there until someone remembers to work on them. Forks have downloads and tarballs. Branches do not. When we look for self-organization, the more visible and declarative(宣言的) the problems, the faster and more accurately(精确地) we can work.

Conclusions
topprevnext

In this section, I've listed a series of arguments, most of which came from fellow team members. Here's how it seems to break down: git veterans(老兵) insist that branches are the way to work, whereas(然而) newcomers tend(照料) to feel intimidated(恐吓) when asked to navigate(驾驶) git branches. Git is not an easy tool to master. What we've discovered, accidentally(意外地), is that when you stop using branches at all, git becomes trivial(不重要的) to use. It literally(照字面地) comes down to six commands (clone, remote, commit, log, push, and pull). Furthermore(此外), a branch-free process actually works, we've used it for a couple of years now, and no visible(明显的) downside(下降趋势) except surprise to the veterans(老兵) and growth of "single" projects over multiple repositories(贮藏室).

If you can't use forks, perhaps because your firm doesn't trust GitHub's private repositories, then you can perhaps use topic branches, one per issue. You'll still suffer the costs of getting up-front(预先的) consensus(一致), low competitiveness(竞争力), and risk(风险) of human error.

Designing for Innovation

topprevnext

Let's look at innovation(创新), which Wikipedia defines(定义) as, "the development of new values through solutions(解决方案) that meet new requirements, inarticulate(口齿不清的) needs, or old customer and market needs in value adding new ways." This really just means solving problems more cheaply. It sounds straight-forward, but the history of collapsed(倒塌的) tech giants(巨人) proves that it's not. I'll try to explain how teams so often get it wrong, and suggest a way for doing innovation right.

The Tale of Two Bridges
topprevnext

Two old engineers were talking of their lives and boasting(夸口说) of their greatest projects. One of the engineers explained how he had designed one of the greatest bridges ever made.

"We built it across a river gorge(峡谷)," he told his friend. "It was wide and deep. We spent two years studying the land, and choosing designs and materials. We hired the best engineers and designed the bridge, which took another five years. We contracted(收缩) the largest engineering firms to build the structures(结构), the towers, the tollbooths(过路收费亭), and the roads that would connect the bridge to the main highways(公路). Dozens died during the construction. Under the road level we had trains, and a special path for cyclists. That bridge represented years of my life."

The second man reflected(反映) for a while, then spoke. "One evening me and a friend got drunk on vodka(伏特加酒), and we threw a rope across a gorge(峡谷)," he said. "Just a rope, tied to two trees. There were two villages, one at each side. At first, people pulled packages across that rope with a pulley(滑轮) and string. Then someone threw a second rope, and built a foot walk. It was dangerous, but the kids loved it. A group of men then rebuilt(重新组装) that, made it solid, and women started to cross, everyday, with their produce. A market grew up on one side of the bridge, and slowly that became a large town, because there was a lot of space for houses. The rope bridge got replaced with a wooden bridge, to allow horses and carts(购物车) to cross. Then the town built a real stone bridge, with metal beams. Later, they replaced the stone part with steel, and today there's a suspension(悬浮) bridge standing in that same spot."

The first engineer was silent. "Funny thing," he said, "my bridge was demolished(拆除) about ten years after we built it. Turns out it was built in the wrong place and no one wanted to use it. Some guys(球员) had thrown a rope across the gorge, a few miles further downstream(下游的), and that's where everyone went."

How ZeroMQ Lost Its Road Map
topprevnext

Presenting ZeroMQ at the Mix-IT conference in Lyon in early 2012, I was asked several times for the "road map". My answer was: there is no road map any longer. We had road maps, and we deleted them. Instead of a few experts trying to lay out the next steps, we were allowing this to happen organically(有机地). The audience didn't really like my answer. So un-French.

However, the history of ZeroMQ makes it quite clear why road maps were problematic(问题的). In the beginning, we had a small team making the library, with few contributors(贡献者), and no documented road map. As ZeroMQ grew more popular and we switched(转换) to more contributors, users asked for road maps. So we collected our plans together and tried to organize them into releases. Here, we wrote, is what will come in the next release.

As we rolled out releases, we hit the problem that it's very easy to promise stuff(东西), and rather harder to make it as planned. For one thing, much of the work was voluntary(自愿的), and it's not clear how you force volunteers(志愿者) to commit(犯罪) to a road map. But also, priorities(优先) can shift(移动) dramatically(戏剧地) over time. So we were making promises we could not keep, and the real deliveries(交付) didn't match the road maps.

The second problem was that by defining(定义) the road map, we in effect claimed(提出要求) territory(领土), making it harder for others to participate(参与). People do prefer to contribute(贡献) to changes they believe were their idea. Writing down a list of things to do turns contribution into a chore(家庭杂务) rather than an opportunity(时机).

Finally, we saw changes in ZeroMQ that were quite traumatic(外伤的), and the road maps didn't help with this, despite(尽管) a lot of discussion and effort to "do it right". Examples of this were incompatible(不相容的) changes in APIs and protocols(协议). It was quite clear that we needed a different approach(方法) for defining the change process.

Software engineers don't like the notion(概念) that powerful, effective(有效的) solutions(解决方案) can come into existence without an intelligent(智能的) designer actively thinking things through. And yet no one in that room in Lyon would have questioned evolution(演变). A strange irony(讽刺), and one I wanted to explore further as it underpins(巩固) the direction the ZeroMQ community has taken since the start of 2012.

In the dominant(显性的) theory of innovation(创新), brilliant(灿烂的) individuals(个人) reflect on large problem sets and then carefully and precisely(精确地) create a solution. Sometimes they will have "eureka" moments where they "get" brilliantly simple answers to whole large problem sets. The inventor, and the process of invention are rare, precious, and can command a monopoly(垄断). History is full of such heroic(英雄的) individuals. We owe them our modern world.

Looking more closely, however, and you will see that the facts don't match. History doesn't show lone inventors. It shows lucky people who steal or claim ownership of ideas that are being worked on by many. It shows brilliant people striking lucky once, and then spending decades(十年) on fruitless(不成功的) and pointless(无意义的) quests(追求). The best known large-scale(大规模的) inventors like Thomas Edison were in fact just very good at systematic(系统的) broad research done by large teams. It's like claiming that Steve Jobs invented every device(装置) made by Apple. It is a nice myth(神话), good for marketing, but utterly(完全地) useless as practical science.

Recent history, much better documented and less easy to manipulate(操纵), shows this well. The Internet is surely one of the most innovative(革新的) and fast-moving areas of technology, and one of the best documented. It has no inventor. Instead, it has a massive(大量的) economy(经济) of people who have carefully and progressively(渐进地) solved a long series of immediate problems, documented their answers, and made those available to all. The innovative nature of the Internet comes not from a small, select band of Einsteins. It comes from RFCs anyone can use and improve, made by hundreds and thousands of smart, but not uniquely(独特地) smart, individuals. It comes from open source software anyone can use and improve. It comes from sharing, scale(规模) of community, and the continuous(连续的) accretion(添加) of good solutions and disposal(处理) of bad ones.

Here thus is an alternative(供选择的) theory of innovation:

  1. There is an infinite(无限的) problem/solution terrain(地形).
  2. This terrain changes over time according to external(外部的) conditions.
  3. We can only accurately(精确地) perceive(察觉) problems to which we are close.
  4. We can rank the cost/benefit(利益) economics(经济学) of problems using a market for solutions.
  5. There is an optimal(最佳的) solution(解决方案) to any solvable(可以解决的) problem.
  6. We can approach(接近) this optimal solution heuristically, and mechanically(机械地).
  7. Our intelligence(智力) can make this process faster, but does not replace it.

There are a few corollaries(推论) to this:

  • Individual(个人的) creativity(创造力) matters less than process. Smarter people may work faster, but they may also work in the wrong direction. It's the collective(集体的) vision(视力) of reality that keeps us honest and relevant(有关的).
  • We don't need road maps if we have a good process. Functionality(功能) will emerge(浮现) and evolve(发展) over time as solutions compete for market share.
  • We don't invent solutions so much as discover them. All sympathies(同情) to the creative(创造性的) soul. It's just an information processing machine that likes to polish(磨光) its own ego and collect karma(因果报应).
  • Intelligence is a social effect, though it feels personal. A person cut off from others eventually(最后) stops thinking. We can neither collect problems nor measure solutions without other people.
  • The size and diversity(多样性) of the community is a key factor(因素). Larger, more diverse(不同的) communities collect more relevant problems, and solve them more accurately(精确地), and do this faster, than a small expert group.

So, when we trust the solitary(独居者) experts, they make classic(经典的) mistakes. They focus on ideas, not problems. They focus on the wrong problems. They make misjudgments(估计错误) about the value of solving problems. They don't use their own work.

Can we turn the above theory into a reusable(可重复使用的) process? In late 2011, I started documenting C4 and similar contracts(合同), and using them both in ZeroMQ and in closed source projects. The underlying(潜在的) process is something I call "Simplicity Oriented Design", or SOD. This is a reproducible(可再生的) way of developing simple and elegant(高雅的) products. It organizes people into flexible(灵活的) supply chains that are able to navigate(驾驶) a problem landscape(风景) rapidly and cheaply. They do this by building, testing, and keeping or discarding(抛弃) minimal(最低的) plausible(貌似可信的) solutions, called "patches(眼罩)". Living products consist of long series of patches, applied one atop(在…的顶上) the other.

SOD is relevant first because it's how we evolve ZeroMQ. It's also the basis(基础) for the design process we will use in Chapter 7 - Advanced Architecture(建筑学) using ZeroMQ to develop larger-scale ZeroMQ applications. Of course, you can use any software architecture methodology(方法学) with ZeroMQ.

To best understand how we ended up with SOD, let's look at the alternatives(二中择一).

Trash-Oriented Design
topprevnext

The most popular design process in large businesses seems to be Trash-Oriented Design, or TOD. TOD feeds off the belief that all we need to make money are great ideas. It's tenacious(顽强的) nonsense(胡说), but a powerful crutch(拐杖) for people who lack imagination(想象力). The theory goes that ideas are rare, so the trick is to capture(俘获) them. It's like non-musicians being awed(充满敬畏的) by a guitar player, not realizing that great talent(才能) is so cheap it literally(照字面地) plays on the streets for coins.

The main output(输出) of TODs is expensive "ideation(构思能力)": concepts(观念), design documents, and products that go straight into the trash(垃圾) can. It works as follows:

  • The Creative(创造性的) People come up with long lists of "we could do X and Y". I've seen endlessly detailed lists of everything amazing a product could do. We've all been guilty of this. Once the creative work of idea generation has happened, it's just a matter of execution(执行), of course.
  • So the managers and their consultants(顾问) pass their brilliant(灿烂的) ideas to designers who create acres(土地) of preciously refined(精炼的) design documents. The designers take the tens of ideas the managers came up with, and turn them into hundreds of world-changing designs.
  • These designs get given to engineers who scratch(抓) their heads and wonder who the heck(饲草架) came up with such nonsense. They start to argue back, but the designs come from up high, and really, it's not up to engineers to argue with creative people and expensive consultants.
  • So the engineers creep(爬行) back to their cubicles(小卧室), humiliated(羞辱) and threatened(威胁) into building the gigantic(巨大的) but oh-so-elegant junk heap. It is bone-breaking work because the designs take no account of practical costs. Minor(未成年的) whims(奇想) might take weeks of work to build. As the project gets delayed, the managers bully(欺负) the engineers into giving up their evenings and weekends.
  • Eventually(最后的), something resembling(类似) a working product makes it out of the door. It's creaky(发辗的) and fragile, complex(复合体) and ugly. The designers curse(诅咒) the engineers for their incompetence(无资格) and pay more consultants to put lipstick(口红) onto the pig, and slowly the product starts to look a little nicer.
  • By this time, the managers have started to try to sell the product and they find, shockingly, that no one wants it. Undaunted(勇敢的), they courageously(勇敢地) build million-dollar web sites and ad campaigns(运动) to explain to the public why they absolutely(绝对地) need this product. They do deals with other businesses to force the product on the lazy, stupid, and ungrateful(忘恩负义的) market.
  • After twelve months of intense(强烈的) marketing, the product still isn't making profits(利润). Worse, it suffers dramatic(戏剧的) failures and gets branded in the press as a disaster. The company quietly shelves it, fires the consultants, buys a competing product from a small startup and rebrands that as its own Version 2. Hundreds of millions of dollars end up in the trash.
  • Meanwhile, another visionary(梦想的) manager somewhere in the organization drinks a little too much tequila(龙舌兰酒) with some marketing people and has a Brilliant Idea.

Trash-Oriented Design would be a caricature(漫画) if it wasn't so common. Something like 19 out of 20 market-ready products built by large firms are failures (yes, 87% of statistics(统计) are made up on the spot). The remaining 1 in 20 probably only succeeds because the competitors are so bad and the marketing is so aggressive.

The main lessons of TOD are quite straightforward(简单的) but hard to swallow. They are:

  • Ideas are cheap. No exceptions(例外). There are no brilliant ideas. Anyone who tries to start a discussion with "oooh, we can do this too!" should be beaten down with all the passion one reserves for traveling evangelists(福音传道者). It is like sitting in a cafe at the foot of a mountain, drinking a hot chocolate and telling others, "Hey, I have a great idea, we can climb that mountain! And build a chalet(瑞士的农舍) on top! With two saunas(桑拿浴)! And a garden! Hey, and we can make it solar(太阳的) powered! Dude(花花公子), that's awesome(可怕的)! What color should we paint it? Green! No, blue! OK, go and make it, I'll stay here and make spreadsheets(电子数据表) and graphics(图形)!"
  • The starting point for a good design process is to collect real problems that confront(面对) real people. The second step is to evaluate(评价) these problems with the basic question, "How much is it worth to solve this problem?" Having done that, we can collect that set of problems that are worth solving.
  • Good solutions(解决方案) to real problems will succeed as products. Their success will depend on how good and cheap the solution is, and how important the problem is (and sadly, how big the marketing budgets(预算) are). But their success will also depend on how much they demand in effort to use—in other words, how simple they are.

Now, after slaying(杀害) the dragon(龙) of utter(完全的) irrelevance(离题), we attack the demon(恶魔) of complexity(复杂).

Complexity-Oriented Design
topprevnext

Really good engineering teams and small firms can usually build decent(正派的) products. But the vast majority of products still end up being too complex(复杂的) and less successful than they might be. This is because specialist teams, even the best, often stubbornly(顽固地) apply a process I call Complexity-Oriented Design, or COD, which works as follows:

  • Management correctly identifies(识别) some interesting and difficult problem with economic(经济的) value. In doing so, they already leapfrog(跳背) over any TOD team.
  • The team with enthusiasm(热心) starts to build prototypes(原型) and core layers. These work as designed and thus encouraged, the team go off into intense(强烈的) design and architecture(建筑学) discussions, coming up with elegant(高雅的) schemas(模式) that look beautiful and solid.
  • Management comes back and challenges the team with yet more difficult problems. We tend(照料) to equate(相等) cost with value, so the harder and more expensive to solve, the more the solution should be worth, in their minds.
  • The team, being engineers and thus loving to build stuff(东西), build stuff. They build and build and build and end up with massive(大量的), perfectly-designed complexity.
  • The products go to market, and the market scratches(擦伤) its head and asks, "Seriously, is this the best you can do?" People do use the products, especially if they aren't spending their own money in climbing the learning curve(曲线).
  • Management gets positive(正数) feedback(反馈) from its larger customers, who share the same idea that high cost (in training and use) means high value, and so continues to push the process.
  • Meanwhile somewhere across the world, a small team is solving the same problem using a better process, and a year later smashes(破碎) the market to little pieces.

COD is characterized(以…为特点的) by a team obsessively(过分地) solving the wrong problems in a form of collective(集体的) delusion(迷惑). COD products tend to be large, ambitious(野心勃勃的), complex, and unpopular. Much open source software is the output(输出) of COD processes. It is insanely(疯狂地) hard for engineers to stop extending(延伸) a design to cover more potential(潜在的) problems. They argue, "What if someone wants to do X?" but never ask themselves, "What is the real value of solving X?"

A good example of COD in practice is Bluetooth, a complex, over-designed set of protocols(协议) that users hate. It continues to exist only because in a massively-patented industry there are no real alternatives(二中择一). Bluetooth(蓝牙技术) is perfectly secure(保护), which is close to pointless(无意义的) for a proximity(亲近) protocol. At the same time, it lacks a standard API for developers, meaning it's really costly to use Bluetooth in applications.

On the #zeromq IRC channel, Wintre once wrote of how enraged(激怒) he was many years ago when he "found that XMMS 2 had a working plugin(插件) system, but could not actually play music."

COD is a form of large-scale(大规模的) "rabbit-holing", in which designers and engineers cannot distance themselves from the technical details of their work. They add more and more features(特色), utterly(完全地) misreading the economics(经济学) of their work.

The main lessons of COD are also simple, but hard for experts to swallow. They are:

  • Making stuff(东西) that you don't immediately have a need for is pointless(无意义的). Doesn't matter how talented(有才能的) or brilliant(灿烂的) you are, if you just sit down and make stuff people are not actually asking for, you are most likely wasting your time.
  • Problems are not equal. Some are simple, and some are complex(复杂的). Ironically(讽刺的), solving the simpler problems often has more value to more people than solving the really hard ones. So if you allow engineers to just work on random(随机的) things, they'll mostly focus on the most interesting but least worthwhile things.
  • Engineers and designers love to make stuff and decoration, and this inevitably(不可避免地) leads to complexity(复杂). It is crucial(重要的) to have a "stop mechanism(机制)", a way to set short, hard deadlines that force people to make smaller, simpler answers to just the most crucial problems.

Simplicity Oriented Design
topprevnext

Finally, we come to the rare but precious Simplicity Oriented Design, or SOD. This process starts with a realization(实现): we do not know what we have to make until after we start making it. Coming up with ideas or large-scale designs isn't just wasteful(浪费的), it's a direct hindrance(障碍) to designing the truly accurate(精确的) solutions(解决方案). The really juicy problems are hidden like far valleys, and any activity except active scouting(初步勘探) creates a fog that hides those distant valleys. You need to keep mobile, pack light, and move fast.

SOD works as follows:

  • We collect a set of interesting problems (by looking at how people use technology or other products) and we line these up from simple to complex, looking for and identifying(确定) patterns of use.
  • We take the simplest, most dramatic(戏剧的) problem and we solve this with a minimal(最低的) plausible(貌似可信的) solution, or "patch(修补)". Each patch solves exactly a genuine(真实的) and agreed-upon problem in a brutally(残忍地) minimal fashion.
  • We apply one measure of quality to patches, namely "Can this be done any simpler while still solving the stated problem?" We can measure complexity in terms of concepts(观念) and models that the user has to learn or guess in order to use the patch. The fewer, the better. A perfect patch solves a problem with zero learning required by the user.
  • Our product development consists of a patch that solves the problem "we need a proof(证明) of concept" and then evolves(发展) in an unbroken line to a mature(成熟的) series of products, through hundreds or thousands of patches piled on top of each other.
  • We do not do anything that is not a patch(眼罩). We enforce(实施) this rule with formal(正式的) processes that demand that every activity or task is tied to a genuine(真实的) and agreed-upon problem, explicitly(明确地) enunciated(发音) and documented.
  • We build our projects into a supply chain where each project can provide problems to its "suppliers" and receive patches in return. The supply chain creates the "stop mechanism(机制)" because when people are impatiently(无耐性地) waiting for an answer, we necessarily cut our work short.
  • Individuals(个人) are free to work on any projects, and provide patches at any place they feel it's worthwhile. No individuals "own" any project, except to enforce the formal processes. A single project can have many variations(变化), each a collection of different, competing patches.
  • Projects export(输出) formal and documented interfaces(界面) so that upstream(上游部门) (client) projects are unaware(不知道的) of change happening in supplier projects. Thus multiple supplier projects can compete for client projects, in effect creating a free and competitive(竞争的) market.
  • We tie our supply chain to real users and external(外部的) clients and we drive the whole process by rapid cycles so that a problem received from outside users can be analyzed(分解), evaluated(评价), and solved with a patch in a few hours.
  • At every moment from the very first patch, our product is shippable(可装运的). This is essential, because a large proportion(比例) of patches will be wrong (10-30%) and only by giving the product to users can we know which patches have become problems that need solving.

SOD is a hill-climbing algorithm, a reliable(可靠的) way of finding optimal(最佳的) solutions(解决方案) to the most significant(重大的) problems in an unknown landscape(风景). You don't need to be a genius(天才) to use SOD successfully, you just need to be able to see the difference between the fog of activity and the progress towards new real problems.

People have pointed out that hill-climbing(爬山) algorithms(算法) have known limitations(限制). One gets stuck on local peaks(山峰), mainly. But this is nonetheless(尽管如此) how life itself works: collecting tiny incremental(增加的) improvements(改进) over long periods of time. There is no intelligent(智能的) designer. We reduce the risk(风险) of local peaks by spreading out widely across the landscape, but it is somewhat moot(未决议的). The limitations aren't optional(可选择的), they are physical laws. The theory says, this is how innovation(创新) really works, so better embrace(拥抱) it and work with it than try to work on the basis(基础) of magical thinking.

And in fact once you see all innovation as more or less successful hill-climbing, you realize why some teams and companies and products get stuck in a never-never(非真实的) land of diminishing(逐渐缩小的) prospects(前途). They simply don't have the diversity(多样性) and collective(集体的) intelligence(智力) to find better hills to climb. When Nokia killed their open source projects, they cut their own throat.

A really good designer with a good team can use SOD to build world-class products, rapidly and accurately(精确地). To get the most out of SOD the designer has to use the product continuously(连续不断地), from day one, and develop his or her ability to smell out problems such as inconsistency(不一致), surprising behavior(行为), and other forms of friction(摩擦). We naturally overlook many annoyances(烦恼), but a good designer picks these up and thinks about how to patch them. Design is about removing friction in the use of a product.

In an open source setting, we do this work in public. There's no "let's open the code" moment. Projects that do this are in my view missing the point of open source, which is to engage(吸引) your users in your exploration(探测), and to build community around the seed of the architecture(建筑学).

Burnout

topprevnext

The ZeroMQ community has been and still is heavily dependent(依靠的) on pro bono individual efforts. I'd like to think that everyone was compensated(补偿) in some way for their contributions, and I believe that with ZeroMQ, contributing(贡献) means gaining expertise(专门知识) in an extraordinarily valuable technology, which leads to improved professional options.

However, not all projects will be so lucky and if you work with or in open source, you should understand the risk(风险) of burnout(烧坏) that volunteers(志愿者) face. This applies to all pro bono communities. In this section, I'll explain what causes burnout, how to recognize it, how to prevent it, and (if it happens) how to try to treat it. Disclaimer(否认): I'm not a psychiatrist(精神病学家) and this article is based on my own experiences of working in pro bono contexts(环境) for the last 20 years, including free software projects, and NGOs such as the FFII.

In a pro bono context, we're expected to work without direct or obvious economic(经济的) incentive(动机). That is, we sacrifice family life, professional advancement(前进), free time, and health in order to accomplish(完成) some goal we have decided to accomplish. In any project, we need some kind of reward to make it worth continuing each day. In most pro bono projects the rewards are very indirect, superficially(表面地) not economical(经济的) at all. Mostly, we do things because people say, "Hey, great!" Karma(因果报应) is a powerful motivator(动力).

However, we are economic beings, and sooner or later, if a project costs us a great deal and does not bring economic rewards of some kind (money, fame(闻名), a new job), we start to suffer. At a certain stage, it seems our subconscious(潜意识的) simply gets disgusted(厌恶) and says, "Enough is enough!" and refuses to go any further. If we try to force ourselves, we can literally(照字面地) get sick.

This is what I call "burnout", though the term is also used for other kinds of exhaustion(枯竭). Too much investment(投资) on a project with too little economic reward, for too long. We are great at manipulating(操纵) ourselves and others, and this is often part of the process that leads to burnout. We tell ourselves that it's for a good cause and that the other guy is doing OK, so we should be able to as well.

When I got burned out on open source projects like Xitami, I remember clearly how I felt. I simply stopped working on it, refused to answer any more emails, and told people to forget about it. You can tell when someone's burned out. They go offline, and everyone starts saying, "He's acting strange… depressed, (沮丧的)or tired…"

Diagnosis(诊断) is simple. Has someone worked a lot on a project that was not paying back in any way? Did she make exceptional(异常的) sacrifices? Did he lose or abandon(遗弃) his job or studies to do the project? If you're answering "yes", it's burnout.

There are three simple techniques I've developed over the years to reduce the risk of burnout in the teams I work with:

  • No one is irreplaceable(不能替代的). Working solo on a critical(鉴定的) or popular project—the concentration o(浓度)f responsibility o(责任)n one person who cannot set their own limits—is probably the main factor. It(因素)'s a management truism: if(自明之理) someone in your organization is irreplaceable, get rid of him or her.
  • We need day jobs to pay the bills. This can be hard, but seems necessary. Getting money from somewhere else makes it much easier to sustain(维持) a sacrificial(牺牲的) project.
  • Teach people about burnout. This should be a basic course in colleges and universities, as pro bono work becomes a more common way for young people to experiment professionally.

When someone is working alone on a critical project, you know they are going blow their fuses(保险丝) sooner or later. It's actually fairly predictable(可预言的): something like 18-36 months depending on the individual(个人) and how much economic stress(压力) they face in their private lives. I've not seen anyone burn-out after half a year, nor last five years in a unrewarding(不值得做的) project.

There is a simple cure for burnout that works in at least some cases: get paid decently(合适地) for your work. However, this pretty much destroys the freedom of movement (across that infinite(无限的) problem landscape(风景)) that the volunteer enjoys.

Patterns for Success

topprevnext

I'll end this code-free chapter with a series of patterns for success in software engineering. They aim to capture(俘获) the essence(本质) of what divides glorious(光荣的) success from tragic(悲剧的) failure. They were described as "religious maniacal(疯狂的) dogma(教条)" by a manager, and "anything else would be effing(该死的) insane(疯狂的)" by a colleague, in a single day. For me, they are science. But treat the Lazy Perfectionist and others as tools to use, sharpen, and throw away if something better comes along.

The Lazy Perfectionist
topprevnext

Never design anything that's not a precise(精确的) minimal(最低的) answer to a problem we can identify(确定) and have to solve.

The Lazy Perfectionist spends his idle(闲置的) time observing others and identifying problems that are worth solving. He looks for agreement on those problems, always asking, "What is the real problem". Then he moves, precisely and minimally, to build, or get others to build, a usable(可用的) answer to one problem. He uses, or gets others to use those solutions(解决方案). And he repeats this until there are no problems left to solve, or time or money runs out.

The Benevolent Tyrant
topprevnext

The control of a large force is the same principle(原理) as the control of a few men: it is merely a question of dividing up their numbers. — Sun Tzu

The Benevolent Tyrant divides large problems into smaller ones and throws them at groups to focus on. She brokers contracts(合同) between these groups, in the form of APIs and the "unprotocols" we'll read about in the next chapter. The Benevolent Tyrant constructs a supply chain that starts with problems, and results in usable solutions. She is ruthless(无情的) about how the supply chain works, but does not tell people what to work on, nor how to do their work.

The Earth and Sky
topprevnext

The ideal(理想的) team consists of two sides: one writing code, and one providing feedback(反馈).

The Earth and Sky work together as a whole, in close proximity(亲近), but they communicate formally(正式地) through issue tracking. Sky seeks out problems from others and from their own use of the product and feeds these to Earth. Earth rapidly answers with testable(可试验的) solutions(解决方案). Earth and Sky can work through dozens of issues in a day. Sky talks to other users, and Earth talks to other developers. Earth and Sky may be two people, or two small groups.

The Open Door
topprevnext

The accuracy(精确度) of knowledge comes from diversity(多样性).

The Open Door accepts contributions from almost anyone. She does not argue quality or direction, instead allowing others to argue that and get more engaged(吸引). She calculates(计算) that even a troll(轮唱) will bring more diverse(不同的) opinion to the group. She lets the group form its opinion about what goes into stable(稳定的) code, and she enforces(实施) this opinion with help of a Benevolent Tyrant.

The Laughing Clown
topprevnext

Perfection(完善) precludes(排除) participation(参与).

The Laughing Clown, often acting as the Happy Failure, makes no claim(要求) to high competence(能力). Instead his antics(滑稽动作) and bumbling(装模作样的) attempts provoke(驱使) others into rescuing him from his own tragedy(悲剧). Somehow however, he always identifies(识别) the right problems to solve. People are so busy proving him wrong they don't realize they're doing valuable work.

The Mindful General
topprevnext

Make no plans. Set goals, develop strategies(战略) and tactics(策略).

The Mindful General operates in unknown territory(领土), solving problems that are hidden until they are nearby. Thus she makes no plans, but seeks opportunities(时机), then exploits them rapidly and accurately(精确地). She develops tactics and strategies in the field, and teaches these to her soldiers so they can move independently, and together.

The Social Engineer
topprevnext

If you know the enemy and know yourself, you need not fear the result of a hundred battles. — Sun Tzu

The Social Engineer reads the hearts and minds of those he works with and for. He asks, of everyone, "What makes this person angry, insecure(不安全的), argumentative(好辩的), calm, happy?" He studies their moods(情绪) and dispositions(处置). With this knowledge he can encourage those who are useful, and discourage those who are not. The Social Engineer never acts on his own emotions(情感).

The Constant Gardener
topprevnext

He will win whose army is animated(有生气) by the same spirit throughout all its ranks. — Sun Tzu

The Constant Gardener grows a process from a small seed, step-by-step(按部就班的) as more people come into the project. She makes every change for a precise(精确的) reason, with agreement from everyone. She never imposes(利用) a process from above but lets others come to consensus(一致), and then he enforces(实施) that consensus. In this way, everyone owns the process together and by owning it, they are attached(依附) to it.

The Rolling Stone
topprevnext

After crossing a river, you should get far away from it. — Sun Tzu

The Rolling Stone accepts his own mortality(死亡数) and transience(短暂). He has no attachment(附件) to his past work. He accepts that all that we make is destined(注定) for the trash(垃圾) can, it is just a matter of time. With precise(精确的), minimal(最低的) investments(投资), he can move rapidly away from the past and stay focused on the present and near future. Above all, he has no ego and no pride to be hurt by the actions of others.

The Pirate Gang
topprevnext

Code, like all knowledge, works best as collective(集体的)—not private—property.

The Pirate Gang organizes freely around problems. It accepts authority(权威) insofar(在…的范围) as authority provides goals and resources. The Pirate Gang owns and shares all it makes: every work is fully remixable by others in the Pirate Gang. The gang moves rapidly as new problems emerge(浮现), and is quick to abandon(遗弃) old solutions(解决方案) if those stop being relevant(有关的). No persons or groups can monopolize(垄断) any part of the supply chain.

The Flash Mob
topprevnext

Water shapes its course according to the nature of the ground over which it flows. — Sun Tzu

The Flash Mob comes together in space and time as needed, then disperses(分散) as soon as they can. Physical closeness(亲密) is essential for high-bandwidth communications. But over time it creates technical ghettos, where Earth gets separated from Sky. The Flash Mob tends(趋向) to collect a lot of frequent flier miles.

The Canary Watcher
topprevnext

Pain is not, generally, a Good Sign.

The Canary Watcher measures the quality of an organization by their own pain level, and the observed pain levels of those with whom he works. He brings new participants(参与者) into existing organizations so they can express the raw pain of the innocent(天真的人). He may use alcohol(酒精) to get others to verbalize(变成动词) their pain points. He asks others, and himself, "Are you happy in this process, and if not, why not?" When an organization causes pain in himself or others, he treats that as a problem to be fixed. People should feel joy in their work.

The Hangman
topprevnext

Never interrupt others when they are making mistakes.

The Hangman knows that we learn only by making mistakes, and she gives others copious(丰富的) rope with which to learn. She only pulls the rope gently, when it's time. A little tug to remind the other of their precarious(危险的) position. Allowing others to learn by failure gives the good reason to stay, and the bad excuse to leave. The Hangman is endlessly patient, because there is no shortcut(捷径) to the learning process.

The Historian
topprevnext

Keeping the public record may be tedious(沉闷的), but it's the only way to prevent collusion(勾结).

The Historian forces discussion into the public view, to prevent collusion to own areas of work. The Pirate Gang depends on full and equal communications that do not depend on momentary(瞬间的) presence(存在). No one really reads the archives, but the simply possibility stops most abuses(滥用). The Historian encourages the right tool for the job: email for transient(短暂的) discussions, IRC for chatter, wikis for knowledge, issue tracking for recording opportunities(时机).

The Provocateur
topprevnext

When a man knows he is to be hanged in a fortnight, it concentrates(浓缩) his mind wonderfully. — Samuel Johnson

The Provocateur creates deadlines, enemies, and the occasional(偶然的) impossibility(不可能). Teams work best when they don't have time for the crap(废话). Deadlines bring people together and focus the collective(集体的) mind. An external(外部的) enemy can move a passive team into action. The Provocateur never takes the deadline too seriously. The product is always ready to ship. But she gently reminds the team of the stakes(桩): fail, and we all look for other jobs.

The Mystic
topprevnext

When people argue or complain, just write them a Sun Tzu quotation(报价单) — Mikko Koppanen

The Mystic never argues directly. He knows that to argue with an emotional(情绪的) person only creates more emotion(情感). Instead he side-steps(回避) the discussion. It's hard to be angry at a Chinese general, especially when he has been dead for 2,400 years. The Mystic plays Hangman when people insist on the right to get it wrong.


Chapter 7 - Advanced Architecture(建筑学) using ZeroMQ

topprevnext

One of the effects of using ZeroMQ at large scale(规模) is that because we can build distributed(分布式的) architectures so much faster than before, the limitations(限制) of our software engineering processes become more visible(明显的). Mistakes in slow motion(动作) are often harder to see (or rather, easier to rationalize(合理化) away).

My experience when teaching ZeroMQ to groups of engineers is that it's rarely sufficient(足够的) to just explain how ZeroMQ works and then just expect them to start building successful products. Like any technology that removes friction(摩擦), ZeroMQ opens the door to big blunders(大错). If ZeroMQ is the ACME rocket-propelled(用火箭推进的) shoe of distributed software development, a lot of us are like Wile E. Coyote, slamming(砰地关上) full speed into the proverbial(谚语的) desert cliff(悬崖).

We saw in Chapter 6 - The ZeroMQ Community that ZeroMQ itself uses a formal(正式的) process for changes. One reason we built this process, over some years, was to stop the repeated cliff-slamming that happened in the library itself.

Partly, it's about slowing down and partially(局部的), it's about ensuring(保证) that when you move fast, you go—and this is essential Dear Reader—in the right direction. It's my standard interview riddle: what's the rarest property of any software system, the absolute(绝对的) hardest thing to get right, the lack of which causes the slow or fast death of the vast majority of projects? The answer is not code quality, funding, performance, or even (though it's a close answer), popularity(普及). The answer is accuracy.

Accuracy(精确度) is half the challenge, and applies to any engineering work. The other half is distributed(分布式的) computing itself, which sets up a whole range of problems that we need to solve if we are going to create architectures(建筑学). We need to encode and decode(译码) data; we need to define(定义) protocols(协议) to connect clients and servers; we need to secure(保护) these protocols against attackers; and we need to make stacks(堆) that are robust(强健的). Asynchronous(异步的) messaging is hard to get right.

This chapter will tackle(处理) these challenges, starting with a basic reappraisal(重新评价) of how to design and build software and ending with a fully formed example of a distributed application for large-scale(大规模的) file distribution(分布).

We'll cover the following juicy topics:

  • How to go from idea to working prototype(原型) safely (the MOPED pattern)
  • Different ways to serialize(连载) your data as ZeroMQ messages
  • How to code-generate binary(二进制的) serialization(序列化) codecs(编码解码器)
  • How to build custom code generators(发电机) using the GSL tool
  • How to write and license a protocol specification(规格)
  • How to build fast restartable(可重新起动的) file transfer(转让) over ZeroMQ
  • How to use credit-based flow control for nonblocking transfers
  • How to build protocol servers and clients as state machines
  • How to make a secure protocol over ZeroMQ
  • A large-scale file publishing system (FileMQ)

Message-Oriented Pattern for Elastic Design

topprevnext

I'll introduce Message-Oriented Pattern for Elastic Design (MOPED), a software engineering pattern for ZeroMQ architectures(建筑学). It was either "MOPED" or "BIKE", the Backronym-Induced Kinetic Effect. That's short for "BICICLE", the Backronym-Inflated See if I Care Less Effect. In life, one learns to go with the least embarrassing(使人尴尬的) choice.

If you've read this book carefully, you'll have seen MOPED in action already. The development of Majordomo in Chapter 4 - Reliable(可靠的) Request-Reply Patterns is a near-perfect case. But cute(可爱的) names are worth a thousand words.

The goal of MOPED is to define(定义) a process by which we can take a rough use case for a new distributed(分布式的) application, and go from "Hello World" to fully-working prototype(原型) in any language in under a week.

Using MOPED, you grow, more than build, a working ZeroMQ architecture from the ground-up with minimal(最低的) risk(风险) of failure. By focusing on the contracts(合同) rather than the implementations(实现), you avoid the risk of premature(早产的) optimization(最佳化). By driving the design process through ultra-short test-based cycles, you can be more certain that what you have works before you add more.

We can turn this into five real steps:

  • Step 1: internalize(内在化) the ZeroMQ semantics(语义学).
  • Step 2: draw a rough architecture.
  • Step 3: decide on the contracts.
  • Step 4: make a minimal end-to-end(端对端) solution(解决方案).
  • Step 5: solve one problem and repeat.

Step 1: Internalize the Semantics
topprevnext

You must learn and digest ZeroMQ's "language", that is, the socket(插座) patterns and how they work. The only way to learn a language is to use it. There's no way to avoid this investment(投资), no tapes you can play while you sleep, no chips you can plug in to magically become smarter. Read this book from the start, work through the code examples in whatever language you prefer, understand what's going on, and (most importantly) write some examples yourself and then throw them away.

At a certain point, you'll feel a clicking noise in your brain. Maybe you'll have a weird(怪异的) chili-induced dream where little ZeroMQ tasks run around trying to eat you alive. Maybe you'll just think "aaahh, so that's what it means!" If we did our work right, it should take two to three days. However long it takes, until you start thinking in terms of ZeroMQ sockets(插座) and patterns, you're not ready for step 2.

Step 2: Draw a Rough Architecture(建筑学)
topprevnext

From my experience, it's essential to be able to draw the core of your architecture. It helps others understand what you are thinking, and it also helps you think through your ideas. There is really no better way to design a good architecture than to explain your ideas to your colleagues, using a whiteboard(白色书写板).

You don't need to get it right, and you don't need to make it complete. What you do need to do is break your architecture into pieces that make sense. The nice thing about software architecture (as compared to constructing bridges) is that your really can replace entire layers cheaply if you've isolated(隔离) them.

Start by choosing the core problem that you are going to solve. Ignore(驳回诉讼) anything that's not essential to that problem: you will add it in later. The problem should be an end-to-end(端对端) problem: the rope across the gorge(峡谷).

For example, a client asked us to make a supercomputing(超级计算) cluster(群) with ZeroMQ. Clients create bundles(束) of work, which are sent to a broker that distributes(分配) them to workers (running on fast graphics(图形) processors(处理器)), collects the results back, and returns them to the client.

The rope across the gorge is one client talking to a broker talking to one worker. We draw three boxes: client, broker, worker. We draw arrows from box to box showing the request flowing one way and the response flowing back. It's just like the many diagrams we saw in earlier chapters.

Be minimalistic. Your goal is not to define(定义) a real architecture, but to throw a rope across the gorge to bootstrap(引导程序) your process. We make the architecture successfully more complete and realistic(现实的) over time: e.g., adding multiple workers, adding client and worker APIs, handling failures, and so on.

Step 3: Decide on the Contracts(合同)
topprevnext

A good software architecture depends on contracts, and the more explicit(明确的) they are, the better things scale(规模). You don't care how things happen; you only care about the results. If I send an email, I don't care how it arrives at its destination(目的地), as long as the contract(合同) is respected. The email contract is: it arrives within a few minutes, no-one modifies(修改) it, and it doesn't get lost.

And to build a large system that works well, you must focus on the contracts before the implementations(实现). It may sound obvious but all too often, people forget or ignore(驳回诉讼) this, or are just too shy to impose(强加) themselves. I wish I could say ZeroMQ had done this properly, but for years our public contracts were second-rate afterthoughts(事后的想法) instead of primary in-your-face(挑衅的) pieces of work.

So what is a contract in a distributed(分布式的) system? There are, in my experience, two types of contract:

  • The APIs to client applications. Remember the Psychological Elements. The APIs need to be as absolutely(绝对的) simple, consistent, and familiar as possible. Yes, you can generate(形成) API documentation(文件材料) from code, but you must first design it, and designing an API is often hard.
  • The protocols(协议) that connect the pieces. It sounds like rocket science, but it's really just a simple trick, and one that ZeroMQ makes particularly easy. In fact they're so simple to write, and need so little bureaucracy(官僚主义) that I call them unprotocols.

You write minimal(最低的) contracts that are mostly just place markers. Most messages and most API methods will be missing or empty. You also want to write down any known technical requirements in terms of throughput(生产量), latency(潜伏), reliability(可靠性), and so on. These are the criteria(标准) on which you will accept or reject any particular piece of work.

Step 4: Write a Minimal End-to-End Solution
topprevnext

The goal is to test out the overall architecture(建筑学) as rapidly as possible. Make skeleton(骨骼的) applications that call the APIs, and skeleton stacks(堆) that implement(实施) both sides of every protocol. You want to get a working end-to-end(端对端) "Hello World" as soon as you can. You want to be able to test code as you write it, so that you can weed out the broken assumptions(假定) and inevitable(必然的) errors you make. Do not go off and spend six months writing a test suite! Instead, make a minimal bare-bones(皮包骨头的人) application that uses our still-hypothetical API.

If you design an API wearing the hat of the person who implements it, you'll start to think of performance, features(特色), options, and so on. You'll make it more complex(复杂的), more irregular(不规则的), and more surprising than it should be. But, and here's the trick (it's a cheap one, was big in Japan): if you design an API while wearing the hat of the person who has to actually write apps that use it, you use all that laziness(怠惰) and fear to your advantage.

Write down the protocols(协议) on a wiki or shared document in such a way that you can explain every command clearly without too much detail. Strip(剥夺) off any real functionality(功能), because it will only create inertia(惯性) that makes it harder to move stuff(东西) around. You can always add weight. Don't spend effort defining(定义) formal(正式的) message structures(结构): pass the minimum(最小值) around in the simplest possible fashion using ZeroMQ's multipart framing(设计).

Our goal is to get the simplest test case working, without any avoidable(可避免的) functionality. Everything you can chop(砍) off the list of things to do, you chop. Ignore(驳回诉讼) the groans(呻吟) from colleagues and bosses. I'll repeat this once again: you can always add functionality, that's relatively easy. But aim to keep the overall weight to a minimum.

Step 5: Solve One Problem and Repeat
topprevnext

You're now in the happy cycle of issue-driven development where you can start to solve tangible(有形的) problems instead of adding features(特色). Write issues that each state a clear problem, and propose(建议) a solution(解决方案). As you design the API, keep in mind your standards for names, consistency(一致性), and behavior(行为). Writing these down in prose(散文) often helps keep them sane(健全的).

From here, every single change you make to the architecture(建筑学) and code can be proven(证明) by running the test case, watching it not work, making the change, and then watching it work.

Now you go through the whole cycle (extending(延伸) the test case, fixing the API, updating the protocol, and extending the code, as needed), taking problems one at a time and testing the solutions individually(个别地). It should take about 10-30 minutes for each cycle, with the occasional(偶然的) spike(长钉) due to random(随机的) confusion(混淆).

Unprotocols

topprevnext

Protocols Without The Goats
topprevnext

When this man thinks of protocols, this man thinks of massive(大量的) documents written by committees(委员会), over years. This man thinks of the IETF, W3C, ISO, Oasis, regulatory(管理的) capture(捕获), FRAND patent(专利权) license disputes(辩论), and soon after, this man thinks of retirement(退休) to a nice little farm in northern Bolivia(玻利维亚) up in the mountains where the only other needlessly(不必要地) stubborn(顽固的) beings are the goats chewing up the coffee plants.

Now, I've nothing personal against committees(委员会). The useless folk need a place to sit out their lives with minimal(最低的) risk(风险) of reproducing(复制); after all, that only seems fair. But most committee protocols(协议) tend(照料) towards complexity(复杂) (the ones that work), or trash(丢弃) (the ones we don't talk about). There's a few reasons for this. One is the amount(数量) of money at stake(桩). More money means more people who want their particular prejudices(偏见) and assumptions(假定) expressed in prose(散文). But two is the lack of good abstractions(抽象) on which to build. People have tried to build reusable(可重复使用的) protocol abstractions, like BEEP. Most did not stick, and those that did, like SOAP and XMPP, are on the complex(复杂的) side of things.

It used to be, decades(十年) ago, when the Internet was a young modest thing, that protocols were short and sweet. They weren't even "standards", but "requests for comments", which is as modest as you can get. It's been one of my goals since we started iMatix in 1995 to find a way for ordinary people like me to write small, accurate(精确的) protocols without the overhead of the committees.

Now, ZeroMQ does appear to provide a living, successful protocol abstraction layer with its "we'll carry multipart messages over random(随机的) transports" way of working. Because ZeroMQ deals silently with framing(框架), connections, and routing(路由选择), it's surprisingly easy to write full protocol specs(投机) on top of ZeroMQ, and in Chapter 4 - Reliable(可靠的) Request-Reply Patterns and Chapter 5 - Advanced Pub-Sub Patterns I showed how to do this.

Somewhere around mid-2007, I kicked off the Digital Standards Organization to define(定义) new simpler ways of producing little standards, protocols, and specifications(规格). In my defense, it was a quiet summer. At the time, I wrote that a new specification should take "minutes to explain, hours to design, days to write, weeks to prove, months to become mature(成熟的), and years to replace."

In 2010, we started calling such little specifications unprotocols, which some people might mistake for a dastardly(怯懦的) plan for world domination(控制) by a shadowy(朦胧的) international organization, but which really just means "protocols without the goats".

Contracts Are Hard
topprevnext

Writing contracts(合同) is perhaps the most difficult part of large-scale(大规模的) architecture(建筑学). With unprotocols, we remove as much of the unnecessary friction(摩擦) as possible. What remains is still a hard set of problems to solve. A good contract (be it an API, a protocol, or a rental(租赁的) agreement) has to be simple, unambiguous(不含糊的), technically sound, and easy to enforce(实施).

Like any technical skill, it's something you have to learn and practice. There are a series of specifications on the
ZeroMQ RFC site, which are worth reading and using them as a basis(基础) for your own specifications when you find yourself in need.

I'll try to summarize(总结) my experience as a protocol(协议) writer:

  • Start simple, and develop your specifications(规格) step-by-step(按部就班的). Don't solve problems you don't have in front of you.
  • Use very clear and consistent(始终如一的) language. A protocol may often break down into commands and fields; use clear short names for these entities(实体).
  • Try to avoid inventing concepts(观念). Reuse anything you can from existing specifications. Use terminology(术语) that is obvious and clear to your audience.
  • Make nothing for which you cannot demonstrate(证明) an immediate need. Your specification solves problems; it does not provide features(特色). Make the simplest plausible(貌似可信的) solution(解决方案) for each problem that you identify(确定).
  • Implement your protocol as you build it, so that you are aware(意识到的) of the technical consequences(结果) of each choice. Use a language that makes it hard (like C) and not one that makes it easy (like Python).
  • Test your specification as you build it on other people. Your best feedback(反馈) on a specification is when someone else tries to implement(实施) it without the assumptions(假定) and knowledge that you have in your head.
  • Cross-test rapidly and consistently, throwing others' clients against your servers and vice(副的) versa.
  • Be prepared to throw it out and start again as often as needed. Plan for this, by layering your architecture(建筑学) so that e.g., you can keep an API but change the underlying(潜在的) protocols.
  • Only use constructs that are independent of programming language and operating system.
  • Solve a large problem in layers, making each layer an independent specification. Beware(当心) of creating monolithic(整体的) protocols. Think about how reusable(可重复使用的) each layer is. Think about how different teams could build competing specifications at each layer.

And above all, write it down. Code is not a specification. The point about a written specification is that no matter how weak it is, it can be systematically(有系统地) improved. By writing down a specification, you will also spot inconsistencies(不一致) and gray areas that are impossible to see in code.

If this sounds hard, don't worry too much. One of the less obvious benefits(利益) of using ZeroMQ is that it cuts the effort necessary to write a protocol(协议) spec(投机) by perhaps 90% or more because it already handles framing(设计), routing(路由选择), queuing, and so on. This means that you can experiment rapidly, make mistakes cheaply, and thus learn rapidly.

How to Write Unprotocols
topprevnext

When you start to write an unprotocol specification(规格) document, stick to a consistent(始终如一的) structure(结构) so that your readers know what to expect. Here is the structure I use:

  • Cover section: with a 1-line summary, URL to the spec, formal(正式的) name, version, who to blame.
  • License for the text: absolutely(绝对地) needed for public specifications.
  • The change process: i.e., how can I as a reader fix problems in the specification?
  • Use of language: MUST, MAY, SHOULD, and so on, with a reference(参考) to RFC 2119.
  • Maturity(成熟) indicator(指示器): is this an experimental(实验的), draft(草稿), stable(稳定的), legacy(遗赠), or retired?
  • Goals of the protocol: what problems is it trying to solve?
  • Formal grammar: prevents arguments due to different interpretations(解释) of the text.
  • Technical explanation: semantics(语义学) of each message, error handling, and so on.
  • Security discussion: explicitly(明确地), how secure(安全的) the protocol is.
  • References: to other documents, protocols, and so on.

Writing clear, expressive(表现的) text is hard. Do avoid trying to describe implementations(实现) of the protocol. Remember that you're writing a contract(合同). You describe in clear language the obligations(义务) and expectations of each party, the level of obligation, and the penalties(罚款) for breaking the rules. You do not try to define(定义) how each party honors its part of the deal.

Here are some key points about unprotocols:

  • As long as your process is open, then you don't need a committee(委员会): just make clean minimal(最低的) designs and make sure anyone is free to improve them.
  • If use an existing license, then you don't have legal(法律的) worries afterwards. I use GPLv3 for my public specifications(规格) and advise you to do the same. For in-house work, standard copyright is perfect.
  • Formality(礼节) is valuable. That is, learn to write a formal(正式的) grammar such as ABNF (Augmented Backus-Naur Form) and use this to fully document your messages.
  • Use a market-driven life cycle process like Digistan's COSS so that people place the right weight on your specs(投机) as they mature(成熟的) (or don't).

Why use the GPLv3 for Public Specifications?
topprevnext

The license you choose is particularly crucial(重要的) for public specifications. Traditionally, protocols(协议) are published under custom licenses, where the authors own the text and derived(源于) works are forbidden. This sounds great (after all, who wants to see a protocol forked?), but it's in fact highly risky(危险的). A protocol committee is vulnerable(易受攻击的) to capture(俘获), and if the protocol is important and valuable, the incentive(动机) for capture grows.

Once captured, like some wild animals, an important protocol will often die. The real problem is that there's no way to free a captive(被俘虏的) protocol published under a conventional(符合习俗的) license. The word "free" isn't just an adjective(形容词) to describe speech or air, it's also a verb(动词), and the right to fork a work against the wishes of the owner is essential to avoiding capture.

Let me explain this in shorter words. Imagine that iMatix writes a protocol today that's really amazing and popular. We publish the spec and many people implement(实施) it. Those implementations(实现) are fast and awesome(可怕的), and free as in beer. They start to threaten(威胁) an existing business. Their expensive commercial(商业的) product is slower and can't compete. So one day they come to our iMatix office in Maetang-Dong, South Korea, and offer to buy our firm. Because we're spending vast amounts(数量) on sushi(寿司) and beer, we accept gratefully. With evil(邪恶的) laughter, the new owners of the protocol stop improving the public version, close the specification, and add patented(专利的) extensions(延长). Their new products support this new protocol version, but the open source versions are legally blocked from doing so. The company takes over the whole market, and competition ends.

When you contribute(贡献) to an open source project, you really want to know your hard work won't be used against you by a closed source competitor. This is why the GPL beats the "more permissive(许可的)" BSD/MIT/X11 licenses for most contributors(贡献者). These licenses give permission to cheat. This applies just as much to protocols as to source code.

When you implement a GPLv3 specification, your applications are of course yours, and licensed any way you like. But you can be certain of two things. One, that specification will never be embraced(拥抱) and extended(延伸) into proprietary(所有的) forms. Any derived(导出的) forms of the specification(规格) must also be GPLv3. Two, no one who ever implements(工具) or uses the protocol(协议) will ever launch(发射) a patent(专利的) attack on anything it covers, nor can they add their patented technology to it without granting(授予) the world a free license.

Using ABNF
topprevnext

My advice when writing protocol specs(投机) is to learn and use a formal(正式的) grammar. It's just less hassle(困难) than allowing others to interpret(说明) what you mean, and then recover from the inevitable(必然的) false assumptions(假定). The target of your grammar is other people, engineers, not compilers(编译).

My favorite grammar is ABNF, as defined(定义) by RFC 2234, because it is probably the simplest and most widely used formal language for defining bidirectional(双向的) communications protocols. Most IETF (Internet Engineering Task Force) specifications use ABNF, which is good company to be in.

I'll give a 30-second crash course in writing ABNF. It may remind you of regular expressions. You write the grammar as rules. Each rule takes the form "name = elements(基础)". An element can be another rule (which you define below as another rule) or a pre-defined(预定义的) terminal like CRLF, OCTET, or a number. The RFC lists all the terminals. To define alternative(供选择的) elements, separate with a slash(削减). To define repetition(重复), use an asterisk(星号). To group elements, use parentheses(括号). Read the RFC because it's not intuitive(直觉的).

I'm not sure if this extension(延长) is proper, but I then prefix(前缀) elements with "C:" and "S:" to indicate(表明) whether they come from the client or server.

Here's a piece of ABNF for an unprotocol called NOM that we'll come back to later in this chapter:

nom-protocol    = open-peering *use-peering

open-peering    = C:OHAI ( S:OHAI-OK / S:WTF )

use-peering     = C:ICANHAZ
                / S:CHEEZBURGER
                / C:HUGZ S:HUGZ-OK
                / S:HUGZ C:HUGZ-OK

I've actually used these keywords (OHAI, WTF) in commercial(商业的) projects. They make developers giggly(吃吃笑的) and happy. They confuse(混乱) management. They're good in first drafts(草稿) that you want to throw away later.

The Cheap or Nasty(肮脏的) Pattern
topprevnext

There is a general lesson I've learned over a couple of decades(十年) of writing protocols(协议) small and large. I call this the Cheap or Nasty pattern: you can often split your work into two aspects(方面) or layers and solve these separately—one using a "cheap" approach, (接近)the other using a "nasty" approach.

The key insight(洞察力) to making Cheap or Nasty work is to realize that many protocols mix a low-volume chatty(饶舌的) part for control, and a high-volume(大容量) asynchronous(异步的) part for data. For instance(举…为例), HTTP has a chatty dialog to authenticate(鉴定) and get pages, and an asynchronous dialog to stream data. FTP actually splits this over two ports; one port for control and one port for data.

Protocol designers who don't separate control from data tend(照料) to make horrid(可怕的) protocols, because the trade-offs in the two cases are almost totally opposed. What is perfect for control is bad for data, and what's ideal(理想的) for data just doesn't work for control. It's especially true when we want high performance at the same time as extensibility(展开性) and good error checking.

Let's break this down using a classic(经典的) client/server use case. The client connects to the server and authenticates. It then asks for some resource. The server chats back, then starts to send data back to the client. Eventually(最后的), the client disconnects(拆开) or the server finishes, and the conversation is over.

Now, before starting to design these messages, stop and think, and let's compare the control dialog and the data flow:

  • The control dialog lasts a short time and involves(包含) very few messages. The data flow could last for hours or days, and involve billions of messages.
  • The control dialog is where all the "normal" errors happen, e.g., not authenticated, not found, payment(付款) required, censored(审查), and so on. In contrast(对比), any errors that happen during the data flow are exceptional(异常的) (disk full, server crashed).
  • The control dialog is where things will change over time as we add more options, parameters(参数), and so on. The data flow should barely(仅仅) change over time because the semantics(语义学) of a resource are fairly constant over time.
  • The control dialog is essentially a synchronous(同步的) request/reply dialog. The data flow is essentially a one-way asynchronous flow.

These differences are critical(鉴定的). When we talk about performance, it applies only to data flows. It's pathological(病理学的) to design a one-time control dialog to be fast. Thus when we talk about the cost of serialization(序列化), this only applies to the data flow. The cost of encoding/decoding(译码) the control flow could be huge, and for many cases it would not change a thing. So we encode control using Cheap, and we encode data flows using Nasty.

Cheap is essentially synchronous(同步的), verbose(冗长的), descriptive(描写的), and flexible(灵活的). A Cheap message is full of rich information that can change for each application. Your goal as designer is to make this information easy to encode and parse(解析), trivial(不重要的) to extend(延伸) for experimentation(实验) or growth, and highly robust(强健的) against change both forwards and backwards. The Cheap part of a protocol(协议) looks like this:

  • It uses a simple self-describing structured(组织) encoding for data, be it XML, JSON, HTTP-style headers, or some other. Any encoding is fine as long as there are standard simple parsers for it in your target languages.
  • It uses a straight request-reply model where each request has a success/failure reply. This makes it trivial to write correct clients and servers for a Cheap dialog.
  • It doesn't try, even marginally(少量地), to be fast. Performance doesn't matter when you do something only once or a few times per session(会议).

A Cheap parser is something you take off the shelf and throw data at. It shouldn't crash, shouldn't leak memory, should be highly tolerant(宽容的), and should be relatively simple to work with. That's it.

Nasty(肮脏的) however is essentially asynchronous(异步的), terse(简洁的), silent, and inflexible(顽固的). A Nasty message carries minimal(最低的) information that practically never changes. Your goal as designer is to make this information ultrafast(超快的) to parse, and possibly even impossible to extend and experiment with. The ideal(理想的) Nasty pattern looks like this:

  • It uses a hand-optimized binary(二进制的) layout(布局) for data, where every bit is precisely(精确地) crafted(精心制作的).
  • It uses a pure asynchronous model where one or both peers(撒尿) send data without acknowledgments(感谢) (or if they do, they use sneaky(鬼鬼祟祟的) asynchronous techniques like credit-based flow control).
  • It doesn't try, even marginally, to be friendly. Performance is all that matters when you are doing something several million times per second.

A Nasty parser is something you write by hand, which writes or reads bits, bytes, words, and integers(整数) individually(个人的) and precisely. It rejects anything it doesn't like, does no memory allocations(分配) at all, and never crashes.

Cheap or Nasty isn't a universal(普遍的) pattern; not all protocols have this dichotomy(二分法). Also, how you use Cheap or Nasty will depend on the situation. In some cases, it can be two parts of a single protocol. In other cases, it can be two protocols, one layered on top of the other.

Error Handling
topprevnext

Using Cheap or Nasty makes error handling rather simpler. You have two kinds of commands and two ways to signal errors:

  • Synchronous control commands: errors are normal: every request has a response that is either OK or an error response.
  • Asynchronous(异步的) data commands: errors are exceptional(异常的): bad commands are either discarded(抛弃) silently, or cause the whole connection to be closed.

It's usually good to distinguish(区分) a few kinds of errors, but as always keep it minimal(最低的) and add only what you need.

Serializing Your Data

topprevnext

When we start to design a protocol(协议), one of the first questions we face is how we encode data on the wire. There is no universal(普遍的) answer. There are a half-dozen(半打) different ways to serialize(连载) data, each with pros and cons. We'll explore some of these.

Abstraction Level
topprevnext

Before looking at how to put data onto the wire, it's worth asking what data we actually want to exchange between applications. If we don't use any abstraction(抽象), we literally(文字的) serialize and deserialize(并行化) our internal(内部的) state. That is, the objects and structures(结构) we use to implement(实施) our functionality(功能).

Putting internal state onto the wire is however a really bad idea. It's like exposing internal state in an API. When you do this, you are hard-coding your implementation(实现) decisions into your protocols. You are also going to produce protocols that are significantly(意味深长地) more complex(复杂的) than they need to be.

It's perhaps the main reason so many older protocols and APIs are so complex: their designers did not think about how to abstract(摘要) them into simpler concepts(观念). There is of course no guarantee(保证) than an abstraction will be simpler; that's where the hard work comes in.

A good protocol or API abstraction encapsulates(压缩) natural patterns of use, and gives them name and properties that are predictable(可预言的) and regular. It chooses sensible(明智的) defaults so that the main use cases can be specified(指定) minimally. It aims to be simple for the simple cases, and expressive(表现的) for the rarer complex cases. It does not make any statements(声明) or assumptions(假定) about the internal implementation unless that is absolutely(绝对地) needed for interoperability(互操作性).

ZeroMQ Framing
topprevnext

The simplest and most widely used serialization(序列化) format for ZeroMQ applications is ZeroMQ's own multipart framing(设计). For example, here is how the Majordomo Protocol defines a request:

Frame 0: Empty frame
Frame 1: "MDPW01" (six bytes, representing MDP/Worker v0.1)
Frame 2: 0x02 (one byte, representing REQUEST)
Frame 3: Client address (envelope stack)
Frame 4: Empty (zero bytes, envelope delimiter)
Frames 5+: Request body (opaque binary)

To read and write this in code is easy, but this is a classic(经典的) example of a control flow (the whole of MDP is really, as it's a chatty(饶舌的) request-reply protocol(协议)). When we came to improve MDP for the second version, we had to change this framing. Excellent, we broke all existing implementations(实现)!

Backwards compatibility(兼容性) is hard, but using ZeroMQ framing for control flows does not help. Here's how I should have designed this protocol if I'd followed my own advice (and I'll fix this in the next version). It's split into a Cheap part and a Nasty part, and uses the ZeroMQ framing to separate these:

Frame 0: "MDP/2.0" for protocol name and version
Frame 1: command header
Frame 2: command body

Where we'd expect to parse(解析) the command header in the various intermediaries(中间人) (client API, broker, and worker API), and pass the command body untouched from application to application.

Serialization Languages
topprevnext

Serialization languages have their fashions. XML used to be big as in popular, then it got big as in over-engineered, and then it fell into the hands of "Enterprise Information Architects" and it's not been seen alive since. Today's XML is the epitome(缩影) of "somewhere in that mess is small, elegant(高雅的) language trying to escape".

Still XML was way, way better than its predecessors(前任), which included such monsters(怪物) as the Standard Generalized Markup Language (SGML), which in turn was a cool breeze(微风) compared to mind-torturing beasts like EDIFACT. So the history of serialization languages seems to be of gradually emerging(新兴的) sanity(明智), hidden by waves of revolting(叛乱的) EIAs doing their best to hold onto their jobs.

JSON popped(被逮捕的) out of the JavaScript world as a quick-and-dirty(应急的) "I'd rather resign(辞职) than use XML here" way to throw data onto the wire and get it back again. JSON is just minimal(最低的) XML expressed, sneakily(鬼鬼祟祟的), as JavaScript source code.

Here's a simple example of using JSON in a Cheap protocol:

"protocol": {
    "name": "MTL",
    "version": 1
},
"virtual-host": "test-env"

The same data in XML would be (XML forces us to invent a single top-level entity(实体)):

<command>
    <protocol name = "MTL" version = "1" />
    <virtual-host>test-env</virtual-host>
</command>

And here it is using plain-old HTTP-style headers:

Protocol: MTL/1.0
Virtual-host: test-env

These are all pretty equivalent(等价的) as long as you don't go overboard with validating(证实) parsers, schemas(模式), and other "trust us, this is all for your own good" nonsense(胡说). A Cheap serialization language gives you space for experimentation(实验) for free ("ignore(驳回诉讼) any elements(基础)/attributes(属性)/headers that you don't recognize"), and it's simple to write generic(类的) parsers that, for example, thunk(形实转换程序) a command into a hash table, or vice(副的) versa.

However, it's not all roses. While modern scripting languages support JSON and XML easily enough, older languages do not. If you use XML or JSON, you create nontrivial(非平凡的) dependencies(属国). It's also somewhat of a pain to work with tree-structured data in a language like C.

So you can drive your choice according to the languages for which you're aiming. If your universe is a scripting language, then go for JSON. If you are aiming to build protocols(协议) for wider system use, keep things simple for C developers and stick to HTTP-style headers.

Serialization Libraries
topprevnext

The msgpack.org site says:

I'm going to make the perhaps unpopular claim(要求) that "fast and small" are features(特色) that solve non-problems. The only real problem that serialization(序列化) libraries solve is, as far as I can tell, the need to document the message contracts(合同) and actually serialize(连载) data to and from the wire.

Let's start by debunking(揭穿) "fast and small". It's based on a two-part argument. First, that making your messages smaller and reducing CPU cost for encoding and decoding(解码) will make a significant(重大的) difference to your application's performance. Second, that this equally valid across-the-board(全面的) to all messages.

But most real applications tend(照料) to fall into one of two categories. Either the speed of serialization and size of encoding is marginal(边缘的) compared to other costs, such as database access or application code performance. Or, network performance really is critical(鉴定的), and then all significant costs occur in a few specific(特殊的) message types.

Thus, aiming for "fast and small" across the board is a false optimization(最佳化). You neither get the easy flexibility(灵活性) of Cheap for your infrequent(罕见的) control flows, nor do you get the brutal(残忍的) efficiency(效率) of Nasty for your high-volume(大容量) data flows. Worse, the assumption(假定) that all messages are equal in some way can corrupt(腐烂) your protocol design. Cheap or Nasty isn't only about serialization strategies(战略), it's also about synchronous(同步的) versus(对) asynchronous(异步的), error handling and the cost of change.

My experience is that most performance problems in message-based applications can be solved by (a) improving the application itself and (b) hand-optimizing the high-volume data flows. And to hand-optimize your most critical data flows, you need to cheat; to learn exploit facts about your data, something general purpose serializers cannot do.

Now let's address documentation(文件材料) and the need to write our contracts explicitly(明确地) and formally(正式地), rather than only in code. This is a valid problem to solve, indeed one of the main ones if we're to build a long-lasting, large-scale(大规模的) message-based architecture(建筑学).

Here is how we describe a typical(典型的) message using the MessagePack interface(界面) definition(定义) language (IDL):

message Person {
  1: string surname
  2: string firstname
  3: optional string email
}

Now, the same message using the Google protocol buffers(有软皮摩擦) IDL:

message Person {
  required string surname = 1;
  required string firstname = 2;
  optional string email = 3;
}

It works, but in most practical cases wins you little over a serialization language backed by decent(正派的) specifications(规格) written by hand or produced mechanically(机械的) (we'll come to this). The price you'll pay is an extra dependency and quite probably, worse overall performance than if you used Cheap or Nasty.

Handwritten(手写的) Binary Serialization
topprevnext

As you'll gather from this book, my preferred language for systems programming is C (upgraded to C99, with a constructor(构造函数)/destructor(破坏者) API model and generic(类的) containers). There are two reasons I like this modernized(现代化的) C language. First, I'm too weak-minded(优柔寡断的) to learn a big language like C++. Life just seems filled with more interesting things to understand. Second, I find that this specific(特殊的) level of manual(手工的) control lets me produce better results, faster.

The point here isn't C versus(对) C++, but the value of manual control for high-end professional users. It's no accident that the best cars, cameras, and espresso(浓咖啡) machines in the world have manual controls. That level of on-the-spot fine tuning(调谐) often makes the difference between world class success, and being second best.

When you are really, truly concerned(涉及) about the speed of serialization(序列化) and/or the size of the result (often these contradict(反驳) each other), you need handwritten(手写的) binary(二进制的) serialization. In other words, let's hear it for Mr. Nasty!

Your basic process for writing an efficient(有效率的) Nasty encoder/decoder(译码) (codec(编码解码器)) is:

  • Build representative(典型的) data sets and test applications that can stress(强调) test your codec.
  • Write a first dumb(哑的) version of the codec.
  • Test, measure, improve, and repeat until you run out of time and/or money.

Here are some of the techniques we use to make our codecs better:

  • Use a profiler. There's simply no way to know what your code is doing until you've profiled it for function counts and for CPU cost per function. When you find your hot spots, fix them.
  • Eliminate(消除) memory allocations(分配). The heap is very fast on a modern Linux kernel(核心), but it's still the bottleneck(瓶颈) in most naive(天真的) codecs. On older kernels, the heap can be tragically(悲剧地) slow. Use local variables(变量) (the stack(堆)) instead of the heap where you can.
  • Test on different platforms and with different compilers(编译) and compiler options. Apart(相距) from the heap, there are many other differences. You need to learn the main ones, and allow for them.
  • Use state to compress better. If you are concerned about codec performance, you are almost definitely(清楚地) sending the same kinds of data many times. There will be redundancy(冗余) between instances(实例) of data. You can detect(察觉) these and use that to compress (e.g., a short value that means "same as last time").
  • Know your data. The best compression(压缩) techniques (in terms of CPU cost for compactness(简洁)) require knowing about the data. For example, the techniques used to compress a word list, a video, and a stream of stock market data are all different.
  • Be ready to break the rules. Do you really need to encode integers(整数) in big-endian network byte order? x86 and ARM account for almost all modern CPUs, yet use little-endian(小头派成员) (ARM is actually bi-endian but Android, like Windows and iOS, is little-endian).

Code Generation
topprevnext

Reading the previous two sections, you might have wondered, "could I write my own IDL generator(发电机) that was better than a general purpose one?" If this thought wandered into your mind, it probably left pretty soon after, chased(追逐) by dark calculations(计算) about how much work that actually involved(包含).

What if I told you of a way to build custom IDL generators cheaply and quickly? You can have a way to get perfectly documented contracts(合同), code that is as evil(邪恶的) and domain-specific as you need it to be, and all you need to do is sign away your soul (who ever really used that, am I right?) just here…

At iMatix, until a few years ago, we used code generation to build ever larger and more ambitious(野心勃勃的) systems until we decided the technology (GSL) was too dangerous for common use, and we sealed the archive and locked it with heavy chains in a deep dungeon(地牢). We actually posted it on GitHub. If you want to try the examples that are coming up, grab the repository and build yourself a gsl command. Typing "make" in the src subdirectory(子目录) should do it (and if you're that guy who loves Windows, I'm sure you'll send a patch(眼罩) with project files).

This section isn't really about GSL at all, but about a useful and little-known trick that's useful for ambitious architects(建筑师) who want to scale(测量) themselves, as well as their work. Once you learn the trick, you can whip(抽打) up your own code generators in a short time. The code generators most software engineers know about come with a single hard-coded model. For instance(实例), Ragel "compiles(编译) executable(可执行的) finite(有限的) state machines from regular languages", i.e., Ragel's model is a regular language. This certainly works for a good set of problems, but it's far from universal(一般概念). How do you describe an API in Ragel? Or a project makefile? Or even a finite-state machine like the one we used to design the Binary Star pattern in Chapter 4 - Reliable(可靠的) Request-Reply Patterns?

All these would benefit(有益于) from code generation, but there's no universal model. So the trick is to design your own models as you need them, and then make code generators as cheap compilers for that model. You need some experience in how to make good models, and you need a technology that makes it cheap to build custom code generators. A scripting language, like Perl and Python, is a good option. However, we actually built GSL specifically(特殊的) for this, and that's what I prefer.

Let's take a simple example that ties into what we already know. We'll see more extensive(广泛的) examples later, because I really do believe that code generation is crucial(重要的) knowledge for large-scale(大规模的) work. In Chapter 4 - Reliable(可靠的) Request-Reply Patterns, we developed the Majordomo Protocol (MDP), and wrote clients, brokers, and workers for that. Now could we generate(形成) those pieces mechanically(机械地), by building our own interface(界面) description language and code generators(发电机)?

When we write a GSL model, we can use any semantics(语义学) we like, in other words we can invent domain-specific languages on the spot. I'll invent a couple—see if you can guess what they represent:

slideshow
    name = Cookery level 3
    page
        title = French Cuisine
        item = Overview
        item = The historical cuisine
        item = The nouvelle cuisine
        item = Why the French live longer
    page
        title = Overview
        item = Soups and salads
        item = Le plat principal
        item = Béchamel and other sauces
        item = Pastries, cakes, and quiches
        item = Soufflé: cheese to strawberry

How about this one:

table
    name = person
    column
        name = firstname
        type = string
    column
        name = lastname
        type = string
    column
        name = rating
        type = integer

We could compile(编译) the first into a presentation. The second, we could compile into SQL to create and work with a database table. So for this exercise, our domain language, our model, consists of "classes" that contain "messages" that contain "fields" of various types. It's deliberately(故意的) familiar. Here is the MDP client protocol(协议):

<class name = "mdp_client">
    MDP/Client
    <header>
        <field name = "empty" type = "string" value = ""
            >Empty frame</field>
        <field name = "protocol" type = "string" value = "MDPC01"
            >Protocol identifier</field>
    </header>
    <message name = "request">
        Client request to broker
        <field name = "service" type = "string">Service name</field>
        <field name = "body" type = "frame">Request body</field>
    </message>
    <message name = "reply">
        Response back to client
        <field name = "service" type = "string">Service name</field>
        <field name = "body" type = "frame">Response body</field>
    </message>
</class>

And here is the MDP worker protocol:

<class name = "mdp_worker">
    MDP/Worker
    <header>
        <field name = "empty" type = "string" value = ""
            >Empty frame</field>
        <field name = "protocol" type = "string" value = "MDPW01"
            >Protocol identifier</field>
        <field name = "id" type = "octet">Message identifier</field>
    </header>
    <message name = "ready" id = "1">
        Worker tells broker it is ready
        <field name = "service" type = "string">Service name</field>
    </message>
    <message name = "request" id = "2">
        Client request to broker
        <field name = "client" type = "frame">Client address</field>
        <field name = "body" type = "frame">Request body</field>
    </message>
    <message name = "reply" id = "3">
        Worker returns reply to broker
        <field name = "client" type = "frame">Client address</field>
        <field name = "body" type = "frame">Request body</field>
    </message>
    <message name = "hearbeat" id = "4">
        Either peer tells the other it's still alive
    </message>
    <message name = "disconnect" id = "5">
        Either peer tells other the party is over
    </message>
</class>

GSL uses XML as its modeling language. XML has a poor reputation(名声), having been dragged through too many enterprise sewers to smell sweet, but it has some strong positives(正数), as long as you keep it simple. Any way to write a self-describing hierarchy(层级) of items and attributes(属性) would work.

Now here is a short IDL generator written in GSL that turns our protocol models into documentation(文件材料):

.#  Trivial IDL generator (specs.gsl)
.#
.output "$(class.name).md"
## The $(string.trim (class.?''):left) Protocol
.for message
.   frames = count (class->header.field) + count (field)

A $(message.NAME) command consists of a multipart message of $(frames)
frames:

.   for class->header.field
.       if name = "id"
* Frame $(item ()): 0x$(message.id:%02x) (1 byte, $(message.NAME))
.       else
* Frame $(item ()): "$(value:)" ($(string.length ("$(value)")) \
bytes, $(field.:))
.       endif
.   endfor
.   index = count (class->header.field) + 1
.   for field
* Frame $(index): $(field.?'') \
.       if type = "string"
(printable string)
.       elsif type = "frame"
(opaque binary)
.           index += 1
.       else
.           echo "E: unknown field type: $(type)"
.       endif
.       index += 1
.   endfor
.endfor

The XML models and this script are in the subdirectory(子目录) examples/models. To do the code generation, I give this command:

gsl -script:specs mdp_client.xml mdp_worker.xml

Here is the Markdown text we get for the worker protocol:

## The MDP/Worker Protocol

A READY command consists of a multipart message of 4 frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x01 (1 byte, READY)
* Frame 4: Service name (printable string)

A REQUEST command consists of a multipart message of 5 frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x02 (1 byte, REQUEST)
* Frame 4: Client address (opaque binary)
* Frame 6: Request body (opaque binary)

A REPLY command consists of a multipart message of 5 frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x03 (1 byte, REPLY)
* Frame 4: Client address (opaque binary)
* Frame 6: Request body (opaque binary)

A HEARBEAT command consists of a multipart message of 3 frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x04 (1 byte, HEARBEAT)

A DISCONNECT command consists of a multipart message of 3 frames:

* Frame 1: "" (0 bytes, Empty frame)
* Frame 2: "MDPW01" (6 bytes, Protocol identifier)
* Frame 3: 0x05 (1 byte, DISCONNECT)

This, as you can see, is close to what I wrote by hand in the original spec(投机). Now, if you have cloned(无性繁殖) the zguide repository(贮藏室) and you are looking at the code in examples/models, you can generate(形成) the MDP client and worker codecs(编码解码器). We pass the same two models to a different code generator(发电机):

gsl -script:codec_c mdp_client.xml mdp_worker.xml

Which gives us mdp_client and mdp_worker classes. Actually MDP is so simple that it's barely(空的) worth the effort of writing the code generator. The profit(利润) comes when we want to change the protocol(协议) (which we did for the standalone(独立的电脑) Majordomo project). You modify(修改) the protocol, run the command, and out pops more perfect code.

The codec_c.gsl code generator is not short, but the resulting codecs are much better than the handwritten(手写的) code I originally put together for Majordomo. For instance(举…为例), the handwritten code had no error checking and would die if you passed it bogus(假的) messages.

I'm now going to explain the pros and cons of GSL-powered model-oriented code generation. Power does not come for free and one of the greatest traps in our business is the ability to invent concepts(观念) out of thin air. GSL makes this particularly easy, so it can be an equally dangerous tool.

Do not invent concepts. The job of a designer is to remove problems, not add features(特色).

Firstly, I will lay out the advantages of model-oriented code generation:

  • You can create near-perfect abstractions(抽象) that map to your real world. So, our protocol model maps 100% to the "real world" of Majordomo. This would be impossible without the freedom to tune(调整) and change the model in any way.
  • You can develop these perfect models quickly and cheaply.
  • You can generate any text output(输出). From a single model, you can create documentation(文件材料), code in any language, test tools—literally a(照字面地)ny output you can think of.
  • You can generate (and I mean this literally) perfect output(输出) because it's cheap to improve your code generators(发电机) to any level you want.
  • You get a single source that combines specifications(规格) and semantics(语义学).
  • You can leverage(手段) a small team to a massive(大量的) size. At iMatix, we produced the million-line OpenAMQ messaging product out of perhaps 85K lines of input(投入) models, including the code generation scripts themselves.

Now let's look at the disadvantages:

  • You add tool dependencies(依赖性) to your project.
  • You may get carried away and create models for the pure joy of creating them.
  • You may alienate(疏远) newcomers, who will see "strange stuff(东西)", from your work.
  • You may give people a strong excuse not to invest(投资) in your project.

Cynically(愤世嫉俗的), model-oriented abuse(滥用) works great in environments where you want to produce huge amounts(数量) of perfect code that you can maintain(维持) with little effort and which no one can ever take away from you. Personally, I like to cross my rivers and move on. But if long-term job security is your thing, this is almost perfect.

So if you do use GSL and want to create open communities around your work, here is my advice:

  • Use it only where you would otherwise be writing tiresome code by hand.
  • Design natural models that are what people would expect to see.
  • Write the code by hand first so you know what to generate(形成).
  • Do not overuse. Keep it simple! Do not get too meta!!
  • Introduce gradually into a project.
  • Put the generated code into your repositories(贮藏室).

We're already using GSL in some projects around ZeroMQ. For example, the high-level C binding(结合), CZMQ, uses GSL to generate the socket(插座) options class (zsockopt). A 300-line code generator(发电机) turns 78 lines of XML model into 1,500 lines of perfect, but really boring code. That's a good win.

Transferring Files

topprevnext

Let's take a break from the lecturing and get back to our first love and the reason for doing all of this: code.

"How do I send a file?" is a common question on the ZeroMQ mailing lists. This should not be surprising, because file transfer(转让) is perhaps the oldest and most obvious type of messaging. Sending files around networks has lots of use cases apart(相距) from annoying the copyright cartels(卡特尔). ZeroMQ is very good out of the box at sending events and tasks, but less good at sending files.

I've promised, for a year or two, to write a proper explanation. Here's a gratuitous(无理由的) piece of information to brighten(闪亮) your morning: the word "proper" comes from the archaic(古代的) French propre, which means "clean". The dark age English common folk, not being familiar with hot water and soap, changed the word to mean "foreign" or "upper-class", as in "that's proper food!", but later the word came to mean just "real", as in "that's a proper mess you've gotten(得到) us into!"

So, file transfer. There are several reasons you can't just pick up a random(随机的) file, blindfold(障眼物) it, and shove(推) it whole into a message. The most obvious reason being that despite(尽管) decades(十年) of determined growth in RAM sizes (and who among us old-timers(老前辈) doesn't fondly remember saving up for that 1024-byte memory extension(延长) card?!), disk sizes obstinately(顽固地) remain much larger. Even if we could send a file with one instruction (say, using a system call like sendfile), we'd hit the reality that networks are not infinitely(无限地) fast nor perfectly reliable(可靠的). After trying to upload a large file several times on a slow flaky(薄片的) network (WiFi, anyone?), you'll realize that a proper file transfer protocol(协议) needs a way to recover from failures. That is, it needs a way to send only the part of a file that wasn't yet received.

Finally, after all this, if you build a proper file server, you'll notice that simply sending massive(大量的) amounts(数量) of data to lots of clients creates that situation we like to call, in the technical parlance(说法), "server went belly-up(死的) due to all available heap memory being eaten by a poorly designed application". A proper file transfer protocol needs to pay attention to memory use.

We'll solve these problems properly, one-by-one, which should hopefully get us to a good and proper file transfer protocol running over ZeroMQ. First, let's generate(形成) a 1GB test file with random data (real power-of-two-giga-like-Von-Neumman-intended, not the fake(伪造的) silicon(硅) ones the memory industry likes to sell):

dd if=/dev/urandom of=testdata bs=1M count=1024

This is large enough to be troublesome when we have lots of clients asking for the same file at once, and on many machines, 1GB is going to be too large to allocate(分配) in memory anyhow. As a base reference(参考), let's measure how long it takes to copy this file from disk back to disk. This will tell us how much our file transfer protocol adds on top (including network costs):

$ time cp testdata testdata2

real    0m7.143s
user    0m0.012s
sys     0m1.188s

The 4-figure precision(精度) is misleading; expect variations(变化) of 25% either way. This is just an "order of magnitude(大小)" measurement(测量).

Here's our first cut at the code, where the client asks for the test data and the server just sends it, without stopping for breath, as a series of messages, where each message holds one chunk:


Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

It's pretty simple, but we already run into a problem: if we send too much data to the ROUTER socket(插座), we can easily overflow(溢出) it. The simple but stupid solution(解决方案) is to put an infinite(无限的) high-water mark on the socket. It's stupid because we now have no protection against exhausting(排出) the server's memory. Yet without an infinite HWM, we risk(风险) losing chunks(大块) of large files.

Try this: set the HWM to 1,000 (in ZeroMQ v3.x this is the default) and then reduce the chunk size to 100K so we send 10K chunks in one go. Run the test, and you'll see it never finishes. As the zmq_socket() man page says with cheerful brutality(无情), for the ROUTER socket: "ZMQ_HWM option action: Drop".

We have to control the amount(数量) of data the server sends up-front(预先的). There's no point in it sending more than the network can handle. Let's try sending one chunk at a time. In this version of the protocol(协议), the client will explicitly(明确地) say, "Give me chunk N", and the server will fetch that specific(特殊的) chunk from disk and send it.

Here's the improved second model, where the client asks for one chunk at a time, and the server only sends one chunk for each request it gets from the client:


Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

It is much slower now, because of the to-and-fro(往复) chatting between client and server. We pay about 300 microseconds(微秒) for each request-reply round-trip, on a local loop(环) connection (client and server on the same box). It doesn't sound like much but it adds up quickly:

$ time ./fileio1
4296 chunks received, 1073741824 bytes

real    0m0.669s
user    0m0.056s
sys     0m1.048s

$ time ./fileio2
4295 chunks received, 1073741824 bytes

real    0m2.389s
user    0m0.312s
sys     0m2.136s

There are two valuable lessons here. First, while request-reply is easy, it's also too slow for high-volume(大容量) data flows. Paying that 300 microseconds once would be fine. Paying it for every single chunk(大块) isn't acceptable(可接受的), particularly on real networks with latencies(潜伏) of perhaps 1,000 times higher.

The second point is something I've said before but will repeat: it's incredibly(难以置信的) easy to experiment, measure, and improve a protocol(协议) over ZeroMQ. And when the cost of something comes way down, you can afford a lot more of it. Do learn to develop and prove your protocols in isolation(隔离): I've seen teams waste time trying to improve poorly designed protocols that are too deeply embedded(栽种) in applications to be easily testable(可试验的) or fixable(可固定的).

Our model two file transfer(转让) protocol isn't so bad, apart(相距) from performance:

  • It completely eliminates(消除) any risk(风险) of memory exhaustion(枯竭). To prove that, we set the high-water mark to 1 in both sender and receiver.
  • It lets the client choose the chunk size, which is useful because if there's any tuning(调谐) of the chunk size to be done, for network conditions, for file types, or to reduce memory consumption(消费) further, it's the client that should be doing this.
  • It gives us fully restartable(可重新起动的) file transfers.
  • It allows the client to cancel the file transfer at any point in time.

If we just didn't have to do a request for each chunk, it'd be a usable(可用的) protocol. What we need is a way for the server to send multiple chunks without waiting for the client to request or acknowledge(承认) each one. What are our choices?

  • The server could send 10 chunks at once, then wait for a single acknowledgment(感谢). That's exactly like multiplying the chunk size by 10, so it's pointless(无意义的). And yes, it's just as pointless for all values of 10.
  • The server could send chunks without any chatter from the client but with a slight delay between each send, so that it would send chunks only as fast as the network could handle them. This would require the server to know what's happening at the network layer, which sounds like hard work. It also breaks layering horribly. And what happens if the network is really fast, but the client itself is slow? Where are chunks queued then?
  • The server could try to spy on the sending queue, i.e., see how full it is, and send only when the queue isn't full. Well, ZeroMQ doesn't allow that because it doesn't work, for the same reason as throttling(节气) doesn't work. The server and network may be more than fast enough, but the client may be a slow little device(装置).
  • We could modify libzmq to take some other action on reaching HWM. Perhaps it could block? That would mean that a single slow client would block the whole server, so no thank you. Maybe it could return an error to the caller? Then the server could do something smart like… well, there isn't really anything it could do that's any better than dropping the message.

Apart(相距) from being complex(复合体) and variously unpleasant, none of these options would even work. What we need is a way for the client to tell the server, asynchronously(异步的) and in the background, that it's ready for more. We need some kind of asynchronous flow control. If we do this right, data should flow without interruption(中断) from the server to the client, but only as long as the client is reading it. Let's review our three protocols(协议). This was the first one:

C: fetch
S: chunk 1
S: chunk 2
S: chunk 3
....

And the second introduced a request for each chunk(大块):

C: fetch chunk 1
S: send chunk 1
C: fetch chunk 2
S: send chunk 2
C: fetch chunk 3
S: send chunk 3
C: fetch chunk 4
....

Now—waves hands mysteriously—h(神秘的)ere's a changed protocol that fixes the performance problem:

C: fetch chunk 1
C: fetch chunk 2
C: fetch chunk 3
S: send chunk 1
C: fetch chunk 4
S: send chunk 2
S: send chunk 3
....

It looks suspiciously(怀疑地) similar. In fact, it's identical(同一的) except that we send multiple requests without waiting for a reply for each one. This is a technique called "pipelining(流水线)" and it works because our DEALER and ROUTER sockets(插座) are fully asynchronous.

Here's the third model of our file transfer(转让) test-bench, with pipelining. The client sends a number of requests ahead (the "credit") and then each time it processes an incoming chunk, it sends one more credit. The server will never send more chunks than the client has asked for:


Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

That tweak(扭) gives us full control over the end-to-end(端对端) pipeline(管道) including all network buffers(有软皮摩擦) and ZeroMQ queues at sender and receiver. We ensure(保证) the pipeline is always filled with data while never growing beyond a predefined(预先确定) limit. More than that, the client decides exactly when to send "credit" to the sender. It could be when it receives a chunk(大块), or when it has fully processed a chunk. And this happens asynchronously(异步的), with no significant(重大的) performance cost.

In the third model, I chose a pipeline size of 10 messages (each message is a chunk). This will cost a maximum of 2.5MB memory per client. So with 1GB of memory we can handle at least 400 clients. We can try to calculate(计算) the ideal(理想的) pipeline size. It takes about 0.7 seconds to send the 1GB file, which is about 160 microseconds(微秒) for a chunk. A round trip is 300 microseconds, so the pipeline needs to be at least 3-5 chunks to keep the server busy. In practice, I still got performance spikes(长钉) with a pipeline of 5 chunks, probably because the credit messages sometimes get delayed by outgoing(外出的) data. So at 10 chunks, it works consistently(一贯地).

$ time ./fileio3
4291 chunks received, 1072741824 bytes

real    0m0.777s
user    0m0.096s
sys     0m1.120s

Do measure rigorously(严厉地). Your calculations(计算) may be good, but the real world tends(趋向) to have its own opinions.

What we've made is clearly not yet a real file transfer(转让) protocol(协议), but it proves the pattern and I think it is the simplest plausible(貌似可信的) design. For a real working protocol, we might want to add some or all of:

  • Authentication(证明) and access controls, even without encryption(加密): the point isn't to protect sensitive(敏感的) data, but to catch errors like sending test data to production servers.
  • A Cheap-style request including file path, optional(可选择的) compression(压缩), and other stuff(东西) we've learned is useful from HTTP (such as If-Modified-Since).
  • A Cheap-style response, at least for the first chunk, that provides meta data such as file size (so the client can pre-allocate, and avoid unpleasant disk-full situations).
  • The ability to fetch a set of files in one go, otherwise the protocol becomes inefficient(无效率的) for large sets of small files.
  • Confirmation(确认) from the client when it's fully received a file, to recover from chunks that might be lost if the client disconnects(拆开) unexpectedly(意外的).

So far, our semantic(语义的) has been "fetch"; that is, the recipient(容器) knows (somehow) that they need a specific(特殊的) file, so they ask for it. The knowledge of which files exist and where they are is then passed out-of-band (e.g., in HTTP, by links in the HTML page).

How about a "push" semantic? There are two plausible(貌似可信的) use cases for this. First, if we adopt(采取) a centralized(集中的) architecture(建筑学) with files on a main "server" (not something I'm advocating(提倡), but people do sometimes like this), then it's very useful to allow clients to upload files to the server. Second, it lets us do a kind of pub-sub for files, where the client asks for all new files of some type; as the server gets these, it forwards them to the client.

A fetch semantic is synchronous(同步的), while a push semantic is asynchronous(异步的). Asynchronous is less chatty(饶舌的), so faster. Also, you can do cute(可爱的) things like "subscribe(签署) to this path" thus creating a pub-sub file transfer(转让) architecture. That is so obviously awesome(可怕的) that I shouldn't need to explain what problem it solves.

Still, here is the problem with the fetch semantic: that out-of-band route(按某路线发送) to tell clients what files exist. No matter how you do this, it ends up being complex(复合体). Either clients have to poll(投票), or you need a separate pub-sub channel to keep clients up-to-date(最新的), or you need user interaction(相互作用).

Thinking this through a little more, though, we can see that fetch is just a special case of pub-sub. So we can get the best of both worlds. Here is the general design:

  • Fetch this path
  • Here is credit (repeat)

To make this work (and we will, my dear readers), we need to be a little more explicit(明确的) about how we send credit to the server. The cute trick of treating a pipelined "fetch chunk(大块)" request as credit won't fly because the client doesn't know any longer what files actually exist, how large they are, anything. If the client says, "I'm good for 250,000 bytes of data", this should work equally for 1 file of 250K bytes, or 100 files of 2,500 bytes.

And this gives us "credit-based flow control", which effectively(有效地) removes the need for high-water marks, and any risk(风险) of memory overflow(充满).

State Machines

topprevnext

Software engineers tend(照料) to think of (finite(有限的)) state machines as a kind of intermediary(中间的) interpreter(解释者). That is, you take a regular language and compile(编译) that into a state machine, then execute(实行) the state machine. The state machine itself is rarely visible(明显的) to the developer: it's an internal(内部的) representation(代表)—optimized, (最佳化的)compressed, and bizarre.(奇异的)

However, it turns out that state machines are also valuable as a first-class modeling languages for protocol(协议) handlers, e.g., ZeroMQ clients and servers. ZeroMQ makes it rather easy to design protocols, but we've never defined(定义) a good pattern for writing those clients and servers properly.

A protocol has at least two levels:

  • How we represent individual(个人的) messages on the wire.
  • How messages flow between peers(撒尿), and the significance(意义) of each message.

We've seen in this chapter how to produce codecs(编码解码器) that handle serialization(序列化). That's a good start. But if we leave the second job to developers, that gives them a lot of room to interpret. As we make more ambitious(野心勃勃的) protocols (file transfer + heartbeating + credit + authentication(证明)), it becomes less and less sane(健全的) to try to implement(实施) clients and servers by hand.

Yes, people do this almost systematically(有系统地). But the costs are high, and they're avoidable(可避免的). I'll explain how to model protocols using state machines, and how to generate(形成) neat and solid code from those models.

My experience with using state machines as a software construction tool dates to 1985 and my first real job making tools for application developers. In 1991, I turned that knowledge into a free software tool called Libero, which spat out executable(可执行的) state machines from a simple text model.

The thing about Libero's model was that it was readable(可读的). That is, you described your program logic(逻辑) as named states, each accepting a set of events, each doing some real work. The resulting state machine hooked into your application code, driving it like a boss.

Libero(自由后卫) was charmingly(陶醉) good at its job, fluent(流畅的) in many languages, and modestly popular given the enigmatic(神秘的) nature of state machines. We used Libero in anger in dozens of large distributed(分布式的) applications, one of which was finally switched(转换) off in 2011 after 20 years of operation. State-machine driven code construction worked so well that it's somewhat impressive(感人的) that this approach(接近) never hit the mainstream(主流) of software engineering.

So in this section I'm going to explain Libero's model, and demonstrate(证明) how to use it to generate(形成) ZeroMQ clients and servers. We'll use GSL again, but like I said, the principles(原理) are general and you can put together code generators(发电机) using any scripting language.

As a worked example, let's see how to carry-on a stateful(状态性的) dialog with a peer(贵族) on a ROUTER socket(插座). We'll develop the server using a state machine (and the client by hand). We have a simple protocol(协议) that I'll call "NOM". I'm using the oh-so-very-serious keywords for unprotocols proposal:

nom-protocol    = open-peering *use-peering

open-peering    = C:OHAI ( S:OHAI-OK / S:WTF )

use-peering     = C:ICANHAZ
                / S:CHEEZBURGER
                / C:HUGZ S:HUGZ-OK
                / S:HUGZ C:HUGZ-OK

I've not found a quick way to explain the true nature of state machine programming. In my experience, it invariably(总是) takes a few days of practice. After three or four days' exposure(暴露) to the idea, there is a near-audible "click!" as something in the brain connects all the pieces together. We'll make it concrete(凝结) by looking at the state machine for our NOM server.

A useful thing about state machines is that you can read them state by state. Each state has a unique(独特的) descriptive(描写的) name and one or more events, which we list in any order. For each event, we perform zero or more actions and we then move to a next state (or stay in the same state).

In a ZeroMQ protocol server, we have a state machine instance(实例) per client. That sounds complex(复杂的) but it isn't, as we'll see. We describe our first state, Start, as having one valid event: OHAI. We check the user's credentials(证书) and then arrive in the Authenticated state.

Figure 64 - The Start State

fig64.png

The Check Credentials action produces either an ok or an error event. It's in the Authenticated state that we handle these two possible events by sending an appropriate(适当的) reply back to the client. If authentication(证明) failed, we return to the Start state where the client can try again.

Figure 65 - The Authenticated State

fig65.png

When authentication has succeeded, we arrive in the Ready state. Here we have three possible events: an ICANHAZ or HUGZ message from the client, or a heartbeat(心跳) timer event.

Figure 66 - The Ready State

fig66.png

There are a few more things about this state machine model that are worth knowing:

  • Events in upper case (like "HUGZ") are external events that come from the client as messages.
  • Events in lower case (like "heartbeat") are internal events, produced by code in the server.
  • The "Send SOMETHING" actions are shorthand(速记法的) for sending a specific(特殊的) reply back to the client.
  • Events that aren't defined(定义) in a particular state are silently ignored(驳回诉讼).

Now, the original source for these pretty pictures is an XML model:

<class name = "nom_server" script = "server_c">

<state name = "start">
    <event name = "OHAI" next = "authenticated">
        <action name = "check credentials" />
    </event>
</state>

<state name = "authenticated">
    <event name = "ok" next = "ready">
        <action name = "send" message ="OHAI-OK" />
    </event>
    <event name = "error" next = "start">
        <action name = "send" message = "WTF" />
    </event>
</state>

<state name = "ready">
    <event name = "ICANHAZ">
        <action name = "send" message = "CHEEZBURGER" />
    </event>
    <event name = "HUGZ">
        <action name = "send" message = "HUGZ-OK" />
    </event>
    <event name = "heartbeat">
        <action name = "send" message = "HUGZ" />
    </event>
</state>
</class>

The code generator(发电机) is in examples/models/server_c.gsl. It is a fairly complete tool that I'll use and expand(扩张) for more serious work later. It generates(形成):

  • A server class in C (nom_server.c, nom_server.h) that implements(工具) the whole protocol(协议) flow.
  • A selftest(自我测试) method that runs the selftest steps listed in the XML file.
  • Documentation(文件材料) in the form of graphics(图形) (the pretty pictures).

Here's a simple main program that starts the generated NOM server:

#include "czmq.h"
#include "nom_server.h"

int main (int argc, char *argv [])
{
printf ("Starting NOM protocol(协议) server on port 5670…\n");
nom_server_t *server = nom_server_new ();
nom_server_bind (server, "tcp://*:5670");
nom_server_wait (server);
nom_server_destroy (&server);
return 0;
}

The generated(生成的) nom_server class is a fairly classic(经典的) model. It accepts client messages on a ROUTER socket(插座), so the first frame(框架) on every request is the client's connection identity(身份). The server manages a set of clients, each with state. As messages arrive, it feeds these as events to the state machine. Here's the core of the state machine, as a mix of GSL commands and the C code we intend to generate:

client_execute (client_t *self, int event)
{
self->next_event = event;
while (self->next_event) {
self->event = self->next_event;
self->next_event = 0;
switch (self->state) {
.for class.state
case $(name:c)_state:
. for event
. if index () > 1
else
. endif
if (self->event == $(name:c)_event) {
. for action
. if name = "send"
zmsg_addstr (self->reply, "$(message:)");
. else
$(name:c)_action (self);
. endif
. endfor
. if defined (event.next)
self->state = $(next:c)_state;
. endif
}
. endfor
break;
.endfor
}
if (zmsg_size (self->reply) > 1) {
zmsg_send (&self->reply, self->router);
self->reply = zmsg_new ();
zmsg_add (self->reply, zframe_dup (self->address));
}
}
}

Each client is held as an object with various properties, including the variables(变量) we need to represent a state machine instance(实例):

event_t next_event; // Next event
state_t state; // Current state
event_t event; // Current event

You will see by now that we are generating(形成) technically-perfect code that has the precise(精确的) design and shape we want. The only clue(线索) that the nom_server class isn't handwritten(手写的) is that the code is too good. People who complain that code generators(发电机) produce poor code are accustomed(使习惯于) to poor code generators. It is trivial(不重要的) to extend(延伸) our model as we need it. For example, here's how we generate the selftest(自我测试) code.

First, we add a "selftest" item to the state machine and write our tests. We're not using any XML grammar or validation(确认) so it really is just a matter of opening the editor and adding half-a-dozen lines of text:

<selftest>
    <step send = "OHAI" body = "Sleepy" recv = "WTF" />
    <step send = "OHAI" body = "Joe" recv = "OHAI-OK" />
    <step send = "ICANHAZ" recv = "CHEEZBURGER" />
    <step send = "HUGZ" recv = "HUGZ-OK" />
    <step recv = "HUGZ" />
</selftest>

Designing on the fly, I decided that "send" and "recv" were a nice way to express "send this request, then expect this reply". Here's the GSL code that turns this model into real code:

.for class->selftest.step
.   if defined (send)
    msg = zmsg_new ();
    zmsg_addstr (msg, "$(send:)");
.       if defined (body)
    zmsg_addstr (msg, "$(body:)");
.       endif
    zmsg_send (&msg, dealer);

.   endif
.   if defined (recv)
    msg = zmsg_recv (dealer);
    assert (msg);
    command = zmsg_popstr (msg);
    assert (streq (command, "$(recv:)"));
    free (command);
    zmsg_destroy (&msg);

.   endif
.endfor

Finally, one of the more tricky(狡猾的) but absolutely(绝对地) essential parts of any state machine generator is how do I plug this into my own code? As a minimal(最低的) example for this exercise I wanted to implement(实施) the "check credentials(证书)" action by accepting all OHAIs from my friend Joe (Hi Joe!) and reject everyone else's OHAIs. After some thought, I decided to grab code directly from the state machine model, i.e., embed(栽种) action bodies in the XML file. So in nom_server.xml, you'll see this:

<action name = "check credentials">
    char *body = zmsg_popstr (self->request);
    if (body && streq (body, "Joe"))
        self->next_event = ok_event;
    else
        self->next_event = error_event;
    free (body);
</action>

And the code generator(发电机) grabs that C code and inserts it into the generated(形成) nom_server.c file:

.for class.action
static void
$(name:c)_action (client_t *self) {
$(string.trim (.):)
}
.endfor

And now we have something quite elegant(高雅的): a single source file that describes my server state machine and also contains the native implementations(实现) for my actions. A nice mix of high-level and low-level that is about 90% smaller than the C code.

Beware(当心), as your head spins with notions(概念) of all the amazing things you could produce with such leverage(手段). While this approach(方法) gives you real power, it also moves you away from your peers(撒尿), and if you go too far, you'll find yourself working alone.

By the way, this simple little state machine design exposes just three variables(变量) to our custom code:

  • self->next_event
  • self->request
  • self->reply

In the Libero state machine model, there are a few more concepts(观念) that we've not used here, but which we will need when we write larger state machines:

  • Exceptions(例外), which lets us write terser(简洁的) state machines. When an action raises an exception, further processing on the event stops. The state machine can then define(定义) how to handle exception events.
  • The Defaults state, where we can define default handling for events (especially useful for exception events).

Authentication Using SASL

topprevnext

When we designed AMQP in 2007, we chose the Simple Authentication(证明) and Security Layer (SASL) for the authentication layer, one of the ideas we took from the BEEP protocol framework. SASL looks complex(复合体) at first, but it's actually simple and fits neatly into a ZeroMQ-based protocol(协议). What I especially like about SASL is that it's scalable(可攀登的). You can start with anonymous(匿名的) access or plain text authentication and no security, and grow to more secure(安全的) mechanisms(机制) over time without changing your protocol.

I'm not going to give a deep explanation now because we'll see SASL in action somewhat later. But I'll explain the principle(原理) so you're already somewhat prepared.

In the NOM protocol, the client started with an OHAI command, which the server either accepted ("Hi Joe!") or rejected. This is simple but not scalable because server and client have to agree up-front(预先的) on the type of authentication they're going to do.

What SASL introduced, which is genius(天才), is a fully abstracted(摘要) and negotiable(可协商的) security layer that's still easy to implement(实施) at the protocol level. It works as follows:

  • The client connects.
  • The server challenges the client, passing a list of security "mechanisms" that it knows about.
  • The client chooses a security mechanism that it knows about, and answers the server's challenge with a blob of opaque(不透明的) data that (and here's the neat trick) some generic(类的) security library calculates(计算) and gives to the client.
  • The server takes the security mechanism the client chose, and that blob of data, and passes it to its own security library.
  • The library either accepts the client's answer, or the server challenges again.

There are a number of free SASL libraries. When we come to real code, we'll implement just two mechanisms, ANONYMOUS and PLAIN, which don't need any special libraries.

To support SASL, we have to add an optional(可选择的) challenge/response step to our "open-peering" flow. Here is what the resulting protocol grammar looks like (I'm modifying(修改) NOM to do this):

secure-nom      = open-peering *use-peering

open-peering    = C:OHAI *( S:ORLY C:YARLY ) ( S:OHAI-OK / S:WTF )

ORLY            = 1*mechanism challenge
mechanism       = string
challenge       = *OCTET

YARLY           = mechanism response
response        = *OCTET

Where ORLY and YARLY contain a string (a list of mechanisms in ORLY, one mechanism in YARLY) and a blob of opaque data. Depending on the mechanism, the initial challenge from the server may be empty. We don't care: we just pass this to the security library to deal with.

The SASL RFC goes into detail about other features(特色) (that we don't need), the kinds of ways SASL could be attacked, and so on.

Large-Scale File Publishing: FileMQ

topprevnext

Let's put all these techniques together into a file distribution(分布) system that I'll call FileMQ. This is going to be a real product, living on GitHub. What we'll make here is a first version of FileMQ, as a training tool. If the concept(观念) works, the real thing may eventually(最后) get its own book.

Why make FileMQ?
topprevnext

Why make a file distribution system? I already explained how to send large files over ZeroMQ, and it's really quite simple. But if you want to make messaging accessible(易接近的) to a million times more people than can use ZeroMQ, you need another kind of API. An API that my five-year old son can understand. An API that is universal(普遍的), requires no programming, and works with just about every single application.

Yes, I'm talking about the file system. It's the DropBox pattern: chuck(丢弃) your files somewhere and they get magically copied somewhere else when the network connects again.

However, what I'm aiming for is a fully decentralized(分散的) architecture(建筑学) that looks more like git, that doesn't need any cloud services (though we could put FileMQ in the cloud), and that does multicast(多路广播), i.e., can send files to many places at once.

FileMQ must be secure(安全的)(able), easily hooked into random(随机的) scripting languages, and as fast as possible across our domestic(国内的) and office networks.

I want to use it to back up photos from my mobile phone to my laptop(膝上型轻便电脑) over WiFi. To share presentation slides in real time across 50 laptops in a conference. To share documents with colleagues in a meeting. To send earthquake data from sensors(传感器) to central clusters(群). To back up video from my phone as I take it, during protests(抗议) or riots(暴乱). To synchronize(合拍) configuration(配置) files across a cloud of Linux servers.

A visionary(梦想的) idea, isn't it? Well, ideas are cheap. The hard part is making this, and making it simple.

Initial Design Cut: the API
topprevnext

Here's the way I see the first design. FileMQ has to be distributed(分布式的), which means that every node can be a server and a client at the same time. But I don't want the protocol(协议) to be symmetrical(匀称的), because that seems forced. We have a natural flow of files from point A to point B, where A is the "server" and B is the "client". If files flow back the other way, then we have two flows. FileMQ is not yet directory synchronization(同步) protocol, but we'll bring it quite close.

Thus, I'm going to build FileMQ as two pieces: a client and a server. Then, I'll put these together in a main application (the filemq tool) that can act both as client and server. The two pieces will look quite similar to the nom_server, with the same kind of API:

fmq_server_t *server = fmq_server_new ();
fmq_server_bind (server, "tcp://*:5670");
fmq_server_publish (server, "/home/ph/filemq/share", "/public");
fmq_server_publish (server, "/home/ph/photos/stream", "/photostream");

fmq_client_t *client = fmq_client_new ();
fmq_client_connect (client, "tcp://pieter.filemq.org:5670");
fmq_client_subscribe(签署) (server, "/public/", "/home/ph/filemq/share");
fmq_client_subscribe(签署) (server, "/photostream/", "/home/ph/photos/stream");

while (!zctx_interrupted)
sleep (1);

fmq_server_destroy (&server);
fmq_client_destroy (&client);

If we wrap(包) this C API in other languages, we can easily script FileMQ, embed(栽种) it applications, port it to smartphones, and so on.

Initial Design Cut: the Protocol(协议)
topprevnext

The full name for the protocol is the "File Message Queuing Protocol", or FILEMQ in uppercase(以大写字母印刷) to distinguish(区分) it from the software. To start with, we write down the protocol as an ABNF grammar. Our grammar starts with the flow of commands between the client and server. You should recognize these as a combination(结合) of the various techniques we've seen already:

filemq-protocol = open-peering *use-peering [ close-peering ]

open-peering    = C:OHAI *( S:ORLY C:YARLY ) ( S:OHAI-OK / error )

use-peering     = C:ICANHAZ ( S:ICANHAZ-OK / error )
                / C:NOM
                / S:CHEEZBURGER
                / C:HUGZ S:HUGZ-OK
                / S:HUGZ C:HUGZ-OK

close-peering   = C:KTHXBAI / S:KTHXBAI

error           = S:SRSLY / S:RTFM

Here are the commands to and from the server:

;   The client opens peering to the server
OHAI            = signature %x01 protocol version
signature       = %xAA %xA3
protocol        = string        ; Must be "FILEMQ"
string          = size *VCHAR
size            = OCTET
version         = %x01

;   The server challenges the client using the SASL model
ORLY            = signature %x02 mechanisms challenge
mechanisms      = size 1*mechanism
mechanism       = string
challenge       = *OCTET        ; ZeroMQ frame

;   The client responds with SASL authentication information
YARLY           = %signature x03 mechanism response
response        = *OCTET        ; ZeroMQ frame

;   The server grants the client access
OHAI-OK         = signature %x04

;   The client subscribes to a virtual path
ICANHAZ         = signature %x05 path options cache
path            = string        ; Full path or path prefix
options         = dictionary
dictionary      = size *key-value
key-value       = string        ; Formatted as name=value
cache           = dictionary    ; File SHA-1 signatures

;   The server confirms the subscription
ICANHAZ-OK      = signature %x06

;   The client sends credit to the server
NOM             = signature %x07 credit
credit          = 8OCTET        ; 64-bit integer, network order
sequence        = 8OCTET        ; 64-bit integer, network order

;   The server sends a chunk of file data
CHEEZBURGER     = signature %x08 sequence operation filename
                  offset headers chunk
sequence        = 8OCTET        ; 64-bit integer, network order
operation       = OCTET
filename        = string
offset          = 8OCTET        ; 64-bit integer, network order
headers         = dictionary
chunk           = FRAME

;   Client or server sends a heartbeat
HUGZ            = signature %x09

;   Client or server responds to a heartbeat
HUGZ-OK         = signature %x0A

;   Client closes the peering
KTHXBAI         = signature %x0B

And here are the different ways the server can tell the client things went wrong:

;   Server error reply - refused due to access rights
S:SRSLY         = signature %x80 reason

;   Server error reply - client sent an invalid command
S:RTFM          = signature %x81 reason

FILEMQ lives on the ZeroMQ unprotocols website and has a registered TCP port with IANA (the Internet Assigned Numbers Authority), which is port 5670.

Building and Trying FileMQ
topprevnext

The FileMQ stack(堆) is on GitHub. It works like a classic(经典的) C/C++ project:

git clone git://github.com/zeromq/filemq.git
cd filemq
./autogen.sh
./configure
make check

You want to be using the latest CZMQ master for this. Now try running the track command, which is a simple tool that uses FileMQ to track changes in one directory in another:

cd src
./track ./fmqroot/send ./fmqroot/recv

And open two file navigator(航海家) windows, one into src/fmqroot/send and one into src/fmqroot/recv. Drop files into the send folder and you'll see them arrive in the recv folder. The server checks once per second for new files. Delete files in the send folder, and they're deleted in the recv folder similarly.

I use track for things like updating my MP3 player mounted as a USB drive. As I add or remove files in my laptop's Music folder, the same changes happen on the MP3 player. FILEMQ isn't a full replication(复制) protocol(协议) yet, but we'll fix that later.

Internal Architecture
topprevnext

To build FileMQ I used a lot of code generation, possibly too much for a tutorial(个别指导). However the code generators(发电机) are all reusable(可重复使用的) in other stacks(堆) and will be important for our final project in Chapter 8 - A Framework for Distributed Computing. They are an evolution(演变) of the set we saw earlier:

  • codec_c.gsl: generates(形成) a message codec(编码解码器) for a given protocol.
  • server_c.gsl: generates a server class for a protocol and state machine.
  • client_c.gsl: generates a client class for a protocol and state machine.

The best way to learn to use GSL code generation is to translate these into a language of your choice and make your own demo protocols and stacks. You'll find it fairly easy. FileMQ itself doesn't try to support multiple languages. It could, but it'd make things needlessly(不必要地) complex(复杂的).

The FileMQ architecture(建筑学) actually slices into two layers. There's a generic(类的) set of classes to handle chunks(大块), directories, files, patches(眼罩), SASL security, and configuration(配置) files. Then, there's the generated(生成的) stack(堆): messages, client, and server. If I was creating a new project I'd fork the whole FileMQ project, and go and modify(修改) the three models:

  • fmq_msg.xml: defines(定义) the message formats
  • fmq_client.xml: defines the client state machine, API, and implementation(实现).
  • fmq_server.xml: does the same for the server.

You'd want to rename(重新命名) things to avoid confusion(混淆). Why didn't I make the reusable(可重复使用的) classes into a separate library? The answer is two-fold. First, no one actually needs this (yet). Second, it'd make things more complex(复杂的) for you as you build and play with FileMQ. It's never worth adding complexity(复杂) to solve a theoretical problem.

Although I wrote FileMQ in C, it's easy to map to other languages. It is quite amazing how nice C becomes when you add CZMQ's generic zlist and zhash containers and class style. Let me go through the classes quickly:

  • fmq_sasl: encodes and decodes(译码) a SASL challenge. I only implemented(实施) the PLAIN mechanism(机制), which is enough to prove the concept(观念).
  • fmq_chunk: works with variable(变量的) sized blobs. Not as efficient(有效率的) as ZeroMQ's messages but they do less weirdness(命运) and so are easier to understand. The chunk class has methods to read and write chunks from disk.
  • fmq_file: works with files, which may or may not exist on disk. Gives you information about a file (like size), lets you read and write to files, remove files, check if a file exists, and check if a file is "stable(稳定的)" (more on that later).
  • fmq_dir: works with directories, reading them from disk and comparing two directories to see what changed. When there are changes, returns a list of "patches".
  • fmq_patch: works with one patch, which really just says "create this file" or "delete this file" (referring to a fmq_file item each time).
  • fmq_config: works with configuration(配置) data. I'll come back to client and server configuration later.

Every class has a test method, and the main development cycle is "edit, test". These are mostly simple self tests, but they make the difference between code I can trust and code I know will still break. It's a safe bet that any code that isn't covered by a test case will have undiscovered errors. I'm not a fan of external(外部的) test harnesses(马具). But internal(内部的) test code that you write as you write your functionality(功能)… that's like the handle on a knife.

You should, really, be able to read the source code and rapidly understand what these classes are doing. If you can't read the code happily, tell me. If you want to port the FileMQ implementation(实现) into other languages, start by forking the whole repository(贮藏室) and later we'll see if it's possible to do this in one overall repo(购回债券).

Public API
topprevnext

The public API consists of two classes (as we sketched(画素描或速写) earlier):

  • fmq_client: provides the client API, with methods to connect to a server, configure(安装) the client, and subscribe(签署) to paths.
  • fmq_server: provides the server API, with methods to bind(结合) to a port, configure the server, and publish a path.

These classes provide an multithreaded API, a model we've used a few times now. When you create an API instance(举…为例) (i.e., fmq_server_new() or fmq_client_new()), this method kicks off a background thread that does the real work, i.e., runs the server or the client. The other API methods then talk to this thread over ZeroMQ sockets(插座) (a pipe consisting of two PAIR sockets(插座) over inproc://).

If I was a keen(敏锐的) young developer eager to use FileMQ in another language, I'd probably spend a happy weekend writing a binding(装订) for this public API, then stick it in a subdirectory(子目录) of the filemq project called, say, bindings/, and make a pull request.

The actual API methods come from the state machine description, like this (for the server):

<method name = "publish">
<argument name = "location" type = "string" />
<argument name = "alias" type = "string" />
mount_t *mount = mount_new (location, alias);
zlist_append (self->mounts, mount);
</method>

Which gets turned into this code:

void
fmq_server_publish (fmq_server_t *self, char *location, char *alias)
{
assert (self);
assert (location);
assert (alias);
zstr_sendm (self->pipe, "PUBLISH");
zstr_sendfm (self->pipe, "%s", location);
zstr_sendf (self->pipe, "%s", alias);
}

Design Notes
topprevnext

The hardest part of making FileMQ wasn't implementing(实施) the protocol(协议), but maintaining(维持) accurate(精确的) state internally(内部地). An FTP or HTTP server is essentially stateless(没有国家的). But a publish/subscribe(签署) server has to maintain subscriptions(捐献), at least.

So I'll go through some of the design aspects(方面):

  • The client detects(察觉) if the server has died by the lack of heartbeats(心跳) (HUGZ) coming from the server. It then restarts(重新开始) its dialog by sending an OHAI. There's no timeout on the OHAI because the ZeroMQ DEALER socket(插座) will queue an outgoing(外出的) message indefinitely(不确定地).
  • If a client stops replying with (HUGZ-OK) to the heartbeats that the server sends, the server concludes that the client has died and deletes all state for the client including its subscriptions.
  • The client API holds subscriptions in memory and replays(重赛) them when it has connected successfully. This means the caller can subscribe at any time (and doesn't care when connections and authentication(证明) actually happen).
  • The server and client use virtual(虚拟的) paths, much like an HTTP or FTP server. You publish one or more mount points, each corresponding to a directory on the server. Each of these maps to some virtual path, for instance(实例) "/" if you have only one mount point. Clients then subscribe to virtual paths, and files arrive in an inbox(收件箱) directory. We don't send physical file names across the network.
  • There are some timing issues: if the server is creating its mount points while clients are connected and subscribing, the subscriptions won't attach(依附) to the right mount points. So, we bind(绑) the server port as last thing.
  • Clients can reconnect(使再接合) at any point; if the client sends OHAI, that signals the end of any previous conversation and the start of a new one. I might one day make subscriptions(捐献) durable(耐用品) on the server, so they survive(幸存) a disconnection(断开). The client stack(堆), after reconnecting, replays(重赛) any subscriptions the caller application already made.

Configuration
topprevnext

I've built several large server products, like the Xitami web server that was popular in the late 90's, and the OpenAMQ messaging server. Getting configuration(配置) easy and obvious was a large part of making these servers fun to use.

We typically(代表性地) aim to solve a number of problems:

  • Ship default configuration files with the product.
  • Allow users to add custom configuration files that are never overwritten(覆盖的).
  • Allow users to configure(安装) from the command-line.

And then layer these one on the other, so command-line settings override(推翻) custom settings, which override default settings. It can be a lot of work to do this right. For FileMQ, I've taken a somewhat simpler approach(方法): all configuration is done from the API.

This is how we start and configure the server, for example:

server = fmq_server_new ();
fmq_server_configure (server, "server_test.cfg");
fmq_server_publish (server, "./fmqroot/send", "/");
fmq_server_publish (server, "./fmqroot/logs", "/logs");
fmq_server_bind (server, "tcp://*:5670");

We do use a specific(特殊的) format for the config(配置) files, which is ZPL, a minimalist(极简抽象派艺术的) syntax(语法) that we started using for ZeroMQ "devices(装置)" a few years ago, but which works well for any server:

#   Configure server for plain access
#
server
    monitor = 1             #   Check mount points
    heartbeat = 1           #   Heartbeat to clients

publish
    location = ./fmqroot/logs
    virtual = /logs

security
    echo = I: use guest/guest to login to server
    #   These are SASL mechanisms we accept
    anonymous = 0
    plain = 1
        account
            login = guest
            password = guest
            group = guest
        account
            login = super
            password = secret
            group = admin

One cute(可爱的) thing (which seems useful) the generated(生成的) server code does is to parse(解析) this config(配置) file (when you use the fmq_server_configure() method) and execute(实行) any section that matches an API method. Thus the publish section works as a fmq_server_publish() method.

File Stability
topprevnext

It is quite common to poll(投票) a directory for changes and then do something "interesting" with new files. But as one process is writing to a file, other processes have no idea when the file has been fully written. One solution(解决方案) is to add a second "indicator(指示器)" file that we create after creating the first file. This is intrusive(侵入的), however.

There is a neater way, which is to detect(察觉) when a file is "stable(稳定的)", i.e., no one is writing to it any longer. FileMQ does this by checking the modification(修改) time of the file. If it's more than a second old, then the file is considered stable, at least stable enough to be shipped off to clients. If a process comes along after five minutes and appends(设置数据文件的搜索路径) to the file, it'll be shipped off again.

For this to work, and this is a requirement for any application hoping to use FileMQ successfully, do not buffer(缓冲) more than a second's worth of data in memory before writing. If you use very large block sizes, the file may look stable when it's not.

Delivery Notifications
topprevnext

One of the nice things about the multithreaded API model we're using is that it's essentially message based. This makes it ideal(理想) for returning events back to the caller. A more conventional(符合习俗的) API approach(方法) would be to use callbacks(回收). But callbacks that cross thread boundaries(边界) are somewhat delicate(微妙的). Here's how the client sends a message back when it has received a complete file:

zstr_sendm (self->pipe, "DELIVER");
zstr_sendm (self->pipe, filename);
zstr_sendf (self->pipe, "%s/%s", inbox, filename);

We can now add a _recv() method to the API that waits for events back from the client. It makes a clean style for the caller: create the client object, configure(安装) it, and then receive and process any events it returns.

Symbolic Links
topprevnext

While using a staging area is a nice, simple API, it also creates costs for senders. If I already have a 2GB video file on a camera, and want to send it via FileMQ, the current implementation(实现) asks that I copy it to a staging area before it will be sent to subscribers(订阅).

One option is to mount the whole content directory (e.g., /home/me/Movies), but this is fragile because it means the application can't decide to send individual(个人的) files. It's everything or nothing.

A simple answer is to implement(实施) portable symbolic(象征的) links. As Wikipedia explains: "A symbolic link contains a text string that is automatically(自动地) interpreted(说明) and followed by the operating system as a path to another file or directory. This other file or directory is called the target. The symbolic link is a second file that exists independently of its target. If a symbolic link is deleted, its target remains unaffected."

This doesn't affect the protocol(协议) in any way; it's an optimization(最佳化) in the server implementation(实现). Let's make a simple portable implementation:

  • A symbolic link consists of a file with the extension(延长) .ln.
  • The filename without .ln is the published file name.
  • The link file contains one line, which is the real path to the file.

Because we've collected all operations on files in a single class (fmq_file), it's a clean change. When we create a new file object, we check if it's a symbolic link and then all read-only actions (get file size, read file) operate on the target file, not the link.

Recovery(恢复) and Late Joiners
topprevnext

As it stands now, FileMQ has one major remaining problem: it provides no way for clients to recover from failures. The scenario(方案) is that a client, connected to a server, starts to receive files and then disconnects(拆开) for some reason. The network may be too slow, or breaks. The client may be on a laptop(膝上型轻便电脑) which is shut down, then resumed. The WiFi may be disconnected. As we move to a more mobile world (see Chapter 8 - A Framework for Distributed Computing) this use case becomes more and more frequent. In some ways it's becoming a dominant(显性的) use case.

In the classic(经典的) ZeroMQ pub-sub pattern, there are two strong underlying(潜在的) assumptions(假定), both of which are usually wrong in FileMQ's real world. First, that data expires(期满) very rapidly so that there's no interest in asking from old data. Second, that networks are stable(稳定的) and rarely break (so it's better to invest(投资) more in improving the infrastructure(基础设施) and less in addressing recovery(恢复)).

Take any FileMQ use case and you'll see that if the client disconnects(拆开) and reconnects(使再接合), then it should get anything it missed(感到思念的). A further improvement(改进) would be to recover from partial(局部的) failures, like HTTP and FTP do. But one thing at a time.

One answer to recovery is "durable(耐用的) subscriptions(捐献)", and the first drafts(草稿) of the FILEMQ protocol(协议) aimed to support this, with client identifiers(确定) that the server could hold onto and store. So if a client reappears(再出现) after a failure, the server would know what files it had not received.

Stateful(状态性的) servers are, however, nasty(肮脏的) to make and difficult to scale(衡量). How do we, for example, do failover(失效备援) to a secondary(第二的) server? Where does it get its subscriptions from? It's far nicer if each client connection works independently and carries all necessary state with it.

Another nail in the coffin(棺材) of durable subscriptions is that it requires up-front(预先的) coordination(协调). Up-front coordination is always a red flag, whether it's in a team of people working together, or a bunch(群) of processes talking to each other. What about late joiners? In the real world, clients do not neatly line up and then all say "Ready!" at the same time. In the real world, they come and go arbitrarily(武断地), and it's valuable if we can treat a brand new client in the same way as a client that went away and came back.

To address this I will add two concepts(观念) to the protocol: a resynchronization option and a cache field (a dictionary). If the client wants recovery, it sets the resynchronization(再同步) option, and tells the server what files it already has via the cache field. We need both, because there's no way in the protocol to distinguish(区分) between an empty field and a null field. The FILEMQ RFC describes these fields as follows:

The options field provides additional(附加的) information to the server. The server SHOULD implement(实施) these options: RESYNC=1 - if the client sets this, the server SHALL send the full contents of the virtual(虚拟的) path to the client, except files the client already has, as identified by their SHA-1 digest in the cache field.

And:

When the client specifies(指定) the RESYNC option, the cache dictionary field tells the server which files the client already has. Each entry in the cache dictionary is a "filename=digest" key/value pair where the digest SHALL be a SHA-1 digest in printable(印得出的) hexadecimal(十六进制的) format. If the filename starts with "/" then it SHOULD start with the path, otherwise the server MUST ignore(驳回诉讼) it. If the filename does not start with "/" then the server SHALL treat it as relative to the path.

Clients that know they are in the classic(经典的) pub-sub use case just don't provide any cache data, and clients that want recovery(恢复) provide their cache data. It requires no state in the server, no up-front(预先的) coordination(协调), and works equally well for brand new clients (which may have received files via some out-of-band means), and clients that received some files and were then disconnected(拆开) for a while.

I decided to use SHA-1 digests for several reasons. First, it's fast enough: 150msec to digest a 25MB core dump(垃圾场) on my laptop(膝上型轻便电脑). Second, it's reliable(可靠的): the chance of getting the same hash for different versions of one file is close enough to zero. Third, it's the widest supported digest algorithm(算法). A cyclic-redundancy check (e.g., CRC-32) is faster but not reliable. More recent SHA versions (SHA-256, SHA-512) are more secure(保护) but take 50% more CPU cycles, and are overkill(过度的杀伤威力) for our needs.

Here is what a typical(典型的) ICANHAZ message looks like when we use both caching and resyncing (this is output(输出) from the dump method of the generated(形成) codec(编码解码器) class):

ICANHAZ:
    path='/photos'
    options={
        RESYNC=1
    }
    cache={
        DSCF0001.jpg=1FABCD4259140ACA99E991E7ADD2034AC57D341D
        DSCF0006.jpg=01267C7641C5A22F2F4B0174FFB0C94DC59866F6
        DSCF0005.jpg=698E88C05B5C280E75C055444227FEA6FB60E564
        DSCF0004.jpg=F0149101DD6FEC13238E6FD9CA2F2AC62829CBD0
        DSCF0003.jpg=4A49F25E2030B60134F109ABD0AD9642C8577441
        DSCF0002.jpg=F84E4D69D854D4BF94B5873132F9892C8B5FA94E
    }

Although we don't do this in FileMQ, the server can use the cache information to help the client catch up with deletions(删除) that it missed(感到思念的). To do this, it would have to log deletions, and then compare this log with the client cache when a client subscribes(订阅).

Test Use Case: The Track Tool
topprevnext

To properly test something like FileMQ we need a test case that plays with live data. One of my sysadmin tasks is to manage the MP3 tracks on my music player, which is, by the way, a Sansa Clip reflashed(再引燃) with Rock Box, which I highly recommend. As I download tracks into my Music folder, I want to copy these to my player, and as I find tracks that annoy me, I delete them in the Music folder and want those gone from my player too.

This is kind of overkill(过度的杀伤威力) for a powerful file distribution(分布) protocol(协议). I could write this using a bash(猛烈的一击) or Perl script, but to be honest the hardest work in FileMQ was the directory comparison(比较) code and I want to benefit(有益于) from that. So I put together a simple tool called track, which calls the FileMQ API. From the command line this runs with two arguments; the sending and the receiving directories:

./track /home/ph/Music /media/3230-6364/MUSIC

The code is a neat example of how to use the FileMQ API to do local file distribution. Here is the full program, minus the license text (it's MIT/X11 licensed):

#include "czmq.h"
#include "../include/fmq.h"

int main (int argc, char *argv [])
{
fmq_server_t *server = fmq_server_new ();
fmq_server_configure(安装) (server, "anonymous.cfg");
fmq_server_publish (server, argv [1], "/");
fmq_server_set_anonymous(匿名的) (server, true);
fmq_server_bind (server, "tcp://*:5670");

fmq_client_t *client = fmq_client_new ();
fmq_client_connect (client, "tcp://localhost:5670");
fmq_client_set_inbox(收件箱) (client, argv [2]);
fmq_client_set_resync (client, true);
fmq_client_subscribe(签署) (client, "/");

while (true) {
// Get message from fmq_client API
zmsg_t *msg = fmq_client_recv (client);
if (!msg)
break; // Interrupted
char *command = zmsg_popstr (msg);
if (streq (command, "DELIVER")) {
char *filename = zmsg_popstr (msg);
char *fullname = zmsg_popstr (msg);
printf ("I: received %s (%s)\n", filename, fullname);
free (filename);
free (fullname);
}
free (command);
zmsg_destroy (&msg);
}
fmq_server_destroy (&server);
fmq_client_destroy (&client);
return 0;
}

Note how we work with physical paths in this tool. The server publishes the physical path /home/ph/Music and maps this to the virtual(虚拟的) path /. The client subscribes(订阅) to / and receives all files in /media/3230-6364/MUSIC. I could use any structure(结构) within the server directory, and it would be copied faithfully(忠实地) to the client's inbox(收件箱). Note the API method fmq_client_set_resync(), which causes a server-to-client synchronization(同步).

Getting an Official Port Number

topprevnext

We've been using port 5670 in the examples for FILEMQ. Unlike all the previous examples in this book, this port isn't arbitrary(任意的) but was assigned by the Internet Assigned(分配) Numbers Authority (IANA), which "is responsible(负责的) for the global coordination(协调) of the DNS Root, IP addressing, and other Internet protocol(协议) resources".

I'll explain very briefly when and how to request registered port numbers for your application protocols. The main reason is to ensure(保证) that your applications can run in the wild without conflict(冲突) with other protocols. Technically, if you ship any software that uses port numbers between 1024 and 49151, you should be using only IANA registered port numbers. Many products don't bother with this, however, and tend(趋向) instead to use the IANA list as "ports to avoid".

If you aim to make a public protocol of any importance, such as FILEMQ, you're going to want an IANA-registered port. I'll explain briefly how to do this:

  • Document your protocol clearly, as IANA will want a specification(规格) of how you intend to use the port. It does not have to be a fully-formed protocol specification, but must be solid enough to pass expert review.
  • Decide what transport protocols you want: UDP, TCP, SCTP, and so on. With ZeroMQ you will usually only want TCP.
  • Fill in the application on iana.org, providing all the necessary information.
  • IANA will then continue the process by email until your application is accepted or rejected.

Note that you don't request a specific(特殊的) port number; IANA will assign you one. It's therefore wise to start this process before you ship software, not afterwards.


Chapter 8 - A Framework(框架) for Distributed(分配) Computing

topprevnext

We've gone though a journey of understanding ZeroMQ in its many aspects(方面). By now you may have started to build your own products using the techniques I explained, as well as others you've figured out yourself. You will start to face questions about how to make these products work in the real world.

But what is that "real world"? I'll argue that it is becoming a world of ever increasing numbers of moving pieces. Some people use the phrase the "Internet of Things", suggesting that we'll see a new category of devices(装置) that are more numerous(许多的) but also more stupid than our current smart phones, tablets, laptops(膝上型轻便电脑), and servers. However, I don't think the data points this way at all. Yes, there are more and more devices, but they're not stupid at all. They're smart and powerful and getting more so all the time.

The mechanism(机制) at work is something I call "Cost Gravity" and it has the effect of reducing the cost of technology by half every 18-24 months. Put another way, our global computing capacity(能力) doubles every two years, over and over and over. The future is filled with trillions(万亿) of devices that are fully powerful multi-core computers: they don't run a cut-down "operating system for things" but full operating systems and full applications.

And this is the world we're targeting with ZeroMQ. When we talk of "scale(规模)", we don't mean hundreds of computers, or even thousands. Think of clouds of tiny smart and perhaps self-replicating machines surrounding every person, filling every space, covering every wall, filling the cracks(裂缝) and eventually(最后), becoming so much a part of us that we get them before birth and they follow us to death.

These clouds of tiny machines talk to each other, all the time, over short-range wireless(无线的) links using the Internet Protocol. They create mesh(网眼) networks, pass information and tasks around like nervous signals. They augment(增加) our memory, vision(视力), every aspect of our communications, and physical functions. And it's ZeroMQ that powers their conversations and events and exchanges of work and information.

Now, to make even a thin imitation(模仿) of this come true today, we need to solve a set of technical problems. These include: How do peers(撒尿) discover each other? How do they talk to existing networks like the Web? How do they protect the information they carry? How do we track and monitor them, to get some idea of what they're doing? Then we need to do what most engineers forget about: package this solution(解决方案) into a framework that is dead easy for ordinary developers to use.

This is what we'll attempt in this chapter: to build a framework for distributed applications as an API, protocols, and implementations(实现). It's not a small challenge but I've claimed(提出要求) often that ZeroMQ makes such problems simple, so let's see if that's still true.

We'll cover:

  • Requirements for distributed(分布式的) computing
  • The pros and cons of WiFi for proximity(亲近) networking
  • Discovery using UDP and TCP
  • A message-based API
  • Creating a new open source project
  • Peer-to-peer(对等) connectivity(连通性) (the Harmony pattern)
  • Tracking peer(贵族) presence(存在) and disappearance
  • Group messaging without central coordination(协调)
  • Large-scale(大规模的) testing and simulation(仿真)
  • Dealing with high-water marks and blocked peers
  • Distributed logging and monitoring

Design for The Real World

topprevnext

Whether we're connecting a roomful(满房间) of mobile devices(装置) over WiFi or a cluster(群) of virtual(虚拟的) boxes over simulated(模仿) Ethernet, we will hit the same kinds of problems. These are:

  • Discovery: how do we learn about other nodes on the network? Do we use a discovery service, centralized(集中的) mediation(调解), or some kind of broadcast beacon(灯塔)?
  • Presence: how do we track when other nodes come and go? Do we use some kind of central registration(登记) service, or heartbeating or beacons(灯塔)?
  • Connectivity: how do we actually connect one node to another? Do we use local networking, wide-area networking, or do we use a central message broker to do the forwarding?
  • Point-to-point messaging: how do we send a message from one node to another? Do we send this to the node's network address, or do we use some indirect addressing via a centralized(集中的) message broker?
  • Group messaging: how do we send a message from one node to a group of others? Do we work via a centralized message broker, or do we use a pub-sub model like ZeroMQ?
  • Testing and simulation: how do we simulate(模仿的) large numbers of nodes so we can test performance properly? Do we have to buy two dozen Android tablets, or can we use pure software simulation(仿真)?
  • Distributed Logging: how do we track what this cloud of nodes is doing so we can detect(察觉) performance problems and failures? Do we create a main logging service, or do we allow every device(装置) to log the world around it?
  • Content distribution: how do we send content from one node to another? Do we use server-centric protocols(协议) like FTP or HTTP, or do we use decentralized(分散的) protocols like FileMQ?

If we can solve these problems reasonably well, and the further problems that will emerge(浮现) (like security and wide-area bridging), we get something like a framework(框架) for what I might call "Really Cool Distributed(分配) Applications", or as my grandkids(孙) call it, "the software our world runs on".

You should have guessed from my rhetorical(修辞的) questions that there are two broad directions in which we can go. One is to centralize everything. The other is to distribute everything. I'm going to bet on decentralization(分散). If you want centralization(集中化), you don't really need ZeroMQ; there are other options you can use.

So very roughly, here's the story. One, the number of moving pieces increases exponentially(以指数方式) over time (doubles every 24 months). Two, these pieces stop using wires because dragging cables(电缆) everywhere gets really boring. Three, future applications run across clusters(群) of these pieces using the Benevolent Tyrant pattern from Chapter 6 - The ZeroMQ Community. Four, today it's really difficult, nay still rather impossible, to build such applications. Five, let's make it cheap and easy using all the techniques and tools we've built up. Six, partay!

The Secret Life of WiFi

topprevnext

The future is clearly wireless(无线的), and while many big businesses live by concentrating(集中) data in their clouds, the future doesn't look quite so centralized(集中的). The devices(装置) at the edges of our networks get smarter every year, not dumber(哑的). They're hungry for work and information to digest and from which to profit(获利). And they don't drag cables(电缆) around, except once a night for power. It's all wireless and more and more, it's 802.11-branded WiFi of different alphabetical(字母的) flavors(情味).

Why Mesh(网眼) Isn't Here Yet
topprevnext

As such a vital(至关重要的) part of our future, WiFi has a big problem that's not often discussed, but that anyone betting on it needs to be aware(意识到的) of. The phone companies of the world have built themselves nice profitable(有利可图的) mobile phone cartels(卡特尔) in nearly every country with a functioning government, based on convincing(令人信服的) governments that without monopoly(垄断) rights to airwaves(电视广播) and ideas, the world would fall apart(相距). Technically, we call this "regulatory(管理的) capture(捕获)" and "patents(专利权)", but in fact it's just a form of blackmail(勒索) and corruption(贪污). If you, the state, give me, a business, the right to overcharge(过度充电), tax the market, and ban all real competitors, I'll give you 5%. Not enough? How about 10%? OK, 15% plus snacks. If you refuse, we pull service.

But WiFi snuck past this, borrowing unlicensed airspace and riding on the back of the open and unpatented(未获得专利权的) and remarkably(显著地) innovative(革新的) Internet Protocol stack(堆). So today, we have the curious situation where it costs me several Euro a minute to call from Seoul to Brussels if I use the state-backed infrastructure(基础设施) that we've subsidized(资助) over decades(十年), but nothing at all if I can find an unregulated(未经调节的) WiFi access point. Oh, and I can do video, send files and photos, and download entire home movies all for the same amazing price point of precisely(精确地) zero point zero zero (in any currency you like). God help me if I try to send just one photo home using the service for which I actually pay. That would cost me more than the camera I took it on.

It is the price we pay for having tolerated(忍受) the "trust us, we're the experts" patent system for so long. But more than that, it's a massive(大量的) economic(经济的) incentive(动机) to chunks(大块) of the technology sector(部门)—and especially chipset m(芯片集)akers who own patents on the anti-Internet GSM, GPRS, 3G, and LTE stacks, and who treat the telcos as prime c(主要的)lients—to actively throttle WiF(节流阀)i development. And of course it's these firms that bulk out(体积) the IEEE committees tha(委员会)t define WiF(定义)i.

The reason for this rant(咆哮) against lawyer-driven "innovation(创新)" is to steer(控制) your thinking towards "what if WiFi were really free?" This will happen one day, not too far off, and it's worth betting on. We'll see several things happen. First, much more aggressive use of airspace especially for near-distance communications where there is no risk(风险) of interference(干扰). Second, big capacity(能力) improvements(改进) as we learn to use more airspace in parallel(平行线). Third, acceleration(加速) of the standardization(标准化) process. Last, broader support in devices for really interesting connectivity(连通性).

Right now, streaming a movie from your phone to your TV is considered "leading edge". This is ridiculous(可笑的). Let's get truly ambitious(野心勃勃的). How about a stadium of people watching a game, sharing photos and HD video with each other in real time, creating an ad-hoc event that literally(文字的) saturates(浸透) the airspace with a digital frenzy(狂暴). I should be able to collect terabytes(太字节) of imagery(像) from those around me, in an hour. Why does this have to go through Twitter or Facebook and that tiny expensive mobile data connection? How about a home with hundreds of devices all talking to each other over mesh, so when someone rings the doorbell, the porch(门廊) lights stream video through to your phone or TV? How about a car that can talk to your phone and play your dubstep playlist without you plugging in wires.

To get more serious, why is our digital society in the hands of central points that are monitored, censored(审查), logged, used to track who we talk to, collect evidence(证据) against us, and then shut down when the authorities(权威) decide we have too much free speech? The loss of privacy(隐私) we're living through is only a problem when it's one-sided, but then the problem is calamitous(灾难的). A truly wireless world would bypass(绕开) all central censorship(审查制度). It's how the Internet was designed, and it's quite feasible(可行的), technically (which is the best kind of feasible).

Some Physics
topprevnext

Naive(天真的) developers of distributed(分布式的) software treat the network as infinitely(无限地) fast and perfectly reliable(可靠的). While this is approximately(大约) true for simple applications over Ethernet, WiFi rapidly proves the difference between magical thinking and science. That is, WiFi breaks so easily and dramatically(戏剧地) under stress(压力) that I sometimes wonder how anyone would dare use it for real work. The ceiling moves up as WiFi gets better, but never fast enough to stop us hitting it.

To understand how WiFi performs technically, you need to understand a basic law of physics: the power required to connect two points increases according to the square of the distance. People who grow up in larger houses have exponentially(以指数方式) louder voices, as I learned in Dallas. For a WiFi network, this means that as two radios get further apart(相距), they have to either use more power or lower their signal rate.

There's only so much power you can pull out of a battery before users treat the device(装置) as hopelessly broken. Thus even though a WiFi network may be rated at a certain speed, the real bit rate between the access point (AP) and a client depends on how far apart the two are. As you move your WiFi-enabled phone away from the AP, the two radios trying to talk to each other will first increase their power and then reduce their bit rate.

This effect has some consequences(结果) of which we should be aware(意识到的) if we want to build robust(强健的) distributed applications that don't dangle(摇晃地悬挂着) wires behind them like puppets(木偶):

  • If you have a group of devices talking to an AP, when the AP is talking to the slowest device, the whole network has to wait. It's like having to repeat a joke at a party to the designated(指定的) driver who has no sense of humor, is still fully and tragically(悲剧地) sober(冷静的), and has a poor grasp(抓住) of the language.
  • If you use unicast(单一传播) TCP and send a message to multiple devices, the AP must send the packets to each device separately, Yes, and you knew this, it's also how Ethernet works. But now understand that one distant (or low-powered) device means everything waits for that slowest device to catch up.
  • If you use multicast(多路广播) or broadcast (which work the same, in most cases), the AP will send single packets to the whole network at once, which is awesome(可怕的), but it will do it at the slowest possible bit rate (usually 1Mbps). You can adjust(调整) this rate manually(手动地) in some APs. That just reduces the reach of your AP. You can also buy more expensive APs that have a little more intelligence(智力) and will figure out the highest bit rate they can safely use. You can also use enterprise APs with IGMP (Internet Group Management Protocol(协议)) support and ZeroMQ's PGM transport to send only to subscribed(订阅) clients. I'd not, however, bet on such APs being widely available, ever.

As you try to put more devices onto an AP, performance rapidly gets worse to the point where adding one more device can break the whole network for everyone. Many APs solve this by randomly(随便地) disconnecting(拆开) clients when they reach some limit, such as four to eight devices for a mobile hotspot(热点), 30-50 devices for a consumer(消费者) AP, perhaps 100 devices for an enterprise AP.

What's the Current Status?
topprevnext

Despite(尽管) its uncomfortable(不舒服的) role as enterprise technology that somehow escaped into the wild, WiFi is already useful for more than getting a free Skype call. It's not ideal(理想的), but it works well enough to let us solve some interesting problems. Let me give you a rapid status report.

First, point-to-point(越过原野的) versus(对) access point-to-client. Traditional WiFi is all AP-client. Every packet has to go from client A to AP, then to client B. You cut your bandwidth(带宽) by 50%, but that's only half the problem. I explained about the inverse(相反的) power law. If A and B are very close together, but both are far from the AP, they'll both be using a low bit rate. Imagine your AP is in the garage, and you're in the living room trying to stream video from your phone to your TV. Good luck!

There is an old "ad-hoc" mode that lets A and B talk to each other, but it's way too slow for anything fun, and of course, it's disabled on all mobile chipsets(芯片集). Actually, it's disabled in the top secret drivers that the chipset makers kindly provide to hardware(计算机硬件) makers. There is a new Tunneled(挖) Direct Link Setup (TDLS) protocol that lets two devices create a direct link, using an AP for discovery but not for traffic. And there's a "5G" WiFi standard (it's a marketing term, so it goes in quotes) that boosts(推动) link speeds to a gigabit(千兆比特). TDLS and 5G together make HD movie streaming from your phone to your TV a plausible(貌似可信的) reality. I assume(承担) TDLS will be restricted in various ways so as to placate(抚慰) the telcos.

Lastly, we saw standardization(标准化) of the 802.11s mesh(网眼) protocol(协议) in 2012, after a remarkably(显著地) speedy(快的) ten years or so of work. Mesh removes the access point completely, at least in the imaginary(虚构的) future where it exists and is widely used. Devices(装置) talk to each other directly, and maintain(维持) little routing(路由选择) tables of neighbors that let them forward packets. Imagine the AP software embedded(栽种) into every device, but smart enough (it's not as impressive(感人的) as it sounds) to do multiple hops(蜱酒花).

No one who is making money from the mobile data extortion(勒索) racket wants to see 802.11s available because city-wide mesh is such a nightmare(恶梦) for the bottom line, so it's happening as slowly as possible. The only large organization with the power (and, I assume(承担) the surface-to-surface(地对地的) missiles) to get mesh technology into wide use is the US Army. But mesh will emerge(浮现) and I'd bet on 802.11s being widely available in consumer(消费者) electronics(电子学) by 2020 or so.

Second, if we don't have point-to-point(越过原野的), how far can we trust APs today? Well, if you go to a Starbucks in the US and try the ZeroMQ "Hello World" example using two laptops(膝上型轻便电脑) connected via the free WiFi, you'll find they cannot connect. Why? Well, the answer is in the name: "attwifi". AT&T is a good old incumbent(现任的) telco that hates WiFi and presumably(大概) provides the service cheaply to Starbucks and others so that independents can't get into the market. But any access point you buy will support client-to-AP-to-client access, and outside the US I've never found a public AP locked-down the AT&T way.

Third, performance. The AP is clearly a bottleneck(瓶颈); you cannot get better than half of its advertised speed even if you put A and B literally(照字面地) beside the AP. Worse, if there are other APs in the same airspace, they'll shout each other out. In my home, WiFi barely(空的) works at all because the neighbors two houses down have an AP which they've amplified(放大). Even on a different channel, it interferes(干涉) with our home WiFi. In the cafe where I'm sitting now there are over a dozen networks. Realistically(现实地), as long as we're dependent(依靠的) on AP-based WiFi, we're subject to random(随机的) interference(干扰) and unpredictable(不可预知的) performance.

Fourth, battery life. There's no inherent(内在的) reason that WiFi, when idle(闲置的), is hungrier than Bluetooth, for example. They use the same radios and low-level framing(框架). The main difference is tuning(调谐) and in the protocols. For wireless(无线的) power-saving to work well, devices have to mostly sleep and beacon(灯塔) out to other devices only once every so often. For this to work, they need to synchronize(合拍) their clocks. This happens properly for the mobile phone part, which is why my old flip(弹) phone can run five days on a charge. When WiFi is working, it will use more power. Current power amplifier technology is also inefficient(无效率的), meaning you draw a lot more energy from your battery than you pump into the air (the waste turns into a hot phone). Power amplifiers are improving as people focus more on mobile WiFi.

Lastly, mobile access points. If we can't trust centralized(集中的) APs, and if our devices are smart enough to run full operating systems, can't we make them work as APs? I'm so glad you asked that question. Yes, we can, and it works quite nicely. Especially because you can switch(转换) this on and off in software, on a modern OS like Android. Again, the villains(坏人) of the peace are the US telcos, who mostly detest(厌恶) this feature(起重要作用) and kill it or cripple(削弱) it on the phones they control. Smarter telcos realize that it's a way to amplify their "last mile" and bring higher-value products to more users, but crooks(骗子) don't compete on smarts.

Conclusions
topprevnext

WiFi is not Ethernet and although I believe future ZeroMQ applications will have a very important decentralized(分散的) wireless presence(存在), it's not going to be an easy road. Much of the basic reliability(可靠性) and capacity(能力) that you expect from Ethernet is missing. When you run a distributed(分布式的) application over WiFi, you must allow for frequent timeouts, random latencies(潜伏), arbitrary(任意的) disconnections(断开), whole interfaces(界面) going down and coming up, and so on.

The technological(技术的) evolution(演变) of wireless networking is best described as "slow and joyless(不高兴的)". Applications and frameworks(框架) that try to exploit decentralized wireless are mostly absent or poor. The only existing open source framework for proximity(亲近) networking is AllJoyn from Qualcomm. But with ZeroMQ, we proved that the inertia(惯性) and decrepit(衰老的) incompetence(无资格) of existing players was no reason for us to sit still. When we accurately(精确地) understand problems, we can solve them. What we imagine, we can make real.

Discovery

topprevnext

Discovery is an essential part of network programming and a first-class problem for ZeroMQ developers. Every zmq_connect () call provides an endpoint(端点) string, and that has to come from somewhere. The examples we've seen so far don't do discovery: the endpoints they connect to are hard-coded as strings in the code. While this is fine for example code, it's not ideal(理想的) for real applications. Networks don't behave that nicely. Things change, and it's how well we handle change that defines(定义) our long-term success.

Service Discovery
topprevnext

Let's start with definitions(定义). Network discovery is finding out what other peers(撒尿) are on the network. Service discovery is learning what those peers can do for us. Wikipedia defines a "network service" as "a service that is hosted on a computer network", and "service" as "a set of related software functionalities(功能) that can be reused for different purposes, together with the policies(政策) that should control its usage(使用)". It's not very helpful. Is Facebook a network service?

In fact the concept(观念) of "network service" has changed over time. The number of moving pieces keeps doubling every 18-24 months, breaking old conceptual(概念上的) models and pushing for ever simpler, more scalable(可攀登的) ones. A service is, for me, a system-level application that other programs can talk to. A network service is one accessible(易接近的) remotely(遥远地) (as compared to, e.g., the "grep(可做文件内的字符串查找)" command, which is a command-line service).

In the classic(经典的) BSD socket(插座) model, a service maps 1-to-1 to a network port. A computer system offers a number of services like "FTP", and "HTTP", each with assigned(指定的) ports. The BSD API has functions like getservbyname to map a service name to a port number. So a classic service maps to a network endpoint: if you know a server's IP address and then you can find its FTP service, if that is running.

In modern messaging, however, services don't map 1-to-1 to endpoints. One endpoint can lead to many services, and services can move around over time, between ports, or even between systems. Where is my cloud storage today? In a realistic(现实的) large distributed(分布式的) application, therefore, we need some kind of service discovery mechanism(机制).

There are many ways to do this and I won't try to provide an exhaustive(详尽的) list. However there are a few classic patterns:

  • We can force the old 1-to-1 mapping from endpoint to service, and simply state up-front(预先的) that a certain TCP port number represents a certain service. Our protocol(拟定) then should let us check this ("Are the first 4 bytes of the request 'HTTP'?").
  • We can bootstrap(引导程序) one service off another; connecting to a well-known(著名的) endpoint and service, asking for the "real" service, and getting an endpoint back in return. This gives us a service lookup(查找) service. If the lookup service allows it, services can then move around as long as they update their location.
  • We can proxy(代理人) one service through another, so that a well-known endpoint and service will provide other services indirectly (i.e. by forwarding messages to them). This is for instance(实例) how our Majordomo service-oriented(服务型的) broker works.
  • We can exchange lists of known services and endpoints, that change over time, using a gossip(小道传闻) approach(方法) or a centralized(集中的) approach (like the Clone pattern), so that each node in a distributed network can build-up an eventually(最后) consistent(始终如一的) map of the whole network.
  • We can create further abstract(抽象的) layers in between network endpoints and services, e.g. assigning each node a unique(独特的) identifier(标识符), so we get a "network of nodes" where each node may offer some services, and may appear on random(随机的) network endpoints.
  • We can discover services opportunistically, e.g. by connecting to endpoints and then asking them what services they offer. "Hi, do you offer a shared printer? If so, what's the maker and model?"

There's no "right answer". The range of options is huge, and changes over time as the scale(规模) of our networks grows. In some networks the knowledge of what services run where can literally(照字面地) become political power. ZeroMQ imposes(利用) no specific(特殊的) model but makes it easy to design and build the ones that suit us best. However, to build service discovery, we must start by solving network discovery.

Network Discovery
topprevnext

Here is a list of the solutions(解决方案) I know for network discovery:

  • Use hard-coded endpoint(端点) strings, i.e., fixed IP addresses and agreed ports. This worked in internal(内部的) networks a decade(十年) ago when there were a few "big servers" and they were so important they got static IP addresses. These days however it's no use except in examples or for in-process work (threads are the new Big Iron). You can make it hurt a little less by using DNS but this is still painful for anyone who's not also doing system administration(管理) as a side-job.
  • Get endpoint strings from configuration(配置) files. This shoves(推) name resolution(分辨率) into user space, which hurts less than DNS but that's like saying a punch(冲压机) in the face hurts less than a kick in the groin(腹股沟). You now get a non-trivial(非平凡的) management problem. Who updates the configuration files, and when? Where do they live? Do we install(安装) a distributed(分布式的) management tool like Salt Stack?
  • Use a message broker. You still need a hard-coded or configured(安装) endpoint string to connect to the broker, but this approach(方法) reduces the number of different endpoints in the network to one. That makes a real impact(影响), and broker-based networks do scale nicely. However, brokers are single points of failure, and they bring their own set of worries about management and performance.
  • Use an addressing broker. In other words use a central service to mediate(调解) address information (like a dynamic(动态的) DNS setup(设置)) but allow nodes to send each other messages directly. It's a good model but still creates a point of failure and management costs.
  • Use helper libraries, like ZeroConf, that provide DNS services without any centralized(集中的) infrastructure(基础设施). It's a good answer for certain applications but your mileage(英里数) will vary(变化). Helper libraries aren't zero cost: they make it more complex(复杂的) to build the software, they have their own restrictions(限制), and they aren't necessarily portable.
  • Build system-level discovery by sending out ARP or ICMP ECHO packets and then querying every node that responds(应答). You can query through a TCP connection, for example, or by sending UDP messages. Some products do this, like the Eye-Fi wireless(无线的) card.
  • Do user-level brute-force(强力攻击) discovery by trying to connect to every single address in the network segment(段). You can do this trivially(琐细地) in ZeroMQ since it handles connections in the background. You don't even need multiple threads. It's brutal(残忍的) but fun, and works very well in demos and workshops(车间). However it doesn't scale, and annoys decent-thinking engineers.
  • Roll your own UDP-based discovery protocol(协议). Lots of people do this (I counted about 80 questions on this topic on StackOverflow). UDP works well for this and it's technically clear. But it's technically tricky(狡猾的) to get right, to the point where any developer doing this the first few times will get it dramatically(戏剧地) wrong.
  • Gossip discovery protocols. A fully-interconnected network is quite effective(有效的) for smaller numbers of nodes (say, up to 100 or 200). For large numbers of nodes, we need some kind of gossip(小道传闻) protocol. That is, where the nodes we can reasonable discover (say, on the same segment(段) as us), tell us about nodes that are further away. Gossip protocols go beyond what we need these days with ZeroMQ, but will likely be more common in the future. One example of a wide-area gossip model is mesh(网眼) networking.

The Use Case
topprevnext

Let's define(定义) our use case more explicitly(明确地). After all, all these different approaches(方法) have worked and still work to some extent(程度). What interests me as architect(建筑师) is the future, and finding designs that can continue to work for more than a few years. This means identifying(确定) long term trends(动态). Our use case isn't here and now, it's ten or twenty years from today.

Here are the long term trends I see in distributed(分布式的) applications:

  • The overall number of moving pieces keeps increasing. My estimate(估计) is that it doubles every 24 months, but how fast it increases matters less than the fact that we keep adding more and more nodes to our networks. They're not just boxes but also processes and threads. The driver here is cost, which keeps falling. In a decade(十年), the average teenager will carry 30-50 devices(装置), all the time.
  • Control shifts(移动) away from the center. Possibly data too, though we're still far from understanding how to build simple decentralized(分散的) information stores. In any case, the star topology(拓扑学) is slowly dying and being replaced by clouds of clouds. In the future there's going to be much more traffic within a local environment (home, office, school, bar) than between remote(遥远的) nodes and the center. The maths here are simple: remote communications cost more, run more slowly and are less natural than close-range communications. It's more accurate(精确的) both technically and socially to share a holiday video with your friend over local WiFi than via Facebook.
  • Networks are increasingly collaborative(合作的), less controlled. This means people bringing their own devices and expecting them to work seamlessly(无缝的). The Web showed one way to make this work but we're reaching the limits of what the Web can do, as we start to exceed(超过) the average of one device per person.
  • The cost of connecting a new node to a network must fall proportionally(成比例地), if the network is to scale(衡量). This means reducing the amount(数量) of configuration(配置) a node needs: less pre-shared state, less context(环境). Again, the Web solved this problem but at the cost of centralization(集中化). We want the same plug and play experience but without a central agency(代理).

In a world of trillions(万亿) of nodes, the ones you talk to most are the ones closest to you. This is how it works in the real world and it's the sanest(健全的) way of scaling(衡量) large-scale(大规模的) architectures(建筑学). Groups of nodes, logically(逻辑上) or physically close, connected by bridges to other groups of nodes. A local group will be anything from half-a-dozen nodes to a few thousand nodes.

So we have two basic use cases:

  • Discovery for proximity(亲近) networks, that is, a set of nodes that find themselves close to each other. We can define(定义) "close to each other" as being "on the same network segment(分割)". It's not going to be true in all cases but it's true enough to be a useful place to start.
  • Discovery across wide area networks, that is, bridging of proximity networks together. We sometimes call this "federation(联合)". There are many ways to do federation but it's complex(复杂的) and something to cover elsewhere(在别处). For now, let's assume(承担) we do federation using a centralized(集中的) broker or service.

So we are left with the problem of proximity networking. I want to just plug things into the network and have them talking to each other. Whether they're tablets in a school or a bunch(群) of servers in a cloud, the less upfront(预付的) agreement and coordination(协调), the cheaper it is to scale. So configuration(配置) files and brokers and any kind of centralized service are all out.

I also want to allow any number of applications on a box, both because that's how the real world works (people download apps), and so that I can simulate(模仿的) large networks on my laptop(膝上型轻便电脑). Upfront simulation(仿真) is the only way I know to be sure a system will work when it's loaded in real-life. You'd be surprised how engineers just hope things will work. "Oh, I'm sure that bridge will stay up when we open it to traffic". If you haven't simulated and fixed the three most likely failures, they'll still be there on opening day.

Running multiple instances(实例) of a service on the same machine - without upfront coordination - means we have to use ephemeral(短暂的) ports, i.e., ports assigned(分配) randomly(随便地) for services. Ephemeral ports rule out brute-force(强力攻击) TCP discovery and any DNS solution(解决方案) including ZeroConf.

Finally, discovery has to happen in user space because the apps we're building will be running on random boxes that we do not necessarily own and control. For example, other people's mobile devices(装置). So any discovery that needs root permissions is excluded(排除). This rules out ARP and ICMP and once again ZeroConf since that also needs root permissions for the service parts.

Technical Requirements
topprevnext

Let's recap(翻新的轮胎) the requirements:

  • The simplest possible solution that works. There are so many edge cases in ad-hoc networks that every extra feature(特色) or functionality(功能) becomes a risk(风险).
  • Supports ephemeral ports, so that we can run realistic(现实的) simulations. If the only way to test is to use real devices, it becomes impossibly expensive and slow to run tests.
  • No root access needed, it must run 100% in user space. We want to ship fully-packaged applications onto devices(装置) like mobile phones that we don't own and where root access isn't available.
  • Invisible(无形的) to system administrators(管理人), so we do not need their help to run our applications. Whatever technique we use should be friendly to the network and available by default.
  • Zero configuration apart(相距) from installing(安装) the applications themselves. Asking the users to do any configuration(配置) is giving them an excuse to not use the applications.
  • Fully portable to all modern operating systems. We can't assume(承担) we'll be running on any specific(特殊的) OS. We can't assume any support from the operating system except standard user-space networking. We can assume ZeroMQ and CZMQ are available.
  • Friendly to WiFi networks with up to 100-150 participants(参与者). This means keeping messages small and being aware(意识到的) of how WiFi networks scale(规模) and how they break under pressure.
  • Protocol-neutral, i.e., our beaconing(照亮) should not impose(强加) any specific discovery protocol(协议). I'll explain what this means a little later.
  • Easy to re-implement in any given language. Sure, we have a nice C implementation(实现), but if it takes too long to re-implement in another language, that excludes(排除) large chunks(大块) of the ZeroMQ community. So, again, simple.
  • Fast response time. By this, I mean a new node should be visible(明显的) to its peers(撒尿) in a very short time, a second or two at most. Networks change shape rapidly. It's OK to take longer, even 30 seconds, to realize a peer has disappeared.

From the list of possible solutions(解决方案) I collected, the only option that isn't disqualified(取消…的资格) for one or more reasons is to build our own UDP-based discovery stack(堆). It's a little disappointing that after so many decades(十年) of research into network discovery, this is where we end up. But the history of computing does seem to go from complex(复合体) to simple, so maybe it's normal.

A Self-Healing P2P Network in 30 Seconds
topprevnext

I mentioned brute-force(强力攻击) discovery. Let's see how that works. One nice thing about software is to brute-force your way through the learning experience. As long as we're happy to throw away work, we can learn rapidly simply by trying things that may seem insane(疯狂的) from the safety of the armchair.

I'll explain a brute-force discovery approach(方法) for ZeroMQ that emerged(浮现) from a workshop(车间) in 2012. It is remarkably(显著地) simple and stupid: connect to every IP address in the room. If your network segment(段) is 192.168.55.x, for instance(举…为例), you do this:

connect to tcp://192.168.55.1:9000
connect to tcp://192.168.55.2:9000
connect to tcp://192.168.55.3:9000
...
connect to tcp://192.168.55.254:9000

Which in ZeroMQ-speak looks like this:

int address;
for (address = 1; address < 255; address++)
zsocket_connect (listener, "tcp://192.168.55.%d:9000", address);

The stupid part is where we assume(承担) that connecting to ourselves is fine, where we assume that all peers(撒尿) are on the same network segment(段), where we waste file handles as if they were free. Luckily these assumptions(假定) are often totally accurate(精确的). At least, often enough to let us do fun things.

The loop(环) works because ZeroMQ connect calls are asynchronous(异步的) and opportunistic(机会主义的). They lie in the shadows like hungry cats, waiting patiently to pounce(突袭) on any innocent(无辜的) mouse that dared start up a service on port 9000. It's simple, effective(有效的), and worked first time.

It gets better: as peers leave and join the network, they'll automatically(自动的) reconnect(使再接合). We've designed a self-healing peer to peer network, in 30 seconds and three lines of code.

It won't work for real cases though. Poorer operating systems tend(照料) to run out of file handles, and networks tend to be more complex(复杂的) than one segment. And if one node squats(蹲坐) a couple of hundred file handles, large-scale(大规模的) simulations(仿真) (with many nodes on one box or in one process) are out of the question.

Still, let's see how far we can go with this approach(方法) before we throw it out. Here's a tiny decentralized(分散的) chat program that lets you talk to anyone else on the same network segment. The code has two threads: a listener and a broadcaster. The listener creates a SUB socket(插座) and does the brute-force(强力攻击) connection to all peers in the network. The broadcaster accepts input(投入) from the console(控制台) and sends it on a PUB socket:


Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

The dechat program needs to know the current IP address, the interface(界面), and an alias(别名). We could get these in code from the operating system, but that's grunky non-portable code. So we provide this information on the command line:

dechat 192.168.55.122 eth0 Joe

Preemptive(优先购买的) Discovery over Raw Sockets
topprevnext

One of the great things about short-range wireless(无线的) is the proximity(亲近). WiFi maps closely to the physical space, which maps closely to how we naturally organize. In fact, the Internet is quite abstract(抽象的) and this confuses(混乱) a lot of people who kind of "get it" but in fact don't really. With WiFi, we have technical connectivity(连通性) that is potentially(潜在的) super-tangible. You see what you get and you get what you see. Tangible(有形的) means easy to understand and that should mean love from users instead of the typical(典型的) frustration(挫折) and seething(火热的) hatred(憎恨).

Proximity(亲近) is the key. We have a bunch(群) of WiFi radios in a room, happily beaconing(照亮) to each other. For lots of applications, it makes sense that they can find each other and start chatting without any user input(投入). After all, most real world data isn't private, it's just highly localized(局部的).

I'm in a hotel room in Gangnam, Seoul, with a 4G wireless(无线的) hotspot(热点), a Linux laptop(膝上型轻便电脑), and an couple of Android phones. The phones and laptop are talking to the hotspot. The ifconfig command says my IP address is 192.168.1.2. Let me try some ping commands. DHCP servers tend(照料) to dish out addresses in sequence(序列), so my phones are probably close by, numerically(数字上) speaking:

$ ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_req=1 ttl=64 time=376 ms
64 bytes from 192.168.1.1: icmp_req=2 ttl=64 time=358 ms
64 bytes from 192.168.1.1: icmp_req=4 ttl=64 time=167 ms
^C
--- 192.168.1.1 ping statistics ---
3 packets transmitted, 2 received, 33% packet loss, time 2001ms
rtt min/avg/max/mdev = 358.077/367.522/376.967/9.445 ms

Found one! 150-300 msec round-trip latency(潜伏)… that's a surprisingly high figure, something to keep in mind for later. Now I ping myself, just to try to double-check things:

$ ping 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_req=1 ttl=64 time=0.054 ms
64 bytes from 192.168.1.2: icmp_req=2 ttl=64 time=0.055 ms
64 bytes from 192.168.1.2: icmp_req=3 ttl=64 time=0.061 ms
^C
--- 192.168.1.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.054/0.056/0.061/0.009 ms

The response time is a bit faster now, which is what we'd expect. Let's try the next couple of addresses:

$ ping 192.168.1.3
PING 192.168.1.3 (192.168.1.3) 56(84) bytes of data.
64 bytes from 192.168.1.3: icmp_req=1 ttl=64 time=291 ms
64 bytes from 192.168.1.3: icmp_req=2 ttl=64 time=271 ms
64 bytes from 192.168.1.3: icmp_req=3 ttl=64 time=132 ms
^C
--- 192.168.1.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 132.781/231.914/291.851/70.609 ms

That's the second phone, with the same kind of latency as the first one. Let's continue and see if there are any other devices(装置) connected to the hotspot:

$ ping 192.168.1.4
PING 192.168.1.4 (192.168.1.4) 56(84) bytes of data.
^C
--- 192.168.1.4 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2016ms

And that is it. Now, ping uses raw IP sockets(插座) to send ICMP_ECHO messages. The useful thing about ICMP_ECHO is that it gets a response from any IP stack(堆) that has not deliberately(故意地) had echo(反射) switched(转换) off. That's still a common practice on corporate(法人的) websites who fear the old "ping of death" exploit where malformed(畸形的) messages could crash the machine.

I call this preemptive discovery because it doesn't take any cooperation(合作) from the device. We don't rely(依靠) on any cooperation from the phones to see them sitting there; as long as they're not actively ignoring(驳回诉讼) us, we can see them.

You might ask why this is useful. We don't know that the peers(撒尿) responding(回答) to ICMP_ECHO run ZeroMQ, that they are interested in talking to us, that they have any services we can use, or even what kind of device(装置) they are. However, knowing that there's something on address 192.168.1.3 is already useful. We also know how far away the device is, relatively, we know how many devices are on the network, and we know the rough state of the network (as in, good, poor, or terrible).

It isn't even hard to create ICMP_ECHO messages and send them. A few dozen lines of code, and we could use ZeroMQ multithreading to do this in parallel(平行线) for addresses stretching(伸展) out above and below our own IP address. Could be kind of fun.

However, sadly, there's a fatal(致命的) flaw(瑕疵) in my idea of using ICMP_ECHO to discover devices. To open a raw IP socket(插座) requires root privileges(特权) on a POSIX box. It stops rogue(凶猛的) programs getting data meant for others. We can get the power to open raw sockets on Linux by giving sudo privileges to our command (ping has the so-called sticky bit set). On a mobile OS like Android, it requires root access, i.e., rooting the phone or tablet. That's out of the question for most people and so ICMP_ECHO is out of reach for most devices.

Expletive deleted! Let's try something in user space. The next step most people take is UDP multicast(多路广播) or broadcast. Let's follow that trail(小径).

Cooperative(合作的) Discovery Using UDP Broadcasts
topprevnext

Multicast(多路广播) tends(趋向) to be seen as more modern and "better" than broadcast. In IPv6, broadcast doesn't work at all: you must always use multicast. Nonetheless(尽管如此), all IPv4 local network discovery protocols(协议) end up using UDP broadcast anyhow. The reasons: broadcast and multicast end up working much the same, except broadcast is simpler and less risky(危险的). Multicast is seen by network admins as kind of dangerous, as it can leak over network segments(段).

If you've never used UDP, you'll discover it's quite a nice protocol. In some ways, it reminds us of ZeroMQ, sending whole messages to peers(撒尿) using a two different patterns: one-to-one, and one-to-many. The main problems with UDP are that (a) the POSIX socket(插座) API was designed for universal(普遍的) flexibility(灵活性), not simplicity(朴素), (b) UDP messages are limited for practical purposes to about 1,500 bytes on LANs and 512 bytes on the Internet, and (c) when you start to use UDP for real data, you find that messages get dropped, especially as infrastructure(基础设施) tends to favor TCP over UDP.

Here is a minimal(最低的) ping program that uses UDP instead of ICMP_ECHO:

// UDP ping command
// Model 1, does UDP work inline(内联的)

#include <czmq.h>
#define(定义) PING_PORT_NUMBER 9999
#define PING_MSG_SIZE 1
#define PING_INTERVAL 1000 // Once per second

static void
derp (char *s)
{
perror (s);
exit (1);
}

int main (void)
{
zctx_t *ctx = zctx_new ();

// Create UDP socket
int fd;
if ((fd = socket(插座) (AF_INET, SOCK_DGRAM, IPPROTO_UDP)) == -1)
derp ("socket");

// Ask operating system to let us do broadcasts from socket
int on = 1;
if (setsockopt (fd, SOL_SOCKET, SO_BROADCAST, &on, sizeof (on)) == -1)
derp ("setsockopt (SO_BROADCAST)");

// Bind(结合) UDP socket(插座) to local port so we can receive pings
struct sockaddr_in si_this = { 0 };
si_this.sin_family = AF_INET;
si_this.sin_port = htons (PING_PORT_NUMBER);
si_this.sin_addr.s_addr = htonl (INADDR_ANY);
if (bind (fd, &si_this, sizeof (si_this)) == -1)
derp ("bind");

byte buffer [PING_MSG_SIZE];

// We use zmq_poll to wait for activity on the UDP socket(插座), because
// this function works on non-0MQ file handles. We send a beacon(灯塔)
// once a second, and we collect and report beacons(灯塔) that come in
// from other nodes:

zmq_pollitem_t pollitems [] = NULL, fd, ZMQ_POLLIN, 0 ;
// Send first ping right away
uint64_t ping_at = zclock_time ();

while (!zctx_interrupted) {
long timeout = (long) (ping_at - zclock_time ());
if (timeout < 0)
timeout = 0;
if (zmq_poll (pollitems, 1, timeout * ZMQ_POLL_MSEC) == -1)
break; // Interrupted

// Someone answered our ping
if (pollitems [0].revents & ZMQ_POLLIN) {
struct sockaddr_in si_that;
socklen_t si_len = sizeof (struct sockaddr_in);
ssize_t size = recvfrom (fd, buffer(缓冲区), PING_MSG_SIZE, 0,
&si_that, &si_len);
if (size == -1)
derp ("recvfrom");
printf ("Found peer %s:%d\n",
inet_ntoa (si_that.sin_addr), ntohs (si_that.sin_port));
}
if (zclock_time () >= ping_at) {
// Broadcast our beacon
puts ("Pinging peers…");
buffer [0] = '!';
struct sockaddr_in si_that = si_this;
inet_aton ("255.255.255.255", &si_that.sin_addr);
if (sendto (fd, buffer(缓冲区), PING_MSG_SIZE, 0,
&si_that, sizeof (struct sockaddr_in)) == -1)
derp ("sendto");
ping_at = zclock_time () + PING_INTERVAL;
}
}
close (fd);
zctx_destroy (&ctx);
return 0;
}


C++ | Python | Ada | Basic | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

This code uses a single socket(插座) to broadcast 1-byte messages and receive anything that other nodes are broadcasting. When I run it, it shows just one node, which is itself:

Pinging peers...
Found peer 192.168.1.2:9999
Pinging peers...
Found peer 192.168.1.2:9999

If I switch(转换) off all networking and try again, sending a message fails, as I'd expect:

Pinging peers...
sendto: Network is unreachable

Working on the basis(基础) of solve the problems currently aiming at your throat, let's fix the most urgent(紧急的) issues in this first model. These issues are:

  • Using the 255.255.255.255 broadcast address is a bit dubious(可疑的). On the one hand, this broadcast address means precisely(精确地) "send to all nodes on the local network, and don't forward". However, if you have several interfaces(界面) (wired Ethernet, WiFi) then broadcasts will go out on your default route(路线) only, and via just one interface. What we want to do is either send our broadcast on each interface's broadcast address, or find the WiFi interface and its broadcast address.
  • Like many aspects(方面) of socket(插座) programming, getting information on network interfaces is not portable. Do we want to write nonportable code in our applications? No, this is better hidden in a library.
  • There's no handling for errors except "abort(流产)", which is too brutal(残忍的) for transient(短暂的) problems like "your WiFi is switched(转换) off". The code should distinguish(区分) between soft errors (ignore(驳回诉讼) and retry(重操作)) and hard errors (assert(维护)).
  • The code needs to know its own IP address and ignore beacons(灯塔) that it sent out. Like finding the broadcast address, this requires inspecting the available interfaces.

The simplest answer to these issues is to push the UDP code into a separate library that provides a clean API, like this:

// Constructor
static udp_t *
udp_new (int port_nbr);

// Destructor
static void
udp_destroy (udp_t **self_p);

// Returns UDP socket(插座) handle
static int
udp_handle (udp_t *self);

// Send message using UDP broadcast
static void
udp_send (udp_t *self, byte *buffer, size_t length);

// Receive message from UDP broadcast
static ssize_t
udp_recv (udp_t *self, byte *buffer, size_t length);

Here is the refactored(重构) UDP ping program that calls this library, which is much cleaner and nicer:

// UDP ping command
// Model 2, uses separate UDP library

#include <czmq.h>
#include "udplib.c"
#define(定义) PING_PORT_NUMBER 9999
#define PING_MSG_SIZE 1
#define PING_INTERVAL 1000 // Once per second

int main (void)
{
zctx_t *ctx = zctx_new ();
udp_t *udp = udp_new (PING_PORT_NUMBER);

byte buffer [PING_MSG_SIZE];
zmq_pollitem_t pollitems [] = {
{ NULL, udp_handle (udp), ZMQ_POLLIN, 0 }
};
// Send first ping right away
uint64_t ping_at = zclock_time ();

while (!zctx_interrupted) {
long timeout = (long) (ping_at - zclock_time ());
if (timeout < 0)
timeout = 0;
if (zmq_poll (pollitems, 1, timeout * ZMQ_POLL_MSEC) == -1)
break; // Interrupted

// Someone answered our ping
if (pollitems [0].revents & ZMQ_POLLIN)
udp_recv (udp, buffer(有软皮摩擦), PING_MSG_SIZE);

if (zclock_time () >= ping_at) {
puts ("Pinging peers…");
buffer [0] = '!';
udp_send (udp, buffer(有软皮摩擦), PING_MSG_SIZE);
ping_at = zclock_time () + PING_INTERVAL;
}
}
udp_destroy (&udp);
zctx_destroy (&ctx);
return 0;
}


Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

The library, udplib, hides a lot of the unpleasant code (which will become uglier as we make this work on more systems). I'm not going to print that code here. You can read it in the repository.

Now, there are more problems sizing us up and wondering if they can make lunch out of us. First, IPv4 versus(对) IPv6 and multicast(多路广播) versus broadcast. In IPv6, broadcast doesn't exist at all; one uses multicast. From my experience with WiFi, IPv4 multicast and broadcast work identically(同一地) except that multicast breaks in some situations where broadcast works fine. Some access points do not forward multicast packets. When you have a device(装置) (e.g., a tablet) that acts as a mobile AP, then it's possible it won't get multicast packets. Meaning, it won't see other peers(撒尿) on the network.

The simplest plausible(貌似可信的) solution(解决方案) is simply to ignore(驳回诉讼) IPv6 for now, and use broadcast. A perhaps smarter solution would be to use multicast and deal with asymmetric(不对称的) beacons(灯塔) if they happen.

We'll stick with stupid and simple for now. There's always time to make it more complex(复杂的).

Multiple Nodes on One Device
topprevnext

So we can discover nodes on the WiFi network, as long as they're sending out beacons as we expect. So I try to test with two processes. But when I run udpping2 twice, the second instance(实例) complains "'Address already in use' on bind(捆绑)" and exits. Oh, right. UDP and TCP both return an error if you try to bind two different sockets(插座) to the same port. This is right. The semantics(语义学) of two readers on one socket would be weird(怪异的) to say the least. Odd/even bytes? You get all the 1s, I get all the 0's?

However, a quick check of stackoverflow.com and some memory of a socket option called SO_REUSEADDR turns up gold. If I use that, I can bind(绑) several processes to the same UDP port, and they will all receive any message arriving on that port. It's almost as if the guys(球员) who designed this were reading my mind! (That's way more plausible(貌似可信的) than the chance that I may be reinventing(重新使用) the wheel.)

A quick test shows that SO_REUSEADDR works as promised. This is great because the next thing I want to do is design an API and then start dozens of nodes to see them discovering each other. It would be really cumbersome(笨重的) to have to test each node on a separate device(装置). And when we get to testing how real traffic behaves on a large, flaky(薄片的) network, the two alternatives(二中择一) are simulation(仿真) or temporary(暂时的) insanity(疯狂).

And I speak from experience: we were, this summer, testing on dozens of devices at once. It takes about an hour to set up a full test run, and you need a space shielded(隔离的) from WiFi interference(干扰) if you want any kind of reproducibility(再现性) (unless your test case is "prove that interference kills WiFi networks faster than Orval can kill a thirst".

If I were a whiz(飕飕声) Android developer with a free weekend, I'd immediately (as in, it would take me two days) port this code to my phone and get it sending beacons(灯塔) to my PC. But sometimes lazy is more profitable(有利可图的). I like my Linux laptop(膝上型轻便电脑). I like being able to start a dozen threads from one process, and have each thread acting like an independent node. I like not having to work in a real Faraday cage when I can simulate(模仿的) one on my laptop.

Designing the API
topprevnext

I'm going to run N nodes on a device, and they are going to have to discover each other, as well as a bunch(群) of other nodes out there on the local network. I can use UDP for local discovery as well as remote(遥远的) discovery. It's arguably(可论证的) not as efficient(有效率的) as using, e.g., the ZeroMQ inproc:// transport, but it has the great advantage that the exact same code will work in simulation and in real deployment(调度).

If I have multiple nodes on one device, we clearly can't use the IP address and port number as node address. I need some logical(合逻辑的) node identifier(标识符). Arguably, the node identifier only has to be unique(独特的) within the context(环境) of the device. My mind fills with complex(复杂的) stuff(东西) I could make, like supernodes that sit on real UDP ports and forward messages to internal(内部的) nodes. I hit my head on the table until the idea of inventing new concepts leaves it.

Experience tells us that WiFi does things like disappear and reappear(再出现) while applications are running. Users click on things, which does interesting things like change the IP address halfway(到一半) through a session(会议). We cannot depend on IP addresses, nor on established(确定的) connections (in the TCP fashion). We need some long-lasting addressing mechanism(机制) that survives(幸存) interfaces(界面) and connections being torn(扯裂) down and then recreated(娱乐).

Here's the simplest solution(解决方案) I can see: we give every node a UUID, and specify(指定) that nodes, represented by their UUIDs, can appear or reappear at certain IP address:port endpoints(端点), and then disappear again. We'll deal with recovery(恢复) from lost messages later. A UUID is 16 bytes. So if I have 100 nodes on a WiFi network, that's (double it for other random(随机的) stuff) 3,200 bytes a second of beacon data that the air has to carry just for discovery and presence(存在). Seems acceptable(可接受的).

Back to concepts(观念). We do need some names for our API. At the least we need a way to distinguish(区分) between the node object that is "us", and node objects that are our peers(撒尿). We'll be doing things like creating an "us" and then asking it how many peers it knows about and who they are. The term "peer" is clear enough.

From the developer point of view, a node (the application) needs a way to talk to the outside world. Let's borrow a term from networking and call this an "interface(界面)". The interface represents us to the rest of the world and presents the rest of the world to us, as a set of other peers(撒尿). It automatically(自动地) does whatever discovery it must. When we want to talk to a peer, we get the interface to do that for us. And when a peer talks to us, it's the interface that delivers us the message.

This seems like a clean API design. How about the internals(内部构件)?

  • The interface must be multithreaded so that one thread can do I/O in the background, while the foreground(前景) API talks to the application. We used this design in the Clone and Freelance client APIs.
  • The interface background thread does the discovery business; bind(绑) to the UDP port, send out UDP beacons(灯塔), and receive beacons.
  • We need to at least send UUIDs in the beacon message so that we can distinguish(区分) our own beacons from those of our peers.
  • We need to track peers that appear, and that disappear. For this, I'll use a hash table that stores all known peers and expire(期满) peers after some timeout.
  • We need a way to report peers and events to the caller. Here we get into a juicy question. How does a background I/O thread tell a foreground API thread that stuff(东西) is happening? Callbacks(回收) maybe? Heck no. We'll use ZeroMQ messages, of course.

The third iteration(迭代) of the UDP ping program is even simpler and more beautiful than the second. The main body, in C, is just ten lines of code.


Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

The interface(界面) code should be familiar if you've studied how we make multithreaded API classes:


Python | Ada | Basic | C++ | C# | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

When I run this in two windows, it reports one peer joining the network. I kill that peer and a few seconds later, it tells me the peer left:

--------------------------------------
[006] JOINED
[032] 418E98D4B7184844B7D5E0EE5691084C
--------------------------------------
[004] LEFT
[032] 418E98D4B7184844B7D5E0EE5691084C

What's nice about a ZeroMQ-message based API is that I can wrap(包) this any way I like. For instance(举…为例), I can turn it into callbacks(回收) if I really want those. I can also trace(追踪) all activity on the API very easily.

Some notes about tuning(调谐). On Ethernet, five seconds (the expiry(满期) time I used in this code) seems like a lot. On a badly stressed(强调) WiFi network, you can get ping latencies(潜伏) of 30 seconds or more. If you use a too-aggressive value for the expiry, you'll disconnect(拆开) nodes that are still there. On the other side, end user applications expect a certain liveliness(活泼). If it takes 30 seconds to report that a node has gone, users will get annoyed.

A decent(正派的) strategy(战略) is to detect(察觉) and report disappeared nodes rapidly, but only delete them after a longer interval. Visually, a node would be green when it's alive, then gray for a while as it went out of reach, then finally disappear. We're not doing this now, but will do it in the real implementation(实现) of the as-yet-unnamed framework(框架) we're making.

As we will also see later, we have to treat any input(投入) from a node, not just UDP beacons(灯塔), as a sign of life. UDP may get squashed(压扁的) when there's a lot of TCP traffic. This is perhaps the main reason we're not using an existing UDP discovery library: it's necessary to integrate(完整) this tightly with our ZeroMQ messaging for it to work.

More About UDP
topprevnext

So we have discovery and presence(存在) working over UDP IPv4 broadcasts. It's not ideal(理想的), but it works for the local networks we have today. However we can't use UDP for real work, not without additional(附加的) work to make it reliable(可靠的人). There's a joke about UDP but sometimes you'll get it, and sometimes you won't.

We'll stick to TCP for all one-to-one messaging. There is one more use case for UDP after discovery, which is multicast(多路广播) file distribution(分布). I'll explain why and how, then shelve(搁置) that for another day. The why is simple: what we call "social networks" is just augmented(增广的) culture. We create culture by sharing, and this means more and more sharing works that we make or remix(使再混合). Photos, documents, contracts(合同), tweets. The clouds of devices(装置) we're aiming towards do more of this, not less.

Now, there are two principal(首要的) patterns for sharing content. One is the pub-sub pattern where one node sends out content to a set of other nodes simultaneously(同时地). Second is the "late joiner" pattern, where a node arrives somewhat later and wants to catch up to the conversation. We can deal with the late joiner using TCP unicast(单一传播). But doing TCP unicast to a group of clients at the same time has some disadvantages. First, it can be slower than multicast. Second, it's unfair because some will get the content before others.

Before you jump off to design a UDP multicast protocol(协议), realize that it's not a simple calculation(计算). When you send a multicast packet, the WiFi access point uses a low bit rate to ensure(保证) that even the furthest devices will get it safely. Most normal APs don't do the obvious optimization(最佳化), which is to measure the distance of the furthest device and use that bit rate. Instead, they just use a fixed value. So if you have a few devices close to the AP, multicast will be insanely(疯狂地) slow. But if you have a roomful(满房间) of devices which all want to get the next chapter of the textbook, multicast can be insanely effective(有效的).

The curves(曲线) cross at about 6-12 devices depending on the network. In theory, you could measure the curves in real time and create an adaptive(适应的) protocol. That would be cool but probably too hard for even the smartest of us.

If you do sit down and sketch(画素描或速写) out a UDP multicast protocol, realize that you need a channel for recovery(恢复), to get lost packets. You'd probably want to do this over TCP, using ZeroMQ. For now, however, we'll forget about multicast UDP and assume(承担) all traffic goes over TCP.

Spinning Off a Library Project

topprevnext

At this stage, however, the code is growing larger than an example should be, so it's time to create a proper GitHub project. It's a rule: build your projects in public view, and tell people about them as you go so your marketing and community building starts on Day 1. I'll walk through what this involves(包含). I explained in Chapter 6 - The ZeroMQ Community about growing communities around projects. We need a few things:

  • A name
  • A slogan
  • A public github repository(贮藏室)
  • A README that links to the C4 process
  • License files
  • An issue tracker
  • Two maintainers
  • A first bootstrap(引导程序) version

The name and slogan(标语) first. The trademarks(商标) of the 21st century are domain names. So the first thing I do when spinning off a project is to look for a domain name that might work. Quite randomly(随便地), one of our old messaging projects was called "Zyre" and I have the domain name for it. The full name is a backronym: the ZeroMQ Realtime Exchange framework(框架).

I'm somewhat shy about pushing new projects into the ZeroMQ community too aggressively, and normally would start a project in either my personal account or the iMatix organization. But we've learned that moving projects after they become popular is counterproductive(反生产的). My predictions(预报) of a future filled with moving pieces are either valid or wrong. If this chapter is valid, we might as well launch(发射) this as a ZeroMQ project from the start. If it's wrong, we can delete the repository later or let it sink to the bottom of a long list of forgotten starts.

Start with the basics. The protocol(协议) (UDP and ZeroMQ/TCP) will be ZRE (ZeroMQ Realtime Exchange protocol) and the project will be Zyre. I need a second maintainer(维持), so I invite my friend Dong Min (the Korean hacker behind JeroMQ, a pure-Java ZeroMQ stack(堆)) to join. He's been working on very similar ideas so is enthusiastic(热情的). We discuss this and we get the idea of building Zyre on top of JeroMQ, as well as on top of CZMQ and libzmq. This would make it a lot easier to run Zyre on Android. It would also give us two fully separate implementations(实现) from the start, which is always a good thing for a protocol.

So we take the FileMQ project I built in Chapter 7 - Advanced Architecture using ZeroMQ as a template(模板) for a new GitHub project. The GNU autoconf tools are quite decent(正派的), but have a painful syntax(语法). It's easiest to copy existing project files and modify(修改) them. The FileMQ project builds a library, has test tools, license files, man pages, and so on. It's not too large so it's a good starting point.

I put together a README to summarize(总结) the goals of the project and point to C4. The issue tracker is enabled by default on new GitHub projects, so once we've pushed the UDP ping code as a first version, we're ready to go. However, it's always good to recruit(补充) more maintainers, so I create an issue "Call for maintainers" that says:

If you'd like to help click that lovely green "Merge Pull Request" button and get eternal(永恒的) good karma(因果报应), add a comment confirming(确认) that you've read and understand the C4 process at http://rfc.zeromq.org/spec(投机):22.

Finally, I change the issue tracker labels(标签). By default, GitHub offers the usual variety of issue types, but with C4 we don't use them. Instead, we need just two labels ("Urgent", in red, and "Ready", in black).

Point-to-Point Messaging

topprevnext

I'm going to take the last UDP ping program and build a point-to-point(越过原野的) messaging layer on top of that. Our goal is that we can detect(察觉) peers(撒尿) as they join and leave the network, that we can send messages to them, and that we can get replies. It is a nontrivial(非平凡的) problem to solve and takes Min and me two days to get a "Hello World" version working.

We had to solve a number of issues:

  • What information to send in the UDP beacon(灯塔), and how to format it.
  • What ZeroMQ socket(插座) types to use to interconnect(使互相连接) nodes.
  • What ZeroMQ messages to send, and how to format them.
  • How to send a message to a specific(特殊的) node.
  • How to know the sender of any message so we could send a reply.
  • How to recover from lost UDP beacons.
  • How to avoid overloading the network with beacons.

I'll explain these in enough detail so that you understand why we made each choice we did, with some code fragments(碎片) to illustrate(阐明). We tagged this code as version 0.1.0 so you can look at the code: most of the hard work is done in zre_interface.c.

UDP Beacon Framing
topprevnext

Sending UUIDs across the network is the bare(空的) minimum(最小值) for a logical(合逻辑的) addressing scheme(计划). However, we have a few more aspects(方面) to get working before this will work in real use:

  • We need some protocol(协议) identification(鉴定) so that we can check for and reject invalid(无效的) packets.
  • We need some version information so that we can change this protocol over time.
  • We need to tell other nodes how to reach us via TCP, i.e., a ZeroMQ port they can talk to us on.

Let's start with the beacon(灯塔) message format. We probably want a fixed protocol header that will never change in future versions and a body that depends on the version.

Figure 67 - ZRE discovery message

fig67.png

The version can be a 1-byte counter starting at 1. The UUID is 16 bytes and the port is a 2-byte port number because UDP nicely tells us the sender's IP address for every message we receive. This gives us a 22-byte frame(框架).

The C language (and a few others like Erlang) make it simple to read and write binary(二进制的) structures(结构). We define(定义) the beacon frame structure:

#define BEACON_PROTOCOL "ZRE"
#define BEACON_VERSION 0x01

typedef struct {
byte protocol [3];
byte version;
uuid_t uuid;
uint16_t port;
} beacon_t;

This makes sending and receiving beacons(灯塔) quite simple. Here is how we send a beacon, using the zre_udp class to do the nonportable network calls:

// Beacon object
beacon_t beacon;

// Format beacon fields
beacon.protocol [0] = 'Z';
beacon.protocol [1] = 'R';
beacon.protocol [2] = 'E';
beacon.version = BEACON_VERSION;
memcpy (beacon.uuid, self->uuid, sizeof (uuid_t));
beacon.port = htons (self->port);

// Broadcast the beacon(灯塔) to anyone who is listening
zre_udp_send (self->udp, (byte *) &beacon, sizeof (beacon_t));

When we receive a beacon(灯塔), we need to guard against bogus(假的) data. We're not going to be paranoid(类似妄想狂的) against, for example, denial-of-service attacks. We just want to make sure that we're not going to crash when a bad ZRE implementation(实现) sends us erroneous(错误的) frames(框架).

To validate(证实) a frame, we check its size and header. If those are OK, we assume(承担) the body is usable(可用的). When we get a UUID that isn't ourselves (recall(召回), we'll get our own UDP broadcasts back), we can treat this as a peer(贵族):

// Get beacon frame from network
beacon_t beacon;
ssize_t size = zre_udp_recv (self->udp,
(byte *) &beacon, sizeof (beacon_t));

// Basic validation(确认) on the frame(框架)
if (size != sizeof (beacon_t)
|| beacon.protocol [0] != 'Z'
|| beacon.protocol [1] != 'R'
|| beacon.protocol [2] != 'E'
|| beacon.version != BEACON_VERSION)
return 0; // Ignore invalid beacons

// If we got a UUID and it's not our own beacon(灯塔), we have a peer(贵族)
if (memcmp (beacon.uuid, self->uuid, sizeof (uuid_t))) {
char *identity = s_uuid_str (beacon.uuid);
s_require_peer(贵族) (self, identity(身份),
zre_udp_from (self->udp), ntohs (beacon(灯塔).port));
free (identity);
}

True Peer Connectivity (Harmony Pattern)
topprevnext

Because ZeroMQ is designed to make distributed(分布式的) messaging easy, people often ask how to interconnect(使互相连接) a set of true peers (as compared to obvious clients and servers). It is a thorny(多刺的) question and ZeroMQ doesn't really provide a single clear answer.

TCP, which is the most commonly-used transport in ZeroMQ, is not symmetric(对称的); one side must bind(绑) and one must connect and though ZeroMQ tries to be neutral(中立的) about this, it's not. When you connect, you create an outgoing(外出的) message pipe. When you bind, you do not. When there is no pipe, you cannot write messages (ZeroMQ will return EAGAIN).

Developers who study ZeroMQ and then try to create N-to-N connections between sets of equal peers(撒尿) often try a ROUTER-to-ROUTER flow. It's obvious why: each peer needs to address a set of peers, which requires ROUTER. It usually ends with a plaintive(哀伤的) email to the list.

Experience teaches us that ROUTER-to-ROUTER is particularly difficult to use successfully. At a minimum(最小的), one peer must bind and one must connect, meaning the architecture(建筑学) is not symmetrical(匀称的). But also because you simply can't tell when you are allowed to safely send a message to a peer. It's a Catch-22: you can talk to a peer after it's talked to you, but the peer can't talk to you until you've talked to it. One side or the other will be losing messages and thus has to retry(重操作), which means the peers cannot be equal.

I'm going to explain the Harmony pattern, which solves this problem, and which we use in Zyre.

We want a guarantee(保证) that when a peer "appears" on our network, we can talk to it safely without ZeroMQ dropping messages. For this, we have to use a DEALER or PUSH socket(插座) that connects out to the peer so that even if that connection takes some non-zero time, there is immediately a pipe and ZeroMQ will accept outgoing messages.

A DEALER socket cannot address multiple peers individually(个别地). But if we have one DEALER per peer, and we connect that DEALER to the peer, we can safely send messages to a peer as soon as we've connected to it.

Now, the next problem is to know who sent us a particular message. We need a reply address that is the UUID of the node who sent any given message. DEALER can't do this unless we prefix(前缀) every single message with that 16-byte UUID, which would be wasteful(浪费的). ROUTER does do it if we set the identity(身份) properly before connecting to the router(路由器).

And so the Harmony pattern comes down to these components(成分):

  • One ROUTER socket that we bind to a ephemeral(短暂的) port, which we broadcast in our beacons(灯塔).
  • One DEALER socket per peer that we connect to the peer's ROUTER socket.
  • Reading from our ROUTER socket.
  • Writing to the peer's DEALER socket.

The next problem is that discovery isn't neatly synchronized(同步的). We can get the first beacon from a peer after we start to receive messages from it. A message comes in on the ROUTER socket(插座) and has a nice UUID attached(依附) to it, but no physical IP address and port. We have to force discovery over TCP. To do this, our first command to any new peer(贵族) to which we connect is an OHAI command with our IP address and port. This ensure(保证) that the receiver connects back to us before trying to send us any command.

Here it is, broken down into steps:

  • If we receive a UDP beacon(灯塔) from a new peer, we connect to the peer through a DEALER socket.
  • We read messages from our ROUTER socket, and each message comes with the UUID of the sender.
  • If it's an OHAI message, we connect back to that peer if not already connected to it.
  • If it's any other message, we must already be connected to the peer (a good place for an assertion(断言)).
  • We send messages to each peer using the per-peer DEALER socket, which must be connected.
  • When we connect to a peer, we also tell our application that the peer exists.
  • Every time we get a message from a peer, we treat that as a heartbeat(心跳) (it's alive).

If we were not using UDP but some other discovery mechanism(机制), I'd still use the Harmony pattern for a true peer network: one ROUTER for input(投入) from all peers, and one DEALER per peer for output(输出). Bind(结合) the ROUTER, connect the DEALER, and start each conversation with an OHAI equivalent(等价的) that provides the return IP address and port. You would need some external(外部的) mechanism to bootstrap(引导程序) each connection.

Detecting Disappearances
topprevnext

Heartbeating sounds simple but it's not. UDP packets get dropped when there's a lot of TCP traffic, so if we depend on UDP beacons(灯塔), we'll get false disconnections(断开). TCP traffic can be delayed for 5, 10, even 30 seconds if the network is really busy. So if we kill peers(撒尿) when they go quiet, we'll have false disconnections.

Because UDP beacons aren't reliable(可靠的), it's tempting(诱惑) to add in TCP beacons. After all, TCP will deliver them reliably. However, there's one little problem. Imagine that you have 100 nodes on a network, and each node sends a TCP beacon once a second. Each beacon is 22 bytes, not counting TCP's framing(设计) overhead. That is 100 * 99 * 22 bytes per second, or 217,000 bytes/second just for heartbeating. That's about 1-2% of a typical(典型的) WiFi network's ideal(理想的) capacity(能力), which sounds OK. But when a network is stressed(强调) or fighting other networks for airspace, that extra 200K a second will break what's left. UDP broadcasts are at least low cost.

So what we do is switch(开关) to TCP heartbeats(心跳) only when a specific(特殊的) peer hasn't sent us any UDP beacons in a while. And then we send TCP heartbeats only to that one peer. If the peer continues to be silent, we conclude it's gone away. If the peer comes back with a different IP address and/or port, we have to disconnect(拆开) our DEALER socket(插座) and reconnect(使再接合) to the new port.

This gives us a set of states for each peer, though at this stage the code doesn't use a formal(正式的) state machine:

  • Peer visible(明显的) thanks to UDP beacon (we connect using IP address and port from beacon)
  • Peer visible thanks to OHAI command (we connect using IP address and port from command)
  • Peer seems alive (we got a UDP beacon or command over TCP recently)
  • Peer seems quiet (no activity in some time, so we send a HUGZ command)
  • Peer has disappeared (no reply to our HUGZ commands, so we destroy peer)

There's one remaining scenario(方案) we didn't address in the code at this stage. It's possible for a peer to change IP addresses and ports without actually triggering(引发) a disappearance event. For example, if the user switches off WiFi and then switches it back on, the access point can assign(分配) the peer a new IP address. We'll need to handle a disappeared WiFi interface(界面) on our node by unbinding(解开) the ROUTER socket and rebinding(重捆) it when we can. Because this is not central to the design now, I decide to log an issue on the GitHub tracker and leave it for a rainy day.

Group Messaging

topprevnext

Group messaging is a common and very useful pattern. The concept(观念) is simple: instead of talking to a single node, you talk to a "group" of nodes. The group is just a name, a string that you agree on in the application. It's precisely(精确的) like using the pub-sub prefixes(前缀) in PUB and SUB sockets(插座). In fact, the only reason I say "group messaging" and not "pub-sub" is to prevent confusion(混淆), because we're not going to use PUB-SUB sockets for this.

PUB-SUB sockets would almost work. But we've just done such a lot of work to solve the late joiner problem. Applications are inevitably(不可避免地) going to wait for peers(撒尿) to arrive before sending messages to groups, so we have to build on the Harmony pattern rather than start again beside it.

Let's look at the operations we want to do on groups:

  • We want to join and leave groups.
  • We want to know what other nodes are in any given group.
  • We want to send a message to (all nodes in) a group.

These look familiar to anyone who's used Internet Relay Chat, except that we have no server. Every node will need to keep track of what each group represents. This information will not always be fully consistent(始终如一的) across the network, but it will be close enough.

Our interface(界面) will track a set of groups (each an object). These are all the known groups with one or more member node, excluding(不包括…) ourselves. We'll track nodes as they leave and join groups. Because nodes can join the network at any time, we have to tell new peers what groups we're in. When a peer disappears, we'll remove it from all groups we know about.

This gives us some new protocol(协议) commands:

  • JOIN - we send this to all peers when we join a group.
  • LEAVE - we send this to all peers when we leave a group.

Plus, we add a groups field to the first command we send (renamed(重新命名) from OHAI to HELLO at this point because I need a larger lexicon(词典) of command verbs(动词)).

Lastly, let's add a way for peers(撒尿) to double-check the accuracy(精确度) of their group data. The risk(风险) is that we miss one of the above messages. Though we are using Harmony to avoid the typical(典型的) message loss at startup, it's worth being paranoid(偏执狂患者). For now, all we need is a way to detect(察觉) such a failure. We'll deal with recovery(恢复) later, if the problem actually happens.

I'll use the UDP beacon(灯塔) for this. What we want is a rolling counter that simply tells how many join and leave operations ("transitions(过渡)") there have been for a node. It starts at 0 and increments(增量) for each group we join or leave. We can use a minimal(最低的) 1-byte value because that will catch all failures except the astronomically(天文学上地) rare "we lost precisely(精确地) 256 messages in a row" failure (this is the one that hits during the first demo). We will also put the transitions counter into the JOIN, LEAVE, and HELLO commands. And to try to provoke(驱使) the problem, we'll test by joining/leaving several hundred groups with a high-water mark set to 10 or so.

It's time to choose verbs for the group messaging. We need a command that means "talk to one peer" and one that means "talk to many peers". After some attempts, my best choices are WHISPER and SHOUT, and this is what the code uses. The SHOUT command needs to tell the user the group name, as well as the sender peer.

Because groups are like pub-sub, you might be tempted(诱惑) to use this to broadcast the JOIN and LEAVE commands as well, perhaps by creating a "global" group that all nodes join. My advice is to keep groups purely as user-space concepts(观念) for two reasons. First, how do you join the global group if you need the global group to send out a JOIN command? Second, it creates special cases (reserved names) which are messy.

It's simpler just to send JOINs and LEAVEs explicitly(明确地) to all connected peers(撒尿), period.

I'm not going to work through the implementation(实现) of group messaging in detail because it's fairly pedantic(迂腐的) and not too exciting. The data structures(结构) for group and peer management aren't optimal(最佳的), but they're workable(切实可行的). We use the following:

  • A list of groups for our interface(界面), which we can send to new peers in a HELLO command;
  • A hash of groups for other peers, which we update with information from HELLO, JOIN, and LEAVE commands;
  • A hash of peers(撒尿) for each group, which we update with the same three commands.

At this stage, I'm starting to get pretty happy with the binary(二进制的) serialization(序列化) (our codec(编码解码器) generator(发电机) from Chapter 7 - Advanced Architecture using ZeroMQ), which handles lists and dictionaries as well as strings and integers(整数).

This version is tagged in the repository(贮藏室) as v0.2.0 and you can download the tarball if you want to check what the code looked like at this stage.

Testing and Simulation

topprevnext

When you build a product out of pieces, and this includes a distributed(分布式的) framework(框架) like Zyre, the only way to know that it will work properly in real life is to simulate(模仿的) real activity on each piece.

On Assertions
topprevnext

The proper use of assertions(断言) is one of the hallmarks(特点) of a professional programmer.

Our confirmation(确认) bias(偏见) as creators makes it hard to test our work properly. We tend(照料) to write tests to prove the code works, rather than trying to prove it doesn't. There are many reasons for this. We pretend to ourselves and others that we can be (could be) perfect, when in fact we consistently(一贯地) make mistakes. Bugs in code are seen as "bad", rather than "inevitable(必然的)", so psychologically(心理上地) we want to see fewer of them, not uncover(发现) more of them. "He writes perfect code" is a compliment(恭维) rather than a euphemism(委婉语) for "he never takes risks(风险) so his code is as boring and heavily used as cold spaghetti".

Some cultures teach us to aspire(渴望) to perfection(完善) and punish mistakes in education and work, which makes this attitude worse. To accept that we're fallible(易犯错误的), and then to learn how to turn that into profit(利润) rather than shame is one of the hardest intellectual(智力的) exercises in any profession. We leverage(手段) our fallibilities(易误) by working with others and by challenging our own work sooner, not later.

One trick that makes it easier is to use assertions(断言). Assertions are not a form of error handling. They are executable(可执行的) theories of fact. The code asserts(维护), "At this point, such and such must be true" and if the assertion fails, the code kills itself.

The faster you can prove code incorrect, the faster and more accurately(精确地) you can fix it. Believing that code works and proving that it behaves as expected is less science, more magical thinking. It's far better to be able to say, "libzmq has five hundred assertions and despite(尽管) all my efforts, not one of them fails".

So the Zyre code base is scattered(分散的) with assertions, and particularly a couple on the code that deals with the state of peers(撒尿). This is the hardest aspect(方面) to get right: peers need to track each other and exchange state accurately or things stop working. The algorithms(算法) depend on asynchronous(异步的) messages flying around and I'm pretty sure the initial design has flaws(瑕疵). It always does.

And as I test the original Zyre code by starting and stopping instances(实例) of zre_ping by hand, every so often I get an assertion failure. Running by hand doesn't reproduce(复制) these often enough, so let's make a proper tester tool.

On Up-Front Testing
topprevnext

Being able to fully test the real behavior(行为) of individual(个人的) components(成分) in the laboratory can make a 10x or 100x difference to the cost of your project. That confirmation(确认) bias(偏见) engineers have to their own work makes up-front(预先的) testing incredibly(难以置信地) profitable(有利可图的), and late-stage testing incredibly expensive.

I'll tell you a short story about a project we worked on in the late 1990's. We provided the software and other teams provided the hardware(计算机硬件) for a factory automation(自动化) project. Three or four teams brought their experts on-site(现场的), which was a remote(遥远的) factory (funny how the polluting factories are always in remote border country).

One of these teams, a firm specializing(专攻) in industrial(工业的) automation, built ticket machines: kiosks(凉亭), and software to run on them. Nothing unusual: swipe(猛击) a badge(徽章), choose an option, receive a ticket. They assembled(集合) two of these kiosks on-site, each week bringing some more bits and pieces. Ticket printers, monitor screens, special keypads(按键) from Israel. The stuff(东西) had to be resistant(抵抗的) against dust because the kiosks sat outside. Nothing worked. The screens were unreadable(不值一读的) in the sun. The ticket printers continually(不断地) jammed and misprinted. The internals(内部构件) of the kiosk just sat on wooden shelving(倾斜). The kiosk software crashed regularly. It was comedic(喜剧的) except that the project really, really had to work and so we spent weeks and then months on-site helping the other teams debug(调试) their bits and pieces until it worked.

A year later, there was a second factory, and the same story. By this time the client, was getting impatient(焦躁的). So when they came to the third and largest factory, a year later, we jumped up and said, "please let us make the kiosks and the software and everything".

We made a detailed design for the software and hardware and found suppliers for all the pieces. It took us three months to search the Internet for each component (in those days, the Internet was a lot slower), and another two months to get them assembled into stainless-steel bricks each weighing about twenty kilos(公斤). These bricks were two feet square and eight inches deep, with a large flat-screen panel(仪表板) behind unbreakable(牢不可破的) glass, and two connectors(连接器): one for power, one for Ethernet. You loaded up the paper bin with enough for six months, then screwed(旋) the brick into a housing, and it automatically(自动的) booted, found its DNS server, loaded its Linux OS and then application software. It connected to the real server, and showed the main menu. You got access to the configuration(配置) screens by swiping a special badge and then entering a code.

The software was portable so we could test that as we wrote it, and as we collected the pieces from our suppliers we kept one of each so we had a disassembled(拆开) kiosk(凉亭) to play with. When we got our finished kiosks, they all worked immediately. We shipped them to the client, who plugged them into their housing, switched(转换) them on, and went to business. We spent a week or so on-site(现场的), and in ten years, one kiosk broke (the screen died, and was replaced).

Lesson is, test upfront(在前面) so that when you plug the thing in, you know precisely(精确地) how it's going to behave. If you haven't tested it upfront, you're going to be spending weeks and months in the field ironing out problems that should never have been there.

The Zyre Tester
topprevnext

During manual(手工的) testing, I did hit an assertion(断言) rarely. It then disappeared. Because I don't believe in magic, I know that meant the code was still wrong somewhere. So, the next step was heavy-duty(耐用的) testing of the Zyre v0.2.0 code to try to break its assertions, and get a good idea of how it will behave in the field.

We packaged the discovery and messaging functionality(功能) as an interface object that the main program creates, works with, and then destroys. We don't use any global variables(变量). This makes it easy to start large numbers of interfaces(界面) and simulate(模仿的) real activity, all within one process. And if there's one thing we've learned from writing lots of examples, it's that ZeroMQ's ability to orchestrate(精心安排) multiple threads in a single process is much easier to work with than multiple processes.

The first version of the tester consists of a main thread that starts and stops a set of child threads, each running one interface, each with a ROUTER, DEALER, and UDP socket(插座) (R, D, and U in the diagram).

Figure 68 - Zyre Tester Tool

fig68.png

The nice thing is that when I am connected to a WiFi access point, all Zyre traffic (even between two interfaces in the same process) goes across the AP. This means I can fully stress(强调) test any WiFi infrastructure(基础设施) with just a couple of PCs running in a room. It's hard to emphasize(强调) how valuable this is: if we had built Zyre as, say, a dedicated(专用的) service for Android, we'd literally(文字的) need dozens of Android tablets or phones to do any large-scale(大规模的) testing. Kiosks, and all that.

The focus is now on breaking the current code, trying to prove it wrong. There's no point at this stage in testing how well it runs, how fast it is, how much memory it uses, or anything else. We'll work up to trying (and failing) to break each individual(个人的) functionality, but first, we try to break some of the core assertions I've put into the code.

These are:

  • The first command that any node receives from a peer(贵族) MUST be HELLO. In other words, messages cannot be lost during the peer-to-peer(对等) connection process.
  • The state each node calculates(计算) for its peers(撒尿) matches the state each peer calculates for itself. In other words, again, no messages are lost in the network.
  • When my application sends a message to a peer, we have a connection to that peer. In other words, the application only "sees" a peer after we have established(确定的) a ZeroMQ connection to it.

With ZeroMQ, there are several cases where we may lose messages. One is the "late joiner" syndrome(综合征). Two is when we close sockets(插座) without sending everything. Three is when we overflow(溢出) the high-water mark on a ROUTER or PUB socket. Four is when we use an unknown address with a ROUTER socket.

Now, I think Harmony(协调) gets around all these potential(潜在的) cases. But we're also adding UDP to the mix. So the first version of the tester simulates(模仿) an unstable(不稳定的) and dynamic(动态的) network, where nodes come and go randomly(随便地). It's here that things will break.

Here is the main thread of the tester, which manages a pool of 100 threads, starting and stopping each one randomly. Every ~750 msecs it either starts or stops one random thread. We randomize(随机化) the timing so that threads aren't all synchronized(同步的). After a few minutes, we have an average of 50 threads happily chatting to each other like Korean teenagers in the Gangnam subway(地铁) station:

int main (int argc, char *argv [])
{
// Initialize(初始化) context(环境) for talking to tasks
zctx_t *ctx = zctx_new ();
zctx_set_linger (ctx, 100);

// Get number of interfaces(界面) to simulate(模仿的), default 100
int max_interface = 100;
int nbr_interfaces = 0;
if (argc > 1)
max_interface = atoi (argv [1]);

// We address interfaces(界面) as an array(数组) of pipes
void **pipes = zmalloc (sizeof (void *) * max_interface);

// We will randomly(随便地) start and stop interface(界面) threads
while (!zctx_interrupted) {
uint index = randof (max_interface);
// Toggle interface thread
if (pipes [index]) {
zstr_send (pipes [index], "STOP");
zsocket_destroy (ctx, pipes [index]);
pipes [index] = NULL;
zclock_log ("I: Stopped interface(界面) (%d running)",
--nbr_interfaces);
}
else {
pipes [index] = zthread_fork (ctx, interface_task, NULL);
zclock_log ("I: Started interface(界面) (%d running)",
++nbr_interfaces);
}
// Sleep ~750 msecs randomly(随机的) so we smooth out activity
zclock_sleep (randof (500) + 500);
}
zctx_destroy (&ctx);
return 0;
}

Note that we maintain(维持) a pipe to each child thread (CZMQ creates the pipe automatically(自动地) when we use the zthread_fork method). It's via this pipe that we tell child threads to stop when it's time for them to leave. The child threads do the following (I'm switching(转换) to pseudo-code for clarity(清楚)):

create an interface
while true:
    poll on pipe to parent, and on interface
    if parent sent us a message:
        break
    if interface sent us a message:
        if message is ENTER:
            send a WHISPER to the new peer
        if message is EXIT:
            send a WHISPER to the departed peer
        if message is WHISPER:
            send back a WHISPER 1/2 of the time
        if message is SHOUT:
            send back a WHISPER 1/3 of the time
            send back a SHOUT 1/3 of the time
    once per second:
        join or leave one of 10 random groups
destroy interface

Test Results
topprevnext

Yes, we broke the code. Several times, in fact. This was satisfying. I'll work through the different things we found.

Getting nodes to agree on consistent(始终如一的) group status was the most difficult. Every node needs to track the group membership(资格) of the whole network, as I already explained in the section "Group Messaging". Group messaging is a pub-sub pattern. JOINs and LEAVEs are analogous(类似的) to subscribe(签署) and unsubscribe(取消订阅) messages. It's essential that none of these ever get lost, or we'll find nodes dropping randomly(随便地) off groups.

So each node counts the total number of JOINs and LEAVEs it's ever done, and broadcasts this status (as 1-byte rolling counter) in its UDP beacon(灯塔). Other nodes pick up the status, compare it to their own calculations(计算), and if there's a difference, the code asserts(维护).

The first problem was that UDP beacons get delayed randomly(随便地), so they're useless for carrying the status. When a beacons arrives late, the status is inaccurate(错误的) and we get a false negative. To fix this, we moved the status information into the JOIN and LEAVE commands. We also added it to the HELLO command. The logic(逻辑) then becomes:

  • Get initial status for a peer(贵族) from its HELLO command.
  • When getting a JOIN or LEAVE from a peer(贵族), increment(增量) the status counter.
  • Check that the new status counter matches the value in the JOIN or LEAVE command
  • If it doesn't, assert(维护).

Next problem we got was that messages were arriving unexpectedly(出乎意料地) on new connections. The Harmony pattern connects, then sends HELLO as the first command. This means the receiving peer should always get HELLO as the first command from a new peer. We were seeing PING, JOIN, and other commands arriving.

This turned out to be due to CZMQ's ephemeral(短暂的) port logic(逻辑). An ephemeral port is just a dynamically(动态地) assigned(分配) port that a service can get rather than asking for a fixed port number. A POSIX system usually assigns ephemeral ports in the range 0xC000 to 0xFFFF. CZMQ's logic is to look for a free port in this range, bind(绑) to that, and return the port number to the caller.

This sounds fine, until you get one node stopping and another node starting close together, and the new node getting the port number of the old node. Remember that ZeroMQ tries to re-establish(重建) a broken connection. So when the first node stopped, its peers would retry(重操作) to connect. When the new node appears on that same port, suddenly all the peers connect to it and start chatting like they're old buddies(伙伴).

It's a general problem that affects any larger-scale dynamic(动态的) ZeroMQ application. There are a number of plausible(貌似可信的) answers. One is to not reuse ephemeral ports, which is easier said than done when you have multiple processes on one system. Another solution(解决方案) would be to select a random(随机的) port each time, which at least reduces the risk(风险) of hitting a just-freed port. This brings the risk of a garbage connection down to perhaps 1/1000 but it's still there. Perhaps the best solution is to accept that this can happen, understand the causes, and deal with it on the application level.

We have a stateful(状态性的) protocol(协议) that always starts with a HELLO command. We know that it's possible for peers(撒尿) to connect to us, thinking we're an existing node that went away and came back, and send us other commands. Step one is when we discover a new peer, to destroy any existing peer connected to the same endpoint(端点). It's not a full answer but at least it's polite. Step two is to ignore(驳回诉讼) anything coming in from a new peer until that peer says HELLO.

This doesn't require any change to the protocol(协议), but it must be specified(指定) in the protocol when we come to it: due to the way ZeroMQ connections work, it's possible to receive unexpected(意外的) commands from a well-behaving peer and there is no way to return an error code or otherwise tell that peer to reset(重置) its connection. Thus, a peer must discard(抛弃) any command from a peer until it receives HELLO.

In fact, if you draw this on a piece of paper and think it through, you'll see that you never get a HELLO from such a connection. The peer will send PINGs and JOINs and LEAVEs and then eventually(最后) time out and close, as it fails to get any heartbeats(心跳) back from us.

You'll also see that there's no risk(风险) of confusion(混淆), no way for commands from two peers to get mixed into a single stream on our DEALER socket(插座).

When you are satisfied that this works, we're ready to move on. This version is tagged in the repository(贮藏室) as v0.3.0 and you can download the tarball if you want to check what the code looked like at this stage.

Note that doing heavy simulation(仿真) of lots of nodes will probably cause your process to run out of file handles, giving an assertion(断言) failure in libzmq. I raised the per-process limit to 30,000 by running (on my Linux box):

ulimit -n 30000

Tracing Activity
topprevnext

To debug(调试) the kinds of problems we saw here, we need extensive(广泛的) logging. There's a lot happening in parallel(平行线), but every problem can be traced(追溯) down to a specific(特殊的) exchange between two nodes, consisting of a set of events that happen in strict sequence(序列). We know how to make very sophisticated(复杂的) logging, but as usual it's wiser to make just what we need and no more. We have to capture(俘获):

  • Time and date for each event.
  • In which node the event occurred.
  • The peer(贵族) node, if any.
  • What the event was (e.g., which command arrived).
  • Event data, if any.

The very simplest technique is to print the necessary information to the console(控制台), with a timestamp(时间戳). That's the approach(方法) I used. Then it's simple to find the nodes affected by a failure, filter the log file for only messages referring to them, and see exactly what happened.

Dealing with Blocked Peers
topprevnext

In any performance-sensitive ZeroMQ architecture(建筑学), you need to solve the problem of flow control. You cannot simply send unlimited messages to a socket(插座) and hope for the best. At the one extreme(极端), you can exhaust(排出) memory. This is a classic(经典的) failure pattern for a message broker: one slow client stops receiving messages; the broker starts to queue them, and eventually(最后) exhausts memory and the whole process dies. At the other extreme, the socket drops messages, or blocks, as you hit the high-water mark.

With Zyre we want to distribute(分配) messages to a set of peers(撒尿), and we want to do this fairly. Using a single ROUTER socket for output(输出) would be problematic(问题的) because any one blocked peer would block outgoing(外出的) traffic to all peers. TCP does have good algorithms(算法) for spreading the network capacity(能力) across a set of connections. And we're using a separate DEALER socket to talk to each peer, so in theory each DEALER socket will send its queued messages in the background reasonably fairly.

The normal behavior(行为) of a DEALER socket that hits its high-water mark is to block. This is usually ideal(理想的), but it's a problem for us here. Our current interface(界面) design uses one thread that distributes messages to all peers. If one of those send calls were to block, all output would block.

There are a few options to avoid blocking. One is to use zmq_poll() on the whole set of DEALER sockets, and only write to sockets that are ready. I don't like this for a couple of reasons. First, the DEALER socket is hidden inside the peer class, and it is cleaner to allow each class to handle this opaquely(无光泽地). Second, what do we do with messages we can't yet deliver to a DEALER socket? Where do we queue them? Third, it seems to be side-stepping the issue. If a peer is really so busy it can't read its messages, something is wrong. Most likely, it's dead.

So no polling(投票) for output. The second option is to use one thread per peer. I quite like the idea of this because it fits into the ZeroMQ design pattern of "do one thing in one thread". But this is going to create a lot of threads (square of the number of nodes we start) in the simulation(仿真), and we're already running out of file handles.

A third option is to use a nonblocking send. This is nicer and it's the solution(解决方案) I choose. We can then provide each peer with a reasonable outgoing queue (the HWM) and if that gets full, treat it as a fatal(致命的) error on that peer. This will work for smaller messages. If we're sending large chunks(大块)—e.g., for content distribution—w(分布)e'll need a credit-based flow control on top.

Therefore the first step is to prove to ourselves that we can turn the normal blocking DEALER socket into a nonblocking socket. This example creates a normal DEALER socket, connects it to some endpoint(端点) (so that there's an outgoing pipe and the socket will accept messages), sets the high-water mark to four, and then sets the send timeout to zero:


C# | Python | Ada | Basic | C++ | Clojure | CL | Delphi | Erlang | F# | Felix | Go | Haskell | Haxe | Java | Lua | Node.js | Objective-C | ooc | Perl | PHP | Q | Racket | Ruby | Scala | Tcl

When we run this, we send four messages successfully (they go nowhere, the socket(插座) just queues them), and then we get a nice EAGAIN error:

Sending message 0
Sending message 1
Sending message 2
Sending message 3
Sending message 4
Resource temporarily unavailable

The next step is to decide what a reasonable high-water mark would be for a peer(贵族). Zyre is meant for human interactions(相互作用); that is, applications that chat at a low frequency(频率), such as two games or a shared drawing program. I'd expect a hundred messages per second to be quite a lot. Our "peer is really dead" timeout is 10 seconds. So a high-water mark of 1,000 seems fair.

Rather than set a fixed HWM or use the default (which randomly(随便地) also happens to be 1,000), we calculate(计算) it as 100 * the timeout. Here's how we configure(安装) a new DEALER socket for a peer:

// Create new outgoing(外出的) socket (drop any messages in transit(运输))
self->mailbox = zsocket_new (self->ctx, ZMQ_DEALER);

// Set our caller "From" identity(身份) so that receiving node knows
// who each message came from.

zsocket_set_identity (self->mailbox, reply_to);

// Set a high-water mark that allows for reasonable activity
zsocket_set_sndhwm (self->mailbox, PEER_EXPIRED * 100);

// Send messages immediately or return EAGAIN
zsocket_set_sndtimeo (self->mailbox, 0);

// Connect through to peer(封为贵族) node
zsocket_connect (self->mailbox, "tcp://%s", endpoint);

And finally, what do we do when we get an EAGAIN on a peer? We don't need to go through all the work of destroying the peer because the interface(界面) will do this automatically(自动地) if it doesn't get any message from the peer within the expiration(呼气) timeout. Just dropping the last message seems very weak; it will give the receiving peer gaps(间隙).

I'd prefer a more brutal(残忍的) response. Brutal is good because it forces the design to a "good" or "bad" decision rather than a fuzzy(模糊的) "should work but to be honest there are a lot of edge cases so let's worry about it later". Destroy the socket(插座), disconnect(拆开) the peer(贵族), and stop sending anything to it. The peer will eventually(最后) have to reconnect(使再接合) and re-initialize any state. It's kind of an assertion(断言) that 100 messages a second is enough for anyone. So, in the zre_peer_send method:

int
zre_peer_send (zre_peer_t *self, zre_msg_t **msg_p)
{
assert (self);
if (self->connected) {
if (zre_msg_send (msg_p, self->mailbox) && errno == EAGAIN) {
zre_peer(撒尿)_disconnect(拆开) (self);
return -1;
}
}
return 0;
}

Where the disconnect method looks like this:

void
zre_peer_disconnect (zre_peer_t *self)
{
// If connected, destroy socket(插座) and drop all pending(未决定的) messages
assert (self);
if (self->connected) {
zsocket_destroy (self->ctx, self->mailbox);
free (self->endpoint);
self->endpoint = NULL;
self->connected = false;
}
}

Distributed(分配) Logging and Monitoring

topprevnext

Let's look at logging and monitoring. If you've ever managed a real server (like a web server), you know how vital(至关重要的) it is to have a capture(捕获) of what is going on. There are a long list of reasons, not least:

  • To measure the performance of the system over time.
  • To see what kinds of work are done the most, to optimize(最优化) performance.
  • To track errors and how often they occur.
  • To do postmortems(验尸) of failures.
  • To provide an audit(审计) trail(小径) in case of dispute(辩论).

Let's scope(范围) this in terms of the problems we think we'll have to solve:

  • We want to track key events (such as nodes leaving and rejoining(答辩) the network).
  • For each event, we want to track a consistent(始终如一的) set of data: the date/time, node that observed the event, peer(贵族) that created the event, type of event itself, and other event data.
  • We want to be able to switch(转换) logging on and off at any time.
  • We want to be able to process log data mechanically(机械地) because it will be sizable(相当大的).
  • We want to be able to monitor a running system; that is, collect logs and analyze(分解) in real time.
  • We want log traffic to have minimal(最低的) effect on the network.
  • We want to be able to collect log data at a single point on the network.

As in any design, some of these requirements are hostile(敌对的) to each other. For example, collecting log data in real time means sending it over the network, which will affect network traffic to some extent(程度). However, as in any design, these requirements are also hypothetical(假设的) until we have running code so we can't take them too seriously. We'll aim for plausibly good enough and improve over time.

A Plausible Minimal Implementation
topprevnext

Arguably(可论证的), just dumping(倾销) log data to disk is one solution(解决方案), and it's what most mobile applications do (using "debug(调试) logs"). But most failures require correlation(相关) of events from two nodes. This means searching lots of debug logs by hand to find the ones that matter. It's not a very clever approach(方法).

We want to send log data somewhere central, either immediately, or opportunistically (i.e., store and forward). For now, let's focus on immediate logging. My first idea when it comes to sending data is to use Zyre for this. Just send log data to a group called "LOG", and hope someone collects it.

But using Zyre to log Zyre itself is a Catch-22. Who logs the logger? What if we want a verbose(冗长的) log of every message sent? Do we include logging messages in that or not? It quickly gets messy. We want a logging protocol(协议) that's independent of Zyre's main ZRE protocol. The simplest approach is a pub-sub protocol, where all nodes publish log data on a PUB socket(插座) and a collector(收藏家) picks that up via a SUB socket.

Figure 69 - Distributed(分配) Log Collection

fig69.png

The collector(收藏家) can, of course, run on any node. This gives us a nice range of use cases:

  • A passive log collector that stores log data on disk for eventual(最后的) statistical(统计的) analysis; this would be a PC with sufficient(足够的) hard disk space for weeks or months of log data.
  • A collector that stores log data into a database where it can be used in real time by other applications. This might be overkill(过度的杀伤威力) for a small workgroup, but would be snazzy(时髦的) for tracking the performance of larger groups. The collector could collect log data over WiFi and then forward it over Ethernet to a database somewhere.
  • A live meter application that joined the Zyre network and then collected log data from nodes, showing events and statistics(统计) in real time.

The next question is how to interconnect(使互相连接) the nodes and collector. Which side binds(捆绑), and which connects? Both ways will work here, but it's marginally(边缘的) better if the PUB sockets(插座) connect to the SUB socket. If you recall(召回), ZeroMQ's internal(内部的) buffers(有软皮摩擦) only pop into existence when there are connections. It means as soon as a node connects to the collector, it can start sending log data without loss.

How do we tell nodes what endpoint(端点) to connect to? We may have any number of collectors on the network, and they'll be using arbitrary(任意的) network addresses and ports. We need some kind of service announcement mechanism(机制), and here we can use Zyre to do the work for us. We could use group messaging, but it seems neater to build service discovery into the ZRE protocol(协议) itself. It's nothing complex(复杂的): if a node provides a service X, it can tell other nodes about that when it sends them a HELLO command.

We'll extend the HELLO command with a headers field that holds a set of name=value pairs. Let's define(定义) that the header X-ZRELOG specifies(指定) the collector endpoint (the SUB socket). A node that acts as a collector can add a header like this (for example):

X-ZRELOG=tcp://192.168.1.122:9992

When another node sees this header, it simply connects its PUB socket to that endpoint. Log data now gets distributed(分布式的) to all collectors (zero or more) on the network.

Making this first version was fairly simple and took half a day. Here are the pieces we had to make or change:

  • We made a new class zre_log that accepts log data and manages the connection to the collector, if any.
  • We added some basic management for peer(贵族) headers, taken from the HELLO command.
  • When a peer has the X-ZRELOG header, we connect to the endpoint(端点) it specifies(指定).
  • Where we were logging to stdout, we switched(转换) to logging via the zre_log class.
  • We extended(延伸) the interface(界面) API with a method that lets the application set headers.
  • We wrote a simple logger application that manages the SUB socket(插座) and sets the X-ZRELOG header.
  • We send our own headers when we send a HELLO command.

This version is tagged in the Zyre repository(贮藏室) as v0.4.0 and you can download the tarball if you want to see what the code looked like at this stage.

At this stage, the log message is just a string. We'll make more professionally structured(有结构的) log data in a little while.

First, a note on dynamic(动态的) ports. In the zre_tester app that we use for testing, we create and destroy interfaces(界面) aggressively. One consequence(结果) is that a new interface can easily reuse a port that was just freed by another application. If there's a ZeroMQ socket(插座) somewhere trying to connect this port, the results can be hilarious(欢闹的).

Here's the scenario(方案) I had, which caused a few minutes' confusion(混淆). The logger was running on a dynamic port:

  • Start logger application
  • Start tester application
  • Stop logger
  • Tester receives invalid(无效的) message (and asserts(维护) as designed)

As the tester created a new interface, that reused the dynamic port freed by the (just stopped) logger, and suddenly the interface began to receive log data from nodes on its mailbox. We saw a similar situation before, where a new interface could reuse the port freed by an old interface and start getting old data.

The lesson is, if you use dynamic ports, be prepared to receive random(随机的) data from ill-informed(所知不多的) applications that are reconnecting(重新连线中) to you. Switching(转换) to a static port stopped the misbehaving(作弊) connection. That's not a full solution(解决方案) though. There are two more weaknesses:

  • As I write this, libzmq doesn't check socket types when connecting. The ZMTP/2.0 protocol does announce each peer's socket type, so this check is doable(可做的).
  • The ZRE protocol(协议) has no fail-fast (assertion(断言)) mechanism(机制); we need to read and parse(解析) a whole message before realizing that it's invalid.

Let's address the second one. Socket pair validation(确认) wouldn't solve this fully anyway.

Protocol Assertions
topprevnext

As Wikipedia puts it, "Fail-fast systems are usually designed to stop normal operation rather than attempt to continue a possibly flawed(有缺陷的) process." A protocol(协议) like HTTP has a fail-fast mechanism(机制) in that the first four bytes that a client sends to an HTTP server must be "HTTP". If they're not, the server can close the connection without reading anything more.

Our ROUTER socket(插座) is not connection-oriented(面向连接) so there's no way to "close the connection" when we get bad incoming messages. However, we can throw out the entire message if it's not valid. The problem is going to be worse when we use ephemeral(短暂的) ports, but it applies broadly to all protocols.

So let's define(定义) a protocol assertion as being a unique(独特的) signature that we place at the start of each message and which identities(身份) the intended protocol. When we read a message, we check the signature and if it's not what we expect, we discard(抛弃) the message silently. A good signature should be hard to confuse(混乱) with regular data and give us enough space for a number of protocols.

I'm going to use a 16-bit signature consisting of a 12-bit pattern and a 4-bit protocol ID. The pattern %xAAA is meant to stay away from values we might otherwise expect to see at the start of a message: %x00, %xFF, and printable(印得出的) characters.

Figure 70 - Protocol Signature

fig70.png

As our protocol codec(编码解码器) is generated(形成), it's relatively easy to add this assertion(断言). The logic(逻辑) is:

  • Get first frame(框架) of message.
  • Check if first two bytes are %xAAA with expected 4-bit signature.
  • If so, continue to parse(解析) rest of message.
  • If not, skip all "more" frames, get first frame, and repeat.

To test this, I switched(转换) the logger back to using an ephemeral port. The interface(界面) now properly detects(察觉) and discards any messages that don't have a valid signature. If the message has a valid signature and is still wrong, that's a proper bug.

Binary Logging Protocol
topprevnext

Now that we have the logging framework(框架) working properly, let's look at the protocol(协议) itself. Sending strings around the network is simple, but when it comes to WiFi we really cannot afford to waste bandwidth(带宽). We have the tools to work with efficient(有效率的) binary(二进制的) protocols, so let's design one for logging.

This is going to be a pub-sub protocol and in ZeroMQ v3.x we do publisher-side filtering. This means we can do multi-level logging (errors, warnings, information) if we put the logging level at the start of the message. So our message starts with a protocol signature (two bytes), a logging level (one byte), and an event type (one byte).

In the first version, we send UUID strings to identify(确定) each node. As text, these are 32 characters each. We can send binary UUIDs, but it's still verbose(冗长的) and wasteful(浪费的). We don't care about the node identifiers in the log files. All we need is some way to correlate(互相有关系) events. So what's the shortest identifier we can use that's going to be unique(独特的) enough for logging? I say "unique enough" because while we really want zero chance of duplicate(复制) UUIDs in the live code, log files are not so critical(鉴定的).

The simplest plausible(貌似可信的) answer is to hash the IP address and port into a 2-byte value. We'll get some collisions(碰撞), but they'll be rare. How rare? As a quick sanity(明智) check, I write a small program that generates(形成) a bunch(群) of addresses and hashes them into 16-bit values, looking for collisions. To be sure, I generate 10,000 addresses across a small number of IP addresses (matching a simulation(仿真) setup(设置)), and then across a large number of addresses (matching a real-life setup). The hashing algorithm(算法) is a modified Bernstein:

uint16_t hash = 0;
while (*endpoint)
hash = 33 * hash ^ *endpoint++;

I don't get any collisions(碰撞) over several runs, so this will work as identifier(标识符) for the log data. This adds four bytes (two for the node recording the event, and two for its peer(贵族) in events that come from a peer).

Next, we want to store the date and time of the event. The POSIX time_t type was previously 32 bits, but because this overflows(充满) in 2038, it's a 64-bit value. We'll use this; there's no need for millisecond(毫秒) resolution(分辨率) in a log file: events are sequential(连续的), clocks are unlikely to be that tightly synchronized(同步的), and network latencies(潜伏) mean that precise(精确的) times aren't that meaningful(有意义的).

We're up to 16 bytes, which is decent(正派的). Finally, we want to allow some additional(附加的) data, formatted(格式化) as text and depending on the type of event. Putting this all together gives the following message specification(规格):

<class
    name = "zre_log_msg"
    script = "codec_c.gsl"
    signature = "2"
>
This is the ZRE logging protocol - raw version.
<include filename = "license.xml" />

<!-- Protocol constants -->
<define name = "VERSION" value = "1" />

<define name = "LEVEL_ERROR" value = "1" />
<define name = "LEVEL_WARNING" value = "2" />
<define name = "LEVEL_INFO" value = "3" />

<define name = "EVENT_JOIN" value = "1" />
<define name = "EVENT_LEAVE" value = "2" />
<define name = "EVENT_ENTER" value = "3" />
<define name = "EVENT_EXIT" value = "4" />

<message name = "LOG" id = "1">
    <field name = "level" type = "number" size = "1" />
    <field name = "event" type = "number" size = "1" />
    <field name = "node" type = "number" size = "2" />
    <field name = "peer" type = "number" size = "2" />
    <field name = "time" type = "number" size = "8" />
    <field name = "data" type = "string" />
Log an event
</message>

</class>

This generates(形成) 800 lines of perfect binary(二进制的) codec(编码解码器) (the zre_log_msg class). The codec does protocol(协议) assertions(断言) just like the main ZRE protocol does. Code generation has a fairly steep starting curve(曲线), but it makes it so much easier to push your designs past "amateur(爱好者)" into "professional".

Content Distribution

topprevnext

We now have a robust(强健的) framework(框架) for creating groups of nodes, letting them chat to each other, and monitoring the resulting network. Next step is to allow them to distribute(分配) content as files.

As usual, we'll aim for the very simplest plausible(貌似可信的) solution(解决方案) and then improve that step-by-step(按部就班的). At the very least we want the following:

  • An application can tell the Zyre API, "Publish this file", and provide the path to a file that exists somewhere in the file system.
  • Zyre will distribute(分配) that file to all peers(撒尿), both those that are on the network at that time, and those that arrive later.
  • Each time an interface(界面) receives a file it tells its application, "Here is this file".

We might eventually(最后) want more discrimination, e.g., publishing to specific(特殊的) groups. We can add that later if it's needed. In Chapter 7 - Advanced Architecture using ZeroMQ we developed a file distribution(分布) system (FileMQ) designed to be plugged into ZeroMQ applications. So let's use that.

Each node is going to be a file publisher and a file subscriber(订户). We bind(绑) the publisher to an ephemeral(短暂的) port (if we use the standard FileMQ port 5670, we can't run multiple interfaces on one box), and we broadcast the publisher's endpoint(端点) in the HELLO message, as we did for the log collector(收藏家). This lets us interconnect(使互相连接) all nodes so that all subscribers talk to all publishers.

We need to ensure(保证) that each node has its own directory for sending and receiving files (the outbox(待发箱) and the inbox(收件箱)). Again, it's so we can run multiple nodes on one box. Because we already have a unique(独特的) ID per node, we just use that in the directory name.

Here's how we set up the FileMQ API when we create a new interface:

sprintf (self->fmq_outbox, ".outbox/%s", self->identity);
mkdir (self->fmq_outbox, 0775);

sprintf (self->fmq_inbox, ".inbox/%s", self->identity);
mkdir (self->fmq_inbox, 0775);

self->fmq_server = fmq_server_new ();
self->fmq_service = fmq_server_bind (self->fmq_server, "tcp://*:*");
fmq_server_publish (self->fmq_server, self->fmq_outbox, "/");
fmq_server_set_anonymous(匿名的) (self->fmq_server, true);
char publisher [32];
sprintf (publisher, "tcp://%s:%d", self->host, self->fmq_service);
zhash_update (self->headers, "X-FILEMQ", strdup (publisher));

// Client will connect as it discovers new nodes
self->fmq_client = fmq_client_new ();
fmq_client_set_inbox(收件箱) (self->fmq_client, self->fmq_inbox);
fmq_client_set_resync (self->fmq_client, true);
fmq_client_subscribe(签署) (self->fmq_client, "/");

And when we process a HELLO command, we check for the X-FILEMQ header field:

// If peer(贵族) is a FileMQ publisher, connect to it
char *publisher = zre_msg_headers_string (msg, "X-FILEMQ", NULL);
if (publisher)
fmq_client_connect (self->fmq_client, publisher);

The last thing is to expose content distribution(分布) in the Zyre API. We need two things:

  • A way for the application to say, "Publish this file"
  • A way for the interface(界面) to tell the application, "We received this file".

In theory, the application can publish a file just by creating a symbolic(象征的) link in the outbox(待发箱) directory, but as we're using a hidden outbox, this is a little difficult. So we add an API method publish:

// Publish file into virtual(虚拟的) space
void
zre_interface_publish (zre_interface_t *self,
char *filename, char *external)
{
zstr_sendm (self->pipe, "PUBLISH");
zstr_sendm (self->pipe, filename); // Real file name
zstr_send (self->pipe, external); // Location in virtual(虚拟的) space
}

The API passes this to the interface(界面) thread, which creates the file in the outbox(待发箱) directory so that the FileMQ server will pick it up and broadcast it. We could literally(照字面地) copy file data into this directory, but because FileMQ supports symbolic(象征的) links, we use that instead. The file has a ".ln" extension(延长) and contains one line, which contains the actual pathname(路径名).

Finally, how do we notify(通告) the recipient(容器) that a file has arrived? The FileMQ fmq_client API has a message, "DELIVER", for this, so all we have to do in zre_interface is grab this message from the fmq_client API and pass it on to our own API:

zmsg_t *msg = fmq_client_recv (fmq_client_handle (self->fmq_client));
zmsg_send (&msg, self->pipe);

This is complex(复杂的) code that does a lot at once. But we're only at around 10K lines of code for FileMQ and Zyre together. The most complex Zyre class, zre_interface, is 800 lines of code. This is compact(紧凑的). Message-based applications do keep their shape if you're careful to organize them properly.

Writing the Unprotocol

topprevnext

We have all the pieces for a formal(正式的) protocol(协议) specification(规格) and it's time to put the protocol on paper. There are two reasons for this. First, to make sure that any other implementations(实现) talk to each other properly. Second, because I want to get an official port for the UDP discovery protocol and that means doing the paperwork.

Like all the other unprotocols we developed in this book, the protocol lives on the ZeroMQ RFC site. The core of the protocol specification is the ABNF grammar for the commands and fields:

zre-protocol    = greeting *traffic

greeting        = S:HELLO
traffic         = S:WHISPER
                / S:SHOUT
                / S:JOIN
                / S:LEAVE
                / S:PING R:PING-OK

;   Greet a peer so it can connect back to us
S:HELLO         = header %x01 ipaddress mailbox groups status headers
header          = signature sequence
signature       = %xAA %xA1
sequence        = 2OCTET        ; Incremental sequence number
ipaddress       = string        ; Sender IP address
string          = size *VCHAR
size            = OCTET
mailbox         = 2OCTET        ; Sender mailbox port number
groups          = strings       ; List of groups sender is in
strings         = size *string
status          = OCTET         ; Sender group status sequence
headers         = dictionary    ; Sender header properties
dictionary      = size *key-value
key-value       = string        ; Formatted as name=value

; Send a message to a peer
S:WHISPER       = header %x02 content
content         = FRAME         ; Message content as ZeroMQ frame

; Send a message to a group
S:SHOUT         = header %x03 group content
group           = string        ; Name of group
content         = FRAME         ; Message content as ZeroMQ frame

; Join a group
S:JOIN          = header %x04 group status
status          = OCTET         ; Sender group status sequence

; Leave a group
S:LEAVE         = header %x05 group status

; Ping a peer that has gone silent
S:PING          = header %06

; Reply to a peer's ping
R:PING-OK       = header %07

Example Zyre Application

topprevnext

Let's now make a minimal(最低的) example that uses Zyre to broadcast files around a distributed(分布式的) network. This example consists of two programs:

  • A listener that joins the Zyre network and reports whenever it receives a file.
  • A sender that joins a Zyre network and broadcasts exactly one file.

The listener is quite short:

#include <zre.h>

int main (int argc, char *argv [])
{
zre_interface_t *interface = zre_interface_new ();
while (true) {
zmsg_t *incoming = zre_interface(界面)_recv (interface);
if (!incoming)
break;
zmsg_dump (incoming);
zmsg_destroy (&incoming);
}
zre_interface_destroy (&interface);
return 0;
}

And the sender isn't much longer:

#include <zre.h>

int main (int argc, char *argv [])
{
if (argc < 2) {
puts ("Syntax(语法): sender filename virtualname");
return 0;
}
printf ("Publishing %s as %s\n", argv [1], argv [2]);
zre_interface_t *interface = zre_interface_new ();
zre_interface(界面)_publish (interface, argv [1], argv [2]);
while (true) {
zmsg_t *incoming = zre_interface(界面)_recv (interface);
if (!incoming)
break;
zmsg_dump (incoming);
zmsg_destroy (&incoming);
}
zre_interface_destroy (&interface);
return 0;
}

Conclusions

topprevnext

Building applications for unstable(不稳定的) decentralized(分散的) networks is one of the end games for ZeroMQ. As the cost of computing falls every year, such networks become more and more common, be it consumer(消费者) electronics(电子学) or virtual(虚拟的) boxes in the cloud. In this chapter, we've pulled together many of the techniques from the book to build Zyre, a framework(框架) for proximity(亲近) computing over a local network. Zyre isn't unique(独特的); there are and have been many attempts to open this area for applications: ZeroConf, SLP, SSDP, UPnP, DDS. But these all seem to end up too complex(复杂的) or otherwise too difficult for application developers to build on.

Zyre isn't finished. Like many of the projects in this book, it's an ice breaker for others. There are some major unfinished areas, which we may address in later editions of this book or versions of the software.

  • High-level APIs: the message-based API that Zyre offers now is usable(可用的) but still rather more complex than I'd like for average developers. If there's one target we absolutely(绝对的) cannot miss, it's raw simplicity. This means we should build high-level APIs, in lots of languages, which hide all the messaging, and which come down to simple methods like start, join/leave group, get message, publish file, stop.
  • Security: how do we build a fully decentralized security system? We might be able to leverage(手段) public key infrastructure(基础设施) for some work, but that requires that nodes have their own Internet access, which isn't guaranteed(保证). The answer is, as far as we can tell, to use any existing secure(安全的) peer-to-peer(对等) link (TLS, BlueTooth, perhaps NFC) to exchange a session(会议) key and use a symmetric(对称的) cipher(密码). Symmetric ciphers have their advantages and disadvantages.
  • Nomadic(游牧的) content: how do I, as a user, manage my content across multiple devices(装置)? The Zyre + FileMQ combination(结合) might help, for local network use, but I'd like to be able to do this across the Internet as well. Are there cloud services I could use? Is there something I could make using ZeroMQ?
  • Federation(联合): how do we scale(衡量) a local-area distributed(分配) application across the globe? One plausible(貌似可信的) answer is federation, which means creating clusters(群) of clusters. If 100 nodes can join together to create a local cluster, then perhaps 100 clusters can join together to create a wide-area cluster. The challenges are then quite similar: discovery, presence(存在), and group messaging.


Postface

topprevnext

Tales from Out There

topprevnext

I asked some of the contributors(贡献者) to this book to tell us what they were doing with ZeroMQ. Here are their stories.

Rob Gagnon's Story
topprevnext

"We use ZeroMQ to assist(参加) in aggregating(聚集) thousands of events occurring every minute across our global network of telecommunications(通讯) servers so that we can accurately(精确地) report and monitor for situations that require our attention. ZeroMQ made the development of the system not only easier, but faster to develop and more robust(强健的) and fault-tolerant(容错的) than we had originally planned in our original design.

"We're able to easily add and remove clients from the network without the loss of any message. If we need to enhance(提高) the server portion(部分) of our system, we can stop and restart(重新启动) it as well without having to worry about stopping all of the clients first. The built-in(嵌入的) buffering(缓冲作用) of ZeroMQ makes this all possible."

Tom van Leeuwen's Story
topprevnext

"I was looking at creating some kind of service bus connecting all kinds of services together. There were already some products that implemented(实施) a broker, but they did not have the functionality(功能) I needed. By accident, I stumbled(踌躇) upon ZeroMQ, which is awesome(可怕的). It's very lightweight(轻量级选手), lean(瘦的), simple and easy to follow because the guide is very complete and reads very well. I've actually implemented the Titanic pattern and the Majordomo broker with some additions (client/worker authentication(证明) and workers sending a catalog explaining what they provide and how they should be addressed).

"The beautiful thing about ZeroMQ is the fact that it is a library and not an application. You can mold(霉菌) it however you like and it simply puts boring things like queuing, reconnecting(使再接合), TCP sockets(插座) and such to the background, making sure you can concentrate(集中) on what is important to you. I've implemented(实施) all kinds of workers/clients and the broker in Ruby, because that is the main language we use for development, but also some PHP clients to connect to the bus from existing PHP webapps. We use this service bus for cloud services, connecting all kinds of platform devices(装置) to a service bus exposing functionality(功能) for automation(自动化).

"ZeroMQ is very easy to understand and if you spend a day with the guide, you'll have good knowledge of how it works. I'm a network engineer, not a software developer, but managed to create a very nice solution(解决方案) for our automation needs! ZeroMQ: Thank you very much!"

Michael Jakl's Story
topprevnext

"We use ZeroMQ for distributing(分配的) millions of documents per day in our distributed processing pipeline(管道). We started out with big message queuing brokers that had their own respective(分别的) issues and problems. In the quest(追求) of simplifying(简约) our architecture(建筑学), we chose ZeroMQ to do the wiring. So far it had a huge impact(影响) in how our architecture scales(天平) and how easy it is to change and move the components(成分). The plethora(过多) of language bindings(结合) lets us choose the right tool for the job without sacrificing interoperability(互操作性) in our system. We don't use a lot of sockets (less than 10 in our whole application), but that's all we needed to split a huge monolithic(整体的) application into small independent parts.

"All in all, ZeroMQ lets me keep my sanity(明智) and helps my customers stay within budget(预算)."

Vadim Shalts's Story
topprevnext

"I am team leader in the company ActForex, which develops software for financial(金融的) markets. Due to the nature of our domain, we need to process large volumes(量) of prices quickly. In addition, it's extremely critical(鉴定的) to minimize(使减到最少) latency(潜伏) in processing orders and prices. Achieving a high throughput(生产量) is not enough. Everything must be handled in a soft real time with a predictable(可预言的) ultra(极端的) low latency per price. The system consists of multiple components exchanging messages. Each price can take a lot of processing stages, each of which increases total latency. As a consequence(结果), low and predictable latency of messaging between components becomes a key factor(因素) of our architecture.

"We investigated(调查) different solutions to find something suitable for our needs. We tried different message brokers (RabbitMQ, ActiveMQ Apollo, Kafka), but failed to reach a low and predictable latency with any of them. In the end, we chose ZeroMQ used in conjunction(结合) with ZooKeeper for service discovery. Complex(复杂的) coordination(协调) with ZeroMQ requires a relatively large effort and a good understanding, as a result of the natural complexity(复杂) of multithreading. We found that an external(外部的) agent like ZooKeeper is better choice for service discovery and coordination while ZeroMQ can be used primarily for simple messaging. ZeroMQ fit perfectly into our architecture. It allowed us to achieve the desired latency using minimal(最低的) efforts. It saved us from a bottleneck(瓶颈) in the processing of messages and made processing time very stable(稳定的) and predictable.

"I can decidedly recommend ZeroMQ for solutions where low latency is important."

How This Book Happened

topprevnext

When I set out to write a ZeroMQ book, we were still debating the pros and cons of forks and pull requests in the ZeroMQ community. Today, for what it's worth, this argument seems settled: the "liberal(自由主义者)" policy(政策) that we adopted(采取) for libzmq in early 2012 broke our dependency(属国) on a single prime(主要的) author, and opened the floor to dozens of new contributors(贡献者). More profoundly(深刻地), it allowed us to move to a gently organic(有机的) evolutionary(进化的) model that was very different from the older forced-march model.

The reason I was confident this would work was that our work on the Guide had, for a year or more, shown the way. True, the text is my own work, which is perhaps as it should be. Writing is not programming. When we write, we tell a story and one doesn't want different voices telling one tale; it feels strange.

For me the real long-term value of the book is the repository(贮藏室) of examples: about 65,000 lines of code in 24 different languages. It's partly about making ZeroMQ accessible(易接近的) to more people. People already refer to the Python and PHP example repositories—two of the most complete—when they want to tell others how to learn ZeroMQ. But it's also about learning programming languages.

Here's a loop(环) of code in Tcl:

while {1} {
# Process all parts of the message
zmq message message
frontend recv_msg message
set more [frontend getsockopt RCVMORE]
backend send_msg message [expr {$more?"SNDMORE":""}]
message close
if {!$more} {
break ; # Last message part
}
}

And here's the same loop(环) in Lua:

while true do
-- Process all parts of the message
local msg = frontend:recv()
if (frontend(前端):getopt(zmq.RCVMORE) == 1) then
backend:send(msg, zmq.SNDMORE)
else
backend:send(msg, 0)
break; -- Last message part
end
end

And this particular example (rrbroker) exists in C#, C++, CL, Clojure, Erlang, F#, Go, Haskell, Haxe, Java, Lua, Node.js, Perl, PHP, Python, Ruby, Scala, Tcl, and of course C. This code base, all provided as open source under the MIT/X11 license, may form the basis(基础) for other books or projects.

But what this collection of translations says most profoundly(深刻地) is this: the language you choose is a detail, even a distraction(注意力分散). The power of ZeroMQ lies in the patterns it gives you and lets you build, and these transcend(胜过) the comings and goings of languages. My goal as a software and social architect(建筑师) is to build structures(结构) that can last generations. There seems no point in aiming for mere(仅仅的) decades(十年).

Removing Friction

topprevnext

I'll explain the technical tool chain we used in terms of the friction(摩擦) we removed. In this book we're telling a story and the goal is to reach as many people as possible, as cheaply and smoothly as we can.

The core idea was to host the text and examples on GitHub and make it easy for anyone to contribute(贡献). It turned out to be more complex(复杂的) than that, however.

Let's start with the division of labor. I'm a good writer and can produce endless amounts(数量) of decent(正派的) text quickly. But what was impossible for me was to provide the examples in other languages. Because the core ZeroMQ API is in C, it seemed logical(合逻辑的) to write the original examples in C. Also, C is a neutral(中立的) choice; it's perhaps the only language that doesn't create strong emotions(情感).

How to encourage people to make translations of the examples? We tried a few approaches(方法) and finally what worked best was to offer a "choose your language" link on every single example in the text, which took people either to the translation or to a page explaining how they could contribute. The way it usually works is that as people learn ZeroMQ in their preferred language, they contribute a handful of translations or fixes to the existing ones.

At the same time, I noticed a few people quite determinedly translating every single example. This was mainly binding(有约束力的) authors who realized that the examples were a great way to encourage people to use their bindings. For their efforts, I extended(延伸) the scripts to produce language-specific versions of the book. Instead of including the C code, we'd include the Python, or PHP code. Lua and Haxe also got their dedicated(专用的) versions.

Once we have an idea of who works on what, we know how to structure(组织) the work itself. It's clear that to write and test an example, what you want to work on is source code. So we import this source code when we build the book, and that's how we make language-specific versions.

I like to write in a plain text format. It's fast and works well with source control systems like git. Because the main platform for our websites is Wikidot, I write using Wikidot's very readable(可读的) markup(涨价) format.

At least in the first chapters, it was important to draw pictures to explain the flow of messages between peers(撒尿). Making diagrams by hand is a lot of work, and when we want to get final output(输出) in different formats, image conversion(转换) becomes a chore(家庭杂务). I started with Ditaa, which turns text diagrams into PNGs, then later switched(转换) to asciitosvg, which produces SVG files, which are rather better. Since the figures are text diagrams, embedded(栽种) in the prose(散文), it's remarkably(卓越的) easy to work with them.

By now you'll realize that the toolchain we use is highly customized(自定义), though it uses a lot of external(外部的) tools. All are available on Ubuntu, which is a mercy, and the whole custom toolchain is in the zguide repository(贮藏室) in the bin subdirectory(子目录).

Let's walk through the editing and publishing process. Here is how we produce the online version:

bin/buildguide

Which works as follows:

  • The original text sits in a series of text files (one per chapter).
  • The examples sit in the examples subdirectory, classified(分类) per language.
  • We take the text and process this using a custom Perl script, mkwikidot, into a set of Wikidot-ready files.
  • We do this for each of the languages that get their own version.
  • We extract(提取) the graphics(图形) and call asciitosvg and rasterize(光栅化) on each one to produce image files, which we store in the images subdirectory(子目录).
  • We extract inline(内联的) listings (which are not translated) and stores these in the listings subdirectory.
  • We use pygmentize on each example and listing to create a marked-up page in Wikidot format.
  • We upload all changed files to the online wiki using the Wikidot API.

Doing this from scratch(擦伤) takes a while. So we store the SHA1 signatures of every image, listing, example, and text file, and only process and upload changes, and that makes it easy to publish a new version of the text when people make new contributions.

To produce the PDF and Epub formats, we do the following:

bin/buildpdfs

Which works as follows:

  • We use the custom mkdocbook Perl program on the input(投入) files to produce a DocBook output(输出).
  • We push the DocBook format through docbook2ps and ps2pdf to create clean PDFs in each language.
  • We push the DocBook format through db2epub to create Epub books and in each language.
  • We upload the PDFs to the public wiki using the Wikidot API.

When creating a community project, it's important to lower the "change latency(潜伏)", which is the time it takes for people to see their work live or, at least, to see that you've accepted their pull request. If that is more than a day or two, you've often lost your contributor's interest.

Licensing

topprevnext

I want people to reuse this text in their own work: in presentations, articles, and even other books. However, the deal is that if they remix(使再混合) my work, others can remix theirs. I'd like credit, and have no argument against others making money from their remixes. Thus, the text is licensed under cc-by-sa.

For the examples, we started with GPL, but it rapidly became clear this wasn't workable(切实可行的). The point of examples is to give people reusable(可重复使用的) code fragments(碎片) so they will use ZeroMQ more widely, and if these are GPL, that won't happen. We switched(转换) to MIT/X11, even for the larger and more complex(复杂的) examples that conceivably(令人信服地) would work as LGPL.

However, when we started turning the examples into standalone(单独的) projects (as with Majordomo), we used the LGPL. Again, remixability trumps(王牌) dissemination(宣传). Licenses are tools; use them with intent(意图), not ideology(意识形态).


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
absl库安装时出现"Could not find a version that satisfies the requirement absl"的错误,这意味着没有找到满足要求的absl库的版本。 类似地,安装pymysql时出现"Could not find a version that satisfies the requirement pymysql==1.0.2"的错误,也是找不到满足要求的pymysql库的版本。 解决这些问题的办法是使用镜像源。你可以使用豆瓣镜像源来安装库包。使用以下命令:pip install 库包名 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com。 另外,有用户在安装PIL时遇到了类似的问题。这可能是因为安装包不适配用户的Python版本或位数。在这种情况下,需要选择适合你Python版本和位数的安装包。 总结来说,如果遇到"Could not find a version that satisfies the requirement"的错误,可以考虑使用镜像源来解决,并确保选择适合你Python版本和位数的安装包。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [ERROR: Could not find a version that satisfies the requirement absl (from versions: none) ERROR: No](https://blog.csdn.net/ao1886/article/details/121101071)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *2* [ Could not find a version that satisfies the requirement xxxx==1.0.2 (from versions](https://blog.csdn.net/qq_51081319/article/details/129476127)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *3* [Could not find a version that satisfies the requirement PIL (from versions: ) No matching distribu](https://download.csdn.net/download/weixin_38722721/13750519)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 33.333333333333336%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值