storm apache_Apache Storm很棒。 这就是为什么(以及如何)使用它的原因。

storm apache

by Usama Ashraf

通过Usama Ashraf

Apache Storm很棒。 这就是为什么(以及如何)使用它的原因。 (Apache Storm is awesome. This is why (and how) you should be using it.)

Continuous data streams are ubiquitous and are becoming even more so with the increasing number of IoT devices being used. Of course, this means huge volumes of data are stored, processed, and analyzed to provide predictive, actionable results.

连续数据流无处不在,并且随着所使用的IoT设备数量增加,甚至越来越多 。 当然,这意味着要存储,处理和分析大量数据以提供可预测的可行结果。

But petabytes of data take a long time to analyze, even with tools such as Hadoop (as good as MapReduce may be) or Spark (a remedy to the limitations of MapReduce).

但是,即使使用Hadoop (与MapReduce一样好)或Spark (弥补MapReduce局限性)之类的工具,PB级数据也需要花费很长时间进行分析。

Often, we don’t need to deduce patterns over long periods of time. Of the petabytes of incoming data collected over months, at any given moment, we might not need to take into account all of it, just a real-time snapshot. Perhaps we don’t need to know the longest trending hashtag over five years, but just the one right now.

通常,我们不需要长时间推断出模式。 在任何给定的时刻,经过几个月收集的PB级传入数据中,我们可能不需要考虑所有这些,而只是实时快照。 也许我们不需要知道五年内最长的趋势标签,而只需知道现在。

This is what Apache Storm is built for, to accept tons of data coming in extremely fast, possibly from various sources, analyze it, and publish real-time updates to a UI or some other place… without storing any actual data.

这就是Apache Storm的目的,它可以接受可能来自各种来源的极其快速的大量数据,对其进行分析,然后将实时更新发布到UI或其他某个地方…… 而无需存储任何实际数据

This article is not the ultimate guide to Apache Storm, nor is it meant to be. Storm’s pretty huge, and just one long-read probably can’t do it justice anyways. Of course, any additions, feedback or constructive criticism will be greatly appreciated.

本文不是Apache Storm的最终指南,也不是。 Storm相当庞大,而且只有一本长篇小说可能无法将它做到公平。 当然,任何补充,反馈或建设性的批评将不胜感激。

OK, now that that’s out of the way, let’s see what we’ll be covering:

好的,既然已经解决了,让我们看看我们将要介绍的内容:

  • The necessity of Storm, the ‘why’ of it, what it is and what it isn’t

    风暴的必要性,它的“原因”,它是什么以及什么不是
  • A bird’s eye view of how it works.

    鸟瞰其工作原理。
  • What a Storm topology roughly looks like in code (Java)

    Storm拓扑在代码中的大致外观(Java)
  • Setting up and playing with a production-worthy Storm cluster on Docker.

    在Docker上设置有价值的Storm集群并在其中进行测试。
  • A few words on message processing reliability.

    关于消息处理可靠性的几句话。

I’m also assuming that you’re at least somewhat familiar with Docker and containerization.

我还假设您至少对Docker和容器化有所了解。

这个怎么运作 (How It Works)

The architecture of Apache Storm can be compared to a network of roads connecting a set of checkpoints. Traffic begins at a certain checkpoint (called a spout) and passes through other checkpoints (called bolts).

可以将Apache Storm的体系结构与连接一组检查点的道路网络进行比较。 交通从某个检查点(称为spout )开始,并经过其他检查点(称为bolts )。

The traffic is of course the stream of data that is retrieved by the spout (from a data source, a public API for example) and routed to various bolts where the data is filtered, sanitized, aggregated, analyzed, and sent to a UI for people to view (or to any other target).

流量当然是由喷口检索的数据流(例如,从数据源,例如公共API),并路由到各个螺栓 ,在此处过滤,清理,汇总,分析数据并将其发送到UI观看(或其他任何目标)的人。

The network of spouts and bolts is called a topology, and the data flows in the form of tuples (list of values that may have different types).

喷嘴和螺栓的网络称为拓扑 ,数据以元组 (可能具有不同类型的值列表)的形式流动。

One important thing to talk about is the direction of the data traffic.

讨论的重要一件事是数据流量的方向。

Conventionally, we would have one or multiple spouts reading the data from an API, a queuing system, and so on. The data would then flow one-way to one or multiple bolts which may forward it to other bolts and so on. Bolts may publish the analyzed data to a UI or to another bolt.

按照惯例,我们将让一个或多个喷嘴从API,排队系统等中读取数据。 然后,数据将单向流到一个或多个螺栓,然后可以将其转发到其他螺栓,依此类推。 螺栓可以将分析的数据发布到UI或另一个螺栓。

But the traffic is almost always unidirectional, like a directed acyclic graph (DAG). Although it is certainly possible to make cycles, we’re unlikely to need such a convoluted topology.

但是流量几乎总是单向的,就像有向无环图 (DAG)一样。 尽管当然可以进行循环,但我们不太需要这样复杂的拓扑。

Installing a Storm release involves a number of steps, which you’re free to follow on your machine. But later on I’ll be using Docker containers for a Storm cluster deployment, and the images will take care of setting up everything we need.

安装Storm发行版涉及许多步骤,您可以在计算机上自由执行这些步骤。 但是稍后,我将使用Docker容器进行Storm集群部署,并且映像将负责设置我们所需的一切。

一些代码 (Some code)

While Storm does offer support for other languages, most topologies are written in Java, since it’s the most efficient option we have.

尽管Storm确实支持其他语言 ,但是大多数拓扑都是用Java编写的,因为它是我们拥有的最有效的选择。

A very basic spout, that just emits random digits, may look like this:

一个非常基本的喷口,它只是发出随机数字,可能看起来像这样:

And a simple bolt that takes in the stream of random digits and emits only the even ones:

一个简单的螺栓接收随机数字流,只发出偶数:

Another simple bolt that’ll receive the filtered stream from EvenDigitBolt, and just multiply each even digit by 10 and emit it forward:

另一个简单的螺栓将接收来自EvenDigitBolt的过滤后的流,并将每个偶数乘以10并向前发送:

Putting them together to form our topology:

将它们放在一起形成我们的拓扑:

Storm拓扑中的并行性 (Parallelism in Storm topologies)

Fully understanding parallelism in Storm can be daunting, at least in my experience. A topology requires at least one process to operate on. Within this process, we can parallelize the execution of our spouts and bolts using threads.

至少以我的经验,完全了解Storm中的并行性可能会令人生畏。 拓扑需要至少一个进程才能进行操作。 在此过程中,我们可以使用线程并行执行喷口和螺栓的执行。

In our example, RandomDigitSpout will launch just one thread, and the data spewed from that thread will be distributed among two threads of the EvenDigitBolt.

在我们的示例中, RandomDigitSpout将仅启动一个线程,并且从该线程喷出的数据将分布在EvenDigitBolt两个线程之间。

But the way this distribution happens, referred to as the stream grouping, can be important. For example, you may have a stream of temperature recordings from two cities, where the tuples emitted by the spout look like this:

但是这种分布发生的方式(称为流分组 )可能很重要。 例如,您可能有来自两个城市的温度记录流,喷口发出的元组看起来像这样:

// City name, temperature, time of recording
(“Atlanta”,       94, “2018–05–11 23:14”)(“New York City”, 75, “2018–05–11 23:15”)(“New York City”, 76, “2018–05–11 23:16”)(“Atlanta”,       96, “2018–05–11 23:15”)(“New York City”, 77, “2018–05–11 23:17”)(“Atlanta”,       95, “2018–05–11 23:16”)(“New York City”, 76, “2018–05–11 23:18”)

Suppose we’re attaching just one bolt whose job is to calculate the changing average temperature of each city.

假设我们仅连接一个螺栓,其作用是计算每个城市的变化平均温度。

If we can reasonably expect that in any given time interval we’ll get roughly an equal number of tuples from both the cities, it would make sense to dedicate two threads to our bolt. We can send the data for Atlanta to one of them and New York to the other.

如果我们可以合理地期望,在任何给定的时间间隔内,我们从两个城市中获得的元组数量大致相等,那么将两个线程专用于我们的螺栓是有意义的。 我们可以将亚特兰大的数据发送给其中一个,将纽约的数据发送给另一个。

A fields grouping would serve our purpose, which partitions data among the threads by the value of the field specified in the grouping:

字段分组将满足我们的目的,即按分组中指定的字段值在线程之间划分数据:

// The tuples with the same city name will go to the same thread.builder.setBolt(“avg-temp-bolt”, new AvgTempBolt(), 2)       .fieldsGrouping(“temp-spout”, new Fields(“city_name”));

And of course, there are other types of groupings as well. For most cases, though, the grouping probably won’t matter much. You can just shuffle the data and throw it among the bolt threads randomly (shuffle grouping).

当然,还有其他类型的分组 。 但是,在大多数情况下,分组可能并不重要。 您可以只对数据进行混洗,然后将其随机扔到各个螺栓线程之间( 混洗分组 )。

Now there’s another important component to this: the number of worker processes that our topology will run on.

现在,还有另一个重要的组成部分:我们的拓扑将在其上运行的工作进程数。

The total number of threads that we specified will then be equally divided among the worker processes. So, in our example random digit topology, we had one spout thread, two even-digit bolt threads, and four multiply-by-ten bolt threads (giving seven in total).

然后,我们指定的线程总数将在工作进程之间平均分配。 因此,在示例随机数字拓扑中,我们有一个喷口螺纹,两个偶数螺栓螺纹和四个乘以十的螺栓螺纹(总共七个)。

Each of the two worker processes would be responsible for running two multiply-by-ten bolt threads, one even-digit bolt, and one of the processes will run the one spout thread.

两个工作进程中的每一个将负责运行两个乘以十的螺栓线程,一个偶数个螺栓,并且其中一个进程将运行一个喷口线程。

Of course, the two worker processes will have their main threads, which in turn will launch the spout and bolt threads. So all in all we’ll have nine threads. These are collectively called executors.

当然,这两个工作进程将具有其主线程,这又将启动喷嘴和螺栓线程。 因此,总共有九个线程。 这些统称为执行人

It’s important to realize that if you set a spout’s parallelism hint to be greater than one (multiple executors), you can end up emitting the same data several times.

重要的是要意识到,如果将spout的并行性提示设置为大于一个(多个执行程序),则最终可能会多次发出相同的数据。

Say the spout reads from the public Twitter stream API and uses two executors. That means that the bolts receiving the data from the spout will get the same tweet twice. It is only after the spout emits the tuples that data parallelism comes into play. In other words, the tuples get divided among the bolts according to the specified stream grouping.

假设喷口从公共Twitter流API读取并使用两个执行程序。 这意味着从喷嘴接收数据的螺栓将获得两次相同的鸣叫。 只有 spout发出元组之后,数据并行性才起作用。 换句话说,元组根据指定的流分组在螺栓之间划分。

Running multiple workers on a single node would be fairly pointless. Later, however, we’ll use a proper, distributed, multi-node cluster and see how workers are divided on different nodes.

在单个节点上运行多个工作程序将毫无意义。 但是,稍后,我们将使用适当的分布式多节点群集,并查看如何在不同节点上划分工作人员。

建立我们的拓扑 (Building our topology)

Here’s the directory structure I suggest:

这是我建议的目录结构:

yourproject/            pom.xml             src/                jvm/                    packagename/                          RandomDigitSpout.java                          EvenDigitBolt.java                          MultiplyByTenBolt.java                          OurSimpleTopology.java

Maven is commonly used for building Storm topologies, and it requires a pom.xml file (The POM) that defines various configuration details, project dependencies, and so on. Getting into the nitty-gritty of the POM will probably be overkill here.

Maven通常用于构建Storm拓扑,它需要一个pom.xml文件(POM),该文件定义了各种配置详细信息,项目依赖项等 。 在这里,进入POM本质可能是过大的。

  • First, we’ll run mvn clean inside yourproject to clear any compiled files we may have, making sure to compile each module from scratch.

    首先,我们将运行mvn cleanyourproject清除我们可能有任何编译的文件,并确保编译从头每个模块。

  • And then mvn package to compile our code and package it in an executable JAR file, inside a newly created target folder. This might take quite a few minutes the first time, especially if your topology has many dependencies.

    然后mvn package以编译我们的代码,并将其打包到一个新创建的target文件夹中的可执行JAR文件中。 第一次可能会花费几分钟,尤其是在您的拓扑具有许多依赖性的情况下。

  • To submit our topology: storm jar target/packagename-{version number}.jar packagename.OurSimpleTopology

    提交拓扑: storm jar target/packagename-{version number}.jar packagename.OurSimpleTopology

Hopefully by now the gap between concept and code in Storm has been somewhat bridged. However, no serious Storm deployment will be a single topology instance running on one server.

希望到目前为止,Storm中概念和代码之间的鸿沟已经有所弥合。 但是,没有严重的Storm部署是在一个服务器上运行的单个拓扑实例。

风暴集群的外观 (What a Storm cluster looks like)

To take full advantage of Storm’s scalability and fault-tolerance, any production-grade topology would be submitted to a cluster of machines.

为了充分利用Storm的可伸缩性容错能力 ,任何生产级拓扑都应提交给计算机集群。

Storm distributions are installed on the primary node (Nimbus) and all the replica nodes (Supervisors).

Storm分发安装在主节点(Nimbus)和所有副本节点(Supervisor)上。

The primary node runs the Storm Nimbus daemon and the Storm UI. The replica nodes run the Storm Supervisor daemons. A Zookeeper daemon on a separate node is used for coordination among the primary node and the replica nodes.

节点运行Storm Nimbus守护程序和Storm UI。 副本节点运行Storm Supervisor守护程序。 单独节点上的Zookeeper守护程序用于在主节点和副本节点之间进行协调。

Zookeeper, by the way, is only used for cluster management and never any kind of message passing. It’s not like spouts and bolts are sending data to each other through it or anything like that. The Nimbus daemon finds available Supervisors via ZooKeeper, to which the Supervisor daemons register themselves. It also carries out other managerial tasks, some of which will become clear shortly.

顺便说一下,Zookeeper仅用于集群管理,绝不用于任何形式的消息传递。 这并不是喷口和螺栓通过它或类似的东西互相发送数据。 Nimbus守护程序通过ZooKeeper查找可用的Supervisor,Supervisor守护程序向其注册。 它还执行其他管理任务,其中一些任务很快就会变得清晰起来。

The Storm UI is a web interface used to manage the state of our cluster. We’ll get to this later.

Storm UI是用于管理集群状态的Web界面。 我们稍后再讨论。

Our topology is submitted to the Nimbus daemon on the primary node, and then distributed among the worker processes running on the replica/supervisor nodes. Thanks to Zookeeper, it doesn’t matter how many replica/supervisor nodes you run initially, as you can always seamlessly add more. Storm will automatically integrate them into the cluster.

我们的拓扑被提交到主节点上的Nimbus守护程序,然后分布在副本/主管节点上运行的辅助进程之间。 多亏了Zookeeper,您最初运行多少个副本/主管节点都无所谓,因为您始终可以无缝地添加更多节点。 Storm将自动将它们集成到群集中。

Whenever we start a Supervisor, it allocates a certain number of worker processes (that we can configure). These can then be used by the submitted topology. So in the image above, there are a total of five allocated workers.

每当我们启动Supervisor时,它都会分配一定数量的工作进程(可以配置)。 然后,提交的拓扑可以使用它们。 因此,在上图中,总共分配了五个工人。

Remember this line: conf.setNumWorkers(5)

记住这一行: conf.setNumWorkers(5)

This means that the topology will try to use a total of five workers. And since our two Supervisor nodes have a total of five allocated workers, each of the 5 allocated worker processes will run one instance of the topology.

这意味着拓扑将尝试总共使用五个工作程序。 并且由于我们的两个Supervisor节点共有五个分配的工作程序,因此5个分配的工作程序进程中的每个进程都将运行一个拓扑实例。

If we had run conf.setNumWorkers(4) then one worker process would have remained idle/unused. If the number of specified workers was six and the total allocated workers were five, then because of the limitation only five actual topology workers would’ve been functional.

如果我们运行conf.setNumWorkers(4)则一个工作进程将保持空闲/未使用状态。 如果指定的工作程序数量为6,而分配的工作程序总数为5,则由于限制,只有五个实际的拓扑工作程序可以使用。

Before we set this all up using Docker, there are a few important things to keep in mind regarding fault-tolerance:

在我们使用Docker进行所有设置之前,关于容错,需要牢记一些重要的事情:

  • If any worker on any replica node dies, the Supervisor daemon will have it restarted. If restarting repeatedly fails, the worker will be reassigned to another machine.

    如果任何副本节点上的任何工作程序死亡,则Supervisor守护程序将重新启动它。 如果反复重启失败,则将工作进程重新分配给另一台计算机。
  • If an entire replica node dies, its share of the work will be given to another supervisor/replica node.

    如果整个副本节点死亡,则其工作份额将分配给另一个主管/副本节点。
  • If the Nimbus goes down, the workers will remain unaffected. However, until the Nimbus is restored, workers won’t be reassigned to other replica nodes if, say, their node crashes.

    如果Nimbus发生故障,工人将不受影响。 但是,在恢复Nimbus之前,如果工作节点崩溃,则不会将其重新分配给其他副本节点。
  • The Nimbus and Supervisors are themselves stateless. But with Zookeeper, some state information is stored so that things can begin where they were left off if a node crashes or a daemon dies unexpectedly.

    雨云和主管本身是无国籍的。 但是,使用Zookeeper可以存储一些状态信息,以便在节点崩溃或守护程序意外终止时可以从中断状态开始。
  • Nimbus, Supervisor and Zookeeper daemons are all fail-fast. This means that they themselves are not very tolerant to unexpected errors, and will shut down if they encounter one. For this reason they have to be run under supervision using a watchdog program that monitors them constantly and restarts them automatically if they ever crash. Supervisord is probably the most popular option for this (not to be confused with the Storm Supervisor daemon).

    Nimbus,Supervisor和Zookeeper守护程序都是快速失败的。 这意味着他们自己对意外错误的容忍度不高,如果遇到意外错误,它们将关闭。 因此,它们必须使用看门狗程序在监视下运行,该程序可以对其进行持续监视,并在它们崩溃时自动将其重新启动。 Supervisord可能是最流行的选择(不要与Storm Supervisor守护程序混淆)。

Note: In most Storm clusters, the Nimbus itself is never deployed as a single instance but as a cluster. If this fault-tolerance is not incorporated and our sole Nimbus goes down, we’ll lose the ability to submit new topologies, gracefully kill running topologies, reassign work to other Supervisor nodes if one crashes, and so on.

注意:在大多数Storm集群中,Nimbus本身从来不会作为单个实例而是作为集群部署。 如果未合并此容错功能,并且我们唯一的Nimbus出现故障, 我们将失去提交新拓扑的能力,无法正常终止正在运行的拓扑,如果一个崩溃则将工作重新分配给其他Supervisor节点,等等

For simplicity, our illustrative cluster will use a single instance. Similarly, the Zookeeper is very often deployed as a cluster but we’ll use just one.

为简单起见,我们的说明性群集将使用单个实例。 同样,Zookeeper通常被部署为集群,但我们仅使用其中一个。

集群化集群 (Dockerizing The Cluster)

Launching individual containers and all that goes along with them can be cumbersome, so I prefer to use Docker Compose.

启动单个容器以及随之而来的所有操作都很麻烦,因此我更喜欢使用Docker Compose

We’ll be going with one Zookeeper node, one Nimbus node, and one Supervisor node initially. They’ll be defined as Compose services, all corresponding to one container each at the beginning.

最初,我们将使用一个Zookeeper节点,一个Nimbus节点和一个Supervisor节点。 它们将被定义为Compose服务,所有服务均在开始时分别与一个容器相对应。

Later on, I’ll use Compose scaling to add another Supervisor node (container). Here’s the entire code and the project structure:

稍后,我将使用Compose缩放来添加另一个Supervisor节点(容器)。 这是完整的代码和项目结构:

zookeeper/          Dockerfilestorm-nimbus/          Dockerfile          storm.yaml          code/               pom.xml               src/                   jvm/                       coincident_hashtags/                                  ExclamationTopology.java storm-supervisor/          Dockerfile          storm.yamldocker-compose.yml

And our docker-compose.yml:

还有我们docker-compose.yml

Feel free to explore the Dockerfiles. They basically just install the dependencies (Java 8, Storm, Maven, Zookeeper) on the relevant containers.

随意探索Dockerfile。 他们基本上只是在相关容器上安装依赖项(Java 8,Storm,Maven,Zookeeper)。

The storm.yaml files override certain default configurations for the Storm installations. The line ADD storm.yaml /conf inside the Nimbus and Supervisor Dockerfiles puts them inside the containers where Storm can read them.

storm.yaml文件将覆盖Storm安装的某些默认配置。 Nimbus和Supervisor Dockerfiles中的ADD storm.yaml /conf这行将其放入Storm可以读取它们的容器中。

storm-nimbus/storm.yaml:

storm-nimbus/storm.yaml

storm-supervisor/storm.yaml:

storm-supervisor/storm.yaml

These options are adequate for our cluster. If you are curious, you can check out all the default configurations here.

这些选项对于我们的集群是足够的。 如果您好奇,可以在此处查看所有默认配置

Run docker-compose up at the project root.

在项目根目录下运行docker-compose up

After all the images have been built and all the service started, open a new terminal, type docker ps and you’ll see something like this:

构建完所有映像并启动所有服务后,打开一个新终端,键入docker ps ,您将看到类似以下内容:

启动雨云 (Starting The Nimbus)

Let’s SSH into the Nimbus container using its name:

让我们使用其名称通过SSH进入Nimbus容器:

docker exec -it coincidenthashtagswithapachestorm_storm-nimbus_1 bash

docker exec -it coincidenthashtagswithapachestorm_storm-nimbus_1 bash

And then start the Nimbus daemon: storm nimbus

然后启动Nimbus守护程序: storm nimbus

启动Storm UI (Starting The Storm UI)

Similarly, open another terminal, SSH into the Nimbus again and launch the UI using storm ui:

同样,打开另一个终端,再次SSH到Nimbus中,然后使用storm ui启动UI:

Go to localhost:8080 on your browser and you’ll see a nice overview of our cluster:

在您的浏览器上转到localhost:8080 ,您将看到一个很好的集群概述:

The ‘Free slots’ in the Cluster Summary indicate how many total workers (on all Supervisor nodes) are available and waiting for a topology to consume them.

“群集摘要”中的“可用插槽”指示(在所有Supervisor节点上)有多少个可用工作线程,并等待拓扑使用它们。

‘Used slots’ indicates how many of the total are currently busy with a topology. Since we haven’t launched any Supervisors yet, they’re both zero. We’ll get to Executors and Tasks later. Also, as we can see, no topologies have been submitted yet.

“已使用的插槽”指示当前有多少忙于拓扑。 由于我们还没有启动任何主管,所以他们都是零。 稍后我们将介绍执行程序任务 。 同样,我们可以看到,尚未提交任何拓扑。

启动主管节点 (Starting A Supervisor Node)

SSH into the one Supervisor container and launch the Supervisor daemon:

SSH进入一个Supervisor容器并启动Supervisor守护程序:

docker exec -it coincidenthashtagswithapachestorm_storm-supervisor_1 bashstorm supervisor

Now let’s go refresh our UI:

现在,让我们刷新UI:

Note: Any changes in our cluster may take a few seconds to reflect on the UI.

注意:集群中的任何更改都可能需要几秒钟才能反映在用户界面上。

We have a new running Supervisor which comes with four allocated workers. These four workers are the result of specifying four ports in our storm.yaml for the Supervisor node. Of course, they’re all free (four Free slots).

我们有一个新的运行中的Supervisor,其中包含四个分配的工人。 这四个工作程序是在我们的storm.yaml为Supervisor节点指定四个端口的结果。 当然,它们都是免费的(四个免费插槽)。

Let’s submit a topology to the Nimbus and put ’em to work.

让我们向Nimbus提交拓扑,然后将它们投入工作。

向Nimbus提交拓扑 (Submitting A Topology To The Nimbus)

SSH into the Nimbus on a new terminal. I’ve written the Dockerfile so that we land on our working (landing) directory /theproject. Inside this is code, where our topology resides.

在新终端上通过SSH进入Nimbus。 我已经编写了Dockerfile,以便我们可以进入工作目录/theproject 。 在其中的code是我们的拓扑所在的地方。

Our topology is pretty simple. It uses a spout that generates random words and a bolt that just appends three exclamation marks (!!!) to the words. Two of these bolts are added back-to-back, and so at the end of the stream we’ll get words with six exclamation marks. It also specifies that it needs three workers (conf.setNumWorkers(3)).

我们的拓扑非常简单 。 它使用一个生成随机单词的喷嘴和一个仅在单词后附加三个感叹号(!!!)的螺栓。 这些螺栓中的两个是背对背添加的,因此在流的结尾,我们将获得带有六个感叹号的单词。 它还指定需要三个工作程序( conf.setNumWorkers(3) )。

Run these commands:

运行以下命令:

1. cd code2. mvn clean3. mvn package4. storm jar target/coincident-hashtags-1.2.1.jar coincident_hashtags.ExclamationTopology

1. cd code 2. mvn clean 3. mvn package 4. storm jar target/coincident-hashtags-1.2.1.jar coincident_hashtags.ExclamationTopology

After the topology has been submitted successfully, refresh the UI:

成功提交拓扑后,刷新UI:

As soon as we submitted the topology, the Zookeeper was notified. The Zookeeper in turn notified the Supervisor to download the code from the Nimbus. We now see our topology along with its three occupied workers, leaving just one free.

提交拓扑后,就会立即通知Zookeeper。 Zookeeper进而通知主管从Nimbus下载代码。 现在,我们看到了拓扑以及它的三个占用的工人,仅剩下一个空闲。

And ten word spout threads + three exclaim1 bolt threads + two exclaim bolt threads + the three main threads from the workers = total of 18 executors.

十个单词喷口线程+三个exclaim1螺栓线程+两个exclaim螺栓线程+工人的三个主要线程=总共18个执行者。

And you might’ve noticed something new: tasks.

您可能已经注意到一些新东西:任务。

什么是任务? (What are tasks?)

Tasks are another concept in Storm’s parallelism. But don’t sweat it, a task is just an instance of a spout or bolt that an executor uses. They are what actually does the processing.

任务是Storm并行性的另一个概念。 但不要汗流,背,任务只是执行者使用的喷嘴或螺栓的一个实例。 它们实际上是在进行处理。

By default, the number of tasks is equal to the number of executors. In rare cases you might need each executor to instantiate more tasks.

默认情况下,任务数等于执行者数。 在极少数情况下,您可能需要每个执行者实例化更多任务。

// Each of the two executors (threads) of this bolt will instantiate// two objects of this bolt (total 4 bolt objects/tasks).builder.setBolt(“even-digit-bolt”, new EvenDigitBolt(), 2)       .setNumTasks(4)        .shuffleGrouping(“random-digit-spout”);

This is a shortcoming on my part, but I can’t think of a good use case where we’d need multiple tasks per executor.

就我而言,这是一个缺点,但我想不出一个好用例,因为每个执行者需要多个任务。

Maybe if we were adding some parallelism ourselves, like spawning a new thread within the bolt to handle a long running task, then the main executor thread won’t block and will be able to continue processing using the other bolt.

也许如果我们自己添加了一些并行性,例如在螺栓中生成一个新线程来处理长时间运行的任务,那么主执行者线程将不会阻塞,并且将能够使用另一个螺栓继续进行处理。

However, this can make our topology hard to understand. If anyone knows of scenarios where the performance gain from multiple tasks outweighs the added complexity, please post a comment.

但是,这会使我们的拓扑难以理解。 如果有人知道从多个任务获得的性能收益超过增加的复杂性的情况,请发表评论。

Anyways, returning from that slight detour, let’s see an overview of our topology. Click on the name under Topology Summary and scroll down to Worker Resources:

无论如何,从那一小段弯路返回,让我们看一下拓扑的概述。 单击“拓扑摘要”下的名称,然后向下滚动到“工作者资源”:

We can clearly see the division of our executors (threads) among the three workers. And of course all the three workers are on the same, single Supervisor node we’re running.

我们可以清楚地看到执行程序(线程)在三个工作人员之间的划分。 当然,所有这三个工作程序都在我们正在运行的同一单个Supervisor节点上。

Now, let’s say scale out!

现在,我们说横向扩展!

添加其他主管 (Add Another Supervisor)

From the project root, let’s add another Supervisor node/container:

从项目根目录,让我们添加另一个Supervisor节点/容器:

docker-compose scale storm-supervisor=2

SSH into the new container:

SSH进入新容器:

docker exec -it coincidenthashtagswithapachestorm_storm-supervisor_2 bash

And fire up: storm supervisor

和火起来: storm supervisor

If you refresh the UI you’ll see that we’ve successfully added another Supervisor and four more workers (total of eight workers/slots). To really take advantage of the new Supervisor, let’s increase the topology’s workers.

如果刷新UI,您会看到我们已经成功添加了另一个主管和另外四个工作人员(总共八个工作人员/插槽)。 为了真正利用新的Supervisor,让我们增加拓扑的工作量。

  • First kill the running one: storm kill exclamation-topology

    首先杀死正在运行的一个: storm kill exclamation-topology

  • Change this line to: conf.setNumWorkers(6)

    将此行更改为: conf.setNumWorkers(6)

  • Change the project version number in your pom.xml. Try using a proper scheme, like semantic versioning. I’ll just stick with 1.2.1.

    pom.xml更改项目版本号。 尝试使用适当的方案,例如语义版本控制。 我只会坚持1.2.1。

  • Rebuild the topology: mvn package

    重建拓扑: mvn package

  • Resubmit it: storm jar target/coincident-hashtags-1.2.1.jar coincident_hashtags.ExclamationTopology

    重新提交: storm jar target/coincident-hashtags-1.2.1.jar coincident_hashtags.ExclamationTopology

Reload the UI:

重新加载UI:

You can now see the new Supervisor and the six busy workers out of a total of eight available ones.

现在,您可以看到新的主管和总共八名可用工人中的六名忙碌的工人。

Also important to note is that the six busy ones have been equally divided among the two Supervisors. Again, click the topology name and scroll down.

还要注意的重要一点是,六个忙碌的成员在两个主管之间平均分配。 再次,单击拓扑名称并向下滚动。

We see two unique Supervisor IDs, both running on different nodes, and all our executors pretty evenly divided among them. This is great.

我们看到两个唯一的Supervisor ID,它们都在不同的节点上运行,并且我们所有的执行程序在它们之间平均分配。 这很棒。

But Storm comes with another nifty way of doing so while the topology is running — rebalancing.

但是Storm在拓扑运​​行时提供了另一种巧妙的方式-重新平衡。

On the Nimbus we’d run:

在Nimbus上,我们将运行:

storm rebalance exclamation-topology -n 6

Or to change the number of executors for a particular component:

或更改特定组件的执行程序数量:

storm rebalance exclamation-topology -e even-digit-bolt=3

可靠的消息处理 (Reliable Message Processing)

One question we haven’t tackled is about what happens if a bolt fails to process a tuple.

我们尚未解决的一个问题是,如果螺栓无法处理元组会发生什么。

Storm provides us a mechanism by which the originating spout (specifically, the task) can replay the failed tuple. This processing guarantee doesn’t just happen by itself. It’s a conscious design choice, and does add latency.

Storm为我们提供了一种机制,通过该机制,始发喷嘴(具体来说就是task )可以重播失败的元组。 这种处理保证不仅是单独发生的。 这是一个有意识的设计选择,并且确实增加了延迟。

Spouts send out tuples to bolts, which emit tuples derived from the input tuples to other bolts and so on. That one original tuple spurs an entire tree of tuples.

喷嘴将元组发送到螺栓,螺栓将从输入元组派生的元组发送到其他螺栓,依此类推。 一个原始的元组刺激了整个元组树。

If any child tuple, so to speak, of the original one fails, then any remedial steps (rollbacks etc) may well have to be taken at multiple bolts. That could get pretty hairy, and so what Storm does is that it allows the original tuple to be emitted again right from the source (the spout).

如果说原始元组的任何子元组失败了,那么很可能必须在多个螺栓上采取任何补救措施(还原等)。 这可能会变得很毛茸茸,因此Storm所做的是,它允许从源(喷嘴)再次再次发出原始元组。

Consequently, any operations performed by bolts that are a function of the incoming tuples should be idempotent.

因此,由螺栓执行的与传入元组有关的任何操作都应是幂等的

A tuple is considered “fully processed” when every tuple in its tree has been processed, and every tuple has to be explicitly acknowledged by the bolts.

当树中的每个元组都已被处理时,该元组被视为“已完全处理”,并且每个元组都必须由螺栓明确确认。

However, that’s not all. There’s another thing to be done explicitly: maintain a link between the original tuple and its child tuples. Storm will then be able to trace the origin of the child tuples and thus be able to replay the original tuple. This is called anchoring. And this has been done in our exclamation bolt:

但是,还不是全部。 明确要做的另一件事是:保持原始元组与其子元组之间的链接。 然后,Storm将能够跟踪子元组的起源,从而能够重播原始元组。 这称为锚定这已经在我们的感叹号中完成了

// ExclamationBolt
// ‘tuple’ is the original one received from the test word spout.// It’s been anchored to/with the tuple going out._collector.emit(tuple, new Values(exclamatedWord.toString()));
// Explicitly acknowledge that the tuple has been processed._collector.ack(tuple);

The ack call will result in the ack method on the spout being called, if it has been implemented.

如果已实现,则ack调用将导致在ack调用ack方法。

So, say you’re reading the tuple data from some queue and you can only take it off the queue if the tuple has been fully processed. The ack method is where you’d do that.

因此,假设您正在从某个队列中读取元组数据,并且只有在元组已被完全处理后才能将其从队列中取出。 ack方法就是您要做的。

You can also emit out tuples without anchoring:

您还可以散发出元组而无需锚定:

_collector.emit(new Values(exclamatedWord.toString()))

and forgo reliability.

并放弃可靠性。

A tuple can fail two ways:

元组可以通过两种方式失败:

  1. A bolt dies and a tuple times out. Or, it times out for some other reason. The timeout is 30 seconds by default and can be changed using config.put(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS, 60)

    螺栓死亡,元组超时。 或者,由于其他原因导致超时。 默认情况下,超时为30秒,可以使用config.put(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS, 60)进行更改

  2. The fail method is explicitly called on the tuple in a bolt: _collector.fail(tuple). You may do this in case of an exception.

    在螺栓_collector.fail(tuple)的元组上显式调用fail方法。 如果有例外,您可以这样做。

In both these cases, the fail method on the spout will be called, if it is implemented. And if we want the tuple to be replayed, it would have to be done explicitly in the fail method by calling emit, just like in nextTuple(). When tracking tuples, every one has to be acked or failed. Otherwise, the topology will eventually run out of memory.

在这两种情况下,都将调用喷口上的fail方法(如果已实现)。 如果我们想元组被重放,就必须显式地在做fail的方法调用emit中,就像nextTuple() 跟踪元组时,每个元组都必须被ackfail 。 否则,拓扑最终将耗尽内存。

It’s also important to know that you have to do all of this yourself when writing custom spouts and bolts. But the Storm core can help. For example, a bolt implementing BaseBasicBolt does acking automatically. Or built-in spouts for popular data sources like Kafka take care of queuing and replay logic after acknowledgment and failure.

同样重要的是要知道在编写自定义喷嘴和螺栓时必须自己做所有这一切。 但是Storm核心可以提供帮助。 例如,实现BaseBasicBolt的销会自动确认。 或用于流行数据源(例如Kafka)的内置喷口负责确认和失败后的排队和重播逻辑。

离别镜头 (Parting Shots)

Designing a Storm topology or cluster is always about tweaking the various knobs we have and settling where the result seems optimal.

设计Storm拓扑或集群总是要调整我们拥有的各种旋钮,并在结果看起来最佳的地方进行设置。

There are a few things that’ll help in this process, like using a configuration file to read parallelism hints, number of workers, and so on so you don’t have to edit and recompile your code repeatedly.

在此过程中,有几件事会有所帮助,例如使用配置文件读取并行提示,工作程序数量等,因此您不必反复编辑和重新编译代码。

Define your bolts logically, one per indivisible task, and keep them light and efficient. Similarly, your spouts’ nextTuple() methods should be optimized.

从逻辑上定义您的螺栓,每个不可分割的任务一个,并使其轻便高效。 同样,您的喷口的nextTuple()方法应该进行优化。

Use the Storm UI effectively. By default, it doesn’t show us the complete picture, only 5% of the total tuples emitted. To monitor all of them, use config.setStatsSampleRate(1.0d).

有效地使用Storm UI。 默认情况下,它不会向我们显示完整的图片,只显示了发出的所有元组的5%。 要监视所有这些,请使用config.setStatsSampleRate(1.0d)

Keep an eye on the Acks and Latency values for individual bolts and topologies via the UI. That’s what you want to look at when tuning the parameters.

通过UI监视各个螺栓和拓扑的“ Acks”和“ Latency”值。 这就是调整参数时要查看的内容。

翻译自: https://www.freecodecamp.org/news/apache-storm-is-awesome-this-is-why-you-should-be-using-it-d7c37519a427/

storm apache

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值