storm指南

Tutorial

Inthis tutorial, you'll learn how to create Storm topologies and deploy them to aStorm cluster.

Javawill be the main language used, but a few examples will use Python toillustrate Storm's

multi-languagecapabilities.

在这篇指南中,你将学习到如何创建一个拓扑并把它们部署到一个storm集群中

Java是后面使用的主要语言,但是为了展示storm支持多语言的能力,有一些例子也会使用python

 

Preliminaries

前言

Thistutorial uses examples from the storm-starter project. It's recommendedthat you clone the project and follow along with the examples. Read Setting up a development environment and Creating a new Storm project to get your machine set up.

这篇指南使用storm-starter项目中的例子,推荐你克隆这个项目跟着例子做一遍,阅读创建一个开发环境和创建一个新的storm项目来开始你的机器安装。

Componentsof a Storm cluster

Storm的组件

AStorm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoopyou run "MapReduce jobs", on Storm you run "topologies"."Jobs" and "topologies" themselves are very different --one key difference is that a MapReduce job eventually finishes, whereas atopology processes messages forever (or until you kill it).

一个storm集群从表面上看和一个hadoop群集相似,像在hadoop中运行mapreduce jobs一样,在storm中你得运行”topologies”,”job””topologies”本身有很大区别一个主要的区别是mapreduce job最终会执行结束而会持续的处理消息(直到你kill掉它)。

Thereare two kinds of nodes on a Storm cluster: the master node and the workernodes. The master node runs a daemon called "Nimbus" that is similarto Hadoop's "JobTracker". Nimbus is responsible for distributing codearound the cluster, assigning tasks to machines, and monitoring for failures.

Storm集群中有两类结点,主结点和工作结点,主结点中运行一个称为”nimbus”的主进程,这有点像hadoop中的jobtracker.” nimbus”负责集群中的代码分发,分配task到机器上执行,监控任选的执行情况。

Eachworker node runs a daemon called the "Supervisor". The supervisorlistens for work assigned to its machine and starts and stops worker processesas necessary based on what Nimbus has assigned to it. Each worker processexecutes a subset of a topology; a running topology consists of many workerprocesses spread across many machines.

每一个工作结点都运行一个后台进程,称之为“Supervisor” “Supervisor”进程监听分配到它机器上的任务,基于“Nimbus”分配给它的任务,在必要时“Supervisor”启动和停止worker线程。每一个worker线程执行” topology”的一个子集;一个正在运行的“topology”由分布在多个机器上的很多worker线程组成。

Allcoordination between Nimbus and the Supervisors is done through a Zookeeper cluster. Additionally, theNimbus daemon and Supervisor daemons are fail-fast andstateless; all state is kept in Zookeeper or on local disk. This means you cankill -9 Nimbus or the Supervisors and they'll start back up like nothinghappened. This design leads to Storm clusters being incredibly stable.

Nimbus the Supervisors之间的所有协调工作都是能过zookeeper群集来实现的,另外,

TopologiesSupervisors守护进程是fail-fast和无状态的;所有的状态都保持在zookeeper或本地磁盘上。这意味着你可以kill -9 Nimbusor the Supervisors,然后它们会像什么都没发生过一个重新启动,这种设计使得storm群集变得异常的稳定。

To dorealtime computation on Storm, you create what are called"topologies". A topology is a graph of computation. Each node in atopology contains processing logic, and links between nodes indicate how datashould be passed around between nodes.

通过创建所谓的topologies来做实时计算。Topology是一个计算的拓扑图,每一个结点包含了处理逻辑和结点间为了指出如何传递数据的联系。

Runninga topology is straightforward. First, you package all your code anddependencies into a single jar. Then, you run a command like the following:

stormjar all-my-code.jar backtype.storm.MyTopology arg1 arg2

运行一个topology相关简单,打包你的代码和依赖到一个jar包中,然后按下面的格式运行命令。

storm jarall-my-code.jar backtype.storm.MyTopology arg1 arg2

 

Thisruns the class backtype.storm.MyTopology with the arguments arg1 and arg2. Themain function of the class defines the topology and submits it to Nimbus. Thestorm jar part takes care of connecting to Nimbus and uploading the jar.

以上运行了一个带两个参数arg1,arg2class backtype.storm.MyTopology。这个class的主要功能是定义了一个topology和把这个“topology”提交给“Nimbus”,storm jar部分注意一下Nimbus和上传jar包的关系。

Sincetopology definitions are just Thrift structs, and Nimbus is a Thrift service,you can create and submit topologies using any programming language. The aboveexample is the easiest way to do it from a JVM-based language. See Running topologies on a production cluster for more information on starting and stopping topologies.

因为topology的定义是thrift结构,Nimbus是一个thrift服务,所以你可以使用任何语言创建和提交topologies,通过一个基于jvm的语言是最容易实现上面的例子的。接着看如何在产品群集中运行topologies,更多的信息在启动和停止群集。

Streams

Thecore abstraction in Storm is the "stream". A stream is an unboundedsequence of tuples. Storm provides the primitives for transforming a streaminto a new stream in a distributed and reliable way. For example, you maytransform a stream of tweets into a stream of trendingtopics.

Thebasic primitives Storm provides for doing stream transformations are"spouts" and "bolts". Spouts and bolts have interfaces thatyou implement to run your application-specific logic.

Storm中最关键的抽像概念是stream,一个流是一个无边界的tuples的序列,storm提供了在一个分布可靠环境里面把一个流转换一个新流的。例如,你可以转换一个tweets成为一个趋势的话题,提供流转换的基元是spoutsbolts,你可以通过实现spoutsbolts的接口来执行你应用的特定逻辑

Aspout is a source of streams. For example, a spout may read tuples off of a Kestrel queue and emit them as a stream. Or a spout may connect to the TwitterAPI and emit a stream of tweets.

Abolt consumes any number of input streams, does some processing, and possiblyemits new streams. Complex stream transformations, like computing a stream oftrending topics from a stream of tweets, require multiple steps and thusmultiple bolts. Bolts can do anything from run functions, filter tuples, dostreaming aggregations, do streaming joins, talk to databases, and more.

 Spout是流的源,例如,spout可能从Kestrel queue中读取tuples然后把它们输出为一个流。或者连接一个twitterAPI然后输出一个twitter流,bolt消费任意数据的流,做一些处理,然后可能输出一个新流。一些复杂的流式转换,比如从twitter的流中计算趋势图,就需要多个步骤和多个boltsBolts可以通过运行函数,过滤tuples流聚合,流join,查数据库和其它方式做任何事情。

Networksof spouts and bolts are packaged into a "topology"which is the top-level abstraction that you submit to Storm clusters forexecution. A topology is a graph of stream transformations where each node is aspout or bolt. Edges in the graph indicate which bolts are subscribing to whichstreams. When a spout or bolt emits a tuple to a stream, it sends the tuple toevery bolt that subscribed to that stream.

Spoutsbolts网络是打包到topology中的,topology是你提交到storm群集运行的一个顶级抽象概念。Topology是一个流转换的图,其中每一个结点是一个spoutbolt,图中的第一条边指明了bolts订阅了那个流。当一个spoutbolt写了一个tuple到一个流,它会把这个流写到每一个订阅这个流的bolts中。

 

Linksbetween nodes in your topology indicate how tuples should be passed around. Forexample, if there is a link between Spout A and Bolt B, a link from Spout A toBolt C, and a link from Bolt B to Bolt C, then everytime Spout A emits a tuple,it will send the tuple to both Bolt B and Bolt C. All of Bolt B's output tupleswill go to Bolt C as well.

Topology中节点关系的关第指明了tuples应该如何分发。例如,SpoutASpoutB间有一个关系,SpoutASpoutC以间有一个关系,SpoutBSpoutC以间有一个关系,那么任何时候Spout A输出一个tuple,都会发送到BoltBBoltC,同样所有BoltB的输出都要输出到BoltC.

 

Eachnode in a Storm topology executes in parallel. In your topology, you canspecify how much parallelism you want for each node, and then Storm will spawnthat number of threads across the cluster to do the execution.

stormtopology中,每个结点都是并行执行的。在你的topology可,你可以指定每个结点的并发数,然后storm会在整个集群中产生相应数目的线程来执行。

Atopology runs forever, or until you kill it. Storm will automatically reassignany failed tasks. Additionally, Storm guarantees that there will be no dataloss, even if machines go down and messages are dropped.

Topology是永久执行的,直到杀掉它,storm可以自动的重新分配失败的任务,另外,尽管机器会宕机消息会丢失,但storm保证数据不会丢失

Datamodel

数据模型

Stormuses tuples as its data model. A tuple is a named list of values, and a fieldin a tuple can be an object of any type. Out of the box, Storm supports all theprimitive types, strings, and byte arrays as tuple fieldvalues. To use an object of another type, you just need to implement a serializer for the type.

Storm使用tuples做为数据模型,tuple是值的命名列表,一个字段在tuple中可以是任何类型的对象,storm支持所有的原生类型,strings,字节数组都可以做为tuple中字段的值,如果想使用其它类型的数据,仅仅需要实现一下它的序列化机制。

Everynode in a topology must declare the output fields for the tuples it emits. Forexample, this bolt declares that it emits 2-tuples with the fields"double" and "triple":

 Topology中的每一个结点都要为它输出的tuples指定输出字段,例如下面这个bolts声明了它使用了两个带doubletriple两个字段的的tuples

publicclass DoubleAndTripleBolt extends BaseRichBolt {

    private OutputCollectorBase _collector;

 

    @Override

    public void prepare(Map conf,TopologyContext context, OutputCollectorBase collector) {

        _collector = collector;

    }

 

    @Override

    public void execute(Tuple input) {

        int val = input.getInteger(0);       

        _collector.emit(input, newValues(val*2, val*3));

        _collector.ack(input);

    }

 

    @Override

    public void declareOutputFields(OutputFieldsDeclarerdeclarer) {

        declarer.declare(newFields("double", "triple"));

    }   

}

ThedeclareOutputFields function declares the output fields ["double","triple"] for the component. The rest of the bolt will be explainedin the upcoming sections.

declareOutputFields函数为bolt组件声明输出字段doubletriple,其它的bolt将按此格式解析接收到的数据片。

Asimple topology

一个简单的topology

Let'stake a look at a simple topology to explore the concepts more and see how thecode shapes up. Let's look at the ExclamationTopology definition fromstorm-starter:

看看下面这段简单的代码能够更深入的了解一些概念,并且能够看到代码是如何形成的,从storm-starter看看ExclamationTopology的定义:

TopologyBuilderbuilder = new TopologyBuilder();       

builder.setSpout("words",new TestWordSpout(), 10);       

builder.setBolt("exclaim1",new ExclamationBolt(), 3)

        .shuffleGrouping("words");

builder.setBolt("exclaim2",new ExclamationBolt(), 2)

        .shuffleGrouping("exclaim1");

Thistopology contains a spout and two bolts. The spout emits words, and each bolt appendsthe string "!!!" to its input. The nodes are arranged in a line: thespout emits to the first bolt which then emits to the second bolt. If the spoutemits the tuples ["bob"] and ["john"], then the second boltwill emit the words ["bob!!!!!!"] and ["john!!!!!!"].

这个 topology 包含了一个spout 和两个boltspout每输出一个词,每个bolt往它的输入中添加"!!!"。这些结点以线性的方式组织起来:spout输出到第一个bolt,第一个bolt又输出到第二个bolt,如果spout 输出["bob"] and["john"],第二个bolt将输出["bob!!!!!!"] and ["john!!!!!!"].

 

This code defines the nodesusing the setSpout and setBolt methods. These methods take as input auser-specified id, an object containing the processing logic, and the amount ofparallelism you want for the node. In this example, the spout is given id "words"and the bolts are given ids "exclaim1" and "exclaim2".

这段代码使用setBoltsetSpout定义了定点,这些方法的输入是一个用户自定义的id,一个包含处理逻辑的对象,还有一个是你期望这个节点的并发数。在这个例子里面spoutid”words”,boltid分别为"exclaim1" "exclaim2".

Theobject containing the processing logic implements the IRichSpout interface for spouts and the IRichBolt interface for bolts.

对象封装了处理逻辑, spout实现了IRichSpout接口,bolts实现了IRichBolt接口。

Thelast parameter, how much parallelism you want for the node, is optional. Itindicates how many threads should execute that component across the cluster. Ifyou omit it, Storm will only allocate one thread for that node.

最后一个参数是,你设置的节点并发数,是可个可选参数,它指出了在集群中有多少个线程来执行这个组件,如果你忽略它,storm将会一个结点对应一个线程。

setBoltreturns an InputDeclarer object that is used to definethe inputs to the Bolt. Here, component "exclaim1" declares that itwants to read all the tuples emitted by component "words" using ashuffle grouping, and component "exclaim2" declares that it wants toread all the tuples emitted by component "exclaim1" using a shufflegrouping. "shuffle grouping" means that tuples should berandomly distributed from the input tasks to the bolt's tasks. There are manyways to group data between components. These will be explained in a fewsections.

setBolt返回一个InputDeclarer对象,InputDeclarer对象用来定义bolt的输入。这里“exclaim1”声明它想利用shuffle分组过滤读取所有”words”组件输出的tuples。“exclaim2”组件想通过shuffle分组过滤来读取” exclaim1”输出的tuples, "shufflegrouping"意味着tuples会从输入任务到bolts任务间实现随机分配。有很多实现组件间分组的方式,这里面只介绍一小部分。

Ifyou wanted component "exclaim2" to read all the tuples emitted byboth component "words" and component "exclaim1", you wouldwrite component "exclaim2"'s definition like this:

如果你想让exclaim2组件读取所有words组件和exclaim1组件的输出,可以如下定exclaim2的定义。

builder.setBolt("exclaim2",new ExclamationBolt(), 5)

            .shuffleGrouping("words")

           .shuffleGrouping("exclaim1");

Asyou can see, input declarations can be chained to specify multiple sources forthe Bolt.

如你所看,Bolt 输入的声明可以将多个数据源链式的串起来。

Let'sdig into the implementations of the spouts and bolts in this topology. Spoutsare responsible for emitting new messages into the topology. TestWordSpout inthis topology emits a random word from the list ["nathan","mike", "jackson", "golda", "bertels"]as a 1-tuple every 100ms. The implementation of nextTuple() in TestWordSpoutlooks like this:

现在来深度挖掘一下topology spouts bolts的实现,Spouts负责输出新的消息到topology,在这个topologyTestWordSpout 100ms["nathan","mike", "jackson", "golda", "bertels"]中随机的挑出一个词做为一个tuple输出,nextTupleTestWordSpout中的实现如下所示:

 

publicvoid nextTuple() {

    Utils.sleep(100);

    final String[] words = new String[]{"nathan", "mike", "jackson", "golda","bertels"};

    final Random rand = new Random();

    final String word =words[rand.nextInt(words.length)];

    _collector.emit(new Values(word));

}

Asyou can see, the implementation is very straightforward.

ExclamationBoltappends the string "!!!" to its input. Let's take a look at the fullimplementation for ExclamationBolt:

publicstatic class ExclamationBolt implements IRichBolt {

    OutputCollector _collector;

 

    @Override

    public void prepare(Map conf,TopologyContext context, OutputCollector collector) {

        _collector = collector;

    }

 

    @Override

    public void execute(Tuple tuple) {

        _collector.emit(tuple, newValues(tuple.getString(0) + "!!!"));

        _collector.ack(tuple);

    }

 

    @Override

    public void cleanup() {

    }

 

    @Override

    public voiddeclareOutputFields(OutputFieldsDeclarer declarer) {

        declarer.declare(newFields("word"));

    }

 

    @Override

    public Map getComponentConfiguration() {

        return null;

    }

}

Theprepare method provides the bolt with an OutputCollector thatis used for emitting tuples from this bolt. Tuples can be emitted at anytimefrom the bolt -- in the prepare, execute, or cleanup methods, or evenasynchronously in another thread. This prepare implementation simply saves theOutputCollector as an instance variable to be used later on in the executemethod.

Prepare方法为bolt提供了一个OutputCollector,用来输出tuples,bolt任何时间都能在prepare,execute,cleanup方法乃至异步的在其它线程中输出tuples-

Theexecute method receives a tuple from one of the bolt's inputs. TheExclamationBolt grabs the first field from the tuple and emits a new tuple withthe string "!!!" appended to it. If you implement a bolt thatsubscribes to multiple input sources, you can find out which component the Tuple came from by using theTuple#getSourceComponent method.

Execute从其中一个bolts的输入里面接收tuple, ExclamationBolttuble中获取第一个字段然后产生一个新的tuble追加上字串符”!!!”,如果实现一个bolt,这个bolt订阅了多个输入源,通过Tuple#getSourceComponent方法能够知道这个tuple来自于那个组件。

There'sa few other things going on in the execute method, namely that the input tupleis passed as the first argument to emit and the input tuple is acked on thefinal line. These are part of Storm's reliability API for guaranteeing no dataloss and will be explained later in this tutorial.

也有一些其它的事情需要在execute方法里面完成,输入tuple被做为第一个输出参数,并且在最后一行中表明输入参数是要求确认的,也有其它一些stormAPI保证数据不会丢失,这些都将在这篇指南的后面介绍。

The cleanupmethod is called when a Bolt is being shutdown and should cleanupany resources that were opened. There's no guarantee that this method will becalled on the cluster: for example, if the machine the task is running on blowsup, there's no way to invoke the method. The cleanup method is intended forwhen you run topologies in local mode (where a Storm cluster issimulated in process), and you want to be able to run and kill many topologieswithout suffering any resource leaks.

Cleanupbolt关闭的时候会被调用以清理所有打开的资源,在集群中不保证这个方法一定会被调用;例如,如果机器上这个task跑挂了对这种情况是没有办法调用cleanup的。当topologies本地运行的时候,clearnup方法是没法调用的,cleanup方法是在以本地模式运行topologies时使用的,并且你可以在没有资源泄露的情况下运行或杀掉任何的topologies

ThedeclareOutputFields method declares that the ExclamationBolt emits 1-tupleswith one field called "word".

declareOutputFields方法声明了ExclamationBolt输出了一个带有word字段的tuples.

ThegetComponentConfiguration method allows you to configure various aspects of howthis component runs. This is a more advanced topic that is explained further onConfiguration.

getComponentConfiguration方法允许你配置组件的运行模式。这是一个比较高级的话题将在配置部分进一步讨论。

Methodslike cleanup and getComponentConfiguration are often not needed in a boltimplementation. You can define bolts more succinctly by using a base class thatprovides default implementations where appropriate. ExclamationBolt can bewritten more succinctly by extending BaseRichBolt, like so:

CleanupgetComponentConfiguration方法通常在 bolt中不是必须实现的。可以通过提供默认实现的一个基类来来简单的定义bolt, ExclamationBolt也可以通过扩展BaseRichBolt来实现。

publicstatic class ExclamationBolt extends BaseRichBolt {

    OutputCollector _collector;

 

    @Override

    public void prepare(Map conf,TopologyContext context, OutputCollector collector) {

        _collector = collector;

    }

 

    @Override

    public void execute(Tuple tuple) {

        _collector.emit(tuple, newValues(tuple.getString(0) + "!!!"));

        _collector.ack(tuple);

    }

 

    @Override

    public voiddeclareOutputFields(OutputFieldsDeclarer declarer) {

        declarer.declare(newFields("word"));

    }   

}

RunningExclamationTopology in local mode

本地模式运行ExclamationTopology

Let'ssee how to run the ExclamationTopology in local mode and see that it's working.

看一下如何以本地模式运行ExclamationTopology以及它是如何工作的。

Stormhas two modes of operation: local mode and distributed mode. In local mode,Storm executes completely in process by simulating worker nodes with threads.Local mode is useful for testing and development of topologies. When you runthe topologies in storm-starter, they'll run in local mode and you'll be ableto see what messages each component is emitting. You can read more aboutrunning topologies in local mode on Local mode.

Storm有两个运行模式,本地模式和分布式模式,本地模式下,Storm通过线程模拟工作节点来完整的在进程中执行。本地模式对topologies的开发有测试是很有用的。当在storm-starter中运行topologies时,topologies将会以本地模式运行,并且你能够看见每个组件输出的消息,如此,在你将会更深刻的理解本地模式下运行topologies

Indistributed mode, Storm operates as a cluster of machines. When you submit atopology to the master, you also submit all the code necessary to run thetopology. The master will take care of distributing your code and allocatingworkers to run your topology. If workers go down, the master will reassign themsomewhere else. You can read more about running topologies on a cluster on Running topologies on a productioncluster.

在分布式模式下,storm做为一个集群运行。当你提交一个topologymaster的时候,你也提交了运行这个topology的必要代码。Master将会负责分发代码、分配worker来运行你的topology。如果workers挂掉,master将会把topology重新分配到其它地方,能过在生产集群中运行一个topology可以深入了解分布式运行模式。

Here'sthe code that runs ExclamationTopology in local mode:

Configconf = new Config();

conf.setDebug(true);

conf.setNumWorkers(2);

 

LocalClustercluster = new LocalCluster();

cluster.submitTopology("test",conf, builder.createTopology());

Utils.sleep(10000);

cluster.killTopology("test");

cluster.shutdown();

First,the code defines an in-process cluster by creating a LocalCluster object.Submitting topologies to this virtual cluster is identical to submittingtopologies to distributed clusters. It submits a topology to the LocalClusterby calling submitTopology, which takes as arguments a name for the runningtopology, a configuration for the topology, and then the topology itself.

Thename is used to identify the topology so that you can kill it later on. A topology will run indefinitely until you kill it.

首先,上面代码通过创建一个LocalCluster定义了一个进程内的集群。提交topologies给一个虚拟集群和提交给一个分布式集群是一样的。通过调用submitTopology,提交了一个topologyLocalClustersubmitTopology需要一个参数做为运行topology的名称,一个configuration,一个topology本身。名称是用来标识一个topology以使用后续杀掉它,topology会一直运行,直到杀掉它。

Theconfiguration is used to tune various aspects of the running topology. The twoconfigurations specified here are very common:

Configuration用来设置运行topology的各项配置信息,下面两项配置是非常常见的。

TOPOLOGY_WORKERS(set with setNumWorkers) specifies how many processes you want allocated aroundthe cluster to execute the topology. Each component in the topology willexecute as many threads. The number of threads allocated to a given componentis configured through the setBolt and setSpout methods. Those threads existwithin worker processes. Each worker process contains within it some number ofthreads for some number of components. For instance, you may have 300 threadsspecified across all your components and 50 worker processes specified in yourconfig. Each worker process will execute 6 threads, each of which of couldbelong to a different component. You tune the performance of Storm topologiesby tweaking the parallelism for each component and the number of workerprocesses those threads should run within.

TOPOLOGY_WORKERS:用来设置你想在集群中分配多少个进程来运行topologyTopology中的每一个组件都作为多个线程在执行的。分配给一个组件的执行线程的数量是通过setBolt setSpout来设置的。这些线程存在于工作进程里面。每一个工作进程都运行着多个组件,每个组件也可能对应多个线程。例如,在你的配置中配置了50个工作进程和300个线程来执行所有的组件,每一个工作进程来执行6个线程,每一个线程属于一个不同的组件。可以通过调整组件的并行机制和线程运行所需要的进程数据量来调整Storm topologies的性能。

TOPOLOGY_DEBUG (set with setDebug), whenset to true, tells Storm to log every message every emitted by a component.This is useful in local mode when testing topologies, but you probably want tokeep this turned off when running topologies on the cluster.

TOPOLOGY_DEBUG:当设置为true的时候,告诉storm记录下组件的每一条日志第一项输出,这在本地模式下测试topologies时是非常有用的,但是在集群中运行时,记得关闭。

There'smany other configurations you can set for the topology. The variousconfigurations are detailed on the Javadoc for Config.

Topology也有许多其它的配置可以设置,在javadoc中有各项配置的详细说明。

Tolearn about how to set up your development environment so that you can runtopologies in local mode (such as in Eclipse), see Creating a new Storm project.

学习如何创建开发环境,在本地模式运行topologies,可以参见创建一个storm项目。

Streamgroupings

流分组

Astream grouping tells a topology how to send tuples between two components.Remember, spouts and bolts execute in parallel as many tasks across thecluster. If you look at how a topology is executing at the task level, it lookssomething like this:

流分组决定如何在两个组件间发送truples. Spouts bolts是并行执行的,就像是在集群中有多个任务在运行一下,如果你查看一个topology在任务级别是如何运行的,如下所示。

 

Whena task for Bolt A emits a tuple to Bolt B, which task should it send the tupleto?

BoltA的一个任务输出一个tupleBoltB,那一个任务把它发送给BoltB.

A"stream grouping" answers this question by telling Storm how to sendtuples between sets of tasks. Before we dig into the different kinds of streamgroupings, let's take a look at another topology from storm-starter. This WordCountTopology reads sentences off of aspout and streams out of WordCountBolt the total number of times it has seenthat word before:

stream grouping通过通知storm如何在任务间发送tuples回答了这个问题。在这这前我们讨论过不同的分组策略,现在从storm-starter的角度看看另一个topologyWordCountTopologyspout读取情况子,输入出给WordCountBoltWordCountBolt统计已经发送过单词的次数。

TopologyBuilderbuilder = new TopologyBuilder();

 

builder.setSpout("sentences",new RandomSentenceSpout(), 5);       

builder.setBolt("split",new SplitSentence(), 8)

       .shuffleGrouping("sentences");

builder.setBolt("count",new WordCount(), 12)

        .fieldsGrouping("split", newFields("word"));

SplitSentenceemits a tuple for each word in each sentence it receives, and WordCount keeps amap in memory from word to count. Each time WordCount receives a word, itupdates its state and emits the new word count.

SplitSentence为它接收到的每个句子中的每个词输出一个tupleWordCount维护了一个由词到计数的内存映射。每当WordCount收到一个词,就更新内存中的状态并输出一个新的词。

There'sa few different kinds of stream groupings.

有两种不同的流分组策略。

Thesimplest kind of grouping is called a "shuffle grouping" which sendsthe tuple to a random task. A shuffle grouping is used in the WordCountTopologyto send tuples from RandomSentenceSpout to the SplitSentence bolt. It has theeffect of evenly distributing the work of processing the tuples across all ofSplitSentence bolt's tasks.

最简单的分组策略是"shufflegrouping",它是将tuple随机的发送给一个任务。shuffle grouping策略被用来在WordCountTopology中从RandomSentenceSpout送一个tuplesSplitSentence olt,它能均匀的将tuples分配到 Sentence bolt's中处理tuple 的任务。

Amore interesting kind of grouping is the "fields grouping". A fieldsgrouping is used between the SplitSentence bolt and the WordCount bolt. It iscritical for the functioning of the WordCount bolt that the same word always goto the same task. Otherwise, more than one task will see the same word, andthey'll each emit incorrect values for the count since each has incompleteinformation. A fields grouping lets you group a stream by a subset of itsfields. This causes equal values for that subset of fields to go to the sametask. Since WordCount subscribes to SplitSentence's output stream using afields grouping on the "word" field, the same word always goes to thesame task and the bolt produces the correct output.

一个最有趣的分组策略是fields grouping分组。fields grouping被用于SplitSentenceWordCount间传输数据。这种策略对WordCount的功能是相当重要的,因为它将相同的词给了同一个task.否则的话,一个或多个task就会收到同一个词。因为各个task都收到了不完整的信息,它们都会输出错误的count信息。fields grouping算法能让你按照流的字段的子集来进行分组。这就造就了发给同一任务的字段的子集的统计结果都相等,因为WordCountword fields上使用fields grouping订阅了SplitSentence's的输出流。同一个词总是输出到同一个task,并且the也能产生正确的结果。

Fieldsgroupings are the basis of implementing streaming joins and streamingaggregations as well as a plethora of other use cases. Underneath the hood,fields groupings are implemented using mod hashing.

Fields groupings是实现流连接和流聚合的基础,像其它的分组策略一样,Fields groupings是通过mod哈稀来实现的。

There'sa few other kinds of stream groupings. You can read more about them on Concepts.

DefiningBolts in other languages

也有其它的一些流分组策略,你可以从概念上了解他们。使用其它的语言来定义bolts.

Boltscan be defined in any language. Bolts written in another language are executedas subprocesses, and Storm communicates with those subprocesses with JSONmessages over stdin/stdout. The communication protocol just requires an ~100line adapter library, and Storm ships with adapter libraries for Ruby, Python,and Fancy.

Bolts能够使用任何语言来定义。用其它语言写成的Bolts以子进程的方式运行。Storm在标准的输入和输出上采用json格式的消息与这些子进程通信。通信协议仅仅使用了不到100行代码的一个适配器。Storm使用适配器实现Ruby,Python,Fance之间的过渡。

Here'sthe definition of the SplitSentence bolt from WordCountTopology:

下面是WordCountTopologySplitSentence的定义。

publicstatic class SplitSentence extends ShellBolt implements IRichBolt {

    public SplitSentence() {

        super("python","splitsentence.py");

    }

 

    public voiddeclareOutputFields(OutputFieldsDeclarer declarer) {

        declarer.declare(newFields("word"));

    }

}

SplitSentenceoverrides ShellBolt and declares it as running using python with the arguments splitsentence.py. Here's the implementation ofsplitsentence.py:

SplitSentence 覆盖了ShellBolt,使用了带splitsentence.py.参数的python来实现一个可执行程序

 

importstorm

classSplitSentenceBolt(storm.BasicBolt):

    def process(self, tup):

        words = tup.values[0].split("")

        for word in words:

          storm.emit([word])

 

SplitSentenceBolt().run()

 

Formore information on writing spouts and bolts in other languages, and to learnabout how to create topologies in other languages (and avoid the JVMcompletely), see Using non-JVM languages with Storm.

关于如何利用其它语言编写spouts and bolts,参照如何利用其它语言编写topologies

Guaranteeingmessage processing

保证消息的处理。

Earlieron in this tutorial, we skipped over a few aspects of how tuples are emitted.Those aspects were part of Storm's reliability API: how Storm guarantees thatevery message coming off a spout will be fully processed. See Guaranteeing message processing for information on how thisworks and what you have to do as a user to take advantage of Storm'sreliability capabilities.

在这篇指南的前面,我们跳过了关于tuples是如何输出的这一小部分,这是storm可靠API的一部分:了解消息的可靠性机制和storm是如何保证来自于一个spout的消息完全被处理呢。做为用户如何利用storm的可靠性能能力。

Transactionaltopologies

事务型topologies

Stormguarantees that every message will be played through the topology at leastonce. A common question asked is "how do you do things like counting ontop of Storm? Won't you overcount?" Storm has a feature calledtransactional topologies that let you achieve exactly-once messaging semanticsfor most computations. Read more about transactional topologies here.

Storm保证每一个消息在topology中至少处理一次。通用的问题是,如何才能熟练的掌握storm.Storm有一个机制称为事务topologies,通过事务topologies能让多数的计算精确的执行一次。更多的了解事务topologies,接着看。

DistributedRPC

分布式RPC

Thistutorial showed how to do basic stream processing on top of Storm. There's lotsmore things you can do with Storm's primitives. One of the most interestingapplications of Storm is Distributed RPC, where you parallelize the computationof intense functions on the fly. Read more about Distributed RPC here.

这篇指南显示了如何在storm上做一些简单的流处理。在storm的规则下你可以做的还有很多。其中一个最有意思的事情是分布式的RPC,这样可以并行的高效的进行一些高强度的计算。

Conclusion

Thistutorial gave a broad overview of developing, testing, and deploying Stormtopologies. The rest of the documentation dives deeper into all the aspects ofusing Storm.

Meetups

Apache Storm & Apache Kafka (Sunnyvale, CA)

Apache Storm & Kafka Users (Seattle, WA)

NYC Storm User Group (New York, NY)

Bay Area Stream Processing (Emeryville, CA)

Boston Realtime Data (Boston, MA)

LondonStorm User Group (London, UK)

AboutStorm

Stormintegrates with any queueing system and any database system. Storm's spoutabstraction makes it easy to integrate a new queuing system. Likewise,integrating Storm with database systems is easy.

FirstLook

Rationale

Tutorial

Setting up development environment

Creating a new Storm project

Documentation

Index

Manual

Javadoc

FAQ


Copyright© 2015 Apache Software Foundation. All Rights Reserved.
Apache Storm, Apache, the Apache feather logo, and the Apache Storm projectlogos are trademarks of The Apache Software Foundation.
All other marks mentioned may be trademarks or registered trademarks of theirrespective owners.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值