storm官方文档

转载 2015年07月10日 17:00:37

Streams

The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream's tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of "default".

Resources:

Spouts

A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.

Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on SpoutOutputCollector.

The main method on spouts is nextTuplenextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTupledoes not block for any spout implementation, because Storm calls all the spout methods on the same thread.

The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack and fail are only called for reliable spouts. See the Javadoc for more information.

Resources:

Bolts

All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.

Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).

Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using theemit method on OutputCollector.

When you declare a bolt's input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually. InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping("1")subscribes to the default stream on component "1" and is equivalent to declarer.shuffleGrouping("1", DEFAULT_STREAM_ID).

The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack method on theOutputCollector for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.

Its perfectly fine to launch new threads in bolts that do processing asynchronously. OutputCollector is thread-safe and can be called at any time.

Resources:

Stream groupings

Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks.

There are seven built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGrouping interface:

  1. Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
  2. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
  3. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.
  4. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
  5. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.
  6. None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
  7. Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](/javadoc/apidocs/backtype/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
  8. Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

Resources:

  • TopologyBuilder: use this class to define topologies
  • InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt's input streams and how those streams should be grouped
  • CoordinatedBolt: this bolt is useful for distributed RPC topologies and makes heavy use of direct streams and direct groupings

Reliability

Storm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a "message timeout" associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.

To take advantage of Storm's reliability capabilities, you must tell Storm when new edges in a tuple tree are being created and tell Storm whenever you've finished processing an individual tuple. These are done using the OutputCollector object that bolts use to emit tuples. Anchoring is done in the emit method, and you declare that you're finished with a tuple using the ack method.

This is all explained in much more detail in Guaranteeing message processing.

Tasks

Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods of TopologyBuilder.

Workers

Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

Resources:

Apache Storm 官方文档中文版

Apache Storm 官方文档中文版 原文链接    译者:魏勇 About 本项目是 Apache Storm 官方文档的中文翻译版,致力于为有实时流计算项目需求和对 Apache Sto...
  • wanglha
  • wanglha
  • 2016年05月12日 10:59
  • 1197

storm官方文档----配置文件说明

源地址:http://storm.apache.org/documentation/Configuration.html   storm由丰富的configure选项, 用来调整nibus、sup...
  • gao634209276
  • gao634209276
  • 2016年06月24日 22:05
  • 1720

Storm官方文档about页面翻译

原创翻译,如有错误请指出,谢谢。 原文链接:http://storm.incubator.apache.org/about/integrates.html   集成 Storm可以集成任何...
  • damacheng
  • damacheng
  • 2015年01月04日 16:17
  • 591

(2)Storm实时日志分析实战--Topology的设计

需求日志数据样例: 215.187.202.215 - - [1481945172991] “GET/IBEIfeng.gif?order_id=1&orderTime=1481945172991...
  • fjse51
  • fjse51
  • 2016年12月26日 15:26
  • 1168

storm的日志问题

由于目前的流计算项目要加监控和报警,因此规范的日志是必须的条件。测试了以后才发现storm的日志原来有个很大的坑。 基本问题如下:storm采用的也是log4j去打印日志,默认的日志配置文件是sto...
  • caodaoxi
  • caodaoxi
  • 2013年05月22日 17:34
  • 5736

storm组件初始化问题及与spring的结合方式

一般来说spout/bolt的生命周期如下:  1.在提交了一个topology之后(是在nimbus所在的机器么?), 创建spout/bolt实例(spout/bolt在storm中统称为comp...
  • asdfsadfasdfsa
  • asdfsadfasdfsa
  • 2017年02月27日 14:39
  • 1473

阅读Oracle官方文档一步步学习Oracle知识的正确顺序

通过阅读Oracle官方文档一步步学习Oracle知识的正确顺序是? Database Concepts 数据库基础概念文档,适合没有任何基础的新手: Oracle 12c概念官方手册A...
  • xiaobluesky
  • xiaobluesky
  • 2015年11月22日 20:58
  • 1000

Storm 1.0.0 稳定版发布

Storm 1.0.0 发布 Storm 1.0.0版本增加了很多新的特性,可用性以及性能也得到了很大的改善,该版本是Storm发展历程上一个里程碑式的版本,主要特点如下。 性能提升 Storm...
  • Glory_Z
  • Glory_Z
  • 2016年04月13日 15:32
  • 1892

Storm命令详解

在Linux终端直接输入storm,不带任何参数信息,或者输入storm help,可以查看storm命令行客户端(Command line client)提供的帮助信息。Storm 0.9.0.1版...
  • lzm1340458776
  • lzm1340458776
  • 2015年04月26日 15:08
  • 5737

storm 开发系列一 第一个程序

本文将在本地开发环境创建一个storm程序,力求简单。首先用mvn创建一个简单的工程hello_stormmvn archetype:generate -DgroupId=org.csfreebird...
  • sheismylife
  • sheismylife
  • 2015年10月13日 22:38
  • 4864
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:storm官方文档
举报原因:
原因补充:

(最多只允许输入30个字)