storm学习一之storm概念

(注:本文主要是一些学习笔记,内容多是由文末的参考连接自行整理而来,如觉本文尚浅可以直接参考文末连接)

1.storm中的一些术语

topology

  • 应用程序的所有逻辑被打包到Storm的topology中。Storm的topology类似与MapReduce的job。两者之间的一个不同点就是MR的job最终是会运行结束的,而topology则会一直运行下去,除非人为的kill掉。topology中定义了spouts、bolts以及数据流的分组。

The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings.

streams

  • 消息流Streams是storm里的最关键的抽象。一个消息流是一个没有边界的tuple序列,而这些tuples会被以一种分布式的方式并行地创建和处理
  • 也可以自定义类型(只要能序列化即可)

The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream’s tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

Resources:
Tuple: streams are composed of tuples
OutputFieldsDeclarer: used to declare streams and their schemas
Serialization: Information about Storm’s dynamic typing of tuples and declaring custom serializations

spout

  • spout是storm topology的数据入口,连接到数据源,将数据转换为一个个tuple,并将tuple作为数据流进行发射emit。
  • spout可以是可靠的也可以是不可靠的,可靠的spout会重新重新发射一个失败了的tuple,不可靠的spout在tuple发射了之后则不会管它了
  • spout可以发射给多个stream
  • spout的主要方法是nextTuple,nextTuple方法要么发射一个新的元组到拓扑,要么没有新的元组发射就简单的返回。注意不要在重写的该方法中形成阻塞。
  • 在发射成功时调用ack,失败时调用fail

A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.

Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on SpoutOutputCollector.

The main method on spouts is nextTuple. nextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTuple does not block for any spout implementation, because Storm calls all the spout methods on the same thread.

The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack and fail are only called for reliable spouts. See the Javadoc for more information.

bolt

  • bolt可以将一个或者多个数据流作为输入,对数据实施运算后,选择性地输出一个或者多个数据流。。一个bolt可以订阅(subscribe)多个由spout或其他bolt发射的数据流。
  • Topology中的所有处理都在bolts中完成。Bolts什么都可以做,如过滤、业务功能、聚合、连接(合并)、访问数据库等等
  • 复杂的处理可以使用多个bolt来进行
  • Bolts可以发射多个流
  • Bolts的主要方法是execute方法,任务在一个新的元组输入时执行该方法。Bolts使用OutputCollector对象发射新的元组。Bolts必须对每个处理的元组调用OutputCollector的ack方法,因此storm知道这个元组完成处理(并且能最终确定ack原始元组是安全的)。一般情况,处理一个输入元组,基于此元组再发射0-N个元组,然后ack输入元组。Strom提供了一个IBasicBolt接口自动调用ack方法。

All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.

Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).

Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

When you declare a bolt’s input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually. InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping(“1”) subscribes to the default stream on component “1” and is equivalent to declarer.shuffleGrouping(“1”, DEFAULT_STREAM_ID).

The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack method on the OutputCollector for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.

Its perfectly fine to launch new threads in bolts that do processing asynchronously. OutputCollector is thread-safe and can be called at any time.

tuple

storm最基础的数据结构了。其实现(TupleImpl.java)内部是使用一个List values来保存各种类型的数据
可以通过如下方式获取

String A = tuple.getString(0);
long a= tuple.getLong(1);

总体用这个图来说表示的最完全:
在这里插入图片描述

Stream groupings

不是很想翻译了,都挺原文说的直接明了的
数据是流式的进入storm集群,storm提供了多种方式来将这一条条数据分发到不同的blot中。 Hadoop map reduce也有这个功能啊
Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt’s tasks.

There are eight built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGrouping interface:

Shuffle grouping: Tuples are randomly distributed across the bolt’s tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the “user-id” field, tuples with the same “user-id” will always go to the same task, but tuples with different “user-id”'s may go to different tasks.
Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.
All grouping: The stream is replicated across all the bolt’s tasks. Use this grouping with care.
Global grouping: The entire stream goes to a single one of the bolt’s tasks. Specifically, it goes to the task with the lowest id.
None grouping: This grouping specifies that you don’t care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

Tasks

spout和bolt是作为一个集群中的task来运行的,每个任务对应一个执行线程。 stream groupings定义了如何将某个task中的tuple发送到其他的task。
Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods of TopologyBuilder.

Workers

topology运行在一个或多个worker进程上,worker是jvm虚拟机,运行topology所有task的一部分。比如,topology的并发是300,有50个worker,那每个worker就有6个task。Storm会平衡所有worker的task数量。通过Config.TOPOLOGY_WORKERS来设置topology的worker数量。
Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

2.storm集群

nimbus守护进程

有点类似Hbase中HMaster的功能(负责分发region到各个RegionServer上,以及一些负载均衡)
nimbus则是负责分发storm集群中的topology到supervisor上

nimbus守护进程的主要职责是管理,协调和监控在集群上运行的topology。包括topology的发布,任务支派,事件处理失败时重新指派任务。

将topology发布到Storm集群,将预先打包的jar文件的topology和配置信息提交到nimbus服务器上,一旦nimbus接收到了topology的压缩包,会将jar包分发到足够数量的supervisor节点上。当supervisor节点接收到了topology压缩文件,nimbus就会指派task(bolt和spout实例)到每个supervisor并且发送信号指示supervisor生成足够的worker来执行指派的task。

nimbus记录所有supervisor节点的状态和分配给它们的task。如果nimbus发现某个supervisor没有上报心跳或者已经不可达了,它会将故障supervisor分配的task重新分配到集群中的其他supervisor节点。

严格意义上讲 nimbus 不会引起单点故障。这个特性是因为 nimubs 并不参与 topology 的数据处理过程,它仅仅是管理 topology 的初始化,任务分发和进行监控。实际上,如果 nimbus 守护进程在 topology 运行时停止了,只要分配的 supervisor 和worker 健康运行,topology 一直继续数据处理。要注意的是,在 nimbus 已经停止的情况下 supervisor 异常终止,因为没有 nimbus 守护进程来重新指派失败这个终止的 supervisor的任务,数据处理就会失败。

supervisor守护进程

这里对应到Hbase中RS似乎并不恰当,因为supervisor只是一个守护进程,真正的任务执行下发到别的worker进程中了

supervisor守护进程等待nimbus分配任务后生成并监控workers(JVM进程)执行任务。supervisor和worker都是运行在不同的 JVM 进程上,如果由 supervisor 拉起的一个woker 进程因为错误(或者因为 Unix 终端的 kill-9 命令,Window 的 tskkill 命令强制结束)异常退出,supervisor 守护进程会尝试重新生成新的 worker 进程。

如果一个worker甚至整个supervisor节点都故障了,Storm怎么保障出错时正在处理的tuples的传输呢?答案就在Storm的tuple的锚定和应答确认机制中。当打开了可靠i传输的选项,传输到故障节点上的tuples将不会收到应答确认,spout会因为超时而重新发射原始的tuple。这样的过程会一直重复直到topology从故障中恢复开始正常处理数据。

总体集群结构用下图来说最合适:
在这里插入图片描述

参考:
storm官网:http://storm.apache.org/releases/current/Concepts.html
参考:https://matt33.com/2015/05/26/the-basis-of-storm/#基础
图片来自于:https://www.jianshu.com/p/90456bab8487
其他参考:https://www.howardliu.cn/storm/the-concepts-of-storm/

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值