Understanding the parallelism of a Storm topology

In the past few days I have been test-driving Twitter’s Storm project, which is a distributed real-time data processing platform. One of my findings so far has been that the quality of Storm’s documentation and example code is pretty good — it is very easy to get up and running with Storm. Big props to the Storm developers! At the same time, I found the sections on how a Storm topology runs in a cluster not perfectly clear, and learned that the recent releases of Storm changed some of its behavior in a way that is not yet fully reflected in the Storm wiki and in the API docs.

In this article I want to share my own understanding of the parallelism of a Storm topology after reading the documentation and writing some first prototype code. More specifically, I describe the relationships of worker processes, executors (threads) and tasks, and how you can configure them according to your needs. The article is based on Storm release 0.8.1.

Update 2012-11-05: This blog post has been merged into  Storm’s documentation.

What is Storm?

For those readers unfamiliar with Storm here is a brief description taken from its homepage:

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

What makes a running topology: worker processes, executors and tasks

Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:

  • Worker processes
  • Executors (threads)
  • Tasks

Here is a simple illustration of their relationships:

Storm: Worker processes, executors (threads) and tasks

Figure 1: The relationships of worker processes, executors (threads) and tasks in Storm

worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.

An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).

task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads ≤ #tasks. By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.

Configuring the parallelism of a topology

Note that in Storm’s terminology “parallelism” is specifically used to describe the so-called parallelism hint, which means the initial number of executors (threads) of a component. In this article though I use the term “parallelism” in a more general sense to describe how you can configure not only the number of executors but also the number of worker processes and the number of tasks of a Storm topology. I will specifically call out when “parallelism” is used in the narrow definition of Storm.

The following table gives an overview of the various configuration options and how to set them in your code. There is more than one way of setting these options though, and the table lists only some of them. Storm currently has the following order of precedence for configuration settingsdefaults.yaml < storm.yaml < topology-specific configuration < internal component-specific configuration < external component-specific configuration. Please take a look at the Storm documentation for more details.

What Description Configuration option

How to set in your code (examples)
#worker processes How many worker processes to createfor the topologyacross machines in the cluster. TOPOLOGY_WORKERS Config#setNumWorkers
#executors (threads) How many executors to spawnper component. ? TopologyBuilder#setSpout() andTopologyBuilder#setBolt()

Note that as of Storm 0.8 theparallelism_hint parameter now specifies the initial number of executors (not tasks!) for that bolt.

#tasks How many tasks to create per component. TOPOLOGY_TASKS ComponentConfigurationDeclarer
#setNumTasks()

Here is an example code snippet to show these settings in practice:

1 topologyBuilder.setBolt("green-bolt"new GreenBolt(), 2)
2                .setNumTasks(4)
3                .shuffleGrouping("blue-spout");

In the above code we configured Storm to run the bolt GreenBolt with an initial number of two executors and four associated tasks. Storm will run two tasks per executor (thread). If you do not explicitly configure the number of tasks, Storm will run by default one task per executor.

Example of a running topology

The following illustration shows how a simple topology would look like in operation. The topology consists of three components: one spout called BlueSpout and two bolts called GreenBolt and YellowBolt. The components are linked such that BlueSpout sends its output to GreenBolt, which in turns sends its own output to YellowBolt.

Storm: Example of a running topology

Figure 2: Example of a running topology in Storm

The GreenBolt was configured as per the code snippet above whereas BlueSpout and YellowBolt only set the parallelism hint (number of executors). Here is the relevant code:

01 Config conf = new Config();
02 conf.setNumWorkers(2); // use two worker processes
03  
04 topologyBuilder.setSpout("blue-spout"new BlueSpout(), 2); // parallelism hint
05  
06 topologyBuilder.setBolt("green-bolt"new GreenBolt(), 2)
07                .setNumTasks(4)
08                .shuffleGrouping("blue-spout");
09  
10 topologyBuilder.setBolt("yellow-bolt"new YellowBolt(), 6)
11                .shuffleGrouping("green-bolt");
12  
13 StormSubmitter.submitTopology(
14         "mytopology",
15         conf,
16         topologyBuilder.createTopology()
17     );

And of course Storm comes with additional configuration settings to control the parallelism of a topology, including:

  • TOPOLOGY_MAX_TASK_PARALLELISM: This setting puts a ceiling on the number of executors that can be spawned for a single component. It is typically used during testing to limit the number of threads spawned when running a topology in local mode. You can set this option via e.g. Config#setMaxTaskParallelism().

Update Oct 18: Nathan informed me that TOPOLOGY_OPTIMIZE will be removed in a future release. I have therefore removed its entry from the configuration list above.

How to change the parallelism of a running topology

A nifty feature of Storm is that you can increase or decrease the number of worker processes and/or executors without being required to restart the cluster or the topology. The act of doing so is called rebalancing.

You have two options to rebalance a topology:

  1. Use the Storm web UI to rebalance the topology.
  2. Use the CLI tool storm rebalance as described below.

Here is an example of using the CLI tool:

# Reconfigure the topology "mytopology" to use 5 worker processes,
# the spout "blue-spout" to use 3 executors and
# the bolt "yellow-bolt" to use 10 executors.

$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

References for this article

To compile this article (and to write my related test code) I used information primarily from the following sources:

Summary

My personal impression is that Storm is a very promising tool. On the one hand I like its clean and elegant design, and on the other hand I loved to find out that a young open source tool can still have an excellent documentation. In this article I tried to summarize my own understanding of the parallelism of topologies, which may or may not be 100% correct -- feel free to let me know if there are any mistakes in the description above!

Ref: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
vc++全版本组件大全 VC++运行时(Visual C++ Runtime)是VC++开发环境中用于支持C和C++程序运行的基础库集合。这些库包含了执行C/C++程序所必需的基本函数和数据结构,例如内存管理、字符串操作、输入输出处理、异常处理等。VC++运行时库分为静态库和动态库两种形式,以适应不同类型的项目需求。 静态链接库 vs 动态链接库 静态链接库(Static Linking Libraries):在编译时,静态库的代码会被直接嵌入到最终生成的可执行文件中。这意味着每个使用静态库的程序都会包含库代码的一个副本,导致最终程序的体积较大,但不需要外部库文件支持即可独立运行。在VC++中,静态链接库的例子有LIBC.lib(用于单线程程序)和LIBCMT.lib(用于多线程程序)。 动态链接库(Dynamic Link Libraries):与静态链接相反,动态库的代码并不直接加入到应用程序中,而是在程序运行时被加载。这使得多个程序可以共享同一份库代码,节省了系统资源。VC++的动态运行时库主要通过msvcrt.dll(或其变体,如MSVCRTD.dll用于调试版本)实现,与之配套的导入库(Import Library)如CRTDLL.lib用于链接阶段。 运行时库的版本 VC++运行时库随着Visual Studio版本的更新而发展,每个版本都可能引入新的特性和优化,同时保持向后兼容性。例如,有VC++ 2005、2008、2010直至2019等多个版本的运行时库,每个版本都对应着特定的开发环境和Windows操作系统。 重要性 VC++运行时对于确保程序正确运行至关重要。当程序在没有安装相应运行时库的计算机上执行时,可能会遇到因缺失DLL文件(如MSVCP*.dll, VCRUNTIME*.dll等)而导致的错误。因此,开发完成后,通常需要分发相应的VC++ Redistributable Packages给最终用户安装,以确保程序能够在目标系统上顺利运行。 安装与部署 安装VC++运行时库通常是通过Microsoft提供的Redistributable Packages完成的,这是一个简单的过程,用户只需运行安装程序即可自动安装所需组件。对于开发者而言,了解和管理不同版本的运行时库对于确保应用程序的广泛兼容性和可靠性是必要的。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值