storm-Understanding the&nbs…

李荣强

于 2015-12-24 10:31:13 发布

阅读量337

点赞数

分类专栏：翻译-大数据

本文链接：https://blog.csdn.net/li951418089/article/details/50392796

版权

翻译-大数据专栏收录该内容

10 篇文章 0 订阅

订阅专栏

原文地址: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

原文内容.

In the past few days I have been test-driving Twitter’sStorm project, which is a distributed real-time data processing platform. One of my findings so far has been that the quality of Storm’s documentation and example code is pretty good – it is very easy to get up and running with Storm. Big props to the Storm developers! At the same time, I found the sections on how a Storm topology runs in a cluster not perfectly clear, and learned that the recent releases of Storm changed some of its behavior in a way that is not yet fully reflected in the Storm wiki and in the API docs.

在过去的几天中,我测试驱动推特的strom项目,这是一个分布式的实时数据处理平台.

迄今为止我发现strom有一点做的相当好,就是他的文档和示例代码写的非常好.易用性非常好,很多初学者能很快上手搭建环境并运行任务.对于strom开发人员来说,简直就是神器. 同时我也发现,在讲解strom topology是如何在集群中运行这部分内容时,有些东西并不完全清楚,并且最近strom发布的新版本中,有些strom内部运行机制的变更并没有完全的体现在wiki和api文档中.

In this article I want to share my own understanding of the parallelism of a Storm topology after reading the documentation and writing some first prototype code. More specifically, I describe the relationships of worker processes, executors (threads) and tasks, and how you can configure them according to your needs. This article is based on Storm release 0.8.1, the latest version as of October 2012.

在本篇文章中,我想分享下在我阅读完storm的文档和写了一下demo后,对于storm topology并行机制的理解. 异乎寻常的,我描述了下, worker的处理,线程执行器及任务之间的关系,以及你如何根据自己的需求去配置这些参数. 这篇文章基于storm 2012年10月份发布的新版本.0.8.1

What is Storm?

For those readers unfamiliar with Storm here is a brief description taken from its homepage:

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

什么是storm,对于不熟悉storm,想知道storm是什么的读者,我建议你去看看官方的简介.

storm是一个开源的分布式是实时计算系统. 通过storm,你可以很容易的实现并可靠的实现对于流式数据的实时处理,就像hadoop的批处理特性一样.storm 是一个简单的,支持多种编程语言的一个框架,用起来非常的爽.

strom有许多适用场景: 实时分析,在线机器学习,流式计算,分布式RPC,ETL等等.

strom是非常快的,基准测试结果: 单节点每秒钟能处理超过100W个元祖(strom的传输数据的单元,译者注).strom也是一个可伸缩的,良好的容错机制,能保证你的数据能被处理,而且简单易用.

What makes a running topology: worker processes, executors and tasks

Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:

Worker processes
Executors (threads)
Tasks

Here is a simple illustration of their relationships:

组成一个执行topology的元素: worker处理进程,执行器和任务.storm区分这三种在集群中实际执行的实体为如下表示:

worker 处理进程

线程执行器

任务

下面有个简单介绍三者之间关系的图.

storm集群中的某个节点机器可能同时为一个或者多个topology运行一个或者多个worker处理单元.每个worker进程为某个指定的topology进行运行.

一个或者多个线程执行器可能运行一个单一的worker进行,因为每个线程执行器是由worker进行进行孵化的. 每个线程执行器运行同一个组件中的一个或者多个任务.

Figure 1: The relationships of worker processes, executors (threads) and tasks in Storm

A worker process executes a subset of a topology, and runs in its own JVM.

A worker process belongs to a specific topology and may run one or

more executors for one or more components (spouts or bolts) of this topology.

A running topology consists of many such processes running on many machines within a Storm cluster.

An executor is a thread that is spawned by a worker process and runs within the worker’s JVM.

An executor may run one or more tasks for the same component (spout or bolt).

An executor always has one thread that it uses for all of its tasks, which means that tasks run serially on an executor.

A task performs the actual data processing and is run within its parent executor’s thread of execution.

Each spout or bolt that you implement in your code executes as many tasks across the cluster.

The number of tasks for a component is always the same throughout the lifetime of a topology,

but the number of executors (threads) for a component can change over time.

This means that the following condition holds true: #threads <= #tasks. By default,

the number of tasks is set to be the same as the number of executors,

i.e. Storm will run one task per thread (which is usually what you want anyways).

图1, worker 进程,线程执行器,任务三者之间的关系

1 worker 进程,执行的是topology的子集,运行在自己的JVM中,worker进程归属于特定的topology,可能运行一个或者多个

一个或者多个线程执行器,为当前topology中的一个或者多个组件服务(spout 或者blot).

一个运行中的topology由storm集群中执行中的worker 进程组成.

一个执行器是一个线程,由运行在JVM上的worker进程所孵化.

一个执行器可能运行同一组件中的一个或者多个任务.

一个执行性总是(至少)有一个线程为所有的任务服务的,那也就意味中所有的任务在一个执行器中串行的执行.(线程内部)

任务扮演着实际数据处理的角色,运行在父线程执行器中.

每个你自己实现的spout或者blot组件以任务的形式在集群中运行.

某个组件的任务数在整个topology的生命周期内都是一样的.但是组件内部的线程数是随时发生变化的

这也就意味着下面这个条件是正确的. 线程数<=任务数

默认情况下, 任务数设置和线程数相同.这就能提高任务的并行效率. 比如storm就会让一个任务对应一个线程(这也就是你想要的吧)

Also be aware that:

The number of executor threads can be changed after the topology has been started

(see storm rebalance command below).

The number of tasks of a topology is static.

See Understanding the Internal Message Buffers of Storm for another view

on the various threads that are running within the lifetime of a worker process and its associated executors and tasks.

同时,关注下面这些

1 线程数可以在topology启动后进行变更(关注下文提到的storm二次负载均衡)

2 topology的任务数是静态不可变的.

可以关注下理解storm 内部消息缓存机制, 另一种视角看待storm的worker进程内部的多线程及线程和任务之间的关联关系.

Configuring the parallelism of a topology

Note that in Storm’s terminology “parallelism” is specifically used to describe the so-called parallelism hint,

which means the initial number of executors (threads) of a component.

In this article though I use the term “parallelism” in a more general sense to describe how you can configure

not only the number of executors but also the number of worker processes and the number of tasks of a Storm topology.

I will specifically call out when “parallelism” is used in the narrow definition of Storm.

The following table gives an overview of the various configuration options and how to set them in your code.

There is more than one way of setting these options though, and the table lists only some of them.

Storm currently has the following order of precedence for configuration settings:

external component-specific configuration > internal component-specific configuration >

topology-specific configuration > storm.yaml > defaults.yaml.

Please take a look at the Storm documentation for more details.

配置 topology的并行数

注意storm的术语 parallelism 是专门用来描述所谓的并行线索 ,也就是组件的初始化线程数

在本篇文章中,我使用并行这个术语,想用个更加通俗易懂的方式来描述,如何配置storm topology 的线程数,worker进程数,以及任务

并行这个词在storm这个狭义的定义中使用会让我倍感亲切.

接下来的表格我将给出多种配置操作的及如何在你的代码中设置的预览.

有多种方式设置这些操作,但是表格中仅仅展示的只是其中一部分. storm当前遵循下面配置项的优先级.

外部组件指定配置> 内部组件指定配置 >topology指定配置 > storm.yam1 >defaus.yam1

更多细节请参看storm文档吧