Configurable ETL processing using Apache Storm and Kite SDK Morphlines

 

翻译的时候,百度百科了好多基本词汇,作为刚入门的小白,出现错误还望大神路过时不吝赐教

About Adrianos Dadis

关于Adrianos Dadis

Adrianos Dadis

Adrianos is working as senior software engineer in telcos business domain. Particularly interested in

Adrianos 是电信业务领域的资深软件工程师。尤其对

enterprise integration, multi-tier architecture and middleware services. He mainly works with Weblogic, JBoss,

企业集成,多层架构,中间件服务感兴趣,他主要使用Weblogic,JBoss

Java EE, Spring, Drools, Oracle SOA Suite and various ESBs.

,Java EE, Spring, Drools, Oracle SOA Suite和ESBs.

  

Configurable ETL processing using Apache Storm and Kite SDK Morphlines

使用Apache Strom和Kite SDK Morphlines的可配置 ETL处理

 

From the first days I have worked as software engineer, I always hear the same request by many sides:

当我第一天开始从事软件工程师的工作起,不断地从许多方面听到同样的要求

We want to have everything configurable, we want to change everything on runtime and we want to have a

”我们想要一切都是可配置的,想要在运行状态下的一切都是可以被改变的,并且想要有一个

visual tool to apply all this logic in order to non-developer people use and configure our application.

可视化工具来应用这些逻辑,以便非开发人员使用并配置我们的应用程序

I like this generic scope too, but as we all know software systems are not so adaptable and customer requests

我也喜欢这种通用状态,但是我们都知道客户的需求不稳定,软件系统也很难适应。

are not stable.

In previous years, we have built such configurable applications (not 100% configurable) using traditional

在前几年,我们使用传统的框架/技术(JMX, 可分布式缓存, Spring or JEE 或者更多)来建立

frameworks/techniques (JMX, distributed cache, Spring or JEE and more).

这样的可配置的应用程序(不是100%可配置)

In recent years, there is an additional concept that have to be included in our architecture, this is the concept

在近几年,有一个额外的概念必须包括在我们的结构体系中,这个概念就是

of Big Data(or 3V or 4V or whatever words fit better). This new concept deprecates various solutions or

大数据(或者3V 、 4V 或者其他一些更适合的词汇)。这种新的观念废弃了我们熟知的各种解决方案或者

workarounds that we were familiar and applied in old 3 tiers applications.

工作区,以及在旧的3层应用程序中的应用。

The funny thing is that many times I find myself in the same position as 10 years back. This is the rule on

有趣的是,很多次,我发现自己在10年前的位置上。这就是软件开发的规则

software development, it never ends and so personal excellence and new adventures never end too :-)

它永远不会结束,所以个人卓越和新的冒险永远不会结束。

The main problem remains the same, how to build a configurable ETL distributed application.

主要问题仍然是相同的,怎样建立一个可配置的ETL分布式应用程序。

For this reason, I have built a mini adaptable solution that might be helpful in many use cases. I have used 3

基于这个原因,我建立了一个迷你可适应的解决方案,在许多使用案例中起到作用。在大数据方面,我

common tools in big data world: JavaApache Storm and Kite SDK Morplines. Java as the main

使用了3个常用的工具:JavaApache Storm 和 Kite SDK Morplines。 Java 作为主要的编程语言

programming language, Apache Stormas the distributed streaming processing engine and Kite

Apache Storm作为分布式流处理引擎

SDK Morphlines as the configurable ETL engine.

SDK Morphlines 作为可配置ETL引擎

Kite SDK Morplines

Copied from its description: Morphlines is an open source framework that reduces the time and efforts

粘贴复制一下它的描述:Morphlines是一个开源框架,可以减少

necessary to build and change Hadoop ETL stream processing applications that extract, transform and load

构建和改变Hadoop ETL流处理应用程序的时间和精力,同时,这些应用程序也可以提取,转换和下载数据

data into Apache Solr, HBase, HDFS, Enterprise Data Warehouses, or Analytic Online Dashboards. A

到 Apache Solr, HBase, HDFS, 企业数据仓库,或者分析在线仪表中。

morphline is a rich configuration file that makes it easy to define a transformation chain that consumes any

morphline是一个丰富的可配置文件,它可以轻松地定义一个转换链,这个转换链消耗任何数据源的

kind of data from any kind of data source, processes the data and loads the results into a Hadoop component.

任何数据,处理数据以及将结果加载到Hadoop组件中。

It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and

它使用简单的可配置步骤取代 Java编程,并且相应的减少了开发和维护自定义的ETL项目的费用

integration effort associated with developing and maintaining custom ETL projects.

和集成工作。

Additional to builtin commands, you can easily implement your own Command and use it in your morphline configuration file.

此外,就builtin命令而言,你可以轻松地实现你自己的命令,并在你的morphline可配置文件中使用它。

Sample Morphline configuration that read a JSON string, parse it and then just log a particular JSON element:

Morphline可配置的示例:读取一个JSON字符串,解析它,然后只记录一个特定的JSON元素

01morphlines : [{
02    id : json_terminal_log
03    importCommands : ["org.kitesdk.**"]
04     
05    commands : [
06            # read the JSON blob
07            { readJson: {} }
08 
09            # extract JSON objects into head fields
10            { extractJsonPaths {
11              flatten: true
12              paths: {
13                name: /name
14                age: /age
15              }
16            } }
17 
18            # log data
19            { logInfo {
20                format "name: {}, record: {}"
21                args : ["@{name}""@{}"]
22            }}
23    ]
24}]

上面的代码可复制,下面看的更容易些。

Storm Morphlines Bolt

In order to use Morphlines inside Storm, I have implemented a custom MorphlinesBolt. The main

为了在Strom中是用Morphlines,我执行了一个自定义MorphlineBolt.

responsibilities of this Bolt are:

这个 Bolt的主要职责是:

  • Initialize Morphlines handler via a configuration file

           通过配置文件初始化Morphlines处理程序

  • Initialize mapping instructions:

           初始化映射指令:

  • a) from Tuple to Morphline input and

           a)从Tuple到 Morphline的输入和

  • b) from Morphline output to new output Tuple

           b)从Morphline输出到新输出Tuple

  • Process each incoming event using the already initialized Morplines context

          使用已经初始化的Morplines文本来处理每个传入事件。

  • If Bolt is not Terminal, then using the provided Mapper (type “b”), emit a new Tuple using the output of Morphline execution

           如果Bolt不是Terminal,就使用已经提供的Mapper(类型“b”),使用Morphline执行的输出来发出一个新的Tuple

Simple Configurable ETL topologies

简单的可配置ETL拓扑

In order to test custom MorphlinesBolt, I have written 2 simple tests. In these tests you can see how

为了检测自定义MorphlinesBolt,我写了2个简单的测试。在这些测试中你可以看到MorphlinesBolt

MorphlinesBolt is initialized and then the result of each execution. As input, I have used a custom Spout

怎样被初始化,然后查看执行的每个结果。作为输入,我使用一个自定义Spout

(RandomJsonTestSpout) that just emit new JSON strings every 100ms (configurable).

(RandomJsonTestSout)仅每100ms(可配置)中发出新的JSON字符串。

DummyJsonTerminalLogTopology

A simple topology that configure Morphline context via a configuration file and the execute Morphline handler

一个简单的拓扑,通过一个配置文件来配置文档,并且对每个传入Tuple来执行Morphline处理程序。

for each incoming Tuple. On this topology, MorphlinesBolt is configured as terminal bolt, which means that

在这个拓扑中,MorphlinesBolt配置为终端,意味着,

for each input Tuple does not emit new Tuple.

每个输入Tuple并没有发出新的Tuple

01public class DummyJsonTerminalLogTopology {
02    public static void main(String[] args) throws Exception {
03        Config config = new Config();
04 
05        RandomJsonTestSpout spout = new RandomJsonTestSpout().withComplexJson(false);
06 
07        String2ByteArrayTupleMapper tuppleMapper = new String2ByteArrayTupleMapper();
08        tuppleMapper.configure(CmnStormCons.TUPLE_FIELD_MSG);
09 
10        MorphlinesBolt morphBolt = new MorphlinesBolt()
11                .withTupleMapper(tuppleMapper)
12                .withMorphlineId("json_terminal_log")
13                .withMorphlineConfFile("target/test-classes/morphline_confs/json_terminal_log.conf");
14 
15        TopologyBuilder builder = new TopologyBuilder();
16        builder.setSpout("WORD_SPOUT", spout, 1);
17        builder.setBolt("MORPH_BOLT", morphBolt, 1).shuffleGrouping("WORD_SPOUT");
18 
19        if (args.length == 0) {
20            LocalCluster cluster = new LocalCluster();
21            cluster.submitTopology("MyDummyJsonTerminalLogTopology", config, builder.createTopology());
22            Thread.sleep(10000);
23            cluster.killTopology("MyDummyJsonTerminalLogTopology");
24            cluster.shutdown();
25            System.exit(0);
26        else if (args.length == 1) {
27            StormSubmitter.submitTopology(args[0], config, builder.createTopology());
28        else {
29            System.out.println("Usage: DummyJsonTerminalLogTopology <topology_name>");
30        }
31    }
32}

 

DummyJson2StringTopology

A simple topology that configure Morphline context via a configuration file and the execute Morphline handler

一个简单的拓扑,通过配置文件来配置Morphline 文档,并且对每个传入的Tuple

for each incoming Tuple. On this topology, MorphlinesBolt is configured as normal bolt, which means that for

来执行   MorphlinesBolt处理程序。在这个拓扑中,MorphlinesBolt配置作为标准bolt,意味着

each input Tuple it emits a new Tuple.

每个输入Tuple发出了一个新的Tuple

01public class DummyJson2StringTopology {
02 
03    public static void main(String[] args) throws Exception {
04        Config config = new Config();
05 
06        RandomJsonTestSpout spout = new RandomJsonTestSpout().withComplexJson(false);
07 
08        String2ByteArrayTupleMapper tuppleMapper = new String2ByteArrayTupleMapper();
09        tuppleMapper.configure(CmnStormCons.TUPLE_FIELD_MSG);
10 
11        MorphlinesBolt morphBolt = new MorphlinesBolt()
12                .withTupleMapper(tuppleMapper)
13                .withMorphlineId("json2string")
14                .withMorphlineConfFile("target/test-classes/morphline_confs/json2string.conf")
15                //.withOutputProcessors(Arrays.asList(resultRecordHandlers));
16                .withOutputFields(CmnStormCons.TUPLE_FIELD_MSG)
17                .withRecordMapper(RecordHandlerFactory.genDefaultRecordHandler(String.class,new JsonNode2StringResultMapper()));
18 
19        LoggingBolt printBolt = new LoggingBolt().withFields(CmnStormCons.TUPLE_FIELD_MSG);
20 
21        TopologyBuilder builder = new TopologyBuilder();
22        builder.setSpout("WORD_SPOUT", spout, 1);
23        builder.setBolt("MORPH_BOLT", morphBolt, 1).shuffleGrouping("WORD_SPOUT");
24        builder.setBolt("PRINT_BOLT", printBolt, 1).shuffleGrouping("MORPH_BOLT");
25 
26        if (args.length == 0) {
27            LocalCluster cluster = new LocalCluster();
28            cluster.submitTopology("MyDummyJson2StringTopology", config, builder.createTopology());
29            Thread.sleep(10000);
30            cluster.killTopology("MyDummyJson2StringTopology");
31            cluster.shutdown();
32            System.exit(0);
33        else if (args.length == 1) {
34            StormSubmitter.submitTopology(args[0], config, builder.createTopology());
35        else {
36            System.out.println("Usage: DummyJson2StringTopology <topology_name>");
37        }
38    }
39}

 

Final thoughts

最后的思考

MorphlinesBolt can be used as part of any configurable ETL “solution” (as single processing Bolt, as

MorphlinesBolt可以用作任何可配置ETL“解决方案”(作为但处理螺栓,作为终端螺栓,作为复杂管道的一部分等等)的一部分。

Terminal Bolt, as part of complex pipeline, etc.).

morphlines_storm_topology_examples

Source code is provided as a maven module (sv-etl-storm-morphlines) within my collection of sample

projects in github.

源代码作为 maven  (sv-etl-storm-morphlines) 模块在github中sample projects 中提供了

A great combination would be to use MorphlinesBolt with Flux. This might give you a fully configurable ETL topology!!!

使用MorphlinesBolt和Flux是一个很棒的结合。这也许会给你一个完整的可配置ETL拓扑
 

转载于:https://my.oschina.net/Bettyty/blog/787547

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值