关闭

兼容storm(beta版)

384人阅读 评论(0) 收藏 举报
分类:

Flink streaming 兼容storm的api 接口,因此可以复用storm写的项目 。

你可以:

  • 在flink上执行一个完整的storm topology.
  • 使用storm的spout和bolt,替换flink的source和operator。

本文档展示如何在flink中,复用storm的代码.

项目配置

引入flink-storm这个依赖,用来在flink中运行storm的代码

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-storm</artifactId>
	<version>1.0-SNAPSHOT</version>
</dependency>

Please noteflink-storm之外还需要引入flink的其他包. See WordCount Storm within flink-storm-examples/pom.xml for an example how to package a jar correctly.

执行Storm Topology

Flink 兼容storm的api (org.apache.flink.storm.api) ,提供了下面的几个替换类:

  • TopologyBuilder 替换成 FlinkTopologyBuilder
  • StormSubmitter 替换成 FlinkSubmitter
  • NimbusClient 和Client 替换成 FlinkClient
  • LocalCluster替换成 FlinkLocalCluster

为了提交一个 Storm topology 到Flink里, 需要使用上面几个类来替代storm里的代码.而实际运行的代码,  Spouts 和 Bolts, 可以不修改. 

如果topology想运行在远程的集群上, 下面的参数需要配置。

parameters nimbus.host and nimbus.thrift.port are used as jobmanger.rpc.address and jobmanger.rpc.port, respectively. If a parameter is not specified, the value is taken from flink-conf.yaml.

FlinkTopologyBuilder builder = new FlinkTopologyBuilder(); // replaces: TopologyBuilder builder = new FlinkTopology();

// actual topology assembling code and used Spouts/Bolts can be used as-is
builder.setSpout("source", new FileSpout(inputFilePath));
builder.setBolt("tokenizer", new BoltTokenizer()).shuffleGrouping("source");
builder.setBolt("counter", new BoltCounter()).fieldsGrouping("tokenizer", new Fields("word"));
builder.setBolt("sink", new BoltFileSink(outputFilePath)).shuffleGrouping("counter");

Config conf = new Config();
if(runLocal) { // submit to test cluster
	FlinkLocalCluster cluster = new FlinkLocalCluster(); // replaces: LocalCluster cluster = new LocalCluster();
	cluster.submitTopology("WordCount", conf, builder.createTopology());
} else { // submit to remote cluster
	// optional
	// conf.put(Config.NIMBUS_HOST, "remoteHost");
	// conf.put(Config.NIMBUS_THRIFT_PORT, 6123);
	FlinkSubmitter.submitTopology("WordCount", conf, builder.createTopology()); // replaces: StormSubmitter.submitTopology(topologyId, conf, builder.createTopology());
}

Embed Storm Operators in Flink Streaming Programs

另一种方案, Spouts 和Bolts 可以插入flink 的streaming程序中. Storm兼容层分别对其提供了包装类,也就是SpoutWrapper 和 BoltWrapper这两个类 (org.apache.flink.storm.wrappers).将storm输出的tuple包装成, Flink’s Tuple 类型(ie, Tuple0 to Tuple25 根据输出storm的field的数量). 对于只有一个的filed,进行数据类型转换 (eg, String instead of Tuple1<String>).

因为 Flink 不能推断出storm操作输出数据的类型,一般需要定义输出数据的类型. 才能转换对的类型, 可以使用Flink的 TypeExtractor .

包装Spouts

为了包装spout成 Flink的 source, 使用StreamExecutionEnvironment.addSource(SourceFunction, TypeInformation).  Spout 对象作为SpoutWrapper<OUT> 的构造方法的参数,作为 addSource(...)的第一个参数. 该泛型定义了spout输出field的数据类型。

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// stream has `raw` type (single field output streams only)
DataStream<String> rawInput = env.addSource(
	new SpoutWrapper<String>(new FileSpout(localFilePath), new String[] { Utils.DEFAULT_STREAM_ID }), // emit default output stream as raw type
	TypeExtractor.getForClass(String.class)); // output type

// process data stream
[...]

如果Spout是个有限流 , SpoutWrapper 可以配置numberOfInvocations 这个参数来使spout自动停止。它让Flink 程序自动关闭当数据处理结束.每个程序会运行到它自动结束。

包装Bolts

为了使用Bolt 作为Flink 的操作, 使用DataStream.transform(String, TypeInformation, OneInputStreamOperator)方法。. Bolt作为BoltWrapper<IN,OUT> 的构造方法的参数列表的最后一个参transform(...). 改泛型分别定义了该操作的输入和输出类型

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile(localFilePath);

DataStream<Tuple2<String, Integer>> counts = text.transform(
	"tokenizer", // operator name
	TypeExtractor.getForObject(new Tuple2<String, Integer>("", 0)), // output type
	new BoltWrapper<String, Tuple2<String, Integer>>(new BoltTokenizer())); // Bolt operator

// do further processing
[...]

Named Attribute Access for Embedded Bolts

Bolts can accesses input tuple fields via name (additionally to access via index). To use this feature with embedded Bolts, you need to have either a

  1. POJO type input stream or
  2. Tuple type input stream and spedify the input schema (ie, name-to-index-mapping)

For POJO input types, Flink accesses the fields via reflection. For this case, Flink expects either a corresponding public member variable or public getter method. For example, if a Bolt accesses a field via name sentence (eg, String s = input.getStringByField("sentence");), the input POJO class must have a member variable public String sentence; or method public String getSentence() { ... }; (pay attention to camel-case naming).

For Tuple input types, it is required to specify the input schema using Storm’s Fields class. For this case, the constructor of BoltWrapper takes an additional argument: new BoltWrapper<Tuple1<String>, ...>(..., new Fields("sentence")). The input type is Tuple1<String> andFields("sentence") specify that input.getStringByField("sentence") is equivalent to input.getString(0).

See BoltTokenizerWordCountPojo and BoltTokenizerWordCountWithNames for examples.

Configuring Spouts and Bolts

In Storm, Spouts and Bolts can be configured with a globally distributed Map object that is given to submitTopology(...) method of LocalClusteror StormSubmitter. This Map is provided by the user next to the topology and gets forwarded as a parameter to the calls Spout.open(...) andBolt.prepare(...). If a whole topology is executed in Flink using FlinkTopologyBuilder etc., there is no special attention required – it works as in regular Storm.

For embedded usage, Flink’s configuration mechanism must be used. A global configuration can be set in a StreamExecutionEnvironment via.getConfig().setGlobalJobParameters(...). Flink’s regular Configuration class can be used to configure Spouts and Bolts. However,Configuration does not support arbitrary key data types as Storm does (only String keys are allowed). Thus, Flink additionally providesStormConfig class that can be used like a raw Map to provide full compatibility to Storm.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

StormConfig config = new StormConfig();
// set config values
[...]

// set global Storm configuration
env.getConfig().setGlobalJobParameters(config);

// assemble program with embedded Spouts and/or Bolts
[...]

Multiple Output Streams

Flink can also handle the declaration of multiple output streams for Spouts and Bolts. If a whole topology is executed in Flink usingFlinkTopologyBuilder etc., there is no special attention required – it works as in regular Storm.

For embedded usage, the output stream will be of data type SplitStreamType<T> and must be split by using DataStream.split(...) andSplitStream.select(...). Flink provides the predefined output selector StormStreamSelector<T> for .split(...) already. Furthermore, the wrapper type SplitStreamTuple<T> can be removed using SplitStreamMapper<T>.

[...]

// get DataStream from Spout or Bolt which declares two output streams s1 and s2 with output type SomeType
DataStream<SplitStreamType<SomeType>> multiStream = ...

SplitStream<SplitStreamType<SomeType>> splitStream = multiStream.split(new StormStreamSelector<SomeType>());

// remove SplitStreamType using SplitStreamMapper to get data stream of type SomeType
DataStream<SomeType> s1 = splitStream.select("s1").map(new SplitStreamMapper<SomeType>()).returns(SomeType.classs);
DataStream<SomeType> s2 = splitStream.select("s2").map(new SplitStreamMapper<SomeType>()).returns(SomeType.classs);

// do further processing on s1 and s2
[...]

See SpoutSplitExample.java for a full example.

Flink Extensions

Finite Spouts

In Flink, streaming sources can be finite, ie, emit a finite number of records and stop after emitting the last record. However, Spouts usually emit infinite streams. The bridge between the two approaches is the FiniteSpout interface which, in addition to IRichSpout, contains a reachedEnd()method, where the user can specify a stopping-condition. The user can create a finite Spout by implementing this interface instead of (or additionally to) IRichSpout, and implementing the reachedEnd() method in addition. In contrast to a SpoutWrapper that is configured to emit a finite number of tuples, FiniteSpout interface allows to implement more complex termination criteria.

Although finite Spouts are not necessary to embed Spouts into a Flink streaming program or to submit a whole Storm topology to Flink, there are cases where they may come in handy:

  • to achieve that a native Spout behaves the same way as a finite Flink source with minimal modifications
  • the user wants to process a stream only for some time; after that, the Spout can stop automatically
  • reading a file into a stream
  • for testing purposes

An example of a finite Spout that emits records for 10 seconds only:

public class TimedFiniteSpout extends BaseRichSpout implements FiniteSpout {
	[...] // implemente open(), nextTuple(), ...

	private long starttime = System.currentTimeMillis();

	public boolean reachedEnd() {
		return System.currentTimeMillis() - starttime > 10000l;
	}
}

兼容Storm 的Example

 flink-storm-examples里可以找到更多例子. 对不同版本的 WordCount, see README.md. 你需要assembly打包正确才能运行该程序. flink-storm-examples-1.0-SNAPSHOT.jar 

除了有单独的包装spout和bolt的例子外. 此外, 还有完整的topology的包装例子

你可以通过运行 bin/flink run <jarname>.jar来执行例子。

0
0

查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:117474次
    • 积分:2581
    • 等级:
    • 排名:第14758名
    • 原创:90篇
    • 转载:161篇
    • 译文:31篇
    • 评论:8条
    最新评论