常见transformation
Transformation | Description |
---|---|
Map DataStream → DataStream | 输入一个元素,然后返回一个元素,中间可以做一些清洗转换等操作 |
FlatMap DataStream → DataStream | 输入一个元素,可以返回零个,一个或者多个元素 |
Filter DataStream → DataStream | 过滤函数,对传入的数据进行判断,符合条件的数据会被留下 |
KeyBy DataStream → KeyedStream | 根据某个key值进行分组 以下两种类型是没法作为key的 1.一个实体类对象,没有重写hashCode方法,并且依赖object的hasCode 2.任意形式的数组类型 3.基本数据类型,比如int long |
Reduce KeyedStream → DataStream | l对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,然后返回一个新的值
|
Aggregations KeyedStream → DataStream | 聚合操作 |
Aggregations on windows WindowedStream → DataStream | Aggregates the contents of a window. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy). |
Union DataStream* → DataStream | 合并两个流。注意两个流的类型必须是一致的
|
Connect DataStream,DataStream → ConnectedStreams | 和union类似,但是只能连接两个流,但是两种流的类型可以不一样 |
CoMap, CoFlatMap ConnectedStreams → DataStream | Similar to map and flatMap on a connected data stream 该方法通常用于流collect之后 DataStream<String> dataStreamSource2 = streamExecutionEnvironment.addSource(new MyNoParalleSource()).map(t->{ return String.valueOf(t+"str"); }); ConnectedStreams<Long,String> connect = dataStreamSource1.connect(dataStreamSource2); SingleOutputStreamOperator<Object> env2 = connect.map(new CoMapFunction<Long, String, Object>() { @Override public Object map1(Long aLong) throws Exception { return aLong; } @Override public Object map2(String s) throws Exception { return s; } }); env2.print(); streamExecutionEnvironment.execute(); |
Split DataStream → SplitStream | 根据规则把一个流切分成多个流 DataStream<Long> even = split.select("even","odd"); |
Select SplitStream → DataStream | Select one or more streams from a split stream.
|
parallel操作
Transformation | Description |
---|---|
Custom partitioning DataStream → DataStream | Uses a user-defined Partitioner to select the target task for each element.
|
Random partitioning DataStream → DataStream | 随机分配
|
Rebalancing (Round-robin partitioning) DataStream → DataStream | 对数据集进行再平衡,重分区,消除数据倾斜
|
Rescaling DataStream → DataStream |
如果上游操作有两个并发,而下游操作有4个并发,那么上还有的一个并发结果会分配给下游的两个并发操作,另外一的一个并发结果分配给了下游的另外两个并发操作,如果上游操作并发数目是4个,下游是两个,那么那么上面两个操作结果分配给一下游一个。 如果不同的并行性不是彼此的倍数,那么一个或多个下游操作将具有不同数量的上游操作输入。 Rescaling和Rebalance的不同点在于Rebalance会全量重新分区,而Rescaling不会
|
Broadcasting DataStream → DataStream | Broadcasts elements to every partition.
|
一个使用redis作为sink的例子
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
DataStream<String> text = env.socketTextStream("localhost",9000, "\n");
DataStream<Tuple2<String,String>> l_word = text.map(new MapFunction<String, Tuple2<String, String>>() {
public Tuple2<String, String> map(String s) throws Exception {
return new Tuple2<String, String>("l_word",s);
}
});
FlinkJedisPoolConfig localhost = new FlinkJedisPoolConfig.Builder().setHost("localhost").setPort(6379)
.build();
final RedisSink<Tuple2<String, String>> tuple2RedisSink = new RedisSink<Tuple2<String, String>>(localhost,
new MyRedisMapper());
l_word.addSink(tuple2RedisSink);
env.execute();
}
public static class MyRedisMapper implements RedisMapper<Tuple2<String,String>>{
public RedisCommandDescription getCommandDescription() {
return new RedisCommandDescription(RedisCommand.LPUSH);
}
public String getKeyFromData(Tuple2<String, String> stringStringTuple2) {
return stringStringTuple2.f0;
}
public String getValueFromData(Tuple2<String, String> stringStringTuple2) {
return stringStringTuple2.f1;
}
}
}