Flink DataStream API - Transformations

Map:DataStream → DataStream

新DataStream中的元素与原DataStream中的元素存在一对一的关系。

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class MapDemo {
	
	private static int index = 1 ;
	
	public static void main(String[] args) throws Exception {
		
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		
		DataStream<String> dataStream = env.readTextFile("F:/test.txt");
		
		// new MapFunction<String,String>() 前一个String为入参类型,后一个String为出参类型
		DataStream<String> newDataStream = dataStream.map(new MapFunction<String,String>() {

			@Override
			public String map(String value) throws Exception {
				// TODO Auto-generated method stub
				return  (index++) + ".您输入的是:" + value;
			}
		});
		
		newDataStream.print();
		
		env.execute(" map demo start");
	}

}

F/:test.txt内容

Takes
one
element
and
produces
one
element

输出结果为:

2> 1.您输入的是:one
1> 2.您输入的是:Takes
5> 3.您输入的是:produces
2> 4.您输入的是:element
4> 5.您输入的是:and
7> 6.您输入的是:element
6> 7.您输入的是:one

将从文件获取的String stream转为Integer stream

public class MapDemo {
	
	public static void main(String[] args) throws Exception {
		
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		
		DataStream<String> dataStream = env.readTextFile("F:/test.txt");
		
		// new MapFunction<String,Integer>() 前一个String为入参类型,后一个Integer为出参类型
		DataStream<Integer> newDataStream = dataStream.map(new MapFunction<String,Integer>() {

			@Override
			public Integer map(String value) throws Exception {
				// TODO Auto-generated method stub
				return  Integer.parseInt(value);
			}
		});
		
		newDataStream.print();
		
		env.execute(" map demo start");
	}

}

FlatMap:DataStream → DataStream

新DataStream中的元素与原DataStream中的元素存在一对一或一对多的关系。

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class FlatMapDemo {
	
	public static void main(String[] args) throws Exception {
		
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		
		DataStream<String> dataStream = env.readTextFile("F:/test.txt");
		
		
		DataStream<String> newDataStream = dataStream.flatMap(new FlatMapFunction<String, String>() {

			@Override
			public void flatMap(String value, Collector<String> out) throws Exception {
				// TODO Auto-generated method stub
				String[] strAry = value.split("\\s");
				for(String str : strAry) {
					out.collect(str);
				}
			}
			
		});
		
		newDataStream.print();
		
		env.execute(" flatMap demo start");
	}

}

F/:test.txt内容

Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words

输出结果为:

1> Takes
1> one
1> element
1> and
1> produces
1> zero,
1> one,
1> or
1> more
1> elements.
1> A
1> flatmap
1> function
1> that
1> splits
1> sentences
1> to
1> words

Filter:DataStream → DataStream

对DataStream中的元素进行过滤

import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class FilterDemo {
	
	public static void main(String[] args) throws Exception {
		
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		
		DataStream<String> dataStream = env.readTextFile("F:/test.txt");
		
		
		DataStream<Integer> newDataStream = dataStream.map(new MapFunction<String, Integer>() {

			@Override
			public Integer map(String value) throws Exception {
				return Integer.parseInt(value);
			}
		}).filter(new FilterFunction<Integer>() {
			
			@Override
			public boolean filter(Integer value) throws Exception {
				// 出去所有等于0的值
				return value != 0;
			}
		});
		
		newDataStream.print();
		
		env.execute(" filter demo start");
	}

}

F/:test.txt内容

0
1
0
2
3
4

输出结果为:

2> 1
5> 2
8> 4
6> 3

KeyBy:DataStream → KeyedStream

在逻辑上将流划分为不通的分区。所有键相同的记录都被分配到同一个分区。在程序内部,keyBy()是通过哈希实现分区的。有不同的方法来指定键。
dataStream.keyBy(“someKey”) // Key by field “someKey”
dataStream.keyBy(0) // Key by the first element of a Tuple
如果出现以下情况,则类型不能成为关键:

  1. 它是POJO类型,但不覆盖hashCode()方法并依赖于Object.hashCode()实现。
  2. 它是任何类型的数组。
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class KeyByDemo{
	
	public static void main(String[] args) throws Exception {
		
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		
		DataStream<String> dataStream = env.readTextFile("F:/test.txt");
		
		DataStream<WordWithCount> newDataStream = dataStream.map(
				new MapFunction<String, WordWithCount>() {

			@Override
			public WordWithCount map(String value) throws Exception {
				return new WordWithCount(value, 1L);
			}
		}).keyBy("word");
		
		newDataStream.print();
		
		env.execute(" filter demo start");
	}
	
	/**
	 * Data type for words with count.
	 */
	public static class WordWithCount {

		public String word;
		public long count;

		public WordWithCount() {}

		public WordWithCount(String word, long count) {
			this.word = word;
			this.count = count;
		}

		@Override
		public String toString() {
			return word + " : " + count;
		}
	}

}

F/:test.txt内容

hello
word
hello word
hello
word
hello word

输出结果为:

4> hello word : 1
4> hello word : 1
6> word : 1
6> word : 1
3> hello : 1
3> hello : 1

Reduce(增量聚合):KeyedStream → DataStream

增量聚合:每收到一个元素就进行一次计算,并显示一次结果。
将逻辑上分区内的元素,聚合成一个元素。 reduce 操作每处理一个元素总是创建一个新值。
reduce方法不能直接应用于SingleOutputStreamOperator对象,因为这个对象是个无限的流,对无限的数据做合并,没有任何意义。所以reduce需要针对分组或者一个window(窗口)来执行,也就是分别对应于keyBy、window/timeWindow 处理后的数据,根据ReduceFunction将元素与上一个reduce后的结果合并,产出合并之后的结果。
在这里插入图片描述
注意:需先在Linux端开启scoket(nc -l 9000)然后再启动main方法

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;

public class ReduceDemo {
	
	public static void main(String[] args) throws Exception {
		
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		
		DataStream<String> dataStream = env.socketTextStream("192.168.220.150", 9000);
		
		
		DataStream<WordWithCount> newDataStream = dataStream.map(
				new MapFunction<String, WordWithCount>() {

			@Override
			public WordWithCount map(String value) throws Exception {
				return new WordWithCount(value, 1L);
			}
		}).keyBy("word")
		  .timeWindow(Time.seconds(5))// 每5秒处理一次
		  .reduce(new ReduceFunction<WordWithCount>() {
			// 输入两个对象,计算结果作为参数放入下一运算中
			@Override
			public WordWithCount reduce(WordWithCount a, WordWithCount b) throws Exception {
				// TODO Auto-generated method stub
				return new WordWithCount(a.word,a.count+b.count);
			}
		});
		
		newDataStream.print().setParallelism(1);
		
		env.execute(" filter demo start");
	}
	
	/**
	 * Data type for words with count.
	 */
	public static class WordWithCount {

		public String word;
		public long count;

		public WordWithCount() {}

		public WordWithCount(String word, long count) {
			this.word = word;
			this.count = count;
		}

		@Override
		public String toString() {
			return word + " : " + count;
		}
	}

}

scoket输入:

在这里插入图片描述
输出结果为:

b : 3
a : 4

聚合函数(增量聚合):KeyedStream → DataStream

在KeyedStream 数据上进行”滚动“聚合。min和minBy之间的差异是min返回最小值,而minBy返回该字段中具有最小值的数据元(max和maxBy类似)。

keyedStream.sum(0);
keyedStream.sum("key");
keyedStream.min(0);
keyedStream.min("key");
keyedStream.max(0);
keyedStream.max("key");
keyedStream.minBy(0);
keyedStream.minBy("key");
keyedStream.maxBy(0);
keyedStream.maxBy("key");

sum-Demo
在这里插入图片描述

import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;

public class SumDemo {
	
	public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        KeyedStream keyedStream =  env.fromElements(Tuple2.of(2L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2.of(2L, 4L), Tuple2.of(1L, 2L))
                .keyBy(0); // 以数组的第一个元素作为key
        // 对第一个元素(位置0)做sum 
        SingleOutputStreamOperator<Tuple2> sumStream = keyedStream.sum(0);
        // 对第一个数据(也就是key)做了累加,然后value以第一个进来的数据为准。
        sumStream.addSink(new PrintSinkFunction<>()).setParallelism(1);
        env.execute("execute");
    }

}

结果

(1,5)
(2,5)
(3,5)
(2,3)
(4,3)

min - Demo
在这里插入图片描述

import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;

public class SumDemo {
	
	public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        KeyedStream keyedStream =  env.fromElements(Tuple2.of(2L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2.of(2L, 4L), Tuple2.of(1L, 2L))
                .keyBy(0); // 以数组的第一个元素作为key
        // 对第一个元素(位置1)求min 
        SingleOutputStreamOperator<Tuple2> sumStream = keyedStream.min(1);
        // 对第一个数据(也就是key)做了累加,然后value以第一个进来的数据为准。
        sumStream.addSink(new PrintSinkFunction<>()).setParallelism(1);
        env.execute("execute");
    }

}

结果:

(1,5)
(1,5)
(1,2)
(2,3)
(2,3)

连接

窗口连接 (Window Join)

两个数据源的相同窗口时间区间内元素组合成对,窗口时间区间可以是事件时间或者处理时间,组合对的具体形式也可以由apply函数定义。这种组合和表内连接相似,归纳如下:

stream.join(otherStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(<WindowAssigner>)
    .apply(<JoinFunction>)

滚动窗口连接

当执行滚动窗口连接时,所有具有公共密钥和公共滚动窗口的元素都以成对组合的形式连接,并传递给JoinFunction或FlatJoinFunction。因为这就像一个内部连接,一个流的元素如果在其滚动窗口中没有来自另一个流的元素,就不会被释放
在这里插入图片描述

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
 
...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });

滑动窗口连接

当执行滑动窗口连接时,具有公共键和公共滑动窗口的所有元素都作为成对组合连接,并传递给JoinFunction或FlatJoinFunction。当前滑动窗口中的另一个流没有元素时的此流的元素不会被发出。
在这里插入图片描述

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });

会话窗口连接

当执行会话窗口连接时,具有共同会话条件的window内的所有元素都以成对组合连接,并传递给JoinFunction或FlatJoinFunction。同样,这将执行内部连接,因此,如果会话窗口只包含来自一个流的元素,则不会发出任何输出!
在这里插入图片描述

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
 
...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });

间隔连接(Interval Join)

在事件时间轴上,以被连接数据源(如下列中的orangeStream)的每一个元素位顶点画锥形,本元素只和被锥形覆盖的另一数据源的元素组合。其中,锥形的两边分布被定义为下边界(负数)和上边界,即下例中between的两个参数:

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;

...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream
    .keyBy(<KeySelector>)
    .intervalJoin(greenStream.keyBy(<KeySelector>))
    .between(Time.milliseconds(-2), Time.milliseconds(1))
    .process (new ProcessJoinFunction<Integer, Integer, String(){

        @Override
        public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
            out.collect(first + "," + second);
        }
    });

在这里插入图片描述

数据分区

分布式系统的通信开销通常都很大,在数据处理应用场景下传输大量数据更是如此。通过合理控制传输通道中的数据分布达到最优的网络通信性能,是实现流式数据处理引擎的一个重要课题。
在未使用数据分区时,Join节点的每个并行实例需要聚合来自所有Source节点实例的数据,大量数据传输会造成网络过载。使用数据分区后,Source和Join节点时一一相连的,可以用同一个Slot的一个线程运行两个相连的任务。

  • 应用程序自定义分区(Custom Partition):根据指定key位置进行数据分区。

    dataStream.partitionCustom(partitioner, "someKey")
    
  • 均匀分布分区(Random Partition):数据均匀的分发给下一级节点

    dataStream.shuffle();
    
  • 负载均衡分区(Rebalance Partition):根据轮询调度算法,将数据均匀的分发给下一级节点。在某些物理拓扑情况下,这是最有效的分区方法,例如Source节点和算子节点部署在不同的物理设备上。

    dataStream.rebalance();
    
  • 可伸缩分区(Rescale Oartition):Flink引擎根据资源使用情况动态调节同一作业的数据分布,根据物理实例部署时的资源共享情况动态调节数据分布,目的是让数据尽可能的在同一Slot中流转,以减少网络开销。

    dataStream.rescale();
    
  • 广播分区(Broadcasting Partition):每个元素都被广播到所有下级节点。

    dataStream.broadcast();
    

资源共享

Flink将多个任务链接成一个任务在一个线程中运行,在降低上下文切换的开销,减小缓存容量,提高系统吞吐量的同时降低延迟。这种机制是可配置的:

// 创建链,以下代码中后两个map函数被链接在一起,而第一个map函数则不会被链接。
dataStream.map(...).map(...).startNweChain().map();

// 关闭作业链优化,这样任何两个算子实例可不共享线程
dataStream.map(...).disableChaining();

// Slot共享组,即在同一组中所有任务的实例在同一个Slot中运行,以隔离非本组实例
dataStream.map(...).slotSharingGroup();
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值