- flatmap
flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
for (String token : value.split("\\W+")) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
})
- 没有reduceby 用keyby
- window
翻滚窗口能将数据流切分成不重叠的窗口,每一个事件只能属于一个窗口。
// tumbling time window of 1 minute length
.timeWindow(Time.minutes(1))
滑动时间窗口(Sliding Time Window)。
// sliding time window of 1 minute length and 30 secs trigger interval
.timeWindow(Time.minutes(1), Time.seconds(30))
- join
datastream
stream.join(otherStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(<WindowAssigner>)
.apply(<JoinFunction>)
dataset
weightedRatings =
ratings.join(weights)
// key of the first input
.where("category")
// key of the second input
.equalTo("f0")
// applying the JoinFunction on joining pairs
.with(new PointWeighter());
- With Periodic Watermarks
产生watermark,依赖于到达的流或仅依赖处理时间
周期性的触发watermark的生成和发送,默认是100ms,每隔N秒自动向流里注入一个WATERMARK 时间间隔由ExecutionConfig.setAutoWatermarkInterval 决定.
每次调用getCurrentWatermark 方法, 如果得到的WATERMARK 不为空并且比之前的大就注入流中
可以定义一个最大延迟的时间
实现AssignerWithPeriodicWatermarks接口
水位线=事件序列最大值-t
/**
* 假定数据是乱序的,但乱序的间隔很短。但数据都会延迟一段时间到达
* This generator generates watermarks assuming that elements arrive out of order,
* but only to a certain degree. The latest elements for a certain timestamp t will arrive
* at most n milliseconds after the earliest elements for timestamp t.*/
public class BoundedOutOfOrdernessGenerator implements AssignerWithPeriodicWatermarks<MyEvent> {
private final long maxOutOfOrderness = 3500; // 3.5 seconds
private long currentMaxTimestamp;
@Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
long timestamp = element.getCreationTime();
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
}
@Override
public Watermark getCurrentWatermark() {
// return the watermark as current highest timestamp minus the out-of-orderness bound
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
}
水位线=当前最晚接收到的时间戳- t 其实仅依赖处理时间
/**
假定数据会延迟了一段时间
This generator generates watermarks that are lagging behind processing time by a fixed amount.
* It assumes that elements arrive in Flink after a bounded delay.*/
public class TimeLagWatermarkGenerator implements AssignerWithPeriodicWatermarks<MyEvent> {
private final long maxTimeLag = 5000; // 5 seconds
@Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
return element.getCreationTime();
}
@Override
public Watermark getCurrentWatermark() {
// return the watermark as current time minus the maximum time lag
return new Watermark(System.currentTimeMillis() - maxTimeLag);
}
}
- tuple
新建
Tuple3.of(input1.f0, input1.f1, input1.f2 + input2.f2)
new Tuple3<>(tuple.f0,true,c);//或者
Tuple2.of(closestCentroidId, p);
new Tuple2<>(tuple.f0,true);//或者
取值 X.f0,X.f1
- flink循环
创建IterativeDataSet initial,对这个dataset进行转换,最后转换成dataset2,最后调用initial.closeWith(dataset2),会自动替换一开始的initial dataset并判断是否结束循环
求pi
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// Create initial IterativeDataSet
IterativeDataSet<Integer> initial = env.fromElements(0).iterate(10000);
DataSet<Integer> iteration = initial.map(new MapFunction<Integer, Integer>() {
@Override
public Integer map(Integer i) throws Exception {
double x = Math.random();
double y = Math.random();
return i + ((x * x + y * y < 1) ? 1 : 0);
}
});
// Iteratively transform the IterativeDataSet
DataSet<Integer> count = initial.closeWith(iteration);
count.map(new MapFunction<Integer, Double>() {
@Override
public Double map(Integer count) throws Exception {
return count / (double) 10000 * 4;
}
}).print();
env.execute("Iterative Pi Example");
kmens
// get input data:
// read the points and centroids from the provided paths or fall back to default data
DataSet<Point> points = getPointDataSet(params, env);
DataSet<Centroid> centroids = getCentroidDataSet(params, env);
// set number of bulk iterations for KMeans algorithm
IterativeDataSet<Centroid> loop = centroids.iterate(10000);
DataSet<Centroid> newCentroids = points
// compute closest centroid for each point
.map(new SelectNearestCenter()).withBroadcastSet(loop, "centroids")
// count and sum point coordinates for each centroid
.map(new CountAppender())
.groupBy(0).reduce(new CentroidAccumulator())
// compute new centroids from point counts and coordinate sums
.map(new CentroidAverager());
// feed new centroids back into next iteration
DataSet<Centroid> finalCentroids = loop.closeWith(newCentroids, new TerminationCriterionImpl().getTerminatedDataSet(newCentroids, loop));
DataSet<Tuple2<Integer, Point>> clusteredPoints = points
// assign points to final clusters
.map(new SelectNearestCenter()).withBroadcastSet(finalCentroids, "centroids");
public class TerminationCriterionImpl extends TerminationCriterion {
public FilterOperator<Tuple2<Centroid,Centroid>> getTerminatedDataSet(DataSet<Centroid> newCentroids, DataSet<Centroid> oldCentroids) throws Exception {
return newCentroids.join(oldCentroids).where("id").equalTo("id").
filter (new FilterFunction<Tuple2<Centroid,Centroid>>(){
@Override
public boolean filter(Tuple2<Centroid,Centroid> value) {
return Math.sqrt((value.f0.x - value.f1.x) * (value.f0.x - value.f1.x) +
(value.f0.y - value.f1.y) * (value.f0.y- value.f1.y))>EPSILON;
}
});
}
}
- state
ValueState
The value can be set using x.update(value) and retrieved using x.value().
public class CountWindowAverage extends RichFlatMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>> {
/**
* The ValueState handle. The first field is the count, the second field a running sum.
*/
private transient ValueState<Tuple2<Long, Long>> sum;
@Override
public void flatMap(Tuple2<Long, Long> input, Collector<Tuple2<Long, Long>> out) throws Exception {
// access the state value
Tuple2<Long, Long> currentSum = sum.value();
// update the count
currentSum.f0 += 1;
// add the second field of the input value
currentSum.f1 += input.f1;
// update the state
sum.update(currentSum);
// if the count reaches 2, emit the average and clear the state
if (currentSum.f0 >= 2) {
out.collect(new Tuple2<>(input.f0, currentSum.f1 / currentSum.f0));
sum.clear();
}
}
@Override
public void open(Configuration config) {
ValueStateDescriptor<Tuple2<Long, Long>> descriptor =
new ValueStateDescriptor<>(
"average", // the state name
TypeInformation.of(new TypeHint<Tuple2<Long, Long>>() {}), // type information
Tuple2.of(0L, 0L)); // default value of the state, if nothing was set
sum = getRuntimeContext().getState(descriptor);
}
}
// this can be used in a streaming program like this (assuming we have a StreamExecutionEnvironment env)
env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2.of(1L, 4L), Tuple2.of(1L, 2L))
.keyBy(0)
.flatMap(new CountWindowAverage())
.print();
- sink
结果输出,可以使用flink已经提供的sink,如kafka,jdbc,es等,当然我们也可以通过自定义的方式,来实现我们自己的sink。 - process
ProcessFunction是一个低级流处理 算子操作,可以访问所有(非循环)流应用程序的基本构建块:事件(流数据元),state(容错,一致,仅在被Key化的数据流上),定时器(事件时间和处理时间,仅限被Key化的数据流)。
该ProcessFunction可被认为是一个FlatMapFunction可以访问Keys状态和定时器。它通过为输入流中接收的每个事件调用来处理事件。
通过定义 open,close,processElement 进行自定义