一、双流Join
源日志在详细代码中展示
1、inner join
join() 算子提供的语义为"Window join",即按照指定字段和(滚动/滑动/会话)窗口进行 inner join。
如下示例是曝光日志和点击日志的inner join
//1、inner join
DataStream<String> innerJoinStream = clickTupleStream
.join(infoTupleStream)
.where(tuple -> tuple.f0) //点击日志流clickTupleStream的key
.equalTo(tuple -> tuple.f0) //曝光日志流infoTupleStream的key
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.apply(new JoinFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
@Override
public String join(Tuple2<String, String> click, Tuple2<String, String> info) throws Exception {
return click.f0 + " " + click.f1 + " " + info.f1;
}
});
JoinFunction源代码如下:
package org.apache.flink.api.common.functions;
import org.apache.flink.annotation.Public;
import java.io.Serializable;
/**
* Interface for Join functions. Joins combine two data sets by joining their
* elements on specified keys. This function is called with each pair of joining elements.
*
* <p>By default, the joins follows strictly the semantics of an "inner join" in SQL.
* the semantics are those of an "inner join", meaning that elements are filtered out
* if their key is not contained in the other data set.
*
* <p>The basic syntax for using Join on two data sets is as follows:
* <pre>{@code
* DataSet<X> set1 = ...;
* DataSet<Y> set2 = ...;
*
* set1.join(set2).where(<key-definition>).equalTo(<key-definition>).with(new MyJoinFunction());
* }</pre>
*
* <p>{@code set1} is here considered the first input, {@code set2} the second input.
*
* <p>The Join function is an optional part of a join operation. If no JoinFunction is provided,
* the result of the operation is a sequence of 2-tuples, where the elements in the tuple are those that
* the JoinFunction would have been invoked with.
*
* <p>Note: You can use a {@link CoGroupFunction} to perform an outer join.
*
* @param <IN1> The type of the elements in the first input.
* @param <IN2> The type of the elements in the second input.
* @param <OUT> The type of the result elements.
*/
@Public
@FunctionalInterface
public interface JoinFunction<IN1, IN2, OUT> extends Function, Serializable {
/**
* The join method, called once per joined pair of elements.
*
* @param first The element from first input.
* @param second The element from second input.
* @return The resulting element.
*
* @throws Exception This method may throw exceptions. Throwing an exception will cause the operation
* to fail and may trigger recovery.
*/
OUT join(IN1 first, IN2 second) throws Exception;
}
2、outer join
coGroup() 算子用于实现left/right outer join
如下示例分别使用曝光日志和点击日志进行左右外连接
//左连接
DataStream<String> leftOutJoinStream = infoTupleStream
.coGroup(clickTupleStream)
.where(tuple -> tuple.f0)
.equalTo(tuple -> tuple.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
@Override
public void coGroup(Iterable<Tuple2<String, String>> infoIterable, Iterable<Tuple2<String, String>> clickIterable, Collector<String> collector) throws Exception {
//遍历左流
for (Tuple2<String, String> infoRecord : infoIterable) {
boolean isMatched = false;
//遍历右流
for (Tuple2<String, String> clickRecord : clickIterable) {
//右流中有对应的记录
collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + clickRecord.f1);
isMatched = true;
}
if (!isMatched) {
//右流中无数据
collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + null);
}
}
}
});
//右连接
DataStream<String> rightOutJoinStream = clickTupleStream
.coGroup(infoTupleStream)
.where(tuple -> tuple.f0)
.equalTo(tuple -> tuple.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
@Override
public void coGroup(Iterable<Tuple2<String, String>> clickIterable, Iterable<Tuple2<String, String>> infoIterable, Collector<String> collector) throws Exception {
boolean isMatched = false;
//遍历右流
for (Tuple2<String, String> infoRecord : infoIterable) {
//遍历左流
for (Tuple2<String, String> clickRecord : clickIterable) {
//左流中对应的记录
collector.collect(infoRecord.f0 + " " + clickRecord.f1 + " " + infoRecord.f1);
isMatched = true;
}
if (!isMatched) {
//左流中无数据
collector.collect(infoRecord.f0 + " " + null + " " + infoRecord.f1);
}
}
}
});
CoGroupFunction源代码如下:
package org.apache.flink.api.common.functions;
import org.apache.flink.annotation.Public;
import org.apache.flink.util.Collector;
import java.io.Serializable;
/**
* The interface for CoGroup functions. CoGroup functions combine two data sets by first grouping each data set
* after a key and then "joining" the groups by calling this function with the two sets for each key.
* If a key is present in only one of the two inputs, it may be that one of the groups is empty.
*
* <p>The basic syntax for using CoGroup on two data sets is as follows:
* <pre>{@code
* DataSet<X> set1 = ...;
* DataSet<Y> set2 = ...;
*
* set1.coGroup(set2).where(<key-definition>).equalTo(<key-definition>).with(new MyCoGroupFunction());
* }</pre>
*
* <p>{@code set1} is here considered the first input, {@code set2} the second input.
*
* <p>Some keys may only be contained in one of the two original data sets. In that case, the CoGroup function is invoked
* with in empty input for the side of the data set that did not contain elements with that specific key.
*
* @param <IN1> The data type of the first input data set.
* @param <IN2> The data type of the second input data set.
* @param <O> The data type of the returned elements.
*/
@Public
@FunctionalInterface
public interface CoGroupFunction<IN1, IN2, O> extends Function, Serializable {
/**
* This method must be implemented to provide a user implementation of a
* coGroup. It is called for each pair of element groups where the elements share the
* same key.
*
* @param first The records from the first input.
* @param second The records from the second.
* @param out A collector to return elements.
*
* @throws Exception The function may throw Exceptions, which will cause the program to cancel,
* and may trigger the recovery logic.
*/
void coGroup(Iterable<IN1> first, Iterable<IN2> second, Collector<O> out) throws Exception;
}
3、interval join
join() 和 coGroup() 都是基于窗口做关联的。但是在某些情况下,两条流的数据步调未必一致。例如,点击流的数据有可能在曝光流的发生之后很久才被写入,如果用窗口来圈定,很容易 join 不上。所以 Flink 又提供了"Interval join"的语义,按照指定字段以及右流相对左流偏移的时间区间进行关联。
//3、interval join
SingleOutputStreamOperator<String> intervalJoinStream = infoTupleStream
.keyBy(record -> record.f0)
.intervalJoin(clickTupleStream.keyBy(record -> record.f0))
.between(Time.seconds(-30), Time.seconds(30)) //指定右流相对左流偏移的时间区间
.process(new ProcessJoinFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
@Override
public void processElement(Tuple2<String, String> infoRecord, Tuple2<String, String> clickRecord, Context context, Collector<String> collector) throws Exception {
collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + clickRecord.f1);
}
});
4、详细代码
package com.scallion.job;
import com.scallion.common.Common;
import com.scallion.utils.FlinkUtil;
import org.apache.flink.api.common.functions.CoGroupFunction;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* created by gaowj.
* created on 2021-05-13.
* function:
* origin ->
*/
public class TwoStreamJoinJob implements Job {
@Override
public void run() {
/**
* source
*/
//点击日志
DataStream<String> clickStream = FlinkUtil.getKafkaStream(Common.KAFKA_BROKER, Common.APP_NEWSAPP_TOPIC, Common.KAFKA_CONSUMER_GROUP_ID);
//曝光日志
DataStream<String> infoStream = FlinkUtil.getKafkaStream(Common.KAFKA_BROKER, Common.APP_NEWSAPP_INFO_TOPIC, Common.KAFKA_CONSUMER_GROUP_ID);
/**
* transform
*/
SingleOutputStreamOperator<Tuple2<String, String>> clickTupleStream = clickStream.map(new MapFunction<String, Tuple2<String, String>>() {
@Override
public Tuple2<String, String> map(String record) throws Exception {
String[] split = record.split("\t");
//split[5]:userkey
//split[11]:点击行为类型,常见为 action page duration btomnews
Tuple2<String, String> tuple = new Tuple2<>(split[5], "click:" + split[11]);
return tuple;
}
});
SingleOutputStreamOperator<Tuple2<String, String>> infoTupleStream = infoStream.map(new MapFunction<String, Tuple2<String, String>>() {
@Override
public Tuple2<String, String> map(String record) throws Exception {
String[] split = record.split("\t");
//split[5]:userkey
//split[11]:曝光行为类型,当前只有pageinfo一种
Tuple2<String, String> tuple = new Tuple2<>(split[5], "info:" + split[11]);
return tuple;
}
});
//1、inner join
DataStream<String> innerJoinStream = clickTupleStream
.join(infoTupleStream)
.where(tuple -> tuple.f0)
.equalTo(tuple -> tuple.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.apply(new JoinFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
@Override
public String join(Tuple2<String, String> click, Tuple2<String, String> info) throws Exception {
return click.f0 + " " + click.f1 + " " + info.f1;
}
});
//2、left|right outer join
DataStream<String> leftOutJoinStream = infoTupleStream
.coGroup(clickTupleStream)
.where(tuple -> tuple.f0)
.equalTo(tuple -> tuple.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
@Override
public void coGroup(Iterable<Tuple2<String, String>> infoIterable, Iterable<Tuple2<String, String>> clickIterable, Collector<String> collector) throws Exception {
//遍历左流
for (Tuple2<String, String> infoRecord : infoIterable) {
boolean isMatched = false;
//遍历右流
for (Tuple2<String, String> clickRecord : clickIterable) {
//右流中有对应的记录
collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + clickRecord.f1);
isMatched = true;
}
if (!isMatched) {
//右流中无数据
collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + null);
}
}
}
});
DataStream<String> rightOutJoinStream = clickTupleStream
.coGroup(infoTupleStream)
.where(tuple -> tuple.f0)
.equalTo(tuple -> tuple.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
@Override
public void coGroup(Iterable<Tuple2<String, String>> clickIterable, Iterable<Tuple2<String, String>> infoIterable, Collector<String> collector) throws Exception {
boolean isMatched = false;
//遍历右流
for (Tuple2<String, String> infoRecord : infoIterable) {
//遍历左流
for (Tuple2<String, String> clickRecord : clickIterable) {
//左流中对应的记录
collector.collect(infoRecord.f0 + " " + clickRecord.f1 + " " + infoRecord.f1);
isMatched = true;
}
if (!isMatched) {
//左流中无数据
collector.collect(infoRecord.f0 + " " + null + " " + infoRecord.f1);
}
}
}
});
//3、interval join
SingleOutputStreamOperator<String> intervalJoinStream = infoTupleStream
.keyBy(record -> record.f0)
.intervalJoin(clickTupleStream.keyBy(record -> record.f0))
.between(Time.seconds(-30), Time.seconds(30))
.process(new ProcessJoinFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
@Override
public void processElement(Tuple2<String, String> infoRecord, Tuple2<String, String> clickRecord, Context context, Collector<String> collector) throws Exception {
collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + clickRecord.f1);
}
});
/**
* sink
*/
// innerJoinStream.print();
// leftOutJoinStream.print();
// rightOutJoinStream.print();
intervalJoinStream.print();
}
}
二、维表Join
1、预加载|定时加载维表
使用RichFunction的open方法预加载或定时加载维表数据到内存中,适用于维表数据数据量小并且更新频率不高的情况;可以将外部存储系统(Redis,HBase,MySQL)数据加载到内存中。
package com.scallion.transform;
import com.scallion.utils.TimeUtil;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;
import java.util.Timer;
import java.util.TimerTask;
/**
* created by gaowj.
* created on 2021-05-14.
* function: 预加载|定时加载维表数据
* origin ->
*/
public class JoinWithDimMapFunction extends RichMapFunction<String, String> {
//选择合适的数据结构,用于将维表数据保存在内存中
//也可以使用Google Guava CacheBuilder实现缓存
private String dimRecord;
//open方法在实际工作方法map方法工作之前被调用,因此适合工作前的配置,如对外部系统调用的配置,HBase,Redis,Mysql;
//对于是迭代的部分,此方法将在每次迭代超步的开始处调用;
//此处模拟每次定时从外部系统中获取维表数据,并缓存到内存中。
@Override
public void open(Configuration parameters) throws Exception {
TimerTask task = new TimerTask() {
@Override
public void run() {
dimRecord = TimeUtil.getTimestampToDate(System.currentTimeMillis()) + "时刻的维表数据";
}
};
Timer timer = new Timer();
timer.schedule(task, 0, 1000);
}
/**
* @param record 用户点击日志
* @return
* @throws Exception
*/
@Override
public String map(String record) throws Exception {
String[] split = record.split("\t");
String res = "userkey:" + split[5] + " opa:" + split[11] + " 关联" + dimRecord;
return res;
}
@Override
public void close() throws Exception {
super.close();
}
}
2、热存储维表:使用异步IO来提高访问吞吐量
可以使用异步IO进行维表Join的数据库:MySQL,Oracle,Redis,HBase
package com.scallion.transform;
import com.scallion.common.Common;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.Collections;
/**
* created by gaowj.
* created on 2021-05-17.
* function: 异步IO
* origin ->
*/
public class AsyncIOMySQLFunction extends RichAsyncFunction<String, String> {
private PreparedStatement ps;
private Connection conn;
@Override
public void open(Configuration parameters) throws Exception {
Class.forName(Common.DRIVERNAME);
Connection conn = DriverManager.getConnection(Common.JDBCURL, Common.USERNAME, Common.PASSWORD);
ps = conn.prepareStatement("select id,name,age,sex from tongji.rt_binlog_to_kudu where name='zhanghao'");
}
@Override
public void close() throws Exception {
conn.close();
}
@Override
public void asyncInvoke(String input, ResultFuture<String> resultFuture) throws Exception {
ResultSet rs = ps.executeQuery();
String sqlStr = "";
if (rs.next()) {
sqlStr = rs.getInt("id") +
rs.getString("name") +
rs.getInt("age") +
rs.getString("sex");
}
resultFuture.complete(Collections.singletonList(input.split("\t")[5] + " " + sqlStr));
}
}
RichAsyncFunction源码结构如下:
3、广播维表
利用Flink的Broadcast State将维表数据流广播到下游做Join操作,优缺点如下:
优点:能及时获取到最新的维表数据
缺点:数据利用状态后端保存在内存中,保存的数据量比较小
MapStateDescriptor broadcastDesc = new MapStateDescriptor("broad1", String.class, String.class);
BroadcastStream<Tuple2<String, String>> broadcastStream = socketDimStream.broadcast(broadcastDesc);
SingleOutputStreamOperator<String> broadcastWithDimStream = clickStream
.map(new MapFunction<String, String>() {
@Override
public String map(String line) throws Exception {
return line.split("\t")[5].trim();
}
})
.keyBy(line -> line)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(30)))
.reduce((line1, line2) -> line1)
.connect(broadcastStream)
.process(new DimBroadcastProcessFunction(broadcastDesc));
package com.scallion.transform;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ReadOnlyBroadcastState;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.util.Collector;
/**
* created by gaowj.
* created on 2021-05-17.
* function: 广播维表
* origin ->
*/
public class DimBroadcastProcessFunction extends BroadcastProcessFunction<String, Tuple2<String, String>, String> {
MapStateDescriptor<String, String> broadcastDesc;
public DimBroadcastProcessFunction(MapStateDescriptor<String, String> broadcastDesc) {
this.broadcastDesc = broadcastDesc;
}
//非广播流调用
@Override
public void processElement(String input, ReadOnlyContext ctx, Collector<String> collector) throws Exception {
//获取广播流数据
ReadOnlyBroadcastState<String, String> state = ctx.getBroadcastState(broadcastDesc);
String cityName = "";
if (state.contains(input))
cityName = state.get(input);
collector.collect("userKey:" + input + " city:" + cityName);
}
//广播流调用
@Override
public void processBroadcastElement(Tuple2<String, String> input, Context ctx, Collector<String> collector) throws Exception {
//将维表数据更新到广播流中
System.out.println("收到广播数据:" + input);
ctx.getBroadcastState(broadcastDesc).put(input.f0, input.f1);
}
}
4、临时表函数Join
5、维表Join入口代码
package com.scallion.job;
import com.scallion.common.Common;
import com.scallion.transform.AsyncIOMySQLFunction;
import com.scallion.transform.DimBroadcastProcessFunction;
import com.scallion.transform.JoinWithDimMapFunction;
import com.scallion.utils.FlinkUtil;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.windowing.assigners.ProcessingTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.util.concurrent.TimeUnit;
/**
* created by gaowj.
* created on 2021-05-14.
* function: 维表Join
* origin ->
*/
public class JoinWithDimJob implements Job {
@Override
public void run() {
/**
* Source
*/
//点击日志
DataStream<String> clickStream = FlinkUtil.getKafkaStream(Common.KAFKA_BROKER, Common.APP_NEWSAPP_TOPIC, Common.KAFKA_CONSUMER_GROUP_ID);
//维表数据
SingleOutputStreamOperator<Tuple2<String, String>> socketDimStream = FlinkUtil.getSocketTextStream(Common.SOCKET_IP, Common.SOCKET_PORT)
.map(new MapFunction<String, Tuple2<String, String>>() {
@Override
public Tuple2<String, String> map(String line) throws Exception {
String[] split = line.split(",");
return new Tuple2<String, String>(split[0], split[1]);
}
});
/**
* Transform
*/
//1、预加载维表
SingleOutputStreamOperator<String> joinWithDimStream = clickStream.map(new JoinWithDimMapFunction());
//2、热存储维表:使用异步IO来提高访问吞吐量
SingleOutputStreamOperator<String> asyncIOStream = AsyncDataStream
.orderedWait(clickStream, new AsyncIOMySQLFunction(), 1000L, TimeUnit.MILLISECONDS, 10);
//3、广播维表
//将维表数据流定义为广播流
MapStateDescriptor broadcastDesc = new MapStateDescriptor("broad1", String.class, String.class);
BroadcastStream<Tuple2<String, String>> broadcastStream = socketDimStream.broadcast(broadcastDesc);
SingleOutputStreamOperator<String> broadcastWithDimStream = clickStream
.map(new MapFunction<String, String>() {
@Override
public String map(String line) throws Exception {
return line.split("\t")[5].trim();
}
})
.keyBy(line -> line)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(30)))
.reduce((line1, line2) -> line1)
.connect(broadcastStream)
.process(new DimBroadcastProcessFunction(broadcastDesc));
//4、临时表函数Join
/**
* Sink
*/
// joinWithDimStream.print();
// asyncIOStream.print();
broadcastWithDimStream.print();
}
}