目录
一、基本概念
1.流合并条件
Flink 中的两个流要实现 Join 操作,必须满足以下两点:
-
流需要能够等待,即:两个流必须在同一个窗口中;
-
双流等值 Join,即:两个流中,必须有一个字段相等才能够 Join 上。
2.Flink 中支持 双流join 的算子
Flink 中支持双流 Join 的算子目前已知有5种,如下:
-
union
:union 支持双流 Join,也支持多流 Join。多个流类型必须一致; -
connector
:connector 支持双流 Join,两个流的类型可以不一致; -
join
:该方法只支持 inner join,即:相同窗口下,两个流中,Key都存在且相同时才会关联成功; -
coGroup
:同样能够实现双流 Join。即:将同一 Window 窗口内的两个DataStream 联合起来,两个流按照 Key 来进行关联,并通过 apply()方法 new CoGroupFunction() 的形式,重写 join() 方法进行逻辑处理。 -
intervalJoin
:Interval Join 没有 Window 窗口的概念,直接用时间戳作为关联的条件,更具表达力。
join() 和 coGroup() 都是 Flink 中用于连接多个流的算子,但是两者也有一定的区别,推荐能使用 coGroup 不要使用Join,因为coGroup更强大(**inner join 除外。就 inner join 的话推荐使用 join ,因为在 join 的策略上做了优化,更高效**
)
二、IntervalJoin介绍
1.IntervalJoin说明
Flink中基于DataStream的join,只能实现在同一个窗口的两个数据流进行join,但是在实际中常常会存在数据乱序或者延时的情况,导致两个流的数据进度不一致,就会出现数据跨窗口的情况,那么数据就无法在同一个窗口内join。
Flink基于KeyedStream提供的interval join机制,intervaljoin 连接两个keyedStream, 按照相同的key在一个相对数据时间的时间段内进行连接。
2.语法格式
LeftDStream.keyBy(订单主表::Id)
.intervalJoin(RightDStream.keyBy(订单明细表::Order_id))
.between(Time.seconds(-5), Time.seconds(5))
.process();
-
分别对LeftDStream和RightDStream通过订单id进行keyBy操作,得到两个KeyedStream,再进行intervalJoin操作。
-
between方法传递的两个参数lowerBound和upperBound,用来控制右边的流可以与哪个时间范围内的左边的流进行关联,即:
leftElement.timestamp + lowerBound <= rightElement.timestamp <= leftElement.timestamp + upperBound
相当于左边的流可以晚到lowerBound(lowerBound为负的话)时间,也可以早到upperBound(upperBound为正的话)时间。
-
使用Interval Join时,必须要指定的时间类型为EventTime。
-
两个KeyedStream在进行intervalJoin并调用between方法后,跟着使用process方法
-
process方法传递一个自定义的 ProcessJoinFunction 作为参数,ProcessJoinFunction的三个参数就是左边流的元素类型,右边流的元素类型,输出流的元素类型。
-
intervalJoin,底层是将两个KeyedStream进行connect操作,得到ConnectedStreams,这样的两个数据流之间就可以实现状态共享,对于intervalJoin来说就是两个流相同key的数据可以相互访问。
概念图:
3. 原码解析
-
intervaljoin首先会将两个KeyedStream 进行connect操作得到一个ConnectedStreams, ConnectedStreams表示的是连接两个数据流,并且这两个数据流之前可以实现状态共享, 对于intervaljoin 来说就是两个流相同key的数据可以相互访问
-
在ConnectedStreams之上进行IntervalJoinOperator算子操作,该算子是intervaljoin 的核心,接下来分析一下其实现:
a. 定义了两个MapState<Long, List<BufferEntry<T1>>>类型的状态对象,分别用来存储两个流的数据,其中Long对应数据的时间戳,List<BufferEntry<T1>>对应相同时间戳的数据
b. 包含processElement1、processElement2两个方法,这两个方法都会调用processElement方法,真正数据处理的地方
private <THIS, OTHER> void processElement(
final StreamRecord<THIS> record,
final MapState<Long, List<IntervalJoinOperator.BufferEntry<THIS>>> ourBuffer,
final MapState<Long, List<IntervalJoinOperator.BufferEntry<OTHER>>> otherBuffer,
final long relativeLowerBound,
final long relativeUpperBound,
final boolean isLeft) throws Exception {
final THIS ourValue = record.getValue();
final long ourTimestamp = record.getTimestamp();
if (ourTimestamp == Long.MIN_VALUE) {
throw new FlinkException("Long.MIN_VALUE timestamp: Elements used in " +
"interval stream joins need to have timestamps meaningful timestamps.");
}
if (isLate(ourTimestamp)) {
return;
}
addToBuffer(ourBuffer, ourValue, ourTimestamp);
for (Map.Entry<Long, List<BufferEntry<OTHER>>> bucket: otherBuffer.entries()) {
final long timestamp = bucket.getKey();
if (timestamp < ourTimestamp + relativeLowerBound ||
timestamp > ourTimestamp + relativeUpperBound) {
continue;
}
for (BufferEntry<OTHER> entry: bucket.getValue()) {
if (isLeft) {
collect((T1) ourValue, (T2) entry.element, ourTimestamp, timestamp);
} else {
collect((T1) entry.element, (T2) ourValue, timestamp, ourTimestamp);
}
}
}
long cleanupTime = (relativeUpperBound > 0L) ? ourTimestamp + relativeUpperBound : ourTimestamp;
if (isLeft) {
internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_LEFT, cleanupTime);
} else {
internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_RIGHT, cleanupTime);
}
}
-
判断延时,数据时间小于当前的watermark值认为数据延时,则不处理
-
将数据添加到对应的MapState<Long, List<BufferEntry<T1>>>缓存状态中,key为数据的时间
-
循环遍历另外一个状态,如果满足ourTimestamp + relativeLowerBound <=timestamp<= ourTimestamp + relativeUpperBound , 则将数据输出给ProcessJoinFunction调用,ourTimestamp表示流入的数据时间,timestamp表示对应join的数据时间
-
注册一个数据清理时间方法,会调用onEventTime方法清理对应状态数据。对于例子中orderStream比addressStream早到1到5秒,那么orderStream的数据清理时间就是5秒之后,也就是orderStream.time+5,当watermark大于该时间就需要清理,对于addressStream是晚来的数据不需要等待,当watermark大于数据时间就可以清理掉。
整个处理逻辑都是基于数据时间的,也就是intervaljoin 必须基于EventTime语义,在between 中有做TimeCharacteristic是否为EventTime校验, 如果不是则抛出异常。
三、IntervalJoin开发实践
1.订单主表和明细表合成订单宽表
(1)需求:
-
实现订单基本信息表和订单明细表合成订单宽表
-
实现类似以下SQL功能
select a.*,b.* from OrderInfo a
left join OrderDetail b on a.id = b.order_id
(2)代码实现(主要代码):
//将orderInf数据流转换为JavaBean并提取时间戳生成WaterMark
SingleOutputStreamOperator<OrderInfo> orderInfoWithWMDS = orderInfoStrDS.map(line -> {
OrderInfo orderInfo = JSON.parseObject(line, OrderInfo.class);
//yyyy-MM-dd HH:mm:ss
String create_time = orderInfo.getCreate_time();
String[] dateHourArr = create_time.split(" ");
orderInfo.setCreate_date(dateHourArr[0]);
orderInfo.setCreate_hour(dateHourArr[1].split(":")[0]);
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long ts = sdf.parse(create_time).getTime();
orderInfo.setCreate_ts(ts);
return orderInfo;
}).assignTimestampsAndWatermarks(WatermarkStrategy.<OrderInfo>forBoundedOutOfOrderness(Duration.ofSeconds(1)).withTimestampAssigner(new SerializableTimestampAssigner<OrderInfo>() {
@Override
public long extractTimestamp(OrderInfo element, long recordTimestamp) {
return element.getCreate_ts();
}
}));
//将orderDetail数据流转换为JavaBean并提取时间戳生成WaterMark
SingleOutputStreamOperator<OrderDetail> orderDetailWithWMDS = orderDetailStrDS.map(line -> {
OrderDetail orderDetail = JSON.parseObject(line, OrderDetail.class);
String create_time = orderDetail.getCreate_time();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long ts = sdf.parse(create_time).getTime();
orderDetail.setCreate_ts(ts);
return orderDetail;
}).assignTimestampsAndWatermarks(WatermarkStrategy.<OrderDetail>forBoundedOutOfOrderness(Duration.ofSeconds(1)).withTimestampAssigner(new SerializableTimestampAssigner<OrderDetail>() {
@Override
public long extractTimestamp(OrderDetail element, long recordTimestamp) {
return element.getCreate_ts();
}
}));
//将两个流进行JOIN
SingleOutputStreamOperator<OrderWide> orderWideDS = orderInfoWithWMDS.keyBy(OrderInfo::getId)
.intervalJoin(orderDetailWithWMDS.keyBy(OrderDetail::getOrder_id))
.between(Time.seconds(-5), Time.seconds(5))//生产环境,为了不丢数据,设置时间为最大网络延迟,这里设置了正负5秒,以防止在业务系统中主表与从表保存的时间差
.process(new ProcessJoinFunction<OrderInfo, OrderDetail, OrderWide>() {
@Override
public void processElement(OrderInfo orderInfo, OrderDetail orderDetail, Context ctx, Collector<OrderWide> out) throws Exception {
out.collect(new OrderWide(orderInfo, orderDetail));
}
});
//Bean:OrderWide
import lombok.AllArgsConstructor;
import lombok.Data;
import org.apache.commons.lang3.ObjectUtils;
import java.math.BigDecimal;
@Data
@AllArgsConstructor
public class OrderWide {
Long detail_id;
Long order_id;
Long sku_id;
BigDecimal order_price;
Long sku_num;
String sku_name;
Long province_id;
String order_status;
Long user_id;
BigDecimal total_amount;
BigDecimal activity_reduce_amount;
BigDecimal coupon_reduce_amount;
BigDecimal original_total_amount;
BigDecimal feight_fee;
BigDecimal split_feight_fee;
BigDecimal split_activity_amount;
BigDecimal split_coupon_amount;
BigDecimal split_total_amount;
String expire_time;
String create_time; //yyyy-MM-dd HH:mm:ss
String operate_time;
String create_date; // 把其他字段处理得到
String create_hour;
String province_name;//查询维表得到
String province_area_code;
String province_iso_code;
String province_3166_2_code;
Integer user_age;
String user_gender;
Long spu_id; //作为维度数据 要关联进来
Long tm_id;
Long category3_id;
String spu_name;
String tm_name;
String category3_name;
public OrderWide(OrderInfo orderInfo, OrderDetail orderDetail) {
mergeOrderInfo(orderInfo);
mergeOrderDetail(orderDetail);
}
public void mergeOrderInfo(OrderInfo orderInfo) {
if (orderInfo != null) {
this.order_id = orderInfo.id;
this.order_status = orderInfo.order_status;
this.create_time = orderInfo.create_time;
this.create_date = orderInfo.create_date;
this.create_hour = orderInfo.create_hour;
this.activity_reduce_amount = orderInfo.activity_reduce_amount;
this.coupon_reduce_amount = orderInfo.coupon_reduce_amount;
this.original_total_amount = orderInfo.original_total_amount;
this.feight_fee = orderInfo.feight_fee;
this.total_amount = orderInfo.total_amount;
this.province_id = orderInfo.province_id;
this.user_id = orderInfo.user_id;
}
}
public void mergeOrderDetail(OrderDetail orderDetail) {
if (orderDetail != null) {
this.detail_id = orderDetail.id;
this.sku_id = orderDetail.sku_id;
this.sku_name = orderDetail.sku_name;
this.order_price = orderDetail.order_price;
this.sku_num = orderDetail.sku_num;
this.split_activity_amount = orderDetail.split_activity_amount;
this.split_coupon_amount = orderDetail.split_coupon_amount;
this.split_total_amount = orderDetail.split_total_amount;
}
}
public void mergeOtherOrderWide(OrderWide otherOrderWide) {
this.order_status = ObjectUtils.firstNonNull(this.order_status, otherOrderWide.order_status);
this.create_time = ObjectUtils.firstNonNull(this.create_time, otherOrderWide.create_time);
this.create_date = ObjectUtils.firstNonNull(this.create_date, otherOrderWide.create_date);
this.coupon_reduce_amount = ObjectUtils.firstNonNull(this.coupon_reduce_amount, otherOrderWide.coupon_reduce_amount);
this.activity_reduce_amount = ObjectUtils.firstNonNull(this.activity_reduce_amount, otherOrderWide.activity_reduce_amount);
this.original_total_amount = ObjectUtils.firstNonNull(this.original_total_amount, otherOrderWide.original_total_amount);
this.feight_fee = ObjectUtils.firstNonNull(this.feight_fee, otherOrderWide.feight_fee);
this.total_amount = ObjectUtils.firstNonNull(this.total_amount, otherOrderWide.total_amount);
this.user_id = ObjectUtils.<Long>firstNonNull(this.user_id, otherOrderWide.user_id);
this.sku_id = ObjectUtils.firstNonNull(this.sku_id, otherOrderWide.sku_id);
this.sku_name = ObjectUtils.firstNonNull(this.sku_name, otherOrderWide.sku_name);
this.order_price = ObjectUtils.firstNonNull(this.order_price, otherOrderWide.order_price);
this.sku_num = ObjectUtils.firstNonNull(this.sku_num, otherOrderWide.sku_num);
this.split_activity_amount = ObjectUtils.firstNonNull(this.split_activity_amount);
this.split_coupon_amount = ObjectUtils.firstNonNull(this.split_coupon_amount);
this.split_total_amount = ObjectUtils.firstNonNull(this.split_total_amount);
}
}
//Bean:OrderInfo
import lombok.Data;
import java.math.BigDecimal;
@Data
public class OrderInfo {
Long id;
Long province_id;
String order_status;
Long user_id;
BigDecimal total_amount;
BigDecimal activity_reduce_amount;
BigDecimal coupon_reduce_amount;
BigDecimal original_total_amount;
BigDecimal feight_fee;
String expire_time;
String create_time;
String operate_time;
String create_date; // 把其他字段处理得到
String create_hour;
Long create_ts;
}
//Bean:OrderDetail
import lombok.Data;
import java.math.BigDecimal;
@Data
public class OrderDetail {
Long id;
Long order_id;
Long sku_id;
BigDecimal order_price;
Long sku_num;
String sku_name;
String create_time;
BigDecimal split_total_amount;
BigDecimal split_activity_amount;
BigDecimal split_coupon_amount;
Long create_ts;
}