Flink 基于DataStream 实现三流Join

最新推荐文章于 2023-08-27 21:46:24 发布

Sou1yu

最新推荐文章于 2023-08-27 21:46:24 发布

阅读量1.8k

点赞数 2

分类专栏： StreamJoin 文章标签： flink

本文链接：https://blog.csdn.net/weixin_46613998/article/details/116132583

版权

StreamJoin 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Flink 基于DataStream 实现三流Join

Flink版本1.7.0 官方提示，Table API &SQL模块还在积极开发当中，不是所有功能都可以支持实现。

在这里插入图片描述

本打算引流进入生成Flink Table 然后写个Flink sql 就完事了，但是我在Flink1.7版本的文档中没找到处理Table对象的时候指定事件时间这个功能，所以改用DataStream完成处理。

本次需求处理的是三个流的数据JOIN，流的数据Schema如下：

流/表名	别名	表描述	字段	字段	字段	字段	字段
Student	a	学生信息	学生编号 s_id	姓名s_name	学校名字school_name		事件时间dataStamp
Student_request	b	学生申请考试确认	学生编号s_id	姓名s_name	申请考试编码exam_code		事件时间dataStamp
score	c	学生成绩表	学生编号s_id	姓名s_name	申请考试编码exam_code	申请考试考成绩 score	事件时间dataStamp

三个数据是以流的形式在Flink中基于事件时间进行计算，业务逻辑是通过表a进行join表b，若表b中有s_id,s_name字段，则证明该学生申请考试了，即可获取该学生在b中的exam_code，通过表b的s_id,s_name,exam_code,查询在表c中该学生的考试成绩。最后，以a表中学校shcool_name字段、及学生姓名s_name字段分组，求该学生的总成绩。

由于业务计算逻辑有实现好的SQL如下：

select 
a.school_name,
a.s_name,
sum(c.score)
from 
a join b
on a.s_id = b.s_id and a.s_name = b.s_name
join c 
on b.s_id = c.s_id and b.s_name = c.s_name and b.exam_code = c.exam_code 
group by a.school_name,a.s_name

三流join的实现在Flink DataStream编程中实现的思路：a和b表以5s中划分一个窗口计算出来的temp流，再与c表划分5s窗口进行join，最后一步再分组聚合。只要保证两次join的时间窗口大小都一样，就可以保证数据计算的结果正确。

看代码和操作。为了方便以Socket流的形式来演示。

数据准备

Student表

1,张三,xx大学,1000
2,李四,yy大学,2000
3,王五,yy大学,3000
4,赵六,xx大学,5000
5,田七,yy大学,4000
6,赵钱,xx大学,7000

Student_request

1,张三,A0001,1000
2,李四,A0002,2000
3,王五,A0003,3000
4,赵六,A0004,5000
5,田七,A0005,4000
6,赵钱,A0006,9000

Score

1,张三,A0001,45,1000
2,李四,A0002,56,2000
3,王五,A0003,33,3000
4,赵六,A0004,89,5000
5,田七,A0005,23,4000
6,赵钱,A0006,70,7000

POJO准备

public class Student {
    private String sId;
    private String sName;
    private String SchoolName;
    private Long dataStamp;
}

public class StudentRequest {
    private String sId;
    private String sName;
    private String examCode;
    private Long dataStamp ;
}

public class Score {
    private String sId;
    private String SName;
    private String examCode;
    private Double score;
    private Long dataStamp;
}

Student流 Join StudentRequst流

数据处理流程：

三个Socket数据流使用Map函数封装至JavaBean
三个流设置waterMark，指定事件时间，乱序数据统一设为1s

首先Student流和StudentRequst流jion所实现大致流程

studentStream.join(StudentRequstStream)
			 .where(KeySelector)      
			 .equalTo(KeySelector)
-- where 和 equalTo两个方法中需要定义Key选择器，即分别代表的是 student流中哪些字段 和 StudentRequest流中哪些字段 做join
			 .window()
-- 根据业务需求指定窗口时间大小 即窗口类型 滚动Tumbling  滑动Sliding 会话Session三种窗口。结合业务，我使用的滚动窗口。
			 .allowedLateness()
--允许迟到数据进行计算。
			 .apply(JoinFuntion)
-- apply方法实现的JoinFunction 表明返回的哪些字段，JoinFunction默认实现

代码实现即首次join的逻辑SQL如下:
select 
a.s_id,
a.s_name,
a.school_name,
b.exam_code
from 
student a join studentrequest b
on a.s_id = b.s_id and a.s_name = b.s_name

package com.sou1yu.threeJoin;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

/**
 * @Author: sou1yu
 * @Email：sou1yu@aliyun.com
 */
public class MultipleStreamJoin {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.getConfig().setAutoWatermarkInterval(500);
        //env.setStateBackend()


        DataStreamSource<String> studentStream =env.socketTextStream("192.168.52.100", 7777);
        DataStreamSource<String> stuReqStream =env.socketTextStream("192.168.52.100", 8888);
        DataStreamSource<String> scoreStream =env.socketTextStream("192.168.52.100", 9999);
        //1.1student封装入Bean并设置水位线
        SingleOutputStreamOperator<Student> stuOperatorStream = studentStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new Student(fields[0], fields[1], fields[2], new Long(fields[3]));
        })      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Student>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(Student element) {
                        return element.getDataStamp();
                    }
                });
        //1.2req封装入Bean并设置水位线
        SingleOutputStreamOperator<StudentRequest> reqOperatorStream = stuReqStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new StudentRequest(fields[0], fields[1], fields[2], new Long(fields[3]));
                })
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<StudentRequest>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(StudentRequest element) {
                        return element.getDataStamp();
                    }
                });
        //1.3score封装入Bean并设置水位线
        SingleOutputStreamOperator<Score> scoreOperatorStream = scoreStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new Score(fields[0], fields[1], fields[2], new Double(fields[3]), new Long(fields[4]));
                })
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Score>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(Score element) {
                        return element.getDataStamp();
                    }
                });
        //2.1 student join studentRequest
        DataStream<Tuple4<String, String, String,String>> firstJoinStream = stuOperatorStream
                .join(reqOperatorStream)
                .where(new FirstJoinStuKey())
                .equalTo(new FirstJoinReqKey())
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
            	.allowedLateness(Time.milliseconds(500))
                .apply(new FirstJoinResult());


        
        studentStream.printToErr("student>>");
        stuReqStream.printToErr("req");
        firstJoinStream.printToErr("FIRST JOIN >>");

        env.execute(MultipleStreamJoin.class.getSimpleName());

    }

    //1.1student join studentReq 时student的字段
    public static class FirstJoinStuKey implements KeySelector<Student,Tuple2<String,String>>{

        @Override
        public Tuple2<String, String> getKey(Student value) throws Exception {
            return new Tuple2<>(value.getsId(),value.getsName());
        }
    }
    //1.2student join studentReq 时studentReq的字段
    public static class  FirstJoinReqKey implements KeySelector<StudentRequest,Tuple2<String,String>>{

        @Override
        public Tuple2<String, String> getKey(StudentRequest value) throws Exception {
            return new Tuple2<>(value.getsId(),value.getsName());
        }
    }
    //1.3 student join studentReq 是提取的结果字段
    public static class  FirstJoinResult implements JoinFunction<Student,StudentRequest,Tuple4<String,String,String,String>>{

        @Override
        public Tuple4<String, String, String, String> join(Student stu, StudentRequest req) throws Exception {
            return new Tuple4<>(stu.getsId(),stu.getsName(),stu.getSchoolName(),req.getExamCode());
        }
    }

}

测试验证

开启虚拟机 7777 8888 9999端口（7777，8888，9999分别监听student、studentRequest、Score数据）
9999端开启，为了Flink程序运行成功，但是该窗口中不传入数据。测试结果如下。

在这里插入图片描述

首个窗口大小[0,5000)

输入Student 和 studentRequest的数据其中5号选手田七为乱序数据，即证明waterMark生效。

当Student数据 6号选手赵钱可以触发窗口计算时，并没有计算，等待join的 studentRequest的触发窗口数据6赵钱来了，才产生计算结果。即证明窗口设置成功！

tmp流 join Score流

首次join之后得到的tmp流，再和 Score进行join。其join 的流程和上述的操作一致。不再赘述。最后join出来的流，还需要根据keyBy进行分组，然后再进行一次sum求和即可。看代码！

package com.sou1yu.threeJoin;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

/**
 * @Author: sou1yu
 * @Email：sou1yu@aliyun.com
 */
public class MultipleStreamJoin {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.getConfig().setAutoWatermarkInterval(500);
        //env.setStateBackend()


        DataStreamSource<String> studentStream =env.socketTextStream("192.168.52.100", 7777);
        DataStreamSource<String> stuReqStream =env.socketTextStream("192.168.52.100", 8888);
        DataStreamSource<String> scoreStream =env.socketTextStream("192.168.52.100", 9999);
        //1.1student封装入Bean并设置水位线
        SingleOutputStreamOperator<Student> stuOperatorStream = studentStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new Student(fields[0], fields[1], fields[2], new Long(fields[3]));
        })      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Student>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(Student element) {
                        return element.getDataStamp();
                    }
                });
        //1.2req封装入Bean并设置水位线
        SingleOutputStreamOperator<StudentRequest> reqOperatorStream = stuReqStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new StudentRequest(fields[0], fields[1], fields[2], new Long(fields[3]));
                })
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<StudentRequest>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(StudentRequest element) {
                        return element.getDataStamp();
                    }
                });
        //1.3score封装入Bean并设置水位线
        SingleOutputStreamOperator<Score> scoreOperatorStream = scoreStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new Score(fields[0], fields[1], fields[2], new Double(fields[3]), new Long(fields[4]));
                })
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Score>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(Score element) {
                        return element.getDataStamp();
                    }
                });
        //2.1 student join studentRequest
        DataStream<Tuple4<String, String, String,String>> firstJoinStream = stuOperatorStream
                .join(reqOperatorStream)
                .where(new FirstJoinStuKey())
                .equalTo(new FirstJoinReqKey())
                .window(TumblingEventTimeWindows.of(Time.seconds(1)))
            	.allowedLateness(Time.milliseconds(5000))
                .apply(new FirstJoinResult());

        //2.2 first join结果 join score表
        DataStream<Tuple3<String, String, Double>> resultJoinedStream = firstJoinStream
                .join(scoreOperatorStream)
                .where(new SecondJoinTupleKey())
                .equalTo(new SecondJoinScoreKey())
                .window(TumblingEventTimeWindows.of(Time.seconds(1)))
            	.allowedLateness(Time.milliseconds(5000))
                .apply(new SecondJoinResult());

        //3 join完的结果再分组聚合
        SingleOutputStreamOperator<Tuple3<String, String, Double>> resultStream = resultJoinedStream
                .keyBy(0, 1)
                .sum(2);


        studentStream.printToErr("student>>");
        stuReqStream.printToErr("req");
        scoreStream.printToErr("score");
        resultStream.printToErr("Result Stream>>>");


        env.execute(MultipleStreamJoin.class.getSimpleName());

    }

    //2.1.1student join studentReq 时student的字段
    public static class FirstJoinStuKey implements KeySelector<Student,Tuple2<String,String>>{

        @Override
        public Tuple2<String, String> getKey(Student value) throws Exception {
            return new Tuple2<>(value.getsId(),value.getsName());
        }
    }
    //2.1.2student join studentReq 时studentReq的字段
    public static class  FirstJoinReqKey implements KeySelector<StudentRequest,Tuple2<String,String>>{

        @Override
        public Tuple2<String, String> getKey(StudentRequest value) throws Exception {
            return new Tuple2<>(value.getsId(),value.getsName());
        }
    }
    //2.1.3 student join studentReq 最终提取的结果字段
    public static class  FirstJoinResult implements JoinFunction<Student,StudentRequest,Tuple4<String,String,String,String>>{

        @Override
        public Tuple4<String, String, String, String> join(Student stu, StudentRequest req) throws Exception {
            //返回request流中的 sID sName ExamCode 和 student流中的SchoolName 包装到4元组内
            return new Tuple4<>(req.getsId(),req.getsName(),stu.getSchoolName(),req.getExamCode());
        }
    }
    //2.2.1 firstjoin流 和 score join指定firstjoin的字段
    public static  class SecondJoinTupleKey implements KeySelector<Tuple4<String, String, String,String>, Tuple3<String,String,String>>{

        @Override
        public Tuple3<String, String, String> getKey(Tuple4<String, String, String, String> value) throws Exception {
            return new Tuple3<>(value.f0,value.f1,value.f3);
        }
    }
    //2.2.2 firstjoin流 和 score join指定score的字段
    public static class SecondJoinScoreKey implements KeySelector<Score,Tuple3<String,String,String>>{

        @Override
        public Tuple3<String, String, String> getKey(Score value) throws Exception {
            return new Tuple3<>(value.getsId(),value.getSName(),value.getExamCode());
        }
    }
    //2.2.3 firstjoin流 和 score join最终提取的结果字段
    public static class  SecondJoinResult implements JoinFunction<Tuple4<String,String,String,String>,Score,Tuple3<String,String,Double>>{

        @Override
        public Tuple3<String, String, Double> join(Tuple4<String, String, String, String> first, Score score) throws Exception {
            //firstjoin流中的schoolName , sName 和Score流中的 score  这三个字段 包装到三元组内
            return new Tuple3<>(first.f2,first.f1,score.getScore());
        }
    }


}

二次join的时候，我大多使用的Flink中带的Tuple结构来接受参数的传入传出，写的时候很容易混淆，结合业务字段的多少，自己也可以选择再使用自定义JavaBean进行包装，这样再实现KeySelector和JoinFunction接口的时候容易更清晰一些。

测试验证

数据准备：

student数据

1,张三,xx大学,1000
2,李四,yy大学,2000
3,王五,yy大学,3000
4,赵六,xx大学,5000
5,田七,yy大学,4000
6,田七,yy大学,4900
7,田七,yy大学,5990
8,田七,yy大学,7000
9,田七,yy大学,4500

StudentRequest数据

1,张三,A0001,1000
2,李四,A0002,2000
3,王五,A0003,3000
4,赵六,A0004,5000
5,田七,A0005,4000
6,田七,A0006,4900
7,田七,A0007,5990
8,田七,A0008,7000
9,田七,A0009,4500

Score数据

1,张三,A0001,45,1000
2,李四,A0002,56,2000
3,王五,A0003,33,3000
4,赵六,A0004,89,5000
5,田七,A0005,20,4000
6,田七,A0006,70,4900
7,田七,A0007,70,5990
8,田七,A0008,30,7000
9,田七,A0009,30,4500

本次要验证的三个关键点：

乱序数据的计算
迟到数据的计算
三流join的结果准确性

就本次的设置，分析Student数据

窗口时间 5s
waterMark 1s
allowedLateness 5s

Student数中：

第一个窗口起始为[0,5000)，时间区间左闭右开

1-3号中为正常数据

4号为5秒钟的数据不属于第一个窗口数据（左闭右开原则）

5，6号为乱序数据，因为watermark机制会被划分到第一个窗口

7号不属于第一个窗口，且7号不会触发窗口的关闭由于没超过水位线 5+1s 所以不会触发计算

8号不属于第一个窗口，超过水位线，所以会触发计算，但是由于allowedLateness不会关闭窗口。

9号属于第一个窗口，且当前最大时间未超过【窗口大小+watermark+allowedLateness】，所以会被再次计算入第一个窗口中去

基于分析Student的原则，另外两个数据源不再赘述。

下面看验证的计算截图如下：

在这里插入图片描述

输入数据顺序为Student数据的1-8 -> StudentRequest数据的1-8 -> Score数据中的1-8 此时第一个窗口触发了计算，着重看田七用户的sum值为20+70，但是窗口还未被关闭。

当我再次输入田七的迟到数据时，依次Student -> StudentRequest -> Score。在输入Score数据后，触发了计算，仍然把该数据算在第一个窗口了。田七的sum值为 20 + 70 +30 =120。

心得

至此，三流的join测试完成。验证了并坚信了这样join方式的可行性。也验证了waterMark，allowedLateness机制的功能。其中代码还有很多不健壮的地方，如fliter过滤空值，非法值的处理，这些都是很重要的判断和处理。还有Flink提供迟到数据的第三层保障sideOutputLateData。完全迟到数据使用侧边流，收集到RDBMS或者其他存储介质中，再结合业务与所属窗口数据进行批处理做合并保证数据完整。

udent的原则，另外两个数据源不再赘述。