Flink 基于DataStream 实现三流Join

Flink 基于DataStream 实现三流Join

Flink版本1.7.0 官方提示,Table API &SQL模块还在积极开发当中,不是所有功能都可以支持实现。

在这里插入图片描述

​ 本打算引流进入生成Flink Table 然后写个Flink sql 就完事了,但是我在Flink1.7版本的文档中没找到处理Table对象的时候指定事件时间这个功能,所以改用DataStream完成处理。

本次需求处理的是三个流的数据JOIN,流的数据Schema如下:

流/表名别名表描述字段字段字段字段字段
Studenta学生信息学生编号 s_id姓名s_name学校名字school_name事件时间dataStamp
Student_requestb学生申请考试确认学生编号s_id姓名s_name申请考试编码exam_code事件时间dataStamp
scorec学生成绩表学生编号s_id姓名s_name申请考试编码exam_code申请考试考成绩 score事件时间dataStamp

​ 三个数据是以流的形式在Flink中基于事件时间进行计算,业务逻辑是通过表a进行join表b,若表b中有s_id,s_name字段,则证明该学生申请考试了,即可获取该学生在b中的exam_code,通过表b的s_id,s_name,exam_code,查询在表c中该学生的考试成绩。最后,以a表中学校shcool_name字段、及学生姓名s_name字段分组,求该学生的总成绩。

由于业务计算逻辑有实现好的SQL如下:

select 
a.school_name,
a.s_name,
sum(c.score)
from 
a join b
on a.s_id = b.s_id and a.s_name = b.s_name
join c 
on b.s_id = c.s_id and b.s_name = c.s_name and b.exam_code = c.exam_code 
group by a.school_name,a.s_name

​ 三流join的实现在Flink DataStream编程中实现的思路:a和b表以5s中划分一个窗口计算出来的temp流,再与c表划分5s窗口进行join,最后一步再分组聚合。只要保证两次join的时间窗口大小都一样,就可以保证数据计算的结果正确。

看代码和操作。为了方便以Socket流的形式来演示。

数据准备

Student表

1,张三,xx大学,1000
2,李四,yy大学,2000
3,王五,yy大学,3000
4,赵六,xx大学,5000
5,田七,yy大学,4000
6,赵钱,xx大学,7000

Student_request

1,张三,A0001,1000
2,李四,A0002,2000
3,王五,A0003,3000
4,赵六,A0004,5000
5,田七,A0005,4000
6,赵钱,A0006,9000

Score

1,张三,A0001,45,1000
2,李四,A0002,56,2000
3,王五,A0003,33,3000
4,赵六,A0004,89,5000
5,田七,A0005,23,4000
6,赵钱,A0006,70,7000
POJO准备
public class Student {
    private String sId;
    private String sName;
    private String SchoolName;
    private Long dataStamp;
}
public class StudentRequest {
    private String sId;
    private String sName;
    private String examCode;
    private Long dataStamp ;
}
public class Score {
    private String sId;
    private String SName;
    private String examCode;
    private Double score;
    private Long dataStamp;
}
Student流 Join StudentRequst流

数据处理流程:

  1. 三个Socket数据流使用Map函数封装至JavaBean

  2. 三个流设置waterMark,指定事件时间,乱序数据统一设为1s

  3. 首先Student流和StudentRequst流jion所实现大致流程

    studentStream.join(StudentRequstStream)
    			 .where(KeySelector)      
    			 .equalTo(KeySelector)
    -- where 和 equalTo两个方法中需要定义Key选择器,即分别代表的是 student流中哪些字段 和 StudentRequest流中哪些字段 做join
    			 .window()
    -- 根据业务需求指定窗口时间大小 即窗口类型 滚动Tumbling  滑动Sliding 会话Session三种窗口。结合业务,我使用的滚动窗口。
    			 .allowedLateness()
    --允许迟到数据进行计算。
    			 .apply(JoinFuntion)
    -- apply方法实现的JoinFunction 表明返回的哪些字段,JoinFunction默认实现
    
    代码实现即首次join的逻辑SQL如下:
    select 
    a.s_id,
    a.s_name,
    a.school_name,
    b.exam_code
    from 
    student a join studentrequest b
    on a.s_id = b.s_id and a.s_name = b.s_name
    
package com.sou1yu.threeJoin;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

/**
 * @Author: sou1yu
 * @Email:sou1yu@aliyun.com
 */
public class MultipleStreamJoin {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.getConfig().setAutoWatermarkInterval(500);
        //env.setStateBackend()


        DataStreamSource<String> studentStream =env.socketTextStream("192.168.52.100", 7777);
        DataStreamSource<String> stuReqStream =env.socketTextStream("192.168.52.100", 8888);
        DataStreamSource<String> scoreStream =env.socketTextStream("192.168.52.100", 9999);
        //1.1student封装入Bean并设置水位线
        SingleOutputStreamOperator<Student> stuOperatorStream = studentStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new Student(fields[0], fields[1], fields[2], new Long(fields[3]));
        })      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Student>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(Student element) {
                        return element.getDataStamp();
                    }
                });
        //1.2req封装入Bean并设置水位线
        SingleOutputStreamOperator<StudentRequest> reqOperatorStream = stuReqStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new StudentRequest(fields[0], fields[1], fields[2], new Long(fields[3]));
                })
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<StudentRequest>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(StudentRequest element) {
                        return element.getDataStamp();
                    }
                });
        //1.3score封装入Bean并设置水位线
        SingleOutputStreamOperator<Score> scoreOperatorStream = scoreStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new Score(fields[0], fields[1], fields[2], new Double(fields[3]), new Long(fields[4]));
                })
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Score>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(Score element) {
                        return element.getDataStamp();
                    }
                });
        //2.1 student join studentRequest
        DataStream<Tuple4<String, String, String,String>> firstJoinStream = stuOperatorStream
                .join(reqOperatorStream)
                .where(new FirstJoinStuKey())
                .equalTo(new FirstJoinReqKey())
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
            	.allowedLateness(Time.milliseconds(500))
                .apply(new FirstJoinResult());


        
        studentStream.printToErr("student>>");
        stuReqStream.printToErr("req");
        firstJoinStream.printToErr("FIRST JOIN >>");

        env.execute(MultipleStreamJoin.class.getSimpleName());

    }

    //1.1student join studentReq 时student的字段
    public static class FirstJoinStuKey implements KeySelector<Student,Tuple2<String,String>>{

        @Override
        public Tuple2<String, String> getKey(Student value) throws Exception {
            return new Tuple2<>(value.getsId(),value.getsName());
        }
    }
    //1.2student join studentReq 时studentReq的字段
    public static class  FirstJoinReqKey implements KeySelector<StudentRequest,Tuple2<String,String>>{

        @Override
        public Tuple2<String, String> getKey(StudentRequest value) throws Exception {
            return new Tuple2<>(value.getsId(),value.getsName());
        }
    }
    //1.3 student join studentReq 是提取的结果字段
    public static class  FirstJoinResult implements JoinFunction<Student,StudentRequest,Tuple4<String,String,String,String>>{

        @Override
        public Tuple4<String, String, String, String> join(Student stu, StudentRequest req) throws Exception {
            return new Tuple4<>(stu.getsId(),stu.getsName(),stu.getSchoolName(),req.getExamCode());
        }
    }

}

测试验证
  1. 开启虚拟机 7777 8888 9999端口 (7777,8888,9999分别监听student、studentRequest、Score数据)

  2. 9999端开启,为了Flink程序运行成功,但是该窗口中不传入数据。测试结果如下。

在这里插入图片描述

首个窗口大小[0,5000)

输入Student 和 studentRequest的数据其中5号选手田七为乱序数据,即证明waterMark生效。

当Student数据 6号选手赵钱可以触发窗口计算时,并没有计算,等待join的 studentRequest的触发窗口数据6赵钱来了,才产生计算结果。即证明窗口设置成功!

tmp流 join Score流

​ 首次join之后得到的tmp流,再和 Score进行join。其join 的流程和上述的操作一致。不再赘述。最后join出来的流,还需要根据keyBy进行分组,然后再进行一次sum求和即可。看代码!

package com.sou1yu.threeJoin;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

/**
 * @Author: sou1yu
 * @Email:sou1yu@aliyun.com
 */
public class MultipleStreamJoin {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.getConfig().setAutoWatermarkInterval(500);
        //env.setStateBackend()


        DataStreamSource<String> studentStream =env.socketTextStream("192.168.52.100", 7777);
        DataStreamSource<String> stuReqStream =env.socketTextStream("192.168.52.100", 8888);
        DataStreamSource<String> scoreStream =env.socketTextStream("192.168.52.100", 9999);
        //1.1student封装入Bean并设置水位线
        SingleOutputStreamOperator<Student> stuOperatorStream = studentStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new Student(fields[0], fields[1], fields[2], new Long(fields[3]));
        })      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Student>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(Student element) {
                        return element.getDataStamp();
                    }
                });
        //1.2req封装入Bean并设置水位线
        SingleOutputStreamOperator<StudentRequest> reqOperatorStream = stuReqStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new StudentRequest(fields[0], fields[1], fields[2], new Long(fields[3]));
                })
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<StudentRequest>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(StudentRequest element) {
                        return element.getDataStamp();
                    }
                });
        //1.3score封装入Bean并设置水位线
        SingleOutputStreamOperator<Score> scoreOperatorStream = scoreStream
                .map(line -> {
                    String[] fields = line.split(",");
                    return new Score(fields[0], fields[1], fields[2], new Double(fields[3]), new Long(fields[4]));
                })
                .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Score>(Time.seconds(1)) {
                    @Override
                    public long extractTimestamp(Score element) {
                        return element.getDataStamp();
                    }
                });
        //2.1 student join studentRequest
        DataStream<Tuple4<String, String, String,String>> firstJoinStream = stuOperatorStream
                .join(reqOperatorStream)
                .where(new FirstJoinStuKey())
                .equalTo(new FirstJoinReqKey())
                .window(TumblingEventTimeWindows.of(Time.seconds(1)))
            	.allowedLateness(Time.milliseconds(5000))
                .apply(new FirstJoinResult());

        //2.2 first join结果 join score表
        DataStream<Tuple3<String, String, Double>> resultJoinedStream = firstJoinStream
                .join(scoreOperatorStream)
                .where(new SecondJoinTupleKey())
                .equalTo(new SecondJoinScoreKey())
                .window(TumblingEventTimeWindows.of(Time.seconds(1)))
            	.allowedLateness(Time.milliseconds(5000))
                .apply(new SecondJoinResult());

        //3 join完的结果再分组聚合
        SingleOutputStreamOperator<Tuple3<String, String, Double>> resultStream = resultJoinedStream
                .keyBy(0, 1)
                .sum(2);


        studentStream.printToErr("student>>");
        stuReqStream.printToErr("req");
        scoreStream.printToErr("score");
        resultStream.printToErr("Result Stream>>>");


        env.execute(MultipleStreamJoin.class.getSimpleName());

    }

    //2.1.1student join studentReq 时student的字段
    public static class FirstJoinStuKey implements KeySelector<Student,Tuple2<String,String>>{

        @Override
        public Tuple2<String, String> getKey(Student value) throws Exception {
            return new Tuple2<>(value.getsId(),value.getsName());
        }
    }
    //2.1.2student join studentReq 时studentReq的字段
    public static class  FirstJoinReqKey implements KeySelector<StudentRequest,Tuple2<String,String>>{

        @Override
        public Tuple2<String, String> getKey(StudentRequest value) throws Exception {
            return new Tuple2<>(value.getsId(),value.getsName());
        }
    }
    //2.1.3 student join studentReq 最终提取的结果字段
    public static class  FirstJoinResult implements JoinFunction<Student,StudentRequest,Tuple4<String,String,String,String>>{

        @Override
        public Tuple4<String, String, String, String> join(Student stu, StudentRequest req) throws Exception {
            //返回request流中的 sID sName ExamCode 和 student流中的SchoolName 包装到4元组内
            return new Tuple4<>(req.getsId(),req.getsName(),stu.getSchoolName(),req.getExamCode());
        }
    }
    //2.2.1 firstjoin流 和 score join指定firstjoin的字段
    public static  class SecondJoinTupleKey implements KeySelector<Tuple4<String, String, String,String>, Tuple3<String,String,String>>{

        @Override
        public Tuple3<String, String, String> getKey(Tuple4<String, String, String, String> value) throws Exception {
            return new Tuple3<>(value.f0,value.f1,value.f3);
        }
    }
    //2.2.2 firstjoin流 和 score join指定score的字段
    public static class SecondJoinScoreKey implements KeySelector<Score,Tuple3<String,String,String>>{

        @Override
        public Tuple3<String, String, String> getKey(Score value) throws Exception {
            return new Tuple3<>(value.getsId(),value.getSName(),value.getExamCode());
        }
    }
    //2.2.3 firstjoin流 和 score join最终提取的结果字段
    public static class  SecondJoinResult implements JoinFunction<Tuple4<String,String,String,String>,Score,Tuple3<String,String,Double>>{

        @Override
        public Tuple3<String, String, Double> join(Tuple4<String, String, String, String> first, Score score) throws Exception {
            //firstjoin流中的schoolName , sName 和Score流中的 score  这三个字段 包装到三元组内
            return new Tuple3<>(first.f2,first.f1,score.getScore());
        }
    }


}

​ 二次join的时候,我大多使用的Flink中带的Tuple结构来接受参数的传入传出,写的时候很容易混淆,结合业务字段的多少,自己也可以选择再使用自定义JavaBean进行包装,这样再实现KeySelector和JoinFunction接口的时候容易更清晰一些。

测试验证

数据准备:

student数据

1,张三,xx大学,1000
2,李四,yy大学,2000
3,王五,yy大学,3000
4,赵六,xx大学,5000
5,田七,yy大学,4000
6,田七,yy大学,4900
7,田七,yy大学,5990
8,田七,yy大学,7000
9,田七,yy大学,4500

StudentRequest数据

1,张三,A0001,1000
2,李四,A0002,2000
3,王五,A0003,3000
4,赵六,A0004,5000
5,田七,A0005,4000
6,田七,A0006,4900
7,田七,A0007,5990
8,田七,A0008,7000
9,田七,A0009,4500

Score数据

1,张三,A0001,45,1000
2,李四,A0002,56,2000
3,王五,A0003,33,3000
4,赵六,A0004,89,5000
5,田七,A0005,20,4000
6,田七,A0006,70,4900
7,田七,A0007,70,5990
8,田七,A0008,30,7000
9,田七,A0009,30,4500

本次要验证的三个关键点:

  1. 乱序数据的计算
  2. 迟到数据的计算
  3. 三流join的结果准确性

就本次的设置,分析Student数据

  • 窗口时间 5s
  • waterMark 1s
  • allowedLateness 5s

Student数中:

​ 第一个窗口起始为[0,5000),时间区间左闭右开

​ 1-3号中为正常数据

​ 4号为5秒钟的数据不属于第一个窗口数据(左闭右开原则)

​ 5,6号为乱序数据,因为watermark机制会被划分到第一个窗口

​ 7号 不属于第一个窗口,且7号不会触发窗口的关闭 由于没超过水位线 5+1s 所以不会触发计算

​ 8号 不属于第一个窗口,超过水位线,所以会触发计算,但是由于allowedLateness不会关闭窗口。

​ 9号 属于第一个窗口,且当前最大时间未超过【窗口大小+watermark+allowedLateness】,所以会被再次计算入第一个窗口中去

基于分析Student的原则,另外两个数据源不再赘述。

下面看验证的计算截图如下:

在这里插入图片描述

​ 输入数据顺序为Student数据的1-8 -> StudentRequest数据的1-8 -> Score数据中的1-8 此时第一个窗口触发了计算,着重看田七用户的sum值为20+70,但是窗口还未被关闭。

​ 当我再次输入田七的迟到数据时,依次Student -> StudentRequest -> Score。在输入Score数据后,触发了计算,仍然把该数据算在第一个窗口了。田七的sum值为 20 + 70 +30 =120。

心得

​ 至此,三流的join测试完成。验证了并坚信了这样join方式的可行性。也验证了waterMark,allowedLateness机制的功能。其中代码还有很多不健壮的地方,如fliter过滤空值,非法值的处理,这些都是很重要的判断和处理。还有Flink提供迟到数据的第三层保障sideOutputLateData。完全迟到数据使用侧边流,收集到RDBMS或者其他存储介质中,再结合业务与所属窗口数据进行批处理做合并保证数据完整。

udent的原则,另外两个数据源不再赘述。

  • 2
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值