Flink：双流Join和维表Join

最新推荐文章于 2023-12-06 15:10:49 发布

GScallion

最新推荐文章于 2023-12-06 15:10:49 发布

阅读量696

点赞数

分类专栏： Flink

本文链接：https://blog.csdn.net/qq_24325581/article/details/116783479

版权

Flink 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

一、双流Join

源日志在详细代码中展示

1、inner join

join() 算子提供的语义为"Window join"，即按照指定字段和（滚动/滑动/会话）窗口进行 inner join。
如下示例是曝光日志和点击日志的inner join

//1、inner join
DataStream<String> innerJoinStream = clickTupleStream
        .join(infoTupleStream)
        .where(tuple -> tuple.f0) //点击日志流clickTupleStream的key
        .equalTo(tuple -> tuple.f0) //曝光日志流infoTupleStream的key
        .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
        .apply(new JoinFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
            @Override
            public String join(Tuple2<String, String> click, Tuple2<String, String> info) throws Exception {
                return click.f0 + " " + click.f1 + " " + info.f1;
            }
        });

JoinFunction源代码如下：

package org.apache.flink.api.common.functions;

import org.apache.flink.annotation.Public;

import java.io.Serializable;

/**
 * Interface for Join functions. Joins combine two data sets by joining their
 * elements on specified keys. This function is called with each pair of joining elements.
 *
 * <p>By default, the joins follows strictly the semantics of an "inner join" in SQL.
 * the semantics are those of an "inner join", meaning that elements are filtered out
 * if their key is not contained in the other data set.
 *
 * <p>The basic syntax for using Join on two data sets is as follows:
 * <pre>{@code
 * DataSet<X> set1 = ...;
 * DataSet<Y> set2 = ...;
 *
 * set1.join(set2).where(<key-definition>).equalTo(<key-definition>).with(new MyJoinFunction());
 * }</pre>
 *
 * <p>{@code set1} is here considered the first input, {@code set2} the second input.
 *
 * <p>The Join function is an optional part of a join operation. If no JoinFunction is provided,
 * the result of the operation is a sequence of 2-tuples, where the elements in the tuple are those that
 * the JoinFunction would have been invoked with.
 *
 * <p>Note: You can use a {@link CoGroupFunction} to perform an outer join.
 *
 * @param <IN1> The type of the elements in the first input.
 * @param <IN2> The type of the elements in the second input.
 * @param <OUT> The type of the result elements.
 */
@Public
@FunctionalInterface
public interface JoinFunction<IN1, IN2, OUT> extends Function, Serializable {

	/**
	 * The join method, called once per joined pair of elements.
	 *
	 * @param first The element from first input.
	 * @param second The element from second input.
	 * @return The resulting element.
	 *
	 * @throws Exception This method may throw exceptions. Throwing an exception will cause the operation
	 *                   to fail and may trigger recovery.
	 */
	OUT join(IN1 first, IN2 second) throws Exception;
}

2、outer join

coGroup() 算子用于实现left/right outer join
如下示例分别使用曝光日志和点击日志进行左右外连接

//左连接
DataStream<String> leftOutJoinStream = infoTupleStream
        .coGroup(clickTupleStream)
        .where(tuple -> tuple.f0)
        .equalTo(tuple -> tuple.f0)
        .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
        .apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
            @Override
            public void coGroup(Iterable<Tuple2<String, String>> infoIterable, Iterable<Tuple2<String, String>> clickIterable, Collector<String> collector) throws Exception {
                //遍历左流
                for (Tuple2<String, String> infoRecord : infoIterable) {
                    boolean isMatched = false;
                    //遍历右流
                    for (Tuple2<String, String> clickRecord : clickIterable) {
                        //右流中有对应的记录
                        collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + clickRecord.f1);
                        isMatched = true;
                    }
                    if (!isMatched) {
                        //右流中无数据
                        collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + null);
                    }
                }
            }
        });

//右连接
DataStream<String> rightOutJoinStream = clickTupleStream
        .coGroup(infoTupleStream)
        .where(tuple -> tuple.f0)
        .equalTo(tuple -> tuple.f0)
        .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
        .apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
            @Override
            public void coGroup(Iterable<Tuple2<String, String>> clickIterable, Iterable<Tuple2<String, String>> infoIterable, Collector<String> collector) throws Exception {
                boolean isMatched = false;
                //遍历右流
                for (Tuple2<String, String> infoRecord : infoIterable) {
                    //遍历左流
                    for (Tuple2<String, String> clickRecord : clickIterable) {
                        //左流中对应的记录
                        collector.collect(infoRecord.f0 + " " + clickRecord.f1 + " " + infoRecord.f1);
                        isMatched = true;
                    }
                    if (!isMatched) {
                        //左流中无数据
                        collector.collect(infoRecord.f0 + " " + null + " " + infoRecord.f1);
                    }
                }
            }
        });

CoGroupFunction源代码如下：

package org.apache.flink.api.common.functions;

import org.apache.flink.annotation.Public;
import org.apache.flink.util.Collector;

import java.io.Serializable;

/**
 * The interface for CoGroup functions. CoGroup functions combine two data sets by first grouping each data set
 * after a key and then "joining" the groups by calling this function with the two sets for each key.
 * If a key is present in only one of the two inputs, it may be that one of the groups is empty.
 *
 * <p>The basic syntax for using CoGroup on two data sets is as follows:
 * <pre>{@code
 * DataSet<X> set1 = ...;
 * DataSet<Y> set2 = ...;
 *
 * set1.coGroup(set2).where(<key-definition>).equalTo(<key-definition>).with(new MyCoGroupFunction());
 * }</pre>
 *
 * <p>{@code set1} is here considered the first input, {@code set2} the second input.
 *
 * <p>Some keys may only be contained in one of the two original data sets. In that case, the CoGroup function is invoked
 * with in empty input for the side of the data set that did not contain elements with that specific key.
 *
 * @param <IN1> The data type of the first input data set.
 * @param <IN2> The data type of the second input data set.
 * @param <O> The data type of the returned elements.
 */
@Public
@FunctionalInterface
public interface CoGroupFunction<IN1, IN2, O> extends Function, Serializable {

	/**
	 * This method must be implemented to provide a user implementation of a
	 * coGroup. It is called for each pair of element groups where the elements share the
	 * same key.
	 *
	 * @param first The records from the first input.
	 * @param second The records from the second.
	 * @param out A collector to return elements.
	 *
	 * @throws Exception The function may throw Exceptions, which will cause the program to cancel,
	 *                   and may trigger the recovery logic.
	 */
	void coGroup(Iterable<IN1> first, Iterable<IN2> second, Collector<O> out) throws Exception;
}

3、interval join

join() 和 coGroup() 都是基于窗口做关联的。但是在某些情况下，两条流的数据步调未必一致。例如，点击流的数据有可能在曝光流的发生之后很久才被写入，如果用窗口来圈定，很容易 join 不上。所以 Flink 又提供了"Interval join"的语义，按照指定字段以及右流相对左流偏移的时间区间进行关联。

//3、interval join
SingleOutputStreamOperator<String> intervalJoinStream = infoTupleStream
        .keyBy(record -> record.f0)
        .intervalJoin(clickTupleStream.keyBy(record -> record.f0))
        .between(Time.seconds(-30), Time.seconds(30)) //指定右流相对左流偏移的时间区间
        .process(new ProcessJoinFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
            @Override
            public void processElement(Tuple2<String, String> infoRecord, Tuple2<String, String> clickRecord, Context context, Collector<String> collector) throws Exception {
                collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + clickRecord.f1);
            }
        });

ProcessJoinFunction源代码分析文章

4、详细代码

package com.scallion.job;

import com.scallion.common.Common;
import com.scallion.utils.FlinkUtil;
import org.apache.flink.api.common.functions.CoGroupFunction;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

/**
 * created by gaowj.
 * created on 2021-05-13.
 * function:
 * origin ->
 */
public class TwoStreamJoinJob implements Job {
    @Override
    public void run() {
        /**
         * source
         */
        //点击日志
        DataStream<String> clickStream = FlinkUtil.getKafkaStream(Common.KAFKA_BROKER, Common.APP_NEWSAPP_TOPIC, Common.KAFKA_CONSUMER_GROUP_ID);
        //曝光日志
        DataStream<String> infoStream = FlinkUtil.getKafkaStream(Common.KAFKA_BROKER, Common.APP_NEWSAPP_INFO_TOPIC, Common.KAFKA_CONSUMER_GROUP_ID);
        /**
         * transform
         */
        SingleOutputStreamOperator<Tuple2<String, String>> clickTupleStream = clickStream.map(new MapFunction<String, Tuple2<String, String>>() {
            @Override
            public Tuple2<String, String> map(String record) throws Exception {
                String[] split = record.split("\t");
                //split[5]：userkey
                //split[11]：点击行为类型，常见为 action page duration btomnews
                Tuple2<String, String> tuple = new Tuple2<>(split[5], "click:" + split[11]);
                return tuple;
            }
        });
        SingleOutputStreamOperator<Tuple2<String, String>> infoTupleStream = infoStream.map(new MapFunction<String, Tuple2<String, String>>() {
            @Override
            public Tuple2<String, String> map(String record) throws Exception {
                String[] split = record.split("\t");
                //split[5]：userkey
                //split[11]：曝光行为类型，当前只有pageinfo一种
                Tuple2<String, String> tuple = new Tuple2<>(split[5], "info:" + split[11]);
                return tuple;
            }
        });

        //1、inner join
        DataStream<String> innerJoinStream = clickTupleStream
                .join(infoTupleStream)
                .where(tuple -> tuple.f0)
                .equalTo(tuple -> tuple.f0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .apply(new JoinFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
                    @Override
                    public String join(Tuple2<String, String> click, Tuple2<String, String> info) throws Exception {
                        return click.f0 + " " + click.f1 + " " + info.f1;
                    }
                });
        //2、left|right outer join
        DataStream<String> leftOutJoinStream = infoTupleStream
                .coGroup(clickTupleStream)
                .where(tuple -> tuple.f0)
                .equalTo(tuple -> tuple.f0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
                    @Override
                    public void coGroup(Iterable<Tuple2<String, String>> infoIterable, Iterable<Tuple2<String, String>> clickIterable, Collector<String> collector) throws Exception {
                        //遍历左流
                        for (Tuple2<String, String> infoRecord : infoIterable) {
                            boolean isMatched = false;
                            //遍历右流
                            for (Tuple2<String, String> clickRecord : clickIterable) {
                                //右流中有对应的记录
                                collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + clickRecord.f1);
                                isMatched = true;
                            }
                            if (!isMatched) {
                                //右流中无数据
                                collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + null);
                            }
                        }
                    }
                });
        DataStream<String> rightOutJoinStream = clickTupleStream
                .coGroup(infoTupleStream)
                .where(tuple -> tuple.f0)
                .equalTo(tuple -> tuple.f0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
                    @Override
                    public void coGroup(Iterable<Tuple2<String, String>> clickIterable, Iterable<Tuple2<String, String>> infoIterable, Collector<String> collector) throws Exception {
                        boolean isMatched = false;
                        //遍历右流
                        for (Tuple2<String, String> infoRecord : infoIterable) {
                            //遍历左流
                            for (Tuple2<String, String> clickRecord : clickIterable) {
                                //左流中对应的记录
                                collector.collect(infoRecord.f0 + " " + clickRecord.f1 + " " + infoRecord.f1);
                                isMatched = true;
                            }
                            if (!isMatched) {
                                //左流中无数据
                                collector.collect(infoRecord.f0 + " " + null + " " + infoRecord.f1);
                            }
                        }
                    }
                });
        //3、interval join
        SingleOutputStreamOperator<String> intervalJoinStream = infoTupleStream
                .keyBy(record -> record.f0)
                .intervalJoin(clickTupleStream.keyBy(record -> record.f0))
                .between(Time.seconds(-30), Time.seconds(30))
                .process(new ProcessJoinFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
                    @Override
                    public void processElement(Tuple2<String, String> infoRecord, Tuple2<String, String> clickRecord, Context context, Collector<String> collector) throws Exception {
                        collector.collect(infoRecord.f0 + " " + infoRecord.f1 + " " + clickRecord.f1);
                    }
                });

        /**
         * sink
         */
//        innerJoinStream.print();
//        leftOutJoinStream.print();
//        rightOutJoinStream.print();
        intervalJoinStream.print();
    }
}

二、维表Join

1、预加载|定时加载维表

使用RichFunction的open方法预加载或定时加载维表数据到内存中，适用于维表数据数据量小并且更新频率不高的情况；可以将外部存储系统（Redis，HBase，MySQL）数据加载到内存中。

package com.scallion.transform;

import com.scallion.utils.TimeUtil;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;

import java.util.Timer;
import java.util.TimerTask;

/**
 * created by gaowj.
 * created on 2021-05-14.
 * function: 预加载|定时加载维表数据
 * origin ->
 */
public class JoinWithDimMapFunction extends RichMapFunction<String, String> {
    //选择合适的数据结构，用于将维表数据保存在内存中
    //也可以使用Google Guava CacheBuilder实现缓存
    private String dimRecord;

    //open方法在实际工作方法map方法工作之前被调用，因此适合工作前的配置，如对外部系统调用的配置，HBase,Redis,Mysql；
    //对于是迭代的部分，此方法将在每次迭代超步的开始处调用;
    //此处模拟每次定时从外部系统中获取维表数据，并缓存到内存中。
    @Override
    public void open(Configuration parameters) throws Exception {
        TimerTask task = new TimerTask() {
            @Override
            public void run() {
                dimRecord = TimeUtil.getTimestampToDate(System.currentTimeMillis()) + "时刻的维表数据";
            }
        };
        Timer timer = new Timer();
        timer.schedule(task, 0, 1000);
    }

    /**
     * @param record 用户点击日志
     * @return
     * @throws Exception
     */
    @Override
    public String map(String record) throws Exception {
        String[] split = record.split("\t");
        String res = "userkey:" + split[5] + " opa:" + split[11] + " 关联" + dimRecord;
        return res;
    }

    @Override
    public void close() throws Exception {
        super.close();
    }
}

2、热存储维表：使用异步IO来提高访问吞吐量

可以使用异步IO进行维表Join的数据库：MySQL,Oracle,Redis,HBase

package com.scallion.transform;

import com.scallion.common.Common;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.Collections;

/**
 * created by gaowj.
 * created on 2021-05-17.
 * function: 异步IO
 * origin ->
 */
public class AsyncIOMySQLFunction extends RichAsyncFunction<String, String> {
    private PreparedStatement ps;
    private Connection conn;

    @Override
    public void open(Configuration parameters) throws Exception {
        Class.forName(Common.DRIVERNAME);
        Connection conn = DriverManager.getConnection(Common.JDBCURL, Common.USERNAME, Common.PASSWORD);
        ps = conn.prepareStatement("select id,name,age,sex from tongji.rt_binlog_to_kudu where name='zhanghao'");
    }

    @Override
    public void close() throws Exception {
        conn.close();
    }

    @Override
    public void asyncInvoke(String input, ResultFuture<String> resultFuture) throws Exception {
        ResultSet rs = ps.executeQuery();
        String sqlStr = "";
        if (rs.next()) {
            sqlStr = rs.getInt("id") +
                    rs.getString("name") +
                    rs.getInt("age") +
                    rs.getString("sex");
        }
        resultFuture.complete(Collections.singletonList(input.split("\t")[5] + " " + sqlStr));
    }
}

RichAsyncFunction源码结构如下：
在这里插入图片描述

3、广播维表

利用Flink的Broadcast State将维表数据流广播到下游做Join操作，优缺点如下：
优点：能及时获取到最新的维表数据
缺点：数据利用状态后端保存在内存中，保存的数据量比较小

MapStateDescriptor broadcastDesc = new MapStateDescriptor("broad1", String.class, String.class);
        BroadcastStream<Tuple2<String, String>> broadcastStream = socketDimStream.broadcast(broadcastDesc);
        SingleOutputStreamOperator<String> broadcastWithDimStream = clickStream
                .map(new MapFunction<String, String>() {
                    @Override
                    public String map(String line) throws Exception {
                        return line.split("\t")[5].trim();
                    }
                })
                .keyBy(line -> line)
                .window(ProcessingTimeSessionWindows.withGap(Time.seconds(30)))
                .reduce((line1, line2) -> line1)
                .connect(broadcastStream)
                .process(new DimBroadcastProcessFunction(broadcastDesc));

package com.scallion.transform;

import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ReadOnlyBroadcastState;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.util.Collector;

/**
 * created by gaowj.
 * created on 2021-05-17.
 * function: 广播维表
 * origin ->
 */
public class DimBroadcastProcessFunction extends BroadcastProcessFunction<String, Tuple2<String, String>, String> {
    MapStateDescriptor<String, String> broadcastDesc;

    public DimBroadcastProcessFunction(MapStateDescriptor<String, String> broadcastDesc) {
        this.broadcastDesc = broadcastDesc;
    }

    //非广播流调用
    @Override
    public void processElement(String input, ReadOnlyContext ctx, Collector<String> collector) throws Exception {
        //获取广播流数据
        ReadOnlyBroadcastState<String, String> state = ctx.getBroadcastState(broadcastDesc);
        String cityName = "";
        if (state.contains(input))
            cityName = state.get(input);
        collector.collect("userKey:" + input + " city:" + cityName);
    }

    //广播流调用
    @Override
    public void processBroadcastElement(Tuple2<String, String> input, Context ctx, Collector<String> collector) throws Exception {
        //将维表数据更新到广播流中
        System.out.println("收到广播数据：" + input);
        ctx.getBroadcastState(broadcastDesc).put(input.f0, input.f1);
    }
}

BroadcastProcessFunction源码分析

4、临时表函数Join

5、维表Join入口代码

package com.scallion.job;

import com.scallion.common.Common;
import com.scallion.transform.AsyncIOMySQLFunction;
import com.scallion.transform.DimBroadcastProcessFunction;
import com.scallion.transform.JoinWithDimMapFunction;
import com.scallion.utils.FlinkUtil;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.windowing.assigners.ProcessingTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

import java.util.concurrent.TimeUnit;

/**
 * created by gaowj.
 * created on 2021-05-14.
 * function: 维表Join
 * origin ->
 */
public class JoinWithDimJob implements Job {
    @Override
    public void run() {
        /**
         * Source
         */
        //点击日志
        DataStream<String> clickStream = FlinkUtil.getKafkaStream(Common.KAFKA_BROKER, Common.APP_NEWSAPP_TOPIC, Common.KAFKA_CONSUMER_GROUP_ID);
        //维表数据
        SingleOutputStreamOperator<Tuple2<String, String>> socketDimStream = FlinkUtil.getSocketTextStream(Common.SOCKET_IP, Common.SOCKET_PORT)
                .map(new MapFunction<String, Tuple2<String, String>>() {
                    @Override
                    public Tuple2<String, String> map(String line) throws Exception {
                        String[] split = line.split(",");
                        return new Tuple2<String, String>(split[0], split[1]);
                    }
                });
        /**
         * Transform
         */
        //1、预加载维表
        SingleOutputStreamOperator<String> joinWithDimStream = clickStream.map(new JoinWithDimMapFunction());
        //2、热存储维表:使用异步IO来提高访问吞吐量
        SingleOutputStreamOperator<String> asyncIOStream = AsyncDataStream
                .orderedWait(clickStream, new AsyncIOMySQLFunction(), 1000L, TimeUnit.MILLISECONDS, 10);
        //3、广播维表
        //将维表数据流定义为广播流
        MapStateDescriptor broadcastDesc = new MapStateDescriptor("broad1", String.class, String.class);
        BroadcastStream<Tuple2<String, String>> broadcastStream = socketDimStream.broadcast(broadcastDesc);
        SingleOutputStreamOperator<String> broadcastWithDimStream = clickStream
                .map(new MapFunction<String, String>() {
                    @Override
                    public String map(String line) throws Exception {
                        return line.split("\t")[5].trim();
                    }
                })
                .keyBy(line -> line)
                .window(ProcessingTimeSessionWindows.withGap(Time.seconds(30)))
                .reduce((line1, line2) -> line1)
                .connect(broadcastStream)
                .process(new DimBroadcastProcessFunction(broadcastDesc));
        //4、临时表函数Join
        /**
         * Sink
         */
//        joinWithDimStream.print();
//        asyncIOStream.print();
        broadcastWithDimStream.print();
    }
}

三、参考文章

Flink中的双流join和维表join
Flink 双流 Join 的 3 种操作示例

GScallion

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Flink：双流Join和维表Join

一、双流Join1、inner join2、outer join3、interval join详细代码package com.scallion.job;import com.scallion.common.Common;import com.scallion.utils.FlinkUtil;import org.apache.flink.api.common.functions.CoGroupFunction;import org.apache.flink.api.common.func
复制链接

扫一扫