flink 1.11 广播状态(Broadcast State)的使用

前言

上篇文章简单将了有状态算子的使用。本文将着重说明广播状态(Broadcast State)的使用。广播状态还是上篇文章中介绍的基本状态,只是将其进行了广播。

flink的join算子也是基于广播状态来实现的。

需求

这里使用flink逛网提供的需求:

  • 一个流中包含的事件Item,Item中包含两个属性,一个为Color,一个为Shape。
  • 我们在在这个Item流中过滤符合特定模式的数据,比如三角形后面跟一个正方形。我们称这些特定模式为Rule。Rule也是通过一个流提供的,也就是说流中有很多这样的模式。

需求分析

为了完成上述需求,我们将Rule流广播出去,然后使用connect算子将两个流关联起来,然后在process算子里面指定匹配检测逻辑。

MapStateDescriptor<String, Rule> ruleStateDescriptor = new MapStateDescriptor<>(
			"RulesBroadcastState",
			BasicTypeInfo.STRING_TYPE_INFO,
			TypeInformation.of(new TypeHint<Rule>() {}));
		
// 广播 rules 流并创建 broadcast state
BroadcastStream<Rule> ruleBroadcastStream = ruleStream
                        .broadcast(ruleStateDescriptor);
DataStream<String> output = colorPartitionedStream
                 .connect(ruleBroadcastStream)
                 .process(
                     
                     // type arguments in our KeyedBroadcastProcessFunction represent: 
                     //   1. the key of the keyed stream
                     //   2. the type of elements in the non-broadcast side
                     //   3. the type of elements in the broadcast side
                     //   4. the type of the result, here a string
                     
                     new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
                         // my matching logic
                     }
                 );

代码

pojo对象


class Item {
    Color color;
    Shape shape;

    public Color getColor() {
        return color;
    }

    public void setColor(Color color) {
        this.color = color;
    }

    public Shape getShape() {
        return shape;
    }

    public void setShape(Shape shape) {
        this.shape = shape;
    }

    @Override
    public String toString() {
        return "Item{" +
                "color=" + color.getCol() +
                ", shape=" + shape.getShp() +
                '}';
    }
}

class Color {
    String col;

    public String getCol() {
        return col;
    }

    public void setCol(String col) {
        this.col = col;
    }

    @Override
    public String toString() {
        return "Color{" +
                "col='" + col + '\'' +
                '}';
    }
}

class Shape {
    String shp;

    public Shape(String shp) {
        this.shp = shp;
    }

    public String getShp() {
        return shp;
    }

    public void setShp(String shp) {
        this.shp = shp;
    }

    @Override
    public String toString() {
        return "Shape{" +
                "shp='" + shp + '\'' +
                '}';
    }
}

class Rule {
    String name;
    Shape first;
    Shape second;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public Shape getFirst() {
        return first;
    }

    public void setFirst(Shape first) {
        this.first = first;
    }

    public Shape getSecond() {
        return second;
    }

    public void setSecond(Shape second) {
        this.second = second;
    }
}

为两个流分别自定义两个数据源

一个数据源生成Item对象的流,一个数据源生成Rule对象的流

package it.kenn.state;

import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.typeutils.ListTypeInfo;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.util.Collector;

import java.util.*;

/**
 * 广播流与广播状态
 */
public class BroadcastStateDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<Item> itemStream = env.addSource(new ItemsSource());
        KeyedStream<Item, Color> colorPartitionedStream = itemStream.keyBy(new KeySelector<Item, Color>() {
            @Override
            public Color getKey(Item item) throws Exception {
                return item.getColor();
            }
        });
        DataStreamSource<Rule> ruleStream = env.addSource(new RuleSource());
        //定义MapState,将来将其广播出去
        MapStateDescriptor<String, Rule> ruleStateDescriptor = new MapStateDescriptor<>(
                "RulesBroadcastState",
                BasicTypeInfo.STRING_TYPE_INFO,
                TypeInformation.of(new TypeHint<Rule>() {
                }));
        //这里是将ruleStream广播出去,并将MapState传入ruleStream中
        BroadcastStream<Rule> ruleBroadcastStream = ruleStream.broadcast(ruleStateDescriptor);

        SingleOutputStreamOperator<String> resStream = colorPartitionedStream
                //连接两个流
                .connect(ruleBroadcastStream)
                //四个泛型分别是key,in1,in2,out,可以看源码能看出来
                .process(new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {

                    // 再定义一个MapState,存储规则中 的第一个元素并等待第二个元素的到来
                    //todo 这里为什么存了一个列表还是没有太明白
                    //we keep a list as we may have many first elements waiting
                    //想明白了。因为rule也是一个流
                    private final MapStateDescriptor<String, List<Item>> mapStateDesc =
                            new MapStateDescriptor<>(
                                    "items",
                                    BasicTypeInfo.STRING_TYPE_INFO,
                                    new ListTypeInfo<>(Item.class));

                    // 和上面定义的ruleStateDescriptor一模一样
                    private final MapStateDescriptor<String, Rule> ruleStateDescriptor =
                            new MapStateDescriptor<>(
                                    "RulesBroadcastState",
                                    BasicTypeInfo.STRING_TYPE_INFO,
                                    TypeInformation.of(new TypeHint<Rule>() {}));

                    @Override
                    public void processElement(Item value, ReadOnlyContext ctx, Collector<String> out) throws Exception {
                        final MapState<String, List<Item>> state = getRuntimeContext().getMapState(mapStateDesc);
                        final Shape shape = value.getShape();

                        for (Map.Entry<String, Rule> entry : ctx.getBroadcastState(ruleStateDescriptor).immutableEntries()) {
                            final String ruleName = entry.getKey();
                            final Rule rule = entry.getValue();
                            List<Item> stored = state.get(ruleName);

                            if (stored == null) {
                                stored = new ArrayList<>();
                            }

                            if (shape.getShp().equals(rule.second   .getShp()) && !stored.isEmpty()) {
                                for (Item i : stored) {
                                    out.collect("MATCH: " + i + " - " + value);
                                }
                                stored.clear();
                            }

                            // there is no else{} to cover if rule.first == rule.second
                            if (shape.getShp().equals(rule.first.getShp())) {
                                stored.add(value);
                            }

                            if (stored.isEmpty()) {
                                state.remove(ruleName);
                                System.out.println("hell?");
                            } else {
                                state.put(ruleName, stored);
                            }
                        }
                    }

                    /**
                     * 注意到这个方法就干了一件事,也就是把广播流中的数据全部塞到了broadcast map state状态中去了,而不是将其输出了
                     * 这样做是为了在processElement中获取Rule流中的规则
                     * @param rule
                     * @param context
                     * @param collector
                     * @throws Exception
                     */
                    @Override
                    public void processBroadcastElement(Rule rule, Context context, Collector<String> collector) throws Exception {
                        context.getBroadcastState(ruleStateDescriptor).put(rule.name, rule);
                    }
                });

        resStream.print();
        env.execute();
    }
}

class Item {
    Color color;
    Shape shape;

    public Color getColor() {
        return color;
    }

    public void setColor(Color color) {
        this.color = color;
    }

    public Shape getShape() {
        return shape;
    }

    public void setShape(Shape shape) {
        this.shape = shape;
    }

    @Override
    public String toString() {
        return "Item{" +
                "color=" + color.getCol() +
                ", shape=" + shape.getShp() +
                '}';
    }
}

class Color {
    String col;

    public String getCol() {
        return col;
    }

    public void setCol(String col) {
        this.col = col;
    }

    @Override
    public String toString() {
        return "Color{" +
                "col='" + col + '\'' +
                '}';
    }
}

class Shape {
    String shp;

    public Shape(String shp) {
        this.shp = shp;
    }

    public String getShp() {
        return shp;
    }

    public void setShp(String shp) {
        this.shp = shp;
    }

    @Override
    public String toString() {
        return "Shape{" +
                "shp='" + shp + '\'' +
                '}';
    }
}

class Rule {
    String name;
    Shape first;
    Shape second;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public Shape getFirst() {
        return first;
    }

    public void setFirst(Shape first) {
        this.first = first;
    }

    public Shape getSecond() {
        return second;
    }

    public void setSecond(Shape second) {
        this.second = second;
    }
}

class ItemsSource implements SourceFunction<Item> {
    boolean flag = true;

    @Override
    public void run(SourceContext<Item> sourceContext) throws Exception {

        while (flag) {
            String[] colors = new String[]{"blue", "yellow", "gray", "black", "red", "orange", "green", "white", "gold"};
            String[] shapes = new String[]{"triangle", "rectangle", "circle", "unknown"};
            Random random = new Random();
            int colorIndex = random.nextInt(8);
            int shapeIndex = random.nextInt(4);
            Item item = new Item();
            Shape shape = new Shape(shapes[shapeIndex]);
            Color color = new Color();
            color.setCol(colors[colorIndex]);
            item.setColor(color);
            item.setShape(shape);

            sourceContext.collect(item);
        }
    }

    @Override
    public void cancel() {
        flag = false;
    }
}

class RuleSource implements SourceFunction<Rule> {
    boolean flag = true;

    @Override
    public void run(SourceContext<Rule> sourceContext) throws Exception {
        while (flag) {
            Rule rule = new Rule();
            String[] shapes = new String[]{"unknown", "circle", "rectangle","triangle"};
            Random random = new Random();
            int index1 = random.nextInt(4);
            int index2 = random.nextInt(4);
            rule.setName(UUID.randomUUID().toString());
            rule.setFirst(new Shape(shapes[index1]));
            rule.setSecond(new Shape(shapes[index2]));
            sourceContext.collect(rule);
        }
    }

    @Override
    public void cancel() {
        flag = false;
    }
}

主要逻辑

package it.kenn.state;

import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.typeutils.ListTypeInfo;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.util.Collector;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;

/**
 * 广播流与广播状态
 */
public class BroadcastStateDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<Item> itemStream = env.addSource(new ItemsSource());
        KeyedStream<Item, Color> colorPartitionedStream = itemStream.keyBy(new KeySelector<Item, Color>() {
            @Override
            public Color getKey(Item item) throws Exception {
                return item.getColor();
            }
        });
        DataStreamSource<Rule> ruleStream = env.addSource(new RuleSource());
        MapStateDescriptor<String, Rule> ruleStateDescriptor = new MapStateDescriptor<>(
                "RulesBroadcastState",
                BasicTypeInfo.STRING_TYPE_INFO,
                TypeInformation.of(new TypeHint<Rule>() {
                }));

        BroadcastStream<Rule> ruleBroadcastStream = ruleStream
                .broadcast(ruleStateDescriptor);

        SingleOutputStreamOperator<String> resStream = colorPartitionedStream.connect(ruleBroadcastStream)
                //四个泛型分别是key,in1,in2,out,可以看源码能看出来
                .process(new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {

                    // store partial matches, i.e. first elements of the pair waiting for their second element
                    // we keep a list as we may have many first elements waiting
                    private final MapStateDescriptor<String, List<Item>> mapStateDesc =
                            new MapStateDescriptor<>(
                                    "items",
                                    BasicTypeInfo.STRING_TYPE_INFO,
                                    new ListTypeInfo<>(Item.class));

                    // identical to our ruleStateDescriptor above
                    private final MapStateDescriptor<String, Rule> ruleStateDescriptor =
                            new MapStateDescriptor<>(
                                    "RulesBroadcastState",
                                    BasicTypeInfo.STRING_TYPE_INFO,
                                    TypeInformation.of(new TypeHint<Rule>() {}));

                    @Override
                    public void processElement(Item value, ReadOnlyContext ctx, Collector<String> out) throws Exception {
                        final MapState<String, List<Item>> state = getRuntimeContext().getMapState(mapStateDesc);
                        final Shape shape = value.getShape();

                        for (Map.Entry<String, Rule> entry : ctx.getBroadcastState(ruleStateDescriptor).immutableEntries()) {
                            final String ruleName = entry.getKey();
                            final Rule rule = entry.getValue();
                            List<Item> stored = state.get(ruleName);
                            if (stored == null) {
                                stored = new ArrayList<>();
                            }
                            System.out.println("shape: "+shape.getShp() + " 2: "+rule.getSecond().getShp() + " "+stored.isEmpty());
                            if (shape == rule.getSecond() && !stored.isEmpty()) {
                                for (Item i : stored) {
                                    out.collect("MATCH: " + i + " - " + value);
                                }
                                stored.clear();
                            }

                            // there is no else{} to cover if rule.first == rule.second
                            if (shape.equals(rule.first)) {
                                stored.add(value);
                            }

                            if (stored.isEmpty()) {
                                state.remove(ruleName);
                            } else {
                                state.put(ruleName, stored);
                            }
                        }
                    }

                    @Override
                    public void processBroadcastElement(Rule rule, Context context, Collector<String> collector) throws Exception {
                        context.getBroadcastState(ruleStateDescriptor).put(rule.name, rule);
                    }
                });

        resStream.print();
        env.execute();
    }
}

上面代码还有空指针异常的错误,,没有找到原因。领会精神就好。。

总结

上面代码虽然有点问题(有点晚了,改天再搞一下),但是不妨碍我们总结一下他的套路:

  • 广播流的同时,传入一个状态,这个状态就变成了广播状态。
  • 状态广播出去一般是给另外一个(或者几个)流使用的(这也是join的基本原理),因此使用connect算子将非广播流和广播流连接起来,connect是非广播流调用的。
  • 得到ConnectedDataStream以后,调用process进行处理。上面代码,process传入的是一个KeyedBroadcastProcessFunction类。因为我们实现对非广播流进行keyBy操作了,如果没有分组则传入BroadcastProcessFunction类。这个类需要实现两个抽象方法,分别是processElement和processBroadcastElement两个函数,分别用于处理非广播流和广播流中的数据。
  • 由于我们使用了广播状态,因此一般是直接将广播流中的数据直接存如广播状态中,然后在非广播流中得到广播流的数据然后对其进行处理。
  • 非广播流通过只读上下文对象(ReadOnlyContext)获取到广播状态,进而得到状态中的广播状态的数据,然后进行相应的业务操作。

顺带一提,processElement和processBroadcastElement两个函数中的上下文对象可以进行如下操作。下面五个操作都挺重要的

  1. give access to the broadcast state: ctx.getBroadcastState(MapStateDescriptor<K, V> stateDescriptor)
  2. allow to query the timestamp of the element: ctx.timestamp(),
  3. get the current watermark: ctx.currentWatermark()
  4. get the current processing time: ctx.currentProcessingTime(), and
  5. emit elements to side-outputs: ctx.output(OutputTag<X> outputTag, X value).

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值