Flink 流处理API

最新推荐文章于 2022-03-24 08:21:48 发布

老鼠扛刀满街找猫@

最新推荐文章于 2022-03-24 08:21:48 发布

阅读量210

点赞数

分类专栏： flink

本文链接：https://blog.csdn.net/qq_27242695/article/details/117961257

版权

flink 专栏收录该内容

28 篇文章 1 订阅

订阅专栏

Flink 流处理API

1 Environment

1.1 getExecutionEnvironment

创建一个执行环境，表示当前执行程序的上下文。如果程序是独立调用的，则此方法返回本地执行环境；如果从命令行客户端调用程序以提交到集群，则此方法返回此集群的执行环境，也就是说，getExecutionEnvironment 会根据查询运行的方式决定返回什么样的运行环境，是最常用的一种创建执行环境的方式。

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

2 Source

2.1 从集合中读取

public class SourceFromCollectionTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> dataStream = env.fromCollection(Arrays.asList("1", "2", "3"));

        dataStream.print().setParallelism(1);

        env.execute();
    }
}

2.2 从文件读取

DataStreamSource<String> dataStream = env.readTextFile("file_path");

2.3 从socket读取

DataStream<String> dataStream = env.socketTextStream("127.0.0.1", 60000);

2.4 从kafka读取

代码

public class SourceKafkaTest {

    public static void main(String[] args) throws Exception {


        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");


        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        dataStream.print().setParallelism(1);

        env.execute();

    }
}

测试

2.5 自定义source

自定义数据源

/**
 * 自定义数据源
 */
public class MySourceFunction implements SourceFunction<String> {

    private Boolean isRunning = true;

    private String sourceTxt;

    private Long sleep;

    public MySourceFunction(String sourceTxt, Long sleep) {
        this.sourceTxt = sourceTxt;
        this.sleep = sleep;`在这里插入代码片`
    }

    public void run(SourceContext<String> sourceContext) throws Exception {

        int i = 0;
        while (isRunning) {
            i = i + 1;
            sourceContext.collect(sourceTxt + i);
            Thread.sleep(this.sleep);
        }

    }

    public void cancel() {
        this.isRunning = false;
    }
}

调用

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> dataStream = env.addSource(new MySourceFunction("thread_1, name,", 1000L));
dataStream.print().setParallelism(1);
env.execute();

4 Transform-转换算子

4.1 map

作用：DataStream → DataStream: 输入一个参数产生一个参数。

public class MapTransFormTest {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        // 转换为二元组,并统计每个单词出现的次数
        dataStream.map(new MyMapFunction()).keyBy(0).sum(1).print().setParallelism(1);

        env.execute();

    }

    public static class MyMapFunction implements MapFunction<String, Tuple2<String, Integer>> {

        public Tuple2<String, Integer> map(String s) throws Exception {
            return new Tuple2<String, Integer>(s, 1);
        }
    }

}

4.2 flatMap

作用：DataStream → DataStream: 输入一个参数产生0或多个参数。

 public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        dataStream.flatMap(new MyFlatFunction()).keyBy(0).sum(1).print().setParallelism(1);

        env.execute();

    }

    public static class MyFlatFunction implements FlatMapFunction<String, Tuple2<String, Integer>> {

        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
            collector.collect(new Tuple2<String, Integer>(s, 1));
        }
    }

4.3 filter

作用：过滤

 public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        //dataStream.map(new MyMapFunction()).keyBy(0).sum(1).print().setParallelism(1);
        //dataStream.flatMap(new MyFlatFunction()).keyBy(0).sum(1).print().setParallelism(1);
        dataStream.filter(new MyFilterFunction()).map(new MyMapFunction()).keyBy(0).sum(1).print().setParallelism(1);
        env.execute();

    }
    public static class MyMapFunction implements MapFunction<String, Tuple2<String, Integer>> {

        public Tuple2<String, Integer> map(String s) throws Exception {
            return new Tuple2<String, Integer>(s, 1);
        }
    }
    public static class MyFilterFunction implements FilterFunction<String> {

        // 过滤非a开头的字符串
        public boolean filter(String s) throws Exception {
            if (s.startsWith("a")) {
                return true;
            }
            return false;
        }
    }

4.4 KeyBy

DataStream → KeyedStream：逻辑地将一个流拆分成不相交的分区，每个分区包含具有相同 key 的元素，在内部以 hash 的形式实现的。

代码

public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));
        DataStream<Tuple2<String, Integer>> flatMapStream = dataStream.flatMap(new MyFlatFunction());
        flatMapStream.keyBy(0).max(1).print().setParallelism(1);
        flatMapStream.keyBy(0).maxBy(1).print().setParallelism(2);
        env.execute();

    }

    public static class MyFlatFunction implements FlatMapFunction<String, Tuple2<String, Integer>> {

        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
            Random random = new Random();
            collector.collect(new Tuple2<String, Integer>(s, random.nextInt(100)));
        }
    }

4.5 滚动聚合算子（Rolling Aggregation）

sum()
min()
max()
minBy()
maxBy()
max、min、sum 会分别返回最大值、最小值和汇总值；而 minBy 和 maxBy 则会把最小或者最大的元素全部返回。
案例参考：https://blog.csdn.net/Jackson_mvp/article/details/105825898

4.6 Reduce

KeyedStream → DataStream：一个分组数据流的聚合操作，合并当前的元素和上次聚合的结果，产生一个新的值，返回的流中包含每一次聚合的结果，而不是只返回最后一次聚合的最终结果。

代码

public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        dataStream.map(new MyMapFunction()).keyBy(0).reduce(new MyReduceFunction()).print().setParallelism(1);

        env.execute();
    }

    // <String, Integer, Long> ==> key 大小值 时间戳
    public static class MyMapFunction implements MapFunction<String, Tuple3<String, Integer, Long>> {

        public Tuple3<String, Integer, Long> map(String s) throws Exception {
            String[] split = s.split(",");
            return new Tuple3<String, Integer, Long>(split[0], Integer.valueOf(split[1]), Long.valueOf(split[2]));
        }
    }

    public static class MyReduceFunction implements ReduceFunction<Tuple3<String, Integer, Long>> {

        /**
         * 合并最大值 且 更新最大时间
         * @param oldVal
         * @param newVal
         * @return
         * @throws Exception
         */
        public Tuple3<String, Integer, Long> reduce(Tuple3<String, Integer, Long> oldVal, Tuple3<String, Integer, Long> newVal) throws Exception {
            if (oldVal.f1 < newVal.f1) {
                return newVal;
            }
            return new Tuple3<String, Integer, Long>(oldVal.f0, oldVal.f1, newVal.f2);
        }
    }

结果

4.7 Split 和 Select（分流）

4.7.1 概念

新版本：1.13.0已被移弃

Split
DataStream → SplitStream：根据某些特征把一个 DataStream 拆分成两个或者多个 DataStream。
Select
SplitStream→DataStream：从一个 SplitStream 中获取一个或者多个DataStream。

4.7.2 案例：根据上报的数据,按临界值 50 分成两份流

public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();


        DataStream<String> dataStream = env.fromCollection(Arrays.asList(
                "a,1",
                "a,49",
                "a,55",
                "a,58"
        ));


        SplitStream<Tuple2<String, Integer>> splitStream = dataStream.map(new MyMapFunction()).split(new MySplitFunction());

        splitStream.select("high").print("high").setParallelism(1);
        splitStream.select("low").print("low").setParallelism(1);
        splitStream.select("high", "low").print("all").setParallelism(1);

        env.execute();

    }

    public static class MyMapFunction implements MapFunction<String, Tuple2<String, Integer>> {

        public Tuple2<String, Integer> map(String s) throws Exception {
            String[] split = s.split(",");
            return new Tuple2<String, Integer>(split[0], Integer.valueOf(split[1].trim()));
        }
    }


    public static class MySplitFunction implements OutputSelector<Tuple2<String, Integer>> {

        public Iterable<String> select(Tuple2<String, Integer> data) {
            if(data.f1 > 50) {
                return Collections.singletonList("high");
            }
            return Collections.singletonList("low");
        }
    }

结果

4.8 Connect 和 CoMap

4.8.1 概念

Connect
DataStream,DataStream → ConnectedStreams：连接两个保持他们类型的数据流，两个数据流被 Connect 之后，只是被放在了一个同一个流中，内部依然保持各自的数据和形式不发生任何变化，两个流相互独立。
CoMap/toFlatMap
ConnectedStreams → DataStream：作用于 ConnectedStreams 上，功能与 map和 flatMap 一样，对 ConnectedStreams 中的每一个 Stream 分别进行 map 和 flatMap处理。
代码

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();


        DataStream<Tuple2<String, Integer>> dataStream1 = env.fromCollection(Arrays.asList(
                new Tuple2<String, Integer>("a", 1),
                new Tuple2<String, Integer>("a", 2),
                new Tuple2<String, Integer>("a", 3),
                new Tuple2<String, Integer>("a", 4),
                new Tuple2<String, Integer>("a", 5),
                new Tuple2<String, Integer>("a", 6)
        ));

        DataStreamSource<Tuple2<String, String>> dataStream2 = env.fromCollection(Arrays.asList(
                new Tuple2<String, String>("b", "1"),
                new Tuple2<String, String>("b", "2"),
                new Tuple2<String, String>("b", "3"),
                new Tuple2<String, String>("b", "4"),
                new Tuple2<String, String>("b", "5"),
                new Tuple2<String, String>("b", "6"),
                new Tuple2<String, String>("b", "7")
        ));

        // 合并流,两个流的数据互相独立
        ConnectedStreams<Tuple2<String, Integer>, Tuple2<String, String>> connectStream = dataStream1.connect(dataStream2);

        connectStream.map(new CoMapFunction<Tuple2<String, Integer>, Tuple2<String, String>, Object>() {

            // 做任意转换
            @Override
            public Object map1(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                return stringIntegerTuple2;
            }

            // 做任意转换
            @Override
            public Object map2(Tuple2<String, String> stringStringTuple2) throws Exception {
                return stringStringTuple2;
            }
        }).print().setParallelism(1);

        env.execute();

    }

4.9 Union

DataStream → DataStream：对两个或者两个以上的 DataStream 进行 union 操作，产生一个包含所有 DataStream 元素的新 DataStream。

4.9.1 Connect 与 Union 区别：

Union 之前两个流的类型必须是一样，Connect 可以不一样，在之后的 coMap
中再去调整成为一样的。
Connect 只能操作两个流，Union 可以操作多个。

5 富函数（Rich Functions）

“富函数”是 DataStream API 提供的一个函数类的接口，所有 Flink 函数类都有其 Rich 版本。它与常规函数的不同在于，可以获取运行环境的上下文，并拥有一些生命周期方法，所以可以实现更复杂的功能。

RichMapFunction
RichFlatMapFunction
RichFilterFunction
…
…

Rich Function 有一个生命周期的概念。典型的生命周期方法有：

open()方法是 rich function 的初始化方法，当一个算子例如 map 或者 filter被调用之前 open()会被调用。
close()方法是生命周期中的最后一个调用的方法，做一些清理工作。
getRuntimeContext()方法提供了函数的 RuntimeContext 的一些信息，例如函数执行的并行度，任务的名字，以及 state 状态

    public static class MyMapFunction extends RichMapFunction<SensorReading,
            Tuple2<Integer, String>> {
        @Override
        public Tuple2<Integer, String> map(SensorReading value) throws Exception {
            return new Tuple2<>(getRuntimeContext().getIndexOfThisSubtask(),
                    value.getId());
        }

        @Override
        public void open(Configuration parameters) throws Exception {
            System.out.println("my map open");
            // 以下可以做一些初始化工作，例如建立一个和 HDFS 的连接
        }

        @Override
        public void close() throws Exception {
            System.out.println("my map close");
            // 以下做一些清理工作，例如断开和 HDFS 的连接
        }
    }