Flink 流处理API

Flink 流处理API

1 Environment

1.1 getExecutionEnvironment

创建一个执行环境,表示当前执行程序的上下文。 如果程序是独立调用的,则此方法返回本地执行环境;如果从命令行客户端调用程序以提交到集群,则此方法返回此集群的执行环境,也就是说,getExecutionEnvironment 会根据查询运行的方式决定返回什么样的运行环境,是最常用的一种创建执行环境的方式。

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

2 Source

2.1 从集合中读取

public class SourceFromCollectionTest {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> dataStream = env.fromCollection(Arrays.asList("1", "2", "3"));

        dataStream.print().setParallelism(1);

        env.execute();
    }
}

2.2 从文件读取

DataStreamSource<String> dataStream = env.readTextFile("file_path");

2.3 从socket读取

DataStream<String> dataStream = env.socketTextStream("127.0.0.1", 60000);

2.4 从kafka读取

  • 代码
public class SourceKafkaTest {

    public static void main(String[] args) throws Exception {


        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");


        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        dataStream.print().setParallelism(1);

        env.execute();

    }
}
  • 测试
    在这里插入图片描述
    在这里插入图片描述

2.5 自定义source

  • 自定义数据源
/**
 * 自定义数据源
 */
public class MySourceFunction implements SourceFunction<String> {

    private Boolean isRunning = true;

    private String sourceTxt;

    private Long sleep;

    public MySourceFunction(String sourceTxt, Long sleep) {
        this.sourceTxt = sourceTxt;
        this.sleep = sleep;`在这里插入代码片`
    }

    public void run(SourceContext<String> sourceContext) throws Exception {

        int i = 0;
        while (isRunning) {
            i = i + 1;
            sourceContext.collect(sourceTxt + i);
            Thread.sleep(this.sleep);
        }

    }

    public void cancel() {
        this.isRunning = false;
    }
}
  • 调用
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> dataStream = env.addSource(new MySourceFunction("thread_1, name,", 1000L));
dataStream.print().setParallelism(1);
env.execute();

4 Transform-转换算子

4.1 map

作用:DataStream → DataStream: 输入一个参数产生一个参数。

public class MapTransFormTest {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        // 转换为二元组,并统计每个单词出现的次数
        dataStream.map(new MyMapFunction()).keyBy(0).sum(1).print().setParallelism(1);

        env.execute();

    }

    public static class MyMapFunction implements MapFunction<String, Tuple2<String, Integer>> {

        public Tuple2<String, Integer> map(String s) throws Exception {
            return new Tuple2<String, Integer>(s, 1);
        }
    }

}

4.2 flatMap

作用:DataStream → DataStream: 输入一个参数产生0或多个参数。

 public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        dataStream.flatMap(new MyFlatFunction()).keyBy(0).sum(1).print().setParallelism(1);

        env.execute();

    }

    public static class MyFlatFunction implements FlatMapFunction<String, Tuple2<String, Integer>> {

        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
            collector.collect(new Tuple2<String, Integer>(s, 1));
        }
    }

4.3 filter

作用:过滤

 public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        //dataStream.map(new MyMapFunction()).keyBy(0).sum(1).print().setParallelism(1);
        //dataStream.flatMap(new MyFlatFunction()).keyBy(0).sum(1).print().setParallelism(1);
        dataStream.filter(new MyFilterFunction()).map(new MyMapFunction()).keyBy(0).sum(1).print().setParallelism(1);
        env.execute();

    }
    public static class MyMapFunction implements MapFunction<String, Tuple2<String, Integer>> {

        public Tuple2<String, Integer> map(String s) throws Exception {
            return new Tuple2<String, Integer>(s, 1);
        }
    }
    public static class MyFilterFunction implements FilterFunction<String> {

        // 过滤非a开头的字符串
        public boolean filter(String s) throws Exception {
            if (s.startsWith("a")) {
                return true;
            }
            return false;
        }
    }

4.4 KeyBy

DataStream → KeyedStream:逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同 key 的元素,在内部以 hash 的形式实现的。

  • 代码
public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));
        DataStream<Tuple2<String, Integer>> flatMapStream = dataStream.flatMap(new MyFlatFunction());
        flatMapStream.keyBy(0).max(1).print().setParallelism(1);
        flatMapStream.keyBy(0).maxBy(1).print().setParallelism(2);
        env.execute();

    }

    public static class MyFlatFunction implements FlatMapFunction<String, Tuple2<String, Integer>> {

        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
            Random random = new Random();
            collector.collect(new Tuple2<String, Integer>(s, random.nextInt(100)));
        }
    }

4.5 滚动聚合算子(Rolling Aggregation)

4.6 Reduce

KeyedStream → DataStream:一个分组数据流的聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。

  • 代码
public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // kafka 配置项
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // 从 kafka 读取数据
        DataStream<String> dataStream = env.addSource(new
                FlinkKafkaConsumer011<String>("test", new SimpleStringSchema(), properties));

        dataStream.map(new MyMapFunction()).keyBy(0).reduce(new MyReduceFunction()).print().setParallelism(1);

        env.execute();
    }

    // <String, Integer, Long> ==> key 大小值 时间戳
    public static class MyMapFunction implements MapFunction<String, Tuple3<String, Integer, Long>> {

        public Tuple3<String, Integer, Long> map(String s) throws Exception {
            String[] split = s.split(",");
            return new Tuple3<String, Integer, Long>(split[0], Integer.valueOf(split[1]), Long.valueOf(split[2]));
        }
    }

    public static class MyReduceFunction implements ReduceFunction<Tuple3<String, Integer, Long>> {

        /**
         * 合并最大值 且 更新最大时间
         * @param oldVal
         * @param newVal
         * @return
         * @throws Exception
         */
        public Tuple3<String, Integer, Long> reduce(Tuple3<String, Integer, Long> oldVal, Tuple3<String, Integer, Long> newVal) throws Exception {
            if (oldVal.f1 < newVal.f1) {
                return newVal;
            }
            return new Tuple3<String, Integer, Long>(oldVal.f0, oldVal.f1, newVal.f2);
        }
    }
  • 结果
    在这里插入图片描述
    在这里插入图片描述

4.7 Split 和 Select(分流)

4.7.1 概念

新版本:1.13.0已被移弃

  • Split
    DataStream → SplitStream:根据某些特征把一个 DataStream 拆分成两个或者多个 DataStream。

  • Select
    SplitStream→DataStream:从一个 SplitStream 中获取一个或者多个DataStream。

4.7.2 案例:根据上报的数据,按临界值 50 分成两份流
public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();


        DataStream<String> dataStream = env.fromCollection(Arrays.asList(
                "a,1",
                "a,49",
                "a,55",
                "a,58"
        ));


        SplitStream<Tuple2<String, Integer>> splitStream = dataStream.map(new MyMapFunction()).split(new MySplitFunction());

        splitStream.select("high").print("high").setParallelism(1);
        splitStream.select("low").print("low").setParallelism(1);
        splitStream.select("high", "low").print("all").setParallelism(1);

        env.execute();

    }

    public static class MyMapFunction implements MapFunction<String, Tuple2<String, Integer>> {

        public Tuple2<String, Integer> map(String s) throws Exception {
            String[] split = s.split(",");
            return new Tuple2<String, Integer>(split[0], Integer.valueOf(split[1].trim()));
        }
    }


    public static class MySplitFunction implements OutputSelector<Tuple2<String, Integer>> {

        public Iterable<String> select(Tuple2<String, Integer> data) {
            if(data.f1 > 50) {
                return Collections.singletonList("high");
            }
            return Collections.singletonList("low");
        }
    }
  • 结果
    在这里插入图片描述

4.8 Connect 和 CoMap

4.8.1 概念
  • Connect
    DataStream,DataStream → ConnectedStreams:连接两个保持他们类型的数据流,两个数据流被 Connect 之后,只是被放在了一个同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。

  • CoMap/toFlatMap
    ConnectedStreams → DataStream:作用于 ConnectedStreams 上,功能与 map和 flatMap 一样,对 ConnectedStreams 中的每一个 Stream 分别进行 map 和 flatMap处理。

  • 代码

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();


        DataStream<Tuple2<String, Integer>> dataStream1 = env.fromCollection(Arrays.asList(
                new Tuple2<String, Integer>("a", 1),
                new Tuple2<String, Integer>("a", 2),
                new Tuple2<String, Integer>("a", 3),
                new Tuple2<String, Integer>("a", 4),
                new Tuple2<String, Integer>("a", 5),
                new Tuple2<String, Integer>("a", 6)
        ));

        DataStreamSource<Tuple2<String, String>> dataStream2 = env.fromCollection(Arrays.asList(
                new Tuple2<String, String>("b", "1"),
                new Tuple2<String, String>("b", "2"),
                new Tuple2<String, String>("b", "3"),
                new Tuple2<String, String>("b", "4"),
                new Tuple2<String, String>("b", "5"),
                new Tuple2<String, String>("b", "6"),
                new Tuple2<String, String>("b", "7")
        ));

        // 合并流,两个流的数据互相独立
        ConnectedStreams<Tuple2<String, Integer>, Tuple2<String, String>> connectStream = dataStream1.connect(dataStream2);

        connectStream.map(new CoMapFunction<Tuple2<String, Integer>, Tuple2<String, String>, Object>() {

            // 做任意转换
            @Override
            public Object map1(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                return stringIntegerTuple2;
            }

            // 做任意转换
            @Override
            public Object map2(Tuple2<String, String> stringStringTuple2) throws Exception {
                return stringStringTuple2;
            }
        }).print().setParallelism(1);

        env.execute();

    }

4.9 Union

DataStream → DataStream:对两个或者两个以上的 DataStream 进行 union 操作,产生一个包含所有 DataStream 元素的新 DataStream。

4.9.1 Connect 与 Union 区别:
  • Union 之前两个流的类型必须是一样,Connect 可以不一样,在之后的 coMap
    中再去调整成为一样的。
  • Connect 只能操作两个流,Union 可以操作多个。

5 富函数(Rich Functions)

“富函数”是 DataStream API 提供的一个函数类的接口,所有 Flink 函数类都有其 Rich 版本。它与常规函数的不同在于,可以获取运行环境的上下文,并拥有一些生命周期方法,所以可以实现更复杂的功能。

  • RichMapFunction
  • RichFlatMapFunction
  • RichFilterFunction

Rich Function 有一个生命周期的概念。典型的生命周期方法有:

  • open()方法是 rich function 的初始化方法,当一个算子例如 map 或者 filter被调用之前 open()会被调用。
  • close()方法是生命周期中的最后一个调用的方法,做一些清理工作。
  • getRuntimeContext()方法提供了函数的 RuntimeContext 的一些信息,例如函数执行的并行度,任务的名字,以及 state 状态
    public static class MyMapFunction extends RichMapFunction<SensorReading,
            Tuple2<Integer, String>> {
        @Override
        public Tuple2<Integer, String> map(SensorReading value) throws Exception {
            return new Tuple2<>(getRuntimeContext().getIndexOfThisSubtask(),
                    value.getId());
        }

        @Override
        public void open(Configuration parameters) throws Exception {
            System.out.println("my map open");
            // 以下可以做一些初始化工作,例如建立一个和 HDFS 的连接
        }

        @Override
        public void close() throws Exception {
            System.out.println("my map close");
            // 以下做一些清理工作,例如断开和 HDFS 的连接
        }
    }
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值