五、Flink流处理核心编程

5 Flink流处理核心编程

和其他所有的计算框架一样,Flink也有一些基础的开发步骤以及基础,核心的API,从开发步骤的角度来讲,主要分为四大部分:

  1. Environment
  2. Source
  3. Transform
  4. Sink

5.1 Environment

Flink Job在提交执行计算时,需要首先建立和Flink框架之间的联系,也就指的是当前的flink运行环境,只有获取了环境信息,才能将task调度到不同的taskManager执行。而这个环境对象的获取方式相对比较简单。

// 批处理环境
ExecutionEnvironment benv = ExecutionEnvironment.getExecutionEnvironment();

// 流式数据处理环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

5.2 Source

Flink框架可以从不同的来源获取数据,将数据提交给框架进行处理, 我们将获取数据的来源称之为数据源(Source)。

5.2.1 准备工作

    <properties>
        <flink.version>1.13.0</flink.version>
        <java.version>1.8</java.version>
        <scala.binary.version>2.12</scala.binary.version>
        <slf4j.version>1.7.30</slf4j.version>
    </properties>

    <dependencies>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-runtime-web_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-to-slf4j</artifactId>
            <version>2.14.0</version>
        </dependency>


        <!-- https://mvnrepository.com/artifact/org.projectlombok/lombok -->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.16</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.3.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
1.	导入注解工具依赖, 方便生产POJO类

<!-- https://mvnrepository.com/artifact/org.projectlombok/lombok -->
<dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
    <version>1.18.16</version>
</dependency>
2.	准备一个WaterSensor类方便演示

package com.atguigu.flink.source;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

/**
 * 水位传感器:用于接收水位数据
 *
 * id:传感器编号
 * ts:时间戳
 * vc:水位
 */

@Data
@NoArgsConstructor
@AllArgsConstructor
public class WaterSensor {
    private String id;
    private Long ts;
    private Integer vc;
}

5.2.2 从Java的集合中读取数据

一般情况下,可以将数据临时存储到内存中,形成特殊的数据结构后,作为数据源使用。这里的数据结构采用集合类型是比较普遍的。

package com.atguigu.flink.source;


import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.util.Arrays;
import java.util.List;
import java.util.Random;

public class Test01_Source_Collection {
    public static void main(String[] args) throws Exception {
        List<WaterSensor> waterSensors = Arrays.asList(
                new WaterSensor("ws_001", System.currentTimeMillis(), new Random().nextInt(50)),
                new WaterSensor("ws_002", System.currentTimeMillis(), new Random().nextInt(50)),
                new WaterSensor("ws_003", System.currentTimeMillis(), new Random().nextInt(50)));

        // 1 创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(2);

        env
                .fromCollection(waterSensors)
                .print();

        env.execute();
    }
}

5.2.3 从文件读取数据

package com.atguigu.flink.source;

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class Test02_Source_File {
    public static void main(String[] args) throws Exception {
        // 1 创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env
                .readTextFile("input")
                .print();

        env.execute();
    }
}

说明:

  1. 参数可以是目录,也可以是文件
  2. 路径可以是相对路径,也可以是相对路径
  3. 相对路径是从系统属性 user.dir 获取路径:idea下的是 project 根目录,standalone 模式下是集群节点根目录
  4. 也可以从HDFS目录下读取,使用路径 hdfs://hadoop102:8020/… 由于Flink没有提供hadoop相关依赖,需要pom中添加hadoop 客户端依赖
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>3.1.3</version>
</dependency>

5.2.4 从Socket读取数据

package com.atguigu.flink.source;

import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.io.IOException;
import java.net.Socket;

public class Test03_Source_Socket {
    public static void main(String[] args) throws Exception {
        // 1 环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        Socket socket = new Socket("hadoop102", 9999);

        DataStreamSource<String> lineDS = env.socketTextStream("hadoop102", 9999);

        lineDS
                .print();

        lineDS.executeAndCollect();
    }
}

5.2.5 从Kafka读取数据

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.12</artifactId>
    <version>1.13.0</version>
</dependency>
package com.atguigu.flink.source;

import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;

import java.util.Properties;

public class Test04_Source_Kafka {
    public static void main(String[] args) throws Exception {

        // 0 todo Kafka相关配置
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "hadoop102:9092,hadoop103:9092,hadoop104:9092");
        properties.setProperty("group.id", "Flink01_Source_Kafka");
        properties.setProperty("auto.offset.reset", "latest");


        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env
                .addSource(new FlinkKafkaConsumer<>("sensor", new SimpleStringSchema(), properties))
                .print("kafka source");


        env.execute();

    }
}
kafka-console-producer.sh --broker-list hadoop102:9092 --topic sensor

5.2.6 自定义Source

大多数情况下,前面的数据源已经能够满足需要,但是难免会存在特殊情况的场合,所以flink也提供了能自定义数据源的方式.

package com.atguigu.flink.source;

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.kafka.common.protocol.types.Field;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.Socket;
import java.nio.charset.StandardCharsets;

public class Test05_Source_Custom {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env
                .addSource(new MySource("hadoop102", 9999))
                .print();
        env.execute();
    }

    public static class MySource implements SourceFunction<WaterSensor> {

        private String host;
        private int prot;
        private volatile boolean isRunning = true;
        private Socket socket;

        public MySource(String host, int prot) {
            this.host = host;
            this.prot = prot;
        }


        @Override
        public void run(SourceContext<WaterSensor> ctx) throws Exception {
            // 实现一个从Socket读取数据的source
            Socket socket = new Socket(host, prot);
            BufferedReader reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8));
            String line = null;
            while (isRunning && (line = reader.readLine()) != null){
                String[] split = line.split(",");
                ctx.collect(new WaterSensor(split[0], Long.valueOf(split[1]), Integer.valueOf(split[2])));
            }
        }

        @Override
        public void cancel() {
            isRunning = false;
            try {
                socket.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

自定义 SourceFunction:

  1. 实现 SourceFunction 相关接口
  2. 重写两个方法:
    1. run(): 主要逻辑
    2. cancel(): 停止逻辑

如果希望 Source 可以指定并行度,那么就 实现 ParallelSourceFunction 这个接口。

5.3 Transform

转换算子可以把一个或多个DataStream转成一个新的DataStream.程序可以把多个复杂的转换组合成复杂的数据流拓扑。

5.3.1 map

作用:将数据流中的数据进行转换, 形成新的数据流,消费一个元素并产出一个元素
参数:lambda表达式 或 MapFunction实现类
返回值:DataStream -> DataStream
package com.atguigu.flink.transform;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class Test01_Map_Anonymous {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env
                .fromElements(1, 2, 3, 4, 5, 6)

                // 匿名内部类
/*                .map(new MapFunction<Integer, Integer>() {

                    @Override
                    public Integer map(Integer value) throws Exception {
                        return value * value;
                    }
                })
                .print();*/

                // Lambda表达式
/*
                .map(ele -> ele * ele)
                .print();
*/

                // 静态内部类
                .map(new MyMapFunction())
                .print();

        env.execute();
    }

    private static class MyMapFunction implements MapFunction<Integer, Integer> {
        @Override
        public Integer map(Integer value) throws Exception {
            return value * value;
        }
    }
}

Rich…Function类

所有Flink函数类都有其Rich版本。它与常规函数的不同在于,可以获取运行环境的上下文,并拥有一些生命周期方法,所以可以实现更复杂的功能。也有意味着提供了更多的,更丰富的功能。例如:RichMapFunction

// 得到一个新的数据流: 新的流的元素是原来流的元素的平方
package com.atguigu.flink.transform;

import org.apache.flink.api.common.functions.IterationRuntimeContext;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import scala.Int;

public class Test02_Map_RichMapFunction {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env
                .fromElements(1, 2, 3, 4, 5)
                .map(new MyRichMapFunction())
                .setParallelism(2)
                .print();

        env.execute();
    }

    private static class MyRichMapFunction extends RichMapFunction<Integer, Integer> {


        @Override
        public void setRuntimeContext(RuntimeContext t) {
            System.out.println("设置运行时上下文 执行一次");
        }

        @Override
        public RuntimeContext getRuntimeContext() {
            System.out.println("运行上下文 执行一次");
            return super.getRuntimeContext();
        }

        @Override
        public IterationRuntimeContext getIterationRuntimeContext() {
            System.out.println("迭代时运行上下文 运行一次");
            return super.getIterationRuntimeContext();
        }

        // 默认生命周期方法,初始化方法,在每个并行度上只会被调用一次
        @Override
        public void open(Configuration parameters) throws Exception {
            System.out.println("open 执行一次");
        }

        // 默认生命周期方法,最后一个方法,做一些清理工作,在每个并行度上只调用一次
        @Override
        public void close() throws Exception {
            System.out.println("close 执行一次");
        }

        @Override
        public Integer map(Integer value) throws Exception {
            System.out.println("map 一个元素执行一次");
            return value * value;
        }
    }
}
设置运行时上下文 执行一次
设置运行时上下文 执行一次
open 执行一次
open 执行一次
map 一个元素执行一次
map 一个元素执行一次
map 一个元素执行一次
map 一个元素执行一次
map 一个元素执行一次
close 执行一次
close 执行一次
1> 16
7> 9
6> 1
8> 25
12> 4

Process finished with exit code 0

方法执行次数:

  1. 默认周期生命周期方法,初始化方法 open() 在每个并行度上只会被调用一次,而且最先被调用。
  2. 默认周期方法,最后一个方法 close() 做一下清理工作,在每个并行度上只调用一次,而且时最后被调用,但读文件时,在每个并行度上调用两次。
  3. 运行时上下文方法 getRuntimeContext() 提供了函数 RuntimeContext 的一些信息,例如函数执行的并行度,任务的名字,以及 state 状态,开发人员在需要的时候可以自行调用获取运行时上下文对象

5.3.2 flatMap

作用:消费一个元素并产生领个或多个元素
参数:FlatMapFunction实现类
返回:DataStream -> DataStream
package com.atguigu.flink.transform;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;


public class Test03_FlatMap {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env
                .fromElements(1, 2, 3, 4, 5)
                // 匿名内部类
/*                .flatMap(new FlatMapFunction<Integer, Integer>() {
                    @Override
                    public void flatMap(Integer value, Collector<Integer> out) throws Exception {
                        out.collect(value*value);
                        out.collect(value*value*value);
                    }
                })
                .print();*/

                // Lambda
                .flatMap((Integer value, Collector<Integer> out) -> {
                    out.collect(value * value);
                    out.collect(value * value * value);
                })
                .returns(Types.INT)
                .print();

        env.execute();
    }
}

说明:在使用Lambda表达式表达式的时候, 由于泛型擦除的存在, 在运行的时候无法获取泛型的具体类型, 全部当做Object来处理, 及其低效, 所以Flink要求当参数中有泛型的时候, 必须明确指定泛型的类型.

5.3.3 filter

作用:根据指定的规则将满足条件(true)的数据保留,不满足条件(false)的数据丢弃。
参数:FlatMapFunction实现类
返回:DataStream -> DataStream
package com.atguigu.flink.transform;

import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;


public class Test04_Fliter {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

/*        env
                .socketTextStream("localhost", 9999)
                .map(ele -> Integer.valueOf(ele))
                .filter(new FilterFunction<Integer>() {
                    @Override
                    public boolean filter(Integer value) throws Exception {
                        return value % 2 == 0;
                    }
                })
                .print();*/

        env
                .socketTextStream("localhost", 9999)
                .map(ele -> Integer.valueOf(ele))
                .filter(ele -> ele % 2 == 0)
                .print();

        env.execute();
    }
}

5.3.4 keyBy

作用:
	1 把流中的数据分到不同的分区中,具有相同 key 的元素会分到同一个分区中,一个分区中可以有多重不同的key
	2 在内部是使用hash分区来实现的

分组和分区的区别:
	分组:是一个逻辑上的划分,按照key进行划分,经过keyby,同一个分组的数据肯定会进入同一个分区
	分区:下游算子的一个并行实例(等价于一个slot),同一个分区内,可能有多个分组
	
参数:
key选择器函数:interface KeySelector<IN, KEY>
注意:什么值不可以作为keySelect的Key:
	1 没有覆写hashCode方法的POJO,而是依赖Object的hashCode,因为这样分组没有意义:因为每个元素都会得到一个独一无二的组,实际情况是:可以运行,但是分组没有意义
	2 任何类型的数组

返回:DataStream -> KeyedStream
package com.atguigu.flink.transform;

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class Test05_KeyBy {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env
                .fromElements(10, 3, 5, 9, 20, 8)
/*                .keyBy(new KeySelector<Integer, String>() {
                    @Override
                    public String getKey(Integer value) throws Exception {
                        return value % 2 == 0 ? "偶数" : "奇数";
                    }
                })
                .print();*/
                .keyBy(value -> value % 2 == 0 ? "偶数" : "奇数")
                .print();
                
        env.execute();
    }
}

总结:

  1. 指定 位置索引,只能用于 Tuple 的数据类型
KeyedStream<WaterSenor, Tuple> sensorKS = sensorDS.keyBy(0);
  1. 指定 字段名字,适用于 POJO
KeyedStream<WaterSensor, Tulpe> sensorKS = sensorDS.keyBy("id");
  1. 推荐使用 使用KeySelector
KeyedStream<WaterSensor, String> sensorKS = sensorDS.keyBy(new KeySelector<WaterSensor, String>() {
            @Override
            public String getKey(WaterSensor value) throws Exception {
                return value.getId();
            }
        });

5.3.5 shuffle

image-20220726171536230
作用:把流中的元素随机打乱,对同一个组数据,每次只需要得到的结果不同
参数:无
返回:DataStream -> DataStream

5.3.6 Split和select

已经过时, 在1.12中已经被移除

作用:在某些情况下,我们需要将数据流根据某些特征拆分成两个或者多个数据流,给不同数据流增加标记以便于从流中取出.
	split用于给流中的每个元素添加标记. select用于根据标记取出对应的元素, 组成新的流.

参数
	split参数: interface OutputSelector<OUT>
	select参数: 字符串

返回
split: SingleOutputStreamOperator -> SplitStream
slect: SplitStream -> DataStream
// 匿名内部类写法
// 奇数一个流, 偶数一个流
SplitStream<Integer> splitStream = env
  .fromElements(10, 3, 5, 9, 20, 8)
  .split(new OutputSelector<Integer>() {
      @Override
      public Iterable<String> select(Integer value) {
          return value % 2 == 0
            ? Collections.singletonList("偶数")
            : Collections.singletonList("奇数");
      }
  });
splitStream
  .select("偶数") 
  .print("偶数");

splitStream
  .select("奇数")
  .print("奇数");
env.execute();

// 	Lambda表达式写法
// 奇数一个流, 偶数一个流
SplitStream<Integer> splitStream = env
  .fromElements(10, 3, 5, 9, 20, 8)
  .split(value -> value % 2 == 0
    ? Collections.singletonList("偶数")
    : Collections.singletonList("奇数"));
splitStream
  .select("偶数")
  .print("偶数");

splitStream
  .select("奇数")
  .print("奇数");
env.execute();

5.3.7 connect

作用:在某些情况下,我们需要将两个不同来源的数据流进行连接,实现数据匹配,比如订单支付和第三方交易信息,这两个信息的数据就来自于不同数据源,连接后,将订单支付和第三方交易信息进行对账,此时,才能算真正的支付完成。Flink中的connect算子可以连接两个保持他们类型的数据流,两个数据流被connect之后,只是被放在了一个同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。

参数:另外一个流

返回:DataStream[A], DataStream[B] -> ConnectedStreams[A,B]
package com.atguigu.flink.transform;

import org.apache.flink.streaming.api.datastream.ConnectedStreams;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class Test06_Connect {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<Integer> intStream = env.fromElements(1, 2, 3, 4, 5);
        DataStreamSource<String> stringStream = env.fromElements("a", "b", "c");
        ConnectedStreams<Integer, String> connectedStreams = intStream.connect(stringStream);
        connectedStreams.getFirstInput().print("frist");
        connectedStreams.getSecondInput().print("second");

        env.execute();
    }
}

注意:

  1. 两个流中存储的数据类型可以不同
  2. 只是机械的合并在一起,内部仍然是分离的2个流
  3. 只能2个流进行connect,不能有第3个流参与

5.3.8 union

作用:对两个或两个以上的DataStream进行union操作,产生一个包含所有DataStream元素的新DataStream
package com.atguigu.flink.transform;

import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class Test07_Union {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<Integer> stream1 = env.fromElements(1, 2, 3, 4, 5);
        DataStreamSource<Integer> stream2 = env.fromElements(10, 20, 30, 40, 50);
        DataStreamSource<Integer> stream3 = env.fromElements(100, 200, 300, 400, 500);

        stream1
                .union(stream2)
                .union(stream3)
                .print();
        env.execute();
    }
}

connect 和 union 的区别:

  1. union之前两个流的类型必须是一样,connect可以不一样
  2. connect 只能操作两个流,union可以操作多个流

5.3.9 简单滚动聚合算子

sum min max minBy maxBy
作用:对KeyedStream的每一个支流做聚合。执行完成后,会将聚合的结果合成一个流返回,所以结果都是DataStream。

参数:
	1 如果流中存储的是POJO或者Scala的样例类,参数使用字段名
	2 如果流中存储的是元组,参数就是位置(基于 0 1...)
	
返回:KeyedStream -> SingleOutputStreamOperator
DataStreamSource<Integer> stream = env.fromElements(1, 2, 3, 4, 5);
KeyedStream<Integer, String> kbStream = stream.keyBy(ele -> ele % 2 == 0 ? "奇数" : "偶数");
kbStream.sum(0).print("sum");
kbStream.max(0).print("max");
kbStream.min(0).print("min");
ArrayList<WaterSensor> waterSensors = new ArrayList<>();
waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 30));
waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));

KeyedStream<WaterSensor, String> kbStream = env
  .fromCollection(waterSensors)
  .keyBy(WaterSensor::getId);

kbStream
  .sum("vc")
  .print("max...");
ArrayList<WaterSensor> waterSensors = new ArrayList<>();
waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));

KeyedStream<WaterSensor, String> kbStream = env
  .fromCollection(waterSensors)
  .keyBy(WaterSensor::getId);

kbStream
  .maxBy("vc", false)
  .print("maxBy...");

env.execute();

注意:

滚动聚合算子:来一条,聚合一条

  1. 聚合算子在 keyby 之后调用,因为这些算子都是属于 KeyedStream 里的
  2. 聚合算子,作用范围,都是分组内。 也就是说,不同分组,要分开算。
  3. max、maxBy的区别:
    1. max:取指定字段当前的最大值,如果有多个字段,其他非比较字段,以第一条为准
    2. maxBy:取指定字段当前的最大值,如果有多个字段,其他字段以最大值那条数据为准;
    3. 如果出现两条数据都是最大值,由第二个参数决定:
      1. true => 其他字段取 比较早的值
      2. false => 其他字段,取最新的值

5.3.10 reduce

作用:
	一个分组数据流的聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。为什么保留聚合的中间值?考虑流式数据的特点: 没有终点, 也就没有最终的概念了. 任何一个中间的聚合结果都是输出值!
	
参数:
	interface ReduceFunction<T>
	
返回:
	KeyedStream => SingleOutputStreamOperator
package com.atguigu.flink.transform;

import com.atguigu.flink.source.WaterSensor;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.util.ArrayList;

public class Test08_Reduce {
    public static void main(String[] args) throws Exception {
        ArrayList<WaterSensor> waterSensors = new ArrayList<>();
        waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
        waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
        waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
        waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
        waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        KeyedStream<WaterSensor, String> kbStream = env
                .fromCollection(waterSensors)
                .keyBy(WaterSensor::getId);

/*        kbStream
                .reduce(new ReduceFunction<WaterSensor>() {
                    @Override
                    public WaterSensor reduce(WaterSensor value1, WaterSensor value2) throws Exception {
                        System.out.println("reduce function...");
                        return new WaterSensor(value1.getId(), value1.getTs(), value1.getVc() + value2.getVc());
                    }
                })
                .print("reduce...");*/

        kbStream
                .reduce((value1, value2) -> {
                    System.out.println("reduce function...");
                    return new WaterSensor(value1.getId(), value1.getTs(), value1.getVc() + value2.getVc());
                })
                .print();

        env.execute();

    }
}

注意:

  1. 一个分组的第一条数据来的时候,不会进入reduce方法
  2. 输入和输出的数据类型,一定要一样

5.3.11 process

作用:process算子在Flink里算是一个比较底层的算子,很多类型的流上都可以调用,可以从流中获取更多的信息(不仅仅是数据本身)
// 在keyBy之前的流上使用
env
  .fromCollection(waterSensors)
  .process(new ProcessFunction<WaterSensor, Tuple2<String, Integer>>() {
      @Override
      public void processElement(WaterSensor value,
                                 Context ctx,
                                 Collector<Tuple2<String, Integer>> out) throws Exception {
          out.collect(new Tuple2<>(value.getId(), value.getVc()));
      }
  })
  .print();
// 在keyBy之后的流上使用
env
  .fromCollection(waterSensors)
  .keyBy(WaterSensor::getId)
  .process(new KeyedProcessFunction<String, WaterSensor, Tuple2<String, Integer>>() {
      @Override
      public void processElement(WaterSensor value, Context ctx, Collector<Tuple2<String, Integer>> out) throws Exception {
          out.collect(new Tuple2<>("key是:" + ctx.getCurrentKey(), value.getVc()));
      }
  })
  .print();

5.3.12 对流重新分区的几个算子

keyBy: 先按照key分组,按照key的双重hash来选择后面的分区

shuffle: 对流中的元素随机分区

reblance: 对流中的元素平均分布到每个区,当处理倾斜数据的时候,进行性能优化

rescale: 同rebalance一样,也是平均值循环的分布数据。但是要比rebalance更高效,因为rescale不需要通过网络,完全走的“管道”

5.4 Sink

Sink有下沉的意思,在Flink中所谓的Sink其实可以表示为将数据存储起来的意思,也可以将范围扩大,表示将处理完的数据发送到指定的存储系统的输出操作.

之前我们一直在使用的print方法其实就是一种Sink

public DataStreamSink<T> print(String sinkIdentifier) {
   PrintSinkFunction<T> printFunction = new PrintSinkFunction<>(sinkIdentifier, false);
   return addSink(printFunction).name("Print to Std. Out");
}

Flink内置了一些Sink, 除此之外的Sink需要用户自定义!

5.4.1 KafkaSink

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.12</artifactId>
    <version>1.13.0</version>
</dependency>
<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>fastjson</artifactId>
    <version>1.2.75</version>
</dependency>
package com.atguigu.flink.sink;

import com.alibaba.fastjson.JSON;
import com.atguigu.flink.source.WaterSensor;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;

import java.util.ArrayList;

public class Test01_KafkaSink {
    public static void main(String[] args) throws Exception {
        ArrayList<WaterSensor> waterSensors = new ArrayList<>();
        waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
        waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
        waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
        waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
        waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env
                .fromCollection(waterSensors)
                .map(JSON::toJSONString)
                .addSink(new FlinkKafkaProducer<String>("hadoop102:9092","topic_sensor",new SimpleStringSchema()));
        env.execute();
    }
}
bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic topic_sensor

5.4.2 RedisSink

        <dependency>
            <groupId>org.apache.bahir</groupId>
            <artifactId>flink-connector-redis_2.11</artifactId>
            <version>1.0</version>
            <exclusions>
                <exclusion>
                    <artifactId>flink-streaming-java_2.11</artifactId>
                    <groupId>org.apache.flink</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>flink-runtime_2.11</artifactId>
                    <groupId>org.apache.flink</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>flink-core</artifactId>
                    <groupId>org.apache.flink</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>flink-java</artifactId>
                    <groupId>org.apache.flink</groupId>
                </exclusion>
            </exclusions>
        </dependency>
package com.atguigu.flink.sink;

import com.alibaba.fastjson.JSON;
import com.atguigu.flink.source.WaterSensor;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.redis.RedisSink;
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommand;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommandDescription;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisMapper;

import java.util.ArrayList;

public class Test02_RedisSink {
    public static void main(String[] args) throws Exception {
        ArrayList<WaterSensor> waterSensors = new ArrayList<>();
        waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
        waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
        waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
        waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
        waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));

        FlinkJedisPoolConfig redisConfig = new FlinkJedisPoolConfig.Builder()
                .setHost("hadoop102")
                .setPort(6379)
                .setMaxTotal(100)
                .setTimeout(1000 * 10)
                .build();

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env
                .fromCollection(waterSensors)
                .addSink(new RedisSink<>(redisConfig, new RedisMapper<WaterSensor>() {
                    @Override
                    public RedisCommandDescription getCommandDescription() {
                        // 返回存在Redis中的数据类型,存储的是Hash,第二个参数是外面的key
                        return new RedisCommandDescription(RedisCommand.HSET, "sensor");
                    }

                    @Override
                    public String getKeyFromData(WaterSensor data) {
                        // 从数据中获取key:Hash的key
                        return data.getId();
                    }

                    @Override
                    public String getValueFromData(WaterSensor data) {
                        // 从吃烤串中获取value: Hash的value
                        return JSON.toJSONString(data);
                    }
                }));

        env.execute();
    }
}

redis-cli --raw
    hgetall sensor

发送了5条数据, redis中只有2条数据. 原因是hash的field的重复了, 后面的会把前面的覆盖掉。

5.4.3 ElasticsearchSink

<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-elasticsearch6 -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-elasticsearch6_2.12</artifactId>
    <version>1.13.0</version>
</dependency>
package com.atguigu.flink.sink;

import com.alibaba.fastjson.JSON;
import com.atguigu.flink.source.WaterSensor;
import jdk.nashorn.internal.ir.RuntimeNode;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import org.elasticsearch.common.xcontent.XContentType;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class Test03_ElasticsearchSink {
    public static void main(String[] args) throws Exception {
        ArrayList<WaterSensor> waterSensors = new ArrayList<>();
        waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
        waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
        waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
        waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
        waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));

        List<HttpHost> esHosts = Arrays.asList(
                new HttpHost("hadoop102", 9200),
                new HttpHost("hadoop103", 9200),
                new HttpHost("hadoop104", 9200)
        );

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env
                .fromCollection(waterSensors)
                .addSink(new ElasticsearchSink.Builder<WaterSensor>(esHosts, new ElasticsearchSinkFunction<WaterSensor>() {

                    @Override
                    public void process(WaterSensor element, RuntimeContext ctx, RequestIndexer indexer) {
                        // 1. 创建es写入请求
                        IndexRequest request = Requests
                                .indexRequest("sensor")
                                .type("_doc")
                                .id(element.getId())
                                .source(JSON.toJSONString(element), XContentType.JSON);
                        // 2. 写入到es
                        indexer.add(request);
                    }
                }).build());


        env.execute();
    }
}

5.4.4 自定义Sink

MySQLSink

create database test;
use test;
CREATE TABLE `sensor` (
  `id` varchar(20) NOT NULL,
  `ts` bigint(20) NOT NULL,
  `vc` int(11) NOT NULL,
  PRIMARY KEY (`id`,`ts`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.49</version>
</dependency>
package com.atguigu.flink.sink;

import com.atguigu.flink.source.WaterSensor;
import org.apache.flink.api.common.functions.IterationRuntimeContext;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;

import com.mysql.jdbc.Driver;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.ArrayList;

public class Test04_MySink {
    public static void main(String[] args) throws Exception {
        ArrayList<WaterSensor> waterSensors = new ArrayList<>();
        waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
        waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
        waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
        waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
        waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
        env.fromCollection(waterSensors)
                .addSink(new RichSinkFunction<WaterSensor>() {

                    private PreparedStatement ps;
                    private Connection conn;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        conn = DriverManager.getConnection("jdbc:mysql://hadoop102:3306/test?useSSL=false", "root",
                                "123456");
                        ps = conn.prepareStatement("insert into sensor values(?, ?, ?)");
                    }

                    @Override
                    public void close() throws Exception {
                        ps.close();
                        conn.close();
                    }

                    @Override
                    public void invoke(WaterSensor value, Context context) throws Exception {
                        ps.setString(1, value.getId());
                        ps.setLong(2, value.getTs());
                        ps.setInt(3, value.getVc());
                        ps.execute();
                    }
                });


        env.execute();

    }
}

5.4.5 JDBCSink

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-jdbc_2.12</artifactId>
  <version>1.13.0</version>
</dependency>
package com.atguigu.flink.sink;

import com.atguigu.flink.source.WaterSensor;
import com.mysql.jdbc.Driver;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.connector.jdbc.JdbcConnectionOptions;
import org.apache.flink.connector.jdbc.JdbcExecutionOptions;
import org.apache.flink.connector.jdbc.JdbcSink;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class Test05_JDBCSink {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.setParallelism(1);

        DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);

        SingleOutputStreamOperator<WaterSensor> result = streamSource.map(new MapFunction<String, WaterSensor>() {

            @Override
            public WaterSensor map(String value) throws Exception {
                String[] split = value.split(",");
                WaterSensor waterSensor = new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));

                return waterSensor;
            }
        });

        result.addSink(JdbcSink.sink(
                "insert into sensor values (? ? ?)",
                (ps, t) -> {
                    ps.setString(1, t.getId());
                    ps.setLong(2, t.getTs());
                    ps.setInt(3, t.getVc());
                },
                new JdbcExecutionOptions.Builder()
                        // 一条一条写 和ES中类似
                        .withBatchSize(1)
                        .build(),
                new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
                        .withUrl("jdbc:mysql://hadoop102:3306/test?useSSL=false")
                        .withUsername("root")
                        .withPassword("123456")
                        .withDriverName(Driver.class.getName())
                        .build()
        ));

        env.execute();
    }
}

5.5 执行模式(Execution Mode)

Flink在1.12.0上对流式API新增一项特性:可以根据你的使用情况和Job的特点, 可以选择不同的运行时执行模式(runtime execution modes)。

流式API的传统执行模式我们称之为 STREAMING 执行模式, 这种模式一般用于无界流, 需要持续的在线处理。

1.12.0新增了一个BATCH执行模式, 这种执行模式在执行方式上类似于MapReduce框架. 这种执行模式一般用于有界数据,目的为了更像流批一体。

默认是使用的STREAMING 执行模式。

5.5.1 选择执行模式

BATCH执行模式仅仅用于有界数据, 而STREAMING 执行模式可以用在有界数据和无界数据。

一个公用的规则就是: 当你处理的数据是有界的就应该使用 BATCH 执行模式, 因为它更加高效. 当你的数据是无界的, 则必须使用STREAMING 执行模式, 因为只有这种模式才能处理持续的数据流。

5.5.2 配置BATH执行模式

执行模式有3个选择可配置:

  1. STREAMING(默认)
  2. BATCH
  3. AUTOMATIC
// 1 通过命令行配置
bin/flink run -Dexecution.runtime-mode-BATCH...
    
// 2 通过代码配置
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.BATCH);

建议: 不要在运行时配置(代码中配置), 而是使用命令行配置, 引用这样会灵活: 同一个应用即可以用于无界数据也可以用于有界数据,无界数据不能使用Batch模式。

5.5.3 有界数据用STREAMING和BATCH的区别

STREAMING模式下, 数据是来一条输出一次结果

BATCH模式下, 数据处理完之后, 一次性输出结果

下面展示WordCount的程序读取文件内容在不同执行模式下的执行结果对比:

// 默认流式模式,可以不用配置
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);

在这里插入图片描述

// 批处理模式
env.setRuntimeMode(RuntimeExecutionMode.BATCH);

在这里插入图片描述

// 自动模式
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值