基础篇（二）：Flink DataStream API使用

最新推荐文章于 2023-12-07 16:41:48 发布

桥~

最新推荐文章于 2023-12-07 16:41:48 发布

阅读量258

点赞数 2

分类专栏： Flink 文章标签：大数据 flink

本文链接：https://blog.csdn.net/firefish009/article/details/110353469

版权

Flink 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

前言

流式处理系统通常需要支持无限数据流的处理，则会采用数据驱动的处理方式。通俗点讲，提前设计好数据的处理算子，数据到达后直接执行，而表达这套计算逻辑使用DAG有向无环图

Word Count演示

public static void main(String[] args) throws Exception{
  // 1、创建执行环境
  StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
  // 2、配置数据源读取文本数据
  String inputPath = "D:\\flink\\src\\main\\resources\\hello.txt";
  DataStreamSource<String> text = env.readTextFile(inputPath);
  //3、转换操作
  DataStream<Tuple2<String, Integer>> counts = text.flatMap(new LineSplitter()).keyBy(0).sum(1);
  //4、输出数据
  counts.writeAsText(outputPath);
  //5、提交执行
  env.execute("streaming word count");
}

public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String,Integer>>{

        public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
            String[] tokens = value.toLowerCase().split("\\W+");
            for (String token : tokens) {
                if (token.length() > 0) {
                    out.collect(new Tuple2<String, Integer>(token, 1));
                }
            }
        }
    }

第一步：创建流的执行环境会根据运行的环境返回相应的执行环境，如果没有设置并行度，集群环境则会以flink-conf.yaml中的配置为准，默认是1，本地运行则是本地机器的核数
在这里插入图片描述
第二步：读取文件数据
第三步：使用转换算子实现切词计数
第四步：输出数据
第五步：执行程序在第二步中调用的api只是在构建计算逻辑的DAG图，当env显式调用execute方法后，才会把jobgraph提交到集群，接入数据并执行实际的逻辑

转换算子

1. 基本转换算子

map

推荐匿名函数写法

  DataStream<String> dataStream = stream.map(r -> r.city);
  ```
    ###### filter  将会保留符合条件结果为true的输入事件
 ```javascript
   DataStreamSource<WeatherRecord> stream = env.fromElements(
             new WeatherRecord("shenzhen",15001210000L,25.20),
             new WeatherRecord("wuhan",15001210000L,10.50),
             new WeatherRecord("changcun",15001210000L,-0.50)
  );
  stream .filter(record->record.temperature>20)
       .print();

输出结果

WeatherRecord{city='shenzhen', timestamp=15001210000,temperature=25.2}

flatMap

可以同时实现map和filter的功能，对每个输入事件可以输出0、1或多个元素

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class FlatMapExample {
 public static void main(String[] args) throws Exception {
     StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
     env.setParallelism(1);
     DataStreamSource<String> stream = env.fromElements("唐人街探案","你好，李焕英","暴裂无声");
     stream.flatMap(new MyFlatMap())
             .print();
     env.execute();
 }
 public static class MyFlatMap implements FlatMapFunction<String, String> {
     @Override
     public void flatMap(String s, Collector<String> collector) throws Exception {
         if (s.equals("唐人街探案")) {
             collector.collect(s);
         } else if (s.equals("你好，李焕英")) {
             collector.collect(s);
             collector.collect(s);
         }
     }
 }
}

输出结果

唐人街探案
你好，李焕英
你好，李焕英

2. 键控流转换算子
键控流是指数据流DataStream经过keyBy() 方法分组后得到KeyedStream，相同的key事件会分配到一个流中（拓展:一个流中会存在不同的key）

KeyedStream常伴随滚动聚合算子，常见的有：
sum()：在输入流上对指定的字段做滚动相加操作。
min()：在输入流上对指定的字段求最小值。
max()：在输入流上对指定的字段求最大值。
minBy()：在输入流上针对指定字段求最小值，并返回包含当前观察到的最小值的事件。
maxBy()：在输入流上针对指定字段求最大值，并返回包含当前观察到的最大值的事件。

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class KeyByExample {
 public static void main(String[] args) throws Exception {
     StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
     env.setParallelism(1);
     DataStreamSource<WeatherRecord> stream = env.fromElements(
             new WeatherRecord("shenzhen",15001210000L,25.20),
             new WeatherRecord("wuhan",15001210000L,10.50),
             new WeatherRecord("shenzhen",15001210000L,32.50)
     );
     stream.map(new MapFunction<WeatherRecord, Tuple2<String, Integer>>() {
         @Override
         public Tuple2<String, Integer> map(WeatherRecord weatherRecord) throws Exception {
             return new Tuple2<String, Integer>(weatherRecord.city, 1);
         }
     }).keyBy(0)
       .sum(1)
       .print();
     env.execute();
 }
}

输出结果

(shenzhen,1)
(wuhan,1)
(shenzhen,2)

观察发现对每个输入事件会保存聚合结果，当下一个相同key事件输入时在原来的结果上做更新

reduce算子
是泛化的聚合操作，会把每一个新输入的事件与当前的聚合结果做操作

3. 多流转换算子
UNION
将多条相同输入类型的输入流合成一条输出流，合流后输入事件顺序为FIFO方式，不会自动去重

CONNECT
将多条不同输入类型的流连接在一起，这个场景在业务中很常见，比如在物联网森林防火场景中，会结合温度传感器和烟雾传感器的数据结合在一起判断是否达到阈值，如高于阈值则发送告警

public class AlertForestFireStream {
 public static void main(String[] args) throws Exception {
     StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
     DataStream<SensorReading> tempReadings = env.addSource(new SensorSource());
     DataStream<SmokeLevel> smokeReadings = env.addSource(new SmokeLevelSource()).setParallelism(1);
     tempReadings.keyBy(r -> r.id)
             //广播会将流分发到下游所有并行任务中
             .connect(smokeReadings.broadcast())
             .flatMap(new AlertFlatMap())
             .print();
     env.execute();
 }

 public static class AlertFlatMap implements CoFlatMapFunction<SensorReading, SmokeLevel, Alert> {
     private SmokeLevel smokeLevel = SmokeLevel.LOW;
     @Override
     public void flatMap1(SensorReading sensorReading, Collector<Alert> collector) throws Exception {
         if (this.smokeLevel == SmokeLevel.HIGH && sensorReading.temperature > 30) {
             collector.collect(new Alert("注意森林防火告警！" + sensorReading, sensorReading.timestamp));
         }
     }
     @Override
     public void flatMap2(SmokeLevel smokeLevel, Collector<Alert> collector) throws Exception {
         this.smokeLevel = smokeLevel;
     }
 }
}

4. 分布式转换算子
主要用来定义和控制数据分区策略

rebalance()：使用Round-Robin负载均衡算法将输入流平均分配到随后的并行运行的任务中去

rescale(): 也使用round-robin算法，但与rebalance()不同的是将数据发送到下游并行任务中的一部分任务，如果接收任务的并行度高于发送任务的并行度，此方法效率更高

broadcast() ：将输入流的所有数据复制并发送到下游算子的所有并行任务中去

桥~

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
基础篇（二）：Flink DataStream API使用

前言流式处理系统通常需要支持无限数据流的处理，则会采用数据驱动的处理方式。通俗点讲，提前设计好数据的处理算子，数据到达后直接执行，而表达这套计算逻辑使用DAG（有向无环）图Word Countpublic static void main(String[] args) throws Exception{ // 1、创建执行环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(
复制链接

扫一扫