flink

xuan0219

已于 2024-06-18 17:44:59 修改

阅读量406

点赞数 9

文章标签： flink

于 2024-02-19 00:22:49 首次发布

本文链接：https://blog.csdn.net/xuan0219/article/details/136161063

版权

特点:高吞吐低延迟时间语义精确一次高可用

flink与sparkstreaming相比优势:流计算，事件时间或处理时间，窗口类型多且灵活，有状态，有流式sql

配置文件:flink-conf.ymal workers masters

组件:client jobmanager taskmanager

部署模式:会话模式，单作业模式，应用模式

slot与并行度

算子:source,transformation,sink

env.execute()触发job执行

容错机制:检查点（周期，模式(精准一次)，存储位置，超时时间，最小间隔时间，最大并发检查点数量；启用增量式checkpoint changelog，减少checkpoint耗时）；保存点（手动执行触发，场景（重启flink任务））

窗口：时间窗口（滚动、滑动、会话）；计数窗口（滚动、滑动）

状态：按键分区状态（ValueState、ListState、MapState、ReducingState、AggregatingState）；算子状态（ListState、UnionListState、BroadcastState（用于广播配置流，实时变更配置））

***streams, state, and time***

streams:

Bounded and unbounded streams
Real-time and recorded streams

state:

Multiple State Primitives:value, list, or map
Pluggable State Backends:memory or RocksDB or FileSystem
Exactly-once state consistency:checkpoint
Very Large State:TB , asynchronous and incremental checkpoint
Scalable Applications

time:

Event-time Mode
Watermark Support:Watermarks are also a flexible mechanism to trade-off the latency and completeness of results.
Late Data Handling:rerouting them via side outputs and updating previously completed results
Processing-time Mode:The processing-time mode can be suitable for certain applications with strict low-latency requirements that can tolerate approximate results.

public class WatermarkDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<WaterSensor> dataSource = env.socketTextStream("hadoop102", 7777)
                .map(new MapFunction<String, WaterSensor>() {
                    @Override
                    public WaterSensor map(String value) throws Exception {
                        String[] data = value.split(",");
                        return new WaterSensor(data[0], Long.parseLong(data[1]), Integer.parseInt(data[2]));
                    }
                });
//        WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy.<WaterSensor>forMonotonousTimestamps()
        WatermarkStrategy<WaterSensor> watermarkStrategy = WatermarkStrategy
                // 乱序等待时间 3s
                .<WaterSensor>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                // 时间戳分配器 从数据中提取
                .withTimestampAssigner((SerializableTimestampAssigner<WaterSensor>) (element, recordTimestamp) -> {
                    System.out.println("数据:" + element + ",recordTimestamp:" + recordTimestamp);
                    return element.getTs() * 1000;
                });
        SingleOutputStreamOperator<WaterSensor> ds = dataSource.assignTimestampsAndWatermarks(watermarkStrategy);
        OutputTag<WaterSensor> lateOutputTag = new OutputTag<>("late-data", Types.POJO(WaterSensor.class));
        SingleOutputStreamOperator<String> process = ds.keyBy(WaterSensor::getId)
                // 滚动窗口长度 10s
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                // 允许迟到时间 2s （推迟2s关窗，关窗前迟到数据来一条触发一次计算）
                .allowedLateness(Time.seconds(2))
                // 关窗后的迟到数据放入侧输出流 （这里迟到数据指watermark超出乱序等待时间+允许迟到时间的数据，这里即 3s+2s）
                .sideOutputLateData(lateOutputTag)
                .process(new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
                    @Override
                    public void process(String s, Context context, Iterable<WaterSensor> elements, Collector<String> out) throws Exception {
                        long start = context.window().getStart();
                        long end = context.window().getEnd();
                        long currentProcessingTime = context.currentProcessingTime();
                        long currentWatermark = context.currentWatermark();
                        String windowStart = DateFormatUtils.format(start, "yyyy-MM-dd HH:mm:ss.SSS");
                        String windowEnd = DateFormatUtils.format(end, "yyyy-MM-dd HH:mm:ss.SSS");
                        long count = elements.spliterator().estimateSize();
                        out.collect("key=" + s + ";" + windowStart + "=>" + windowEnd + ";" + "count=" + count + ";" + "currentProcessingTime=" + currentProcessingTime + "currentWatermark=" + currentWatermark);
                    }
                });

        process.print();
        process.getSideOutput(lateOutputTag).printToErr();
        env.execute();
    }
}

分层API

ProcessFunctions provide fine-grained control over time and state. A ProcessFunction can arbitrarily modify its state and register timers that will trigger a callback function in the future. Hence, ProcessFunctions can implement complex per-event business logic as required for many stateful event-driven applications.

The following example shows a KeyedProcessFunction that operates on a KeyedStream and matches START and END events. When a START event is received, the function remembers its timestamp in state and registers a timer in four hours. If an END event is received before the timer fires, the function computes the duration between END and START event, clears the state, and returns the value. Otherwise, the timer just fires and clears the state.

/**
* Matches keyed START and END events and computes the difference between
* both elements' timestamps. The first String field is the key attribute,
* the second String attribute marks START and END events.
*/
public static class StartEndDuration
  extends KeyedProcessFunction<String, Tuple2<String, String>, Tuple2<String, Long>> {

  private ValueState<Long> startTime;
  
  @Override
  public void open(Configuration conf) {
    // obtain state handle
    startTime = getRuntimeContext()
      .getState(new ValueStateDescriptor<Long>("startTime", Long.class));
  }

  /** Called for each processed event. */
  @Override
  public void processElement(
      Tuple2<String, String> in,
      Context ctx,
      Collector<Tuple2<String, Long>> out) throws Exception {
  
      switch (in.f1) {
        case "START":
          // set the start time if we receive a start event.
          startTime.update(ctx.timestamp());
          // register a timer in four hours from the start event.
          ctx.timerService()
            .registerEventTimeTimer(ctx.timestamp() + 4 * 60 * 60 * 1000);
          break;
        case "END":
          // emit the duration between start and end event
          Long sTime = startTime.value();
          if (sTime != null) {
            out.collect(Tuple2.of(in.f0, ctx.timestamp() - sTime));
            // clear the state
            startTime.clear();
          }
        default:
          // do nothing
      }
  }

  /** Called when a timer fires. */
  @Override
  public void onTimer(
    long timestamp,
    OnTimerContext ctx,
    Collector<Tuple2<String, Long>> out) {

    // Timeout interval exceeded. Cleaning up the state.
    startTime.clear();
  }
}

The DataStream API provides primitives for many common stream processing operations, such as windowing, record-at-a-time transformations, and enriching events by querying an external data store. The DataStream API is available for Java and Scala and is based on functions, such as map(), reduce(), and aggregate(). Functions can be defined by extending interfaces or as Java or Scala lambda functions.

The following example shows how to sessionize a clickstream and count the number of clicks per session.

// a stream of website clicks
DataStream<Click> clicks = ...

DataStream<Tuple2<String, Long>> result = clicks
  // project clicks to userId and add a 1 for counting
  .map(
    // define function by implementing the MapFunction interface.
    new MapFunction<Click, Tuple2<String, Long>>() {
      @Override
      public Tuple2<String, Long> map(Click click) {
        return Tuple2.of(click.userId, 1L);
      }
    })
  // key by userId (field 0)
  .keyBy(0)
  // define session window with 30 minute gap
  .window(EventTimeSessionWindows.withGap(Time.minutes(30L)))
  // count clicks per session. Define function as lambda function.
  .reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1));

Flink features two relational APIs, the Table API and SQL . Both APIs are unified APIs for batch and stream processing, i.e., queries are executed with the same semantics on unbounded, real-time streams or bounded, recorded streams and produce the same results. The Table API and SQL leverage Apache Calcite for parsing, validation, and query optimization. They can be seamlessly integrated with the DataStream and DataSet APIs and support user-defined scalar, aggregate, and table-valued functions.

SELECT userId, COUNT(*)
FROM clicks
GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId

Flink和Kafka连接时的精确一次保证

1）整体介绍

既然是端到端的exactly-once，我们依然可以从三个组件的角度来进行分析：

（1）Flink内部

Flink内部可以通过检查点机制保证状态和处理结果的exactly-once语义。

（2）输入端

输入数据源端的Kafka可以对数据进行持久化保存，并可以重置偏移量（offset）。所以我们可以在Source任务（FlinkKafkaConsumer）中将当前读取的偏移量保存为算子状态，写入到检查点中；当发生故障时，从检查点中读取恢复状态，并由连接器FlinkKafkaConsumer向Kafka重新提交偏移量，就可以重新消费数据、保证结果的一致性了。

（3）输出端

输出端保证exactly-once的最佳实现，当然就是两阶段提交（2PC）。作为与Flink天生一对的Kafka，自然需要用最强有力的一致性保证来证明自己。

也就是说，我们写入Kafka的过程实际上是一个两段式的提交：处理完毕得到结果，写入Kafka时是基于事务的“预提交”；等到检查点保存完毕，才会提交事务进行“正式提交”。如果中间出现故障，事务进行回滚，预提交就会被放弃；恢复状态之后，也只能恢复所有已经确认提交的操作。

2）需要的配置

在具体应用中，实现真正的端到端exactly-once，还需要有一些额外的配置：

（1）必须启用检查点

（2）指定KafkaSink的发送级别为DeliveryGuarantee.EXACTLY_ONCE

（3）配置Kafka读取数据的消费者的隔离级别

这里所说的Kafka，是写入的外部系统。预提交阶段数据已经写入，只是被标记为“未提交”（uncommitted），而Kafka中默认的隔离级别isolation.level是read_uncommitted，也就是可以读取未提交的数据。这样一来，外部应用就可以直接消费未提交的数据，对于事务性的保证就失效了。所以应该将隔离级别配置

为read_committed，表示消费者遇到未提交的消息时，会停止从分区中消费数据，直到消息被标记为已提交才会再次恢复消费。当然，这样做的话，外部应用消费数据就会有显著的延迟。

（4）事务超时配置

Flink的Kafka连接器中配置的事务超时时间transaction.timeout.ms默认是1小时，而Kafka集群配置的事务最大超时时间transaction.max.timeout.ms默认是15分钟。所以在检查点保存时间很长时，有可能出现Kafka已经认为事务超时了，丢弃了预提交的数据；而Sink任务认为还可以继续等待。如果接下来检查点保存成功，发生故障后回滚到这个检查点的状态，这部分数据就被真正丢掉了。所以这两个超时时间，前者应该小于等于后者。

public class KafkaEOSDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 代码中用到hdfs，需要导入hadoop依赖、指定访问hdfs的用户名
        System.setProperty("HADOOP_USER_NAME", "atguigu");
        
        
        // TODO 1、启用检查点,设置为精准一次
        env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);
        CheckpointConfig checkpointConfig = env.getCheckpointConfig();
        checkpointConfig.setCheckpointStorage("hdfs://hadoop102:8020/chk");
        checkpointConfig.setExternalizedCheckpointCleanup(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);


        // TODO 2.读取kafka
        KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
                .setBootstrapServers("hadoop102:9092,hadoop103:9092,hadoop104:9092")
                .setGroupId("atguigu")
                .setTopics("topic_1")
                .setValueOnlyDeserializer(new SimpleStringSchema())
                .setStartingOffsets(OffsetsInitializer.latest())
                .build();

        DataStreamSource<String> kafkasource = env
                .fromSource(kafkaSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(3)), "kafkasource");
        

        /**
         * TODO 3.写出到Kafka
         * 精准一次 写入Kafka，需要满足以下条件，缺一不可
         * 1、开启checkpoint
         * 2、sink设置保证级别为 精准一次
         * 3、sink设置事务前缀
         * 4、sink设置事务超时时间： checkpoint间隔 <  事务超时时间  < max的15分钟
         */
        KafkaSink<String> kafkaSink = KafkaSink.<String>builder()
                // 指定 kafka 的地址和端口
                .setBootstrapServers("hadoop102:9092,hadoop103:9092,hadoop104:9092")
                // 指定序列化器：指定Topic名称、具体的序列化
                .setRecordSerializer(
                        KafkaRecordSerializationSchema.<String>builder()
                                .setTopic("ws")
                                .setValueSerializationSchema(new SimpleStringSchema())
                                .build()
                )
                // TODO 3.1 精准一次,开启 2pc
                .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
                // TODO 3.2 精准一次，必须设置 事务的前缀
                .setTransactionalIdPrefix("atguigu-")
                // TODO 3.3 精准一次，必须设置 事务超时时间: 大于checkpoint间隔，小于 max 15分钟
                .setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, 10*60*1000+"")
                .build();


        kafkasource.sinkTo(kafkaSink);

        env.execute();
    }
}

后续读取“ws”这个topic的消费者，要设置事务的隔离级别为“读已提交”，如下：

public class KafkaEOSDemo2 {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 消费 在前面使用两阶段提交写入的Topic
        KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
                .setBootstrapServers("hadoop102:9092,hadoop103:9092,hadoop104:9092")
                .setGroupId("atguigu")
                .setTopics("ws")
                .setValueOnlyDeserializer(new SimpleStringSchema())
                .setStartingOffsets(OffsetsInitializer.latest())
                // TODO 作为 下游的消费者，要设置 事务的隔离级别 = 读已提交
                .setProperty(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed")
                .build();


        env
                .fromSource(kafkaSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(3)), "kafkasource")
                .print();

        env.execute();
    }
}

FlinkCEP - Complex event processing for Flink

FlinkCEP is the Complex Event Processing (CEP) library implemented on top of Flink. It allows you to detect event patterns in an endless stream of events, giving you the opportunity to get hold of what’s important in your data.

public class LoginFailDetectExample {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        SingleOutputStreamOperator<LoginEvent> loginEventStream = env.fromElements(
                new LoginEvent("user_1", "192.168.0.1", "fail", 2000L),
                new LoginEvent("user_1", "192.168.0.2", "fail", 3000L),
                new LoginEvent("user_2", "192.168.1.29", "fail", 4000L),
                new LoginEvent("user_1", "171.56.23.10", "fail", 5000L),
                new LoginEvent("user_2", "192.168.1.29", "fail", 6000L),
                new LoginEvent("user_2", "192.168.1.29", "success", 7000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<LoginEvent>forBoundedOutOfOrderness(Duration.ofSeconds(10))
                .withTimestampAssigner(new SerializableTimestampAssigner<LoginEvent>() {
                    @Override
                    public long extractTimestamp(LoginEvent element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }));

        // 定义模式，检测连续三次登录失败
        Pattern<LoginEvent, LoginEvent> pattern = Pattern.<LoginEvent>begin("first").where(new SimpleCondition<LoginEvent>() {
            @Override
            public boolean filter(LoginEvent value) throws Exception {
                return value.getEventType().equals("fail");
            }
        }).next("second").where(new SimpleCondition<LoginEvent>() {
            @Override
            public boolean filter(LoginEvent value) throws Exception {
                return value.getEventType().equals("fail");
            }
        }).next("third").where(new SimpleCondition<LoginEvent>() {
            @Override
            public boolean filter(LoginEvent value) throws Exception {
                return value.getEventType().equals("fail");
            }
        });

        // 将模式应用到数据流上
        PatternStream<LoginEvent> patternStream = CEP.pattern(loginEventStream.keyBy(LoginEvent::getUserId), pattern);

        // 将检测到的复杂事件提取出来
        SingleOutputStreamOperator<String> warningStream = patternStream.select(new PatternSelectFunction<LoginEvent, String>() {
            @Override
            public String select(Map<String, List<LoginEvent>> pattern) throws Exception {
                LoginEvent first = pattern.get("first").get(0);
                LoginEvent second = pattern.get("second").get(0);
                LoginEvent third = pattern.get("third").get(0);
                return first.getUserId() + "连续三次登录失败!登录时间:" + first.getTimestamp() + "," + second.getTimestamp() + "," + third.getTimestamp();
            }
        });

        warningStream.print();

        env.execute();
    }
}

FlinkCDC - Change Data Capture

用途：Mysql数据实时同步。

Flink CDC作业默认以流模式运行，在实时binlog同步场景下提供亚秒级的端到端延迟，有效地保证了下游业务的数据新鲜性。

Flink CDC支持读取数据库历史数据，并且能够以只处理一次的方式读取CDC事件，即使在作业失败后也是如此。

Submitting a Job # 提交作业 #
./bin/flink run -d -c FlinkcdcDataStream /home/xswtest/flinkcdc-mysql-1.0-SNAPSHOT-jar-with-dependencies.jar
./bin/flink run -d -c FlinkNeo4jSink /home/xswtest/flink-neo4jsink-1.0-SNAPSHOT-jar-with-dependencies.jar
./bin/flink run -d -c FlinkPGSink /home/xswtest/flink-pgsink-1.0-SNAPSHOT-jar-with-dependencies.jar

Terminating a Job # 终止作业 # (会自带一个保存点。Savepoint completed. Path: hdfs://cdh01:8020/flink/flink-savepoints/savepoint-f72ee1-0210e7b9ed27)
./bin/flink stop $JOB_ID

Starting a Job from a Savepoint # 从保存点开始作业 #
./bin/flink run -d -c FlinkcdcDataStream -s hdfs://cdh01:8020/flink/flink-savepoints/savepoint-f72ee1-0210e7b9ed27 /home/xswtest/flinkcdc-mysql-1.0-SNAPSHOT-jar-with-dependencies.jar

This is useful if your program dropped an operator that was part of the savepoint.
如果您的程序删除了作为保存点一部分的操作符，这将非常有用。
./bin/flink run -s <savepointPath> --allowNonRestoredState ...