Learn Flink:Event-driven Applications

本文深入探讨了如何使用Flink的ProcessFunction实现事件驱动的应用程序。ProcessFunction结合了事件处理、计时器和状态,是流处理的强大组件。文章通过实例展示了如何使用KeyedProcessFunction代替TumblingEventTimeWindow来处理事件,同时讲解了open()、processElement()和onTimer()方法的用法,并提到了性能考虑和侧流输出的应用场景。
摘要由CSDN通过智能技术生成

Event-driven Applications 事件驱动的应用程序

Process Functions

Introduction 简介

A ProcessFunction combines event processing with timers and state, making it a powerful building block for stream processing applications. This is the basis for creating event-driven applications with Flink. It is very similar to a RichFlatMapFunction, but with the addition of timers.
ProcessFunction将事件处理与计时器和状态相结合,使其成为流处理应用程序的强大构建块。这是使用Flink创建事件驱动的应用程序的基础。它非常类似于RichFlatMapFunction,但添加了计时器。

Example 例子

If you’ve done the hands-on exercise in the Streaming Analytics training, you will recall that it uses a TumblingEventTimeWindow to compute the sum of the tips for each driver during each hour, like this:
如果您做了Streaming Analytics training中的实践练习,您会记得它使用TumblingEventTimeWindow来计算每个司机在每个小时内的小费总和,如下所示:

// compute the sum of the tips per hour for each driver
DataStream<Tuple3<Long, Long, Float>> hourlyTips = fares
        .keyBy((TaxiFare fare) -> fare.driverId)
        .window(TumblingEventTimeWindows.of(Time.hours(1)))
        .process(new AddTips());

It is reasonably straightforward, and educational, to do the same thing with a KeyedProcessFunction. Let us begin by replacing the code above with this:
用KeyedProcessFunction做同样的事情是相当简单的,也是很有教育意义的。让我们首先用以下代码替换上面的代码:

// compute the sum of the tips per hour for each driver
DataStream<Tuple3<Long, Long, Float>> hourlyTips = fares
        .keyBy((TaxiFare fare) -> fare.driverId)
        .process(new PseudoWindow(Time.hours(1)));

In this code snippet a KeyedProcessFunction called PseudoWindow is being applied to a keyed stream, the result of which is a DataStream<Tuple3<Long, Long, Float>> (the same kind of stream produced by the implementation that uses Flink’s built-in time windows).
在这段代码中,一个名为PseudoWindow的KeyedProcessFunction被应用于一个按key分组的流,其返回结果是一个DataStream<Tuple3<Long, Long, Float>>(与使用Flink内置时间窗口产生的流相同)。

The overall outline of PseudoWindow has this shape:
PseudoWindow大概如下:

// Compute the sum of the tips for each driver in hour-long windows.
// The keys are driverIds.
public static class PseudoWindow extends 
        KeyedProcessFunction<Long, TaxiFare, Tuple3<Long, Long, Float>> {

    private final long durationMsec;

    public PseudoWindow(Time duration) {
        this.durationMsec = duration.toMilliseconds();
    }

    @Override
    // Called once during initialization.
    public void open(Configuration conf) {
        . . .
    }

    @Override
    // Called as each fare arrives to be processed.
    public void processElement(
            TaxiFare fare,
            Context ctx,
            Collector<Tuple3<Long, Long, Float>> out) throws Exception {

        . . .
    }

    @Override
    // Called when the current watermark indicates that a window is now complete.
    public void onTimer(long timestamp, 
            OnTimerContext context, 
            Collector<Tuple3<Long, Long, Float>> out) throws Exception {

        . . .
    }
}

Things to be aware of:
需要注意的事项:

  • There are several types of ProcessFunctions – this is a KeyedProcessFunction, but there are also CoProcessFunctions, BroadcastProcessFunctions, etc.
    ProcessFunction有几种类型–这是一种KeyedProcessFunction,但也有CoProcessFunction、BroadcastProcessFunction等。

  • A KeyedProcessFunction is a kind of RichFunction. Being a RichFunction, it has access to the open and getRuntimeContext methods needed for working with managed keyed state.
    KeyedProcessFunction是一种RichFunction。作为一个RichFunction,它可以访问open和getRuntimeContext方法(如果要使用被托管的按key分组的状态就需要调用这两个方法)。

  • There are two callbacks to implement: processElement and onTimer. processElement is called with each incoming event; onTimer is called when timers fire. These can be either event time or processing time timers. Both processElement and onTimer are provided with a context object that can be used to interact with a TimerService (among other things). Both callbacks are also passed a Collector that can be used to emit results.
    有两个抽象方法需要实现:processElement和onTimer。每个传入事件都会调用processElement;计时器启动时调用onTimer,可以是事件时间也可以是处理时间计时器。processElement和onTimer都提供了一个上下文对象,可以用于与TimerService交互。两个抽象方法中还传递了一个Collector,该Collector可用于发出结果。

The open() method

// Keyed, managed state, with an entry for each window, keyed by the window's end time.
// There is a separate MapState object for each driver.
private transient MapState<Long, Float> sumOfTips;

@Override
public void open(Configuration conf) {

    MapStateDescriptor<Long, Float> sumDesc =
            new MapStateDescriptor<>("sumOfTips", Long.class, Float.class);
    sumOfTips = getRuntimeContext().getMapState(sumDesc);
}

Because the fare events can arrive out of order, it will sometimes be necessary to process events for one hour before having finished computing the results for the previous hour. In fact, if the watermarking delay is much longer than the window length, then there may be many windows open simultaneously, rather than just two. This implementation supports this by using a MapState that maps the timestamp for the end of each window to the sum of the tips for that window.
由于fare事件可能无序到达,在计算完上一个小时的结果前,有时需要处理后一个小时的事件。
事实上,如果水印延迟比窗口长度长得多,那么可能会有多个窗口同时打开,而不是只有两个窗口。
使用MapState可以实现这些,MapState将每个窗口结束的时间戳与该窗口的小费总和映射在一起。

The processElement() method

public void processElement(
        TaxiFare fare,
        Context ctx,
        Collector<Tuple3<Long, Long, Float>> out) throws Exception {

    long eventTime = fare.getEventTime();
    TimerService timerService = ctx.timerService();

    if (eventTime <= timerService.currentWatermark()) {
        // This event is late; its window has already been triggered.
    } else {
        // Round up eventTime to the end of the window containing this event.
        long endOfWindow = (eventTime - (eventTime % durationMsec) + durationMsec - 1);

        // Schedule a callback for when the window has been completed.
        timerService.registerEventTimeTimer(endOfWindow);

        // Add this fare's tip to the running total for that window.
        Float sum = sumOfTips.get(endOfWindow);
        if (sum == null) {
            sum = 0.0F;
        }
        sum += fare.tip;
        sumOfTips.put(endOfWindow, sum);
    }
}

Things to consider:
需要考虑的事项:

  • What happens with late events? Events that are behind the watermark (i.e., late) are being dropped. If you want to do something better than this, consider using a side output, which is explained in the next section.
    如果事件延迟会发生什么?水印后面的事件(即延迟)将会被删除。如果您想做的更好,请考虑使用侧流输出,这将在下一节中解释。

  • This example uses a MapState where the keys are timestamps, and sets a Timer for that same timestamp. This is a common pattern; it makes it easy and efficient to lookup relevant information when the timer fires.
    本例使用MapState,其中key是时间戳,并为此时间戳设置一个计时器。这是一种常见的模式;当计时器启动时,它可以轻松高效地查找相关信息。

The onTimer() method

public void onTimer(
        long timestamp, 
        OnTimerContext context, 
        Collector<Tuple3<Long, Long, Float>> out) throws Exception {

    long driverId = context.getCurrentKey();
    // Look up the result for the hour that just ended.
    Float sumOfTips = this.sumOfTips.get(timestamp);

    Tuple3<Long, Long, Float> result = Tuple3.of(driverId, timestamp, sumOfTips);
    out.collect(result);
    this.sumOfTips.remove(timestamp);
}

Observations:
观察:

1.The OnTimerContext context passed in to onTimer can be used to determine the current key.
传递给onTimer的OnTimerContext上下文可用于确定当前key。

2.Our pseudo-windows are being triggered when the current watermark reaches the end of each hour, at which point onTimer is called. This onTimer method removes the related entry from sumOfTips, which has the effect of making it impossible to accommodate late events. This is the equivalent of setting the allowedLateness to zero when working with Flink’s time windows.
当当前水印到达每小时结束时,我们的pseudo-windows被触发,此时调用onTimer。onTimer方法从sumOfTips中删除相关条目,其效果是无法容纳延迟事件。这相当于在使用Flink的时间窗口时将allowedLateness设置为零。

Performance Considerations 性能注意事项

Flink provides MapState and ListState types that are optimized for RocksDB. Where possible, these should be used instead of a ValueState object holding some sort of collection. The RocksDB state backend can append to ListState without going through (de)serialization, and for MapState, each key/value pair is a separate RocksDB object, so MapState can be efficiently accessed and updated.
Flink提供了针对RocksDB优化的MapState和ListState类型。在可能的情况下,应该使用它们,而不是ValueState对象。
RocksDB状态后端可以追加到ListState,而无需进行(反)序列化,对于MapState,每个key/value对都是一个单独的RocksDB对象,因此可以高效地访问和更新MapState。

Side Outputs 侧流输出

Introduction 简介

There are several good reasons to want to have more than one output stream from a Flink operator, such as reporting:
有些情况下很有必要从Flink operator中获得多个输出流,例如:

  • exceptions
    异常
  • malformed events
    畸形事件
  • late events
    延迟事件
  • operational alerts, such as timed-out connections to external services
    操作警报,如与外部服务的连接超时

Side outputs are a convenient way to do this. Beyond error reporting, side outputs are also a good way to implement an n-way split of a stream.
侧流输出是用来处理上述情况的方便方式。除了错误报告之外,侧流输出也是实现流的n路分割的好方法。

Example 例子

You are now in a position to do something with the late events that were ignored in the previous section.
您现在可以处理上一节中忽略的延迟事件。

A side output channel is associated with an OutputTag. These tags have generic types that correspond to the type of the side output’s DataStream, and they have names.
侧流输出通道与OutputTag相关联。这些标记的泛型类型与侧流输出的类型保持一致,并且具有标记名称。

private static final OutputTag<TaxiFare> lateFares = new OutputTag<TaxiFare>("lateFares") {};

Shown above is a static OutputTag that can be referenced both when emitting late events in the processElement method of the PseudoWindow:
上面显示的是一个静态OutputTag,在PseudoWindow的processElement方法中发出延迟事件时可以引用该输出:

if (eventTime <= timerService.currentWatermark()) {
    // This event is late; its window has already been triggered.
    ctx.output(lateFares, fare);
} else {
    . . .
}

and when accessing the stream from this side output in the main method of the job:
可以在main方法中访问该侧流输出:

// compute the sum of the tips per hour for each driver
SingleOutputStreamOperator hourlyTips = fares
        .keyBy((TaxiFare fare) -> fare.driverId)
        .process(new PseudoWindow(Time.hours(1)));

hourlyTips.getSideOutput(lateFares).print();

Alternatively, you can use two OutputTags with the same name to refer to the same side output, but if you do, they must have the same type.
或者,您可以使用两个名称相同的OutputTag来引用同一个侧流输出,但如果这样做,它们必须具有相同的类型。

Closing Remarks 最后提醒

In this example you have seen how a ProcessFunction can be used to reimplement a straightforward time window. Of course, if Flink’s built-in windowing API meets your needs, by all means, go ahead and use it. But if you find yourself considering doing something contorted with Flink’s windows, don’t be afraid to roll your own.
在本例中,您已经看到了如何使用ProcessFunction重新实现一个简单的时间窗口。当然,如果Flink的内置窗口API满足了您的需求,请务必使用它。但是,如果你发现自己正在考虑用Flink的窗户做一些特别的事情,不要害怕自己动手。

Also, ProcessFunctions are useful for many other use cases beyond computing analytics. The hands-on exercise below provides an example of something completely different.
此外,ProcessFunctions对于计算分析之外的许多其他用例也很有用。下面的实践练习提供了一个完全不同的例子。

Another common use case for ProcessFunctions is for expiring stale state. If you think back to the Rides and Fares Exercise , where a RichCoFlatMapFunction is used to compute a simple join, the sample solution assumes that the TaxiRides and TaxiFares are perfectly matched, one-to-one for each rideId. If an event is lost, the other event for the same rideId will be held in state forever. This could instead be implemented as a KeyedCoProcessFunction, and a timer could be used to detect and clear any stale state.
ProcessFunctions的另一个常见用例是过期状态。如果您回想一下乘坐和票价练习,其中RichCoFlatMapFunction用于计算简单连接,示例解决方案假设出租车行程和出租车票据是完美匹配的,每个rideId一一对应。如果一个事件丢失,同一个rideId的另一个事件将永远保持在状态中。因此,可以通过KeyedCoProcessFunction来处理这种情况,并且可以使用定时器来检测和清除任何陈旧状态。

Hands-on

The hands-on exercise that goes with this section is the Long Ride Alerts Exercise .
本节附带的实践练习是Long Ride Alerts Exercise。

Further Reading

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值