如何在Kafka Streams中使用状态操作？

最新推荐文章于 2024-04-17 17:22:26 发布

cunxiedian8614

最新推荐文章于 2024-04-17 17:22:26 发布

阅读量378

点赞数

文章标签：大数据数据库操作系统

Kafka Streams API博客系列的第一部分介绍了无状态的诸如过滤，地图 etc. In this part，we will explore 有状态的Kafka Streams DSL API中的操作。它着重于聚合操作，例如骨料，计数，减少以及相关概念的讨论。

Aggregation

聚合操作应用于相同键的记录。 Kafka Streams支持以下聚合-骨料，计数，减少。As mentioned in the previous blog，grouping is a pre-requisite for aggregation。You can run 通过...分组（或其变体）在KStream或一个桌子导致KGroupedStream和KGroupedTable分别。

桌子分组之前不在无状态操作博客中介绍

aggregate

的骨料功能有两个关键组成部分-初始化器和聚合器。收到第一条记录后，初始化器被调用，并被用作聚合器. For subsequent records, the 聚合器 uses the current record along with the computed 骨料 (until now) for its calculation. Conceptually, this is a stateful computation being performed on an infinite data set - it is stateful because calculating the current state takes into account the current state (the key-value record) along with the latest state (current 骨料). This can be used for scenarios such as moving average, sum, count, etc.

这是一个如何计算计数的示例，即接收特定密钥的次数

code examples are available on GitHub

        StreamsBuilder builder = new StreamsBuilder();
        KStream<String, String> stream = builder.stream(INPUT_TOPIC);

        KTable<String, Count> aggregate = stream.groupByKey()
                .aggregate(new Initializer<Count>() {
                    @Override
                    public Count apply() {
                        return new Count("", 0);
                    }
                }, new Aggregator<String, String, Count>() {
                    @Override
                    public Count apply(String k, String v, Count aggKeyCount) {
                        Integer currentCount = aggKeyCount.getCount();
                        return new Count(k, currentCount + 1);
                    }
                });


        aggregate.toStream()
                 .map((k,v) -> new KeyValue<>(k, v.getCount()))
                 .to(COUNTS_TOPIC, Produced.with(Serdes.String(), Serdes.Integer()));

count

计数是一种通常使用的聚合形式，它是作为一流操作提供的。将流记录按键分组后（KGroupedStream) you can 计数 the number of records of a specific key by using this operation.

的骨料可以用一个方法调用来代替做事方式！

        StreamsBuilder builder = new StreamsBuilder();
        KStream<String, String> stream = builder.stream(INPUT_TOPIC);

        stream.groupByKey().count();

reduce

您可以使用减少合并价值流。的骨料前面介绍的操作是的广义形式减少。您可以实现以下功能和，分，最高等。这是一个例子最高

        StreamsBuilder builder = new StreamsBuilder();
        KStream<String, Long> stream = builder.stream(INPUT_TOPIC, Consumed.with(Serdes.String(), Serdes.Long()));

        stream.groupByKey()
                .reduce(new Reducer<Long>() {
                    @Override
                    public Long apply(Long currentMax, Long v) {
                        Long max = (currentMax > v) ? currentMax : v;
                        return max;
                    }
                }).toStream().to(OUTPUT_TOPIC);

        return builder.build();

请注意，所有聚合操作都会忽略带有空值密钥是显而易见的，因为这些功能集的目的是对特定密钥的记录进行操作

Aggregation and state stores

在以上示例中，汇总值被推送到输出主题-尽管这不是强制性的。可以将聚合结果存储在本地状态存储中。这是一个例子：

        StreamsBuilder builder = new StreamsBuilder();
        KStream<String, String> stream = builder.stream(INPUT_TOPIC);

        stream.groupByKey().count(Materialized.as("count-store"));

在上面的示例中，计数还创建一个名为的本地状态存储计数-store然后可以使用交互式查询自省。

这些状态存储可以在内存中，也可以使用岩石数据库。这允许可扩展性由于每个状态存储都位于特定的Kafka Streams应用程序本地，该应用程序处理主题的不同分区的输入-因此，总体状态分布在（潜在的）应用程序的多个实例中（除非全局表s）。另一个关键属性是高可用性因为这些状态存储的内容已备份为Kafka变更日志又名压实的 topics (although this can be disabled) which provides 高可用性 - if an app instance crashes, it's state store contents can be restored from Kafka itself

`KGroupedTable`

一种分组表当通过...分组*操作在桌子。就像KGroupedStream，有一个分组表是对一个集合应用聚合的先决条件桌子。骨料，计数和减少以相同的方式工作分组表 as they do with a KGrou`pedStream。But，there is an important difference that needs to be highlighted.

一种桌子在概念上与KStream从某种意义上说，它表示某个时间点的数据快照（非常类似于数据库表）。它是一个可变的实体，而不是KStream它代表一个不变的+无限的记录序列。考虑到这一差异，骨料和减少在一个功能分组表还添加一个额外的聚合器 (often known as a subtractor)和it is invoked when a key is updated or a 空值获得值。

Windowing

有状态的Kafka Streams操作也支持加窗。这使您可以将流处理管道的范围限定为特定的时间窗口/范围，例如跟踪号每分钟链接点击数或否。每小时的唯一身份浏览量

去表演窗口ed聚合一组记录，则必须创建一个KGroupedStream（如上所述）使用通过...分组在一个KStream然后使用窗口化操作（有两种重载形式）。您可以在传统窗口（滚动，跳动或滑动）或基于会话的时间窗口之间进行选择

使用windowedBy(Windows<W> windows)在一个KGroupedStream返回一个TimeWindowedKStream您还可以在其上调用上述聚合操作。例如如果您希望在特定时间范围内（例如5分钟）点击次数，请选择滚动时间窗口。这样可以确保在给定的时间范围内清楚地分隔记录，即从user1的上午10点至10:05 AM的点击将分别进行汇总（计数），并且新的时间段（窗口）将从10:06 AM开始，在此期间点击计数器重置为零并再次计数

` StreamsBuilder builder = new StreamsBuilder（）; KStream stream = builder.stream（INPUT_TOPIC）;

TimeWindowedKStream windoweded = stream.groupByKey（）。windowedBy（TimeWindows.of（Duration.ofMinutes（5）））;

windowed.count（）。toStream（）。to（OUTPUT_TOPIC）; `

其他窗口类型包括：

翻滚永远不会重叠的时间窗口，即一条记录只会是一个窗口的一部分......与跃迁可以在一个或多个时间范围/窗口中显示记录的时间窗口滑行时间窗口适合与Joining操作一起使用

有状态操作的另一种类型是加盟。这是一个广泛的话题，其本身值得一整篇文章（或其他系列文章？）

如果您要考虑“会话”，即活动时间间隔为已定义的不活动间隔，请使用windowedBy（SessionWindows窗口）返回一个SessionWindowedKStream。

` StreamsBuilder builder = new StreamsBuilder（）; KStream stream = builder.stream（INPUT_TOPIC）;

    stream.groupByKey().windowedBy(SessionWindows.with(Duration.ofMinutes(5)))
                       .toStream().to(OUTPUT_TOPIC);

返回builder.build（）; `

这就是Kafka Streams博客系列的全部内容。请继续关注下一部分，它将演示如何使用内置的测试实用程序来测试Kafka Streams应用程序。

References

请不要忘记查看以下有关Kafka Streams的资源

from: https://dev.to//itnext/how-to-use-stateful-operations-in-kafka-streams-4ia1