kafka 学习笔记（2）

最新推荐文章于 2022-05-20 18:14:00 发布

weixin_33924220

最新推荐文章于 2022-05-20 18:14:00 发布

阅读量216

点赞数

文章标签：大数据网络 java

原文链接：https://my.oschina.net/dongtianxi/blog/714935

版权

2019独角兽企业重金招聘Python工程师标准>>>

Storing Offsets Outside Kafka

offset的外部保存

The consumer application need not use Kafka's built-in offset storage, it can store offsets in a store of its own choosing. The primary use case for this is allowing the application to store both the offset and the results of the consumption in the same system in a way that both the results and offsets are stored atomically. This is not always possible, but when it is it will make the consumption fully atomic and give "exactly once" semantics that are stronger than the default "at-least once" semantics you get with Kafka's offset commit functionality.

消费者程序不需要使用kafka内置的offset存储，而是可以自主选择offset的存储方式。如果能够实现offset和result的原子性保存，将会实现exactly once的事务性保证，要比kafka的offset提交机制所提供的at-least once更加强壮。

Here are a couple of examples of this type of usage:

下面是这种应用的一些例子：

If the results of the consumption are being stored in a relational database, storing the offset in the database as well can allow committing both the results and offset in a single transaction. Thus either the transaction will succeed and the offset will be updated based on what was consumed or the result will not be stored and the offset won't be updated.
消息result和offset，如果保存在传统的数据库当中，可以用数据库事务来保证result和offset的一致性，共同成功或者是共同失败。
If the results are being stored in a local store it may be possible to store the offset there as well. For example a search index could be built by subscribing to a particular partition and storing both the offset and the indexed data together. If this is done in a way that is atomic, it is often possible to have it be the case that even if a crash occurs that causes unsync'd data to be lost, whatever is left has the corresponding offset stored as well. This means that in this case the indexing process that comes back having lost recent updates just resumes indexing from what it has ensuring that no updates are lost.
如果消息result被保存在本地存储中，那么offset也可以这样来进行存储。例如：一个search index可以订阅一个特定的partition，然后把offset和被索引的数据一起来进行保存。如果能够保证操作的原子性，那么在一些异常情况发生的时候，就能够始终保证消息result和offset的一致性。

Each record comes with its own offset, so to manage your own offset you just need to do the following:

每一条记录都有自己的offset，需要管理offset，只需要做到以下几步：

This type of usage is simplest when the partition assignment is also done manually (this would be likely in the search index use case described above). If the partition assignment is done automatically special care is needed to handle the case where partition assignments change.

如果是手动分配partition的方式，这种类型的用法是非常简单的；如果是subscribe topic的方式进而自动分配partition 的话，需要特别注意partition分配的变化。

This can be done by providing a ConsumerRebalanceListener instance in the call to subscribe(Collection, ConsumerRebalanceListener) and subscribe(Pattern, ConsumerRebalanceListener). For example, when partitions are taken from a consumer the consumer will want to commit its offset for those partitions by implementing ConsumerRebalanceListener.onPartitionsRevoked(Collection). When partitions are assigned to a consumer, the consumer will want to look up the offset for those new partitions and correctly initialize the consumer to that position by implementing ConsumerRebalanceListener.onPartitionsAssigned(Collection).

在sbuscribe一个topic的时候，可以传入一个 ConsumerRebalanceListener 类型的参数。 ConsumerRebalanceListener 有2个方法， onPartitionsRevoked 和 onPartitionsAssigned ，分别监控partition的撤销和分配事件。

Another common use for ConsumerRebalanceListener is to flush any caches the application maintains for partitions that are moved elsewhere.

没理解上去！

Controlling The Consumer's Position

控制消费者位置

In most use cases the consumer will simply consume records from beginning to end, periodically committing its position (either automatically or manually). However Kafka allows the consumer to manually control its position, moving forward or backwards in a partition at will. This means a consumer can re-consume older records, or skip to the most recent records without actually consuming the intermediate records.

在大多数情况下，消费者只是需要简单的从头到尾的消费消息记录，然后周期性的提交他们的位置。但是，kafka同样允许，消费者能够手动的自主的控制自己的位置，在一个partition当中向前或者向后移动。这也就意味着消费者可以重复的消费一些记录或者是跳过一些记录。

There are several instances where manually controlling the consumer's position can be useful.

在下面几种情况当中，手动控制消费者的位置可能是有用的。

One case is for time-sensitive record processing it may make sense for a consumer that falls far enough behind to not attempt to catch up processing all records, but rather just skip to the most recent records.

某些事件敏感性较高的系统当中，如果数据量较大，可能需要跳过某些记录，以保证对最新记录的实时监控！

Another use case is for a system that maintains local state as described in the previous section. In such a system the consumer will want to initialize its position on start-up to whatever is contained in the local store. Likewise if the local state is destroyed (say because the disk is lost) the state may be recreated on a new machine by re-consuming all the data and recreating the state (assuming that Kafka is retaining sufficient history).

另一种情况就是前面提到的offset的外部保存，这种情况下，consumer可能需要在启动的时候，用本地保存的offset来初始化自己的位置。另外，如果consumer的本地存储收到了破坏，可能需要在另一台设备上重新消费所有的数据并创建和保存offset。

Kafka allows specifying the position using seek(TopicPartition, long) to specify the new position. Special methods for seeking to the earliest and latest offset the server maintains are also available ( seekToBeginning(Collection) and seekToEnd(Collection) respectively).

kafka允许我们来使用seek来指定一个新的位置，另外还有seekToBeginning和seekToEnd.

Consumption Flow Control

消费流控制

If a consumer is assigned multiple partitions to fetch data from, it will try to consume from all of them at the same time, effectively giving these partitions the same priority for consumption. However in some cases consumers may want to first focus on fetching from some subset of the assigned partitions at full speed, and only start fetching other partitions when these partitions have few or no data to consume.

如果一个消费者被分配了多个partition，那么它会同时从多个partition获取数据。但是，某些情况下，消费者可能需要首先聚焦在某些特定的partition上，当这些特定的partion消费完毕之后，在考虑其他的partition。

One of such cases is stream processing, where processor fetches from two topics and performs the join on these two streams. When one of the topics is long lagging behind the other, the processor would like to pause fetching from the ahead topic in order to get the lagging stream to catch up.

上面这段可能涉及到stream处理部分，这部分内容暂时不明，后续添加

Another example is bootstraping upon consumer starting up where there are a lot of history data to catch up, the applications usually want to get the latest data on some of the topics before consider fetching other topics.

另一种情况，消费者在启动的时候，有大量的历史数据需要获取，而应用程序可能希望，首先获取某些topic上最新的数据，之后再考虑其他的topic。

Kafka supports dynamic controlling of consumption flows by using pause(Collection) and resume(Collection) to pause the consumption on the specified assigned partitions and resume the consumption on the specified paused partitions respectively in the future poll(long) calls.

Kafka提供了动态控制机制，pause和resume。

Multi-threaded Processing

多线程处理

The Kafka consumer is NOT thread-safe. All network I/O happens in the thread of the application making the call. It is the responsibility of the user to ensure that multi-threaded access is properly synchronized. Un-synchronized access will result in ConcurrentModificationException.

kafka消费者不是线程安全的，所有的网络I/O都和方法的调用在同一个县城当中，需要用户来保证多线程之间的同步性。如果发生了非同步的访问，将会引发 ConcurrentModificationException 。

The only exception to this rule is wakeup(), which can safely be used from an external thread to interrupt an active operation. In this case, a WakeupException will be thrown from the thread blocking on the operation. This can be used to shutdown the consumer from another thread. The following snippet shows the typical pattern:

唯一的例外是wakeup方法，wakeup方法可以再外部线程当中安全的使用，来打断一个当前活动的操作。在这种情况下，一个WakeupException将会从该操作的线程阻塞当中被抛出。可以在其他的线程当中来关闭一个consumer。下面这段代码显示了这种典型的模式：

public class KafkaConsumerRunner implements Runnable {
     private final AtomicBoolean closed = new AtomicBoolean(false);
     private final KafkaConsumer consumer;

     public void run() {
         try {
             consumer.subscribe(Arrays.asList("topic"));
             while (!closed.get()) {
                 ConsumerRecords records = consumer.poll(10000);
                 // Handle new records
             }
         } catch (WakeupException e) {
             // Ignore exception if closing
             if (!closed.get()) throw e;
         } finally {
             consumer.close();
         }
     }

     // Shutdown hook which can be called from a separate thread
     public void shutdown() {
         closed.set(true);
         consumer.wakeup();
     }
 }

Then in a separate thread, the consumer can be shutdown by setting the closed flag and waking up the consumer.

在另一个线程当中，可以通过设置closed标志位和wake up consumer的方式来关闭consumer。

closed.set(true);
     consumer.wakeup();

查看源代码可以发现，consumer内部，使用的是java nio的通讯方式，所谓的wakeup，最终是wakeup一个nio selector，因为java nio的selector，本身是阻塞式的，wakeup方法可以终止selector对感兴趣事件的等待。

We have intentionally avoided implementing a particular threading model for processing. This leaves several options for implementing multi-threaded processing of records.

我们有意去避免实现某种特定的线程模型，同时我们还可能有不同的选择来实现不同的线程模型。

1. One Consumer Per Thread

1. 每个线程一个consumer

A simple option is to give each thread its own consumer instance. Here are the pros and cons of this approach:

一个简单的选择是给每个线程自己的消费实例，下面是这种方式的优缺点。

PRO: It is the easiest to implement 最容易实现
PRO: It is often the fastest as no inter-thread co-ordination is needed 通常也是最快的，因为不存在线程间的协调问题。
PRO: It makes in-order processing on a per-partition basis very easy to implement 在每个partition的基础上，按序的处理变得非常容易。(each thread just processes messages in the order it receives them).
CON: More consumers means more TCP connections to the cluster (one per thread). In general Kafka handles connections very efficiently so this is generally a small cost. 更多的消费者意味着更多的TCP连接。虽然，对于kafka来说，有着高效的控制连接的能力，这样的消耗可能不会很大。
CON: Multiple consumers means more requests being sent to the server and slightly less batching of data which can cause some drop in I/O throughput. 多个消费者意味着会有更多的请求发送给server，同时意味着数据的批量处理可能减少，这样可能会导致I/O性能的下降。
CON: The number of total threads across all processes will be limited by the total number of partitions. 总的线程数将会受到partitions数量的限制。

2. Decouple Consumption and Processing

2. 将消费和处理进行分离

Another alternative is to have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing. This option likewise has pros and cons:

另外一种方案是配置若干个线程来专门负责从Kafka当中获取数据，同时把这些数据放入队列当中，其他另一组线程专门负责处理这些数据。这样的方式同样有他的优缺点：

PRO: This option allows independently scaling the number of consumers and processors. This makes it possible to have a single consumer that feeds many processor threads, avoiding any limitation on partitions. 此种方式，允许独立的调整consumer线程和process线程的数量，可以设置一个consumer来满足多个process，这样也就避免了partition数量的限制。
CON: Guaranteeing order across the processors requires particular care as the threads will execute independently an earlier chunk of data may actually be processed after a later chunk of data just due to the luck of thread execution timing. For processing that has no ordering requirements this is not a problem. 需要考虑排序的问题
CON: Manually committing the position becomes harder as it requires that all threads co-ordinate to ensure that processing is complete for that partition. 如果要考虑业务逻辑和offset提交的一致性，可能会非常复杂。

There are many possible variations on this approach. For example each processor thread can have its own queue, and the consumer threads can hash into these queues using the TopicPartition to ensure in-order consumption and simplify commit.

这种方式有着多种可能的变化，例如：每个processor 线程可以拥有自己的队列，然后consumer可以用TopicPartition来hash进入不同的队列，进而保证按序的操作和相对简单的offset提交。

转载于:https://my.oschina.net/dongtianxi/blog/714935