Consumer Group Example-High Level Consumer(翻译)

最新推荐文章于 2024-10-09 15:55:17 发布

weixin_34121282

最新推荐文章于 2024-10-09 15:55:17 发布

阅读量97

点赞数

文章标签：大数据 python java

原文链接：https://my.oschina.net/yulongblog/blog/1511371

版权

2019独角兽企业重金招聘Python工程师标准>>>

Using the High Level Consumer

Why use the High Level Consumer

Sometimes the logic to read messages from Kafka doesn't care about handling the message offsets, it just wants the data. So the High Level Consumer is provided to abstract most of the details of consuming events from Kafka.

有时候从Kafka读取message的逻辑并不关心维护message offsets,只是想要获取数据。所以High Level Consumer被提供作为从Kafka消费事件大部分细节的抽象。

First thing to know is that the High Level Consumer stores the last offset read from a specific partition in ZooKeeper. This offset is stored based on the name provided to Kafka when the process starts. This name is referred to as the Consumer Group.

首先需要知道的事情是High Level Consumer将从某个分区读取时的最后offset存储在ZooKeeper中。在程序启动时，这个offset基于Kafka提供的名字进行存储。这个名字被称为Consumer Group。

The Consumer Group name is global across a Kafka cluster, so you should be careful that any 'old' logic Consumers be shutdown before starting new code. When a new process is started with the same Consumer Group name, Kafka will add that processes' threads to the set of threads available to consume the Topic and trigger a 're-balance'. During this re-balance Kafka will assign available partitions to available threads, possibly moving a partition to another process. If you have a mixture of old and new business logic, it is possible that some messages go to the old logic.

Consumer Group名字对于Kafka集群来说是全局，所以你应该注意在启动新的程序之前确认任何'历史的'逻辑Consumers被关闭。当一个新程序使用相同的Consumer Group name，Kafka将把这个程序的thread增加到消费这个Topic的线程集合中，并且触发一个're-balance'.这个re-balance中，Kafka将指定可用的分区到可用的线程，可能移动一个分区到其他的process。如果你有一个旧的和新的业务逻辑的组合，这可能的导致一些信息发送到旧逻辑中。

Designing a High Level Consumer

The first thing to know about using a High Level Consumer is that it can (and should!) be a multi-threaded application. The threading model revolves around the number of partitions in your topic and there are some very specific rules:

关于High Level Consumer首先需要知道的事情是，它可以(并且应该)是一个多线程程序。线程模型将围绕你的topic分区数量来进行工作，有一些具体的规则：

if you provide more threads than there are partitions on the topic, some threads will never see a message
线程数大于topic分区数时，一些线程将无法接收信息
if you have more partitions than you have threads, some threads will receive data from multiple partitions
线程数小于topic分区数时，一些线程将从多个分区中接收信息
if you have multiple partitions per thread there is NO guarantee about the order you receive messages, other than that within the partition the offsets will be sequential. For example, you may receive 5 messages from partition 10 and 6 from partition 11, then 5 more from partition 10 followed by 5 more from partition 10 even if partition 11 has data available.
当每个线程接收多个分区消息时，则不能保证接收消息的顺序，而不是分区内的顺序，offset将是连续的。例如，您可以从分区10收到5条信息，从分区11收到6条消息，然而可能出现从分区10接收5条信息之后又接收5条信息，即使分区11有可用的数据。
adding more processes/threads will cause Kafka to re-balance, possibly changing the assignment of a Partition to a Thread.
增加更多的进程/线程将导致Kafka re-balance，可能改变线程上的分区分配。

Next, your logic should expect to get an iterator from Kafka that may block if there are no new messages available.

其次，你的逻辑应该从Kafka中得到迭代器，并且果没有可用的新消息时进行阻塞。

Here is an example of a very simple consumer that expects to be threaded.

这是一个很简单的线程化消费的例子。

package com.test.groups;
 
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
 
public class ConsumerTest implements Runnable {
    private KafkaStream m_stream;
    private int m_threadNumber;
 
    public ConsumerTest(KafkaStream a_stream, int a_threadNumber) {
        m_threadNumber = a_threadNumber;
        m_stream = a_stream;
    }
 
    public void run() {
        ConsumerIterator<byte[], byte[]> it = m_stream.iterator();
        while (it.hasNext())
            System.out.println("Thread " + m_threadNumber + ": " + new String(it.next().message()));
        System.out.println("Shutting down Thread: " + m_threadNumber);
    }
}

The interesting part here is the while (it.hasNext()) section. Basically this code reads from Kafka until you stop it.

有趣的部分是while (it.hasNext())，这段代码会一直从Kafka中读取，直到你停止它。

Configuring the test application

Unlike the SimpleConsumer the High level consumer takes care of a lot of the bookkeeping and error handling for you. However you do need to tell Kafka where to store some information. The following method defines the basics for creating a High Level Consumer:

与SimpleConsumer 不同，High level consumer已经为你负责了一些bookkeeping和错误处理的工作。然而你需要告诉Kafka将这些信息存储在哪里。下面的方法定义了如何创建High Level Consumer的基础部分。

private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId) {
        Properties props = new Properties();
        props.put("zookeeper.connect", a_zookeeper);
        props.put("group.id", a_groupId);
        props.put("zookeeper.session.timeout.ms", "400");
        props.put("zookeeper.sync.time.ms", "200");
        props.put("auto.commit.interval.ms", "1000");
        return new ConsumerConfig(props);
    }

The ‘zookeeper.connect’ string identifies where to find once instance of Zookeeper in your cluster. Kafka uses ZooKeeper to store offsets of messages consumed for a specific topic and partition by this Consumer Group.定义在你的集群冲找到Zookeeper的一个实例。Kafka使用ZooKeeper来存储对于某个特定topic和partition通过Consumer Group进行消费的消息offsets。

The ‘group.id’ string defines the Consumer Group this process is consuming on behalf of.定义这个进程消费的Consumer Group

The ‘zookeeper.session.timeout.ms’ is how many milliseconds Kafka will wait for ZooKeeper to respond to a request (read or write) before giving up and continuing to consume messages.定义Kafka等待ZooKeeper响应的超时时间。

The ‘zookeeper.sync.time.ms’ is the number of milliseconds a ZooKeeper ‘follower’ can be behind the master before an error occurs.

The ‘auto.commit.interval.ms’ setting is how often updates to the consumed offsets are written to ZooKeeper. Note that since the commit frequency is time based instead of # of messages consumed, if an error occurs between updates to ZooKeeper on restart you will get replayed messages.定义在ZooKeeper中存储消费offset的更新时间间隔。这个commit间隔是基于时间而不是消息消费的数量，如果一个错误在两次更新中发生，当重新启动任务是，可能得到重复消息。

More information about these settings can be found here

Creating the thread pool

This example uses the Java java.util.concurrent package for thread management since it makes creating a thread pool very simple.

public void run(int a_numThreads) {
    Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
    topicCountMap.put(topic, new Integer(a_numThreads));
    Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
    List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
 
 
    // now launch all the threads
    //
    executor = Executors.newFixedThreadPool(a_numThreads);
 
    // now create an object to consume the messages
    //
    int threadNumber = 0;
    for (final KafkaStream stream : streams) {
        executor.execute(new ConsumerTest(stream, threadNumber));
        threadNumber++;
    }
}

First we create a Map that tells Kafka how many threads we are providing for which topics. The consumer.createMessageStreams is how we pass this information to Kafka. The return is a map of KafkaStream to listen on for each topic. (Note here we only asked Kafka for a single Topic but we could have asked for multiple by adding another element to the Map.)

首先我们创建一个Map并且告诉Kafka我们提供给哪些topics多少个线程。这个consumer.createMessageStreams就是我们如何将信息传递给Kafka的方法。返回的是一个用于监听每个topic的KafkaStream的map。(备注,这里我们只是获取了一个topic，但是我们可以通过在Map中增加元素来获取多个topics)

Finally we create the thread pool and pass a new ConsumerTest object to each thread as our business logic.

最终我们创建一个线程池，并且在我们的业务逻辑中为每个线程设置一个ConsumerTest对象。

Clean Shutdown and Error Handling

Kafka does not update Zookeeper with the message offset last read after every read, instead it waits a short period of time. Due to this delay it is possible that your logic has consumed a message and that fact hasn't been synced to zookeeper. So if your client exits/crashes you may find messages being replayed next time to start.

Kafka不会每次read都更新Zookeeper的message offset，而是每次等待一个短的周期。由于这个延迟，这可能导致你的逻辑已经消费了消息，然而事实上还没有同步到ZooKeeper。所以当你的客户端退出或者崩溃时，你可能在下次启动时发现消息重播了。

Also note that sometimes the loss of a Broker or other event that causes the Leader for a Partition to change can also cause duplicate messages to be replayed.

而且需要记住，有时候一个Broker丢失或者其他导致分区Leader改变，也会导致重复消息重播。

To help avoid this, make sure you provide a clean way for your client to exit instead of assuming it can be 'kill -9'd.

为了避免这个问题，确认你提供了一个clean方法来退出客户端，而不是使用"kill -9".

As an example, the main here sleeps for 10 seconds, which allows the background consumer threads to consume data from their streams 10 seconds. Since auto commit is on, they will commit offsets every second. Then, shutdown is called, which calls shutdown on the consumer, then on the ExecutorService, and finally tries to wait for the ExecutorService to finish all outsanding work. This gives the consumer threads time to finish processing the few outstanding messages that may remain in their streams. Shutting down the consumer causes the iterators for each stream to return false for hasNext() once all messages already received from the server are processed, so the other threads should exit gracefully. Additionally, with auto commit enabled, the call to consumer.shutdown() will commit the final offsets.

作为一个例子，main方法在这里sleep 10秒，允许后台的消费者线程可以消费流数据10秒。由于自动提交，他们每秒都会commit offsets。然后，shutdown方法被调用，它将关闭consumer，然后是关闭ExecutorService，并最终等待ExecutorService完成所有主要工作。这给消费者线程时间来处理完一些可以保留在流中的message。关闭消费者导致每个数据流iterator的hasNext方法一旦所有点的message已经接受并处理则返回false，所以其他的线程应该优雅地关闭。此外，因为启动了自动提交，consumer.shutdown()的调用将提交最终的offset。

try {
    Thread.sleep(10000);
} catch (InterruptedException ie) {
 
}
example.shutdown();

In practice, a more common pattern is to use sleep indefinitely and use a shutdown hook to trigger clean shutdown.

实际上，更常见的模式是无限期地使用休眠，并使用关闭挂钩来触发优雅关闭。

Running the example

The example code expects the following command line parameters:

ZooKeeper connection string with port number
Consumer Group name to use for this process
Topic to consume messages from
# of threads to launch to consume the messages

For example:

server01.myco.com1:2181 group3 myTopic  4

Will connect to port 2181 on server01.myco.com for ZooKeeper and requests all partitions from Topic myTopic and consume them via 4 threads. The Consumer Group for this example is group3.

Full Source Code

package com.test.groups;
 
import kafka.consumer.ConsumerConfig;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
 
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
 
public class ConsumerGroupExample {
    private final ConsumerConnector consumer;
    private final String topic;
    private  ExecutorService executor;
 
    public ConsumerGroupExample(String a_zookeeper, String a_groupId, String a_topic) {
        consumer = kafka.consumer.Consumer.createJavaConsumerConnector(
                createConsumerConfig(a_zookeeper, a_groupId));
        this.topic = a_topic;
    }
 
    public void shutdown() {
        if (consumer != null) consumer.shutdown();
        if (executor != null) executor.shutdown();
        try {
            if (!executor.awaitTermination(5000, TimeUnit.MILLISECONDS)) {
                System.out.println("Timed out waiting for consumer threads to shut down, exiting uncleanly");
            }
        } catch (InterruptedException e) {
            System.out.println("Interrupted during shutdown, exiting uncleanly");
        }
   }
 
    public void run(int a_numThreads) {
        Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
        topicCountMap.put(topic, new Integer(a_numThreads));
        Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
        List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
 
        // now launch all the threads
        //
        executor = Executors.newFixedThreadPool(a_numThreads);
 
        // now create an object to consume the messages
        //
        int threadNumber = 0;
        for (final KafkaStream stream : streams) {
            executor.submit(new ConsumerTest(stream, threadNumber));
            threadNumber++;
        }
    }
 
    private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId) {
        Properties props = new Properties();
        props.put("zookeeper.connect", a_zookeeper);
        props.put("group.id", a_groupId);
        props.put("zookeeper.session.timeout.ms", "400");
        props.put("zookeeper.sync.time.ms", "200");
        props.put("auto.commit.interval.ms", "1000");
        props.put("auto.offset.reset", "smallest"); // 从最早的数据开始消费
        // props.put("auto.offset.reset", "largest "); // 从最新的数据开始消费
 
        return new ConsumerConfig(props);
    }
 
    public static void main(String[] args) {
        String zooKeeper = args[0];
        String groupId = args[1];
        String topic = args[2];
        int threads = Integer.parseInt(args[3]);
 
        ConsumerGroupExample example = new ConsumerGroupExample(zooKeeper, groupId, topic);
        example.run(threads);
 
        try {
            Thread.sleep(10000);
        } catch (InterruptedException ie) {
 
        }
        example.shutdown();
    }
}

转载于:https://my.oschina.net/yulongblog/blog/1511371