kafka offset & flink & spark structured streaming

最新推荐文章于 2022-12-22 10:51:10 发布

binbin_civil1

最新推荐文章于 2022-12-22 10:51:10 发布

阅读量3.5k

点赞数

分类专栏：大数据文章标签： kafka offset flink spark structured streaming

本文链接：https://blog.csdn.net/bingospunky/article/details/90409275

版权

大数据专栏收录该内容

3 篇文章 0 订阅

订阅专栏

前言

Kafka有offset的概念，offset记录每个groupId对于每个topic的每个partition里已经提交的读取位置。当comsumer程序失败重启时，可以从这个位置重新读取数据。

可以通过如下方法查看一个groupId的offset：

root@h2:~# /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server h2:9092 --group mabb-g1 --describe
Note: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).

Consumer group 'mabb-g1' has no active members.

TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG        CONSUMER-ID                                       HOST                           CLIENT-ID
test6                          0          39156           61587           22431      -                                                 -                              -

这篇博客介绍与offset有关的内容，包括kafka原生的api，flink，spark structured streaming。

Kafka API

依赖：

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>1.0.0</version>
        </dependency>

消费者代码：

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.util.Arrays;
import java.util.Properties;

public class KafkaConsumerTest implements Runnable {

    private final KafkaConsumer<String, String> consumer;
    private ConsumerRecords<String, String> msgList;
    private String topic;
    private static final String GROUPID = "mabb-g1";

    public static void main(String[] args) {
        KafkaConsumerTest test1 = new KafkaConsumerTest("test6");
        Thread thread1 = new Thread(test1);
        thread1.start();
    }

    public KafkaConsumerTest(String topicName) {
        Properties props = new Properties();
        //kafka消费的的地址
        props.put("bootstrap.servers", "h2:9092,h3:9092,h3:9092");
        //组名 不同组名可以重复消费
        props.put("group.id", GROUPID);
        //是否自动提交
        props.put("enable.auto.commit", "true");
        //设置自动提交offset的间隔
        props.put("auto.commit.interval.ms ", "100"); 
        //从poll(拉)的回话处理时长
        props.put("auto.commit.interval.ms", "100");
        //超时时间
        props.put("session.timeout.ms", "30000");
        //一次最大拉取的条数
        props.put("max.poll.records", 20);
//		earliest: 当各分区下有已提交的offset时，从提交的offset开始消费；无提交的offset时，从头开始消费
//		latest: 当各分区下有已提交的offset时，从提交的offset开始消费；无提交的offset时，消费新产生的该分区下的数据
//		none: topic各分区都存在已提交的offset时，从offset后开始消费；只要有一个分区不存在已提交的offset，则抛出异常
//        props.put("auto.offset.reset", "earliest");
        //序列化
        props.put("key.deserializer", StringDeserializer.class.getName());
        props.put("value.deserializer", StringDeserializer.class.getName());
        this.consumer = new KafkaConsumer<String, String>(props);
        this.topic = topicName;
        //订阅主题列表topic
        this.consumer.subscribe(Arrays.asList(topic));
    }

    @Override
    public void run() {
        int messageNo = 1;
        try {
            for (; ; ) {
                msgList = consumer.poll(20);
                if (null != msgList && msgList.count() > 0) {
                    for (ConsumerRecord<String, String> record : msgList) {
                        System.out.println(messageNo + "=======receive: key = " + record.key() + ", value = " + record.value() + " offset===" + record.offset());
                        messageNo++;
                    }
                } else {
                    Thread.sleep(10000);
                }
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        } finally {
            consumer.close();
        }
    }
}

attention：

1.第29行，设置***enable.auto.commit*** 为true，客户端才会在固定的周期提交当前的offset给server。

2.第39行，设置***auto.offset.reset***为none，从之前记录的offset开始消费。

flink

flink会把offset提交到kakfa，由kafka来维护offset，在失败恢复时从上次位置继续消费。

代码如下：

        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "h2:9092");
        properties.setProperty("group.id", "dev_test1");
        properties.setProperty("enable.auto.commit", "true");

        FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011("topic1", new SimpleStringSchema(), properties);
        // start from the latest record
//        myConsumer.setStartFromLatest();
        // start from the earliest record
//        myConsumer.setStartFromEarliest();
        // 从消息到达kafka的物理时间开始
//        long time = System.currentTimeMillis() - TimeUnit.MINUTES.toMillis(2);
//        myConsumer.setStartFromTimestamp(time);
        // 从kafka维护的offset开始
        myConsumer.setStartFromGroupOffsets();

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<String> dataSource1 = env.addSource(myConsumer, "kafka-source-topic1");

第8行~第15行描述了可以从4中位置开始消费kafka，分别是latest、earliest、物理时间、offset。

spark streaming

本文不阐述spark streaming是如何维护kafka offset。

spark structured streaming

spark structured streaming不使用kafka维护的offset，它自己在checkpoint维护一套offset，所以与kafka offset相关的配置有一定的特殊性，特殊性如下:

Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:

Parameter	Description
group.id	Kafka source will create a unique group id for each query automatically.
auto.offset.reset	Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.
enable.auto.commit	Kafka source doesn’t commit any offset.

这里尤其注意auto.offset.reset的说明。

配置kafak的参数如下：

Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g, stream.option(“kafka.bootstrap.servers”, “host:port”).

可以通过设置checkpoint，让spark来维护offset，当job重启时，从上次结束的位置继续消费，代码如下：

    val query = df2.writeStream
      .outputMode("append")
      .option("checkpointLocation", "/path/to/checkpoint")
      .format("console")
      .option("truncate", "false")
      .start()

Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.

attention

kafka client发送offset到服务器的间隔是***auto.commit.interval.ms***参数指定的。如果拉取了数据，但是还没有提交offset，这个时候程序挂掉，下次重新消费时这部分数据会重复消费。
为了避免这个情况，直接使用kafka api时，可以设置手动提交offset。
使用flink时，flink会在checkpoint以后再提交offset，但是checkpoint也是有一定周期的。有少量重复读数据出现，代码如下：

public class KafkaTest {

    public static void main(String[] args) throws Exception {
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "h2:9092");
        properties.setProperty("group.id", "test-g1");
        properties.setProperty("enable.auto.commit", "true");

        FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011("topic1", new SimpleStringSchema(), properties);
        myConsumer.setStartFromGroupOffsets();

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.enableCheckpointing(1000);
        env.setStateBackend(new FsStateBackend("file:///Users/mabinbin/delete/checkpoint_dir/" + KafkaTest2.class.getName()));

        DataStream<String> dataSource1 = env.addSource(KafkaProperties.getKafkaConsumer(), "kafka-source-test9");
        dataSource1.print();
        env.execute("KafkaTest");
    }

参考

Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)

Flink 小贴士 (2)：Flink 如何管理 Kafka 消费位点

binbin_civil1

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
2
评论
kafka offset & flink & spark structured streaming

前言Kafka有offset的概念，offset记录每个groupId对于每个topic的每个partition里已经提交的读取位置。当comsumer程序失败重启时，可以从这个位置重新读取数据。可以通过如下方法查看一个groupId的offset：root@h2:~# /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server h2...
复制链接

扫一扫