前言
Kafka有offset的概念,offset记录每个groupId对于每个topic的每个partition里已经提交的读取位置。当comsumer程序失败重启时,可以从这个位置重新读取数据。
可以通过如下方法查看一个groupId的offset:
root@h2:~# /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server h2:9092 --group mabb-g1 --describe
Note: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).
Consumer group 'mabb-g1' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test6 0 39156 61587 22431 - - -
这篇博客介绍与offset有关的内容,包括kafka原生的api,flink,spark structured streaming。
Kafka API
依赖:
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>1.0.0</version>
</dependency>
消费者代码:
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.util.Arrays;
import java.util.Properties;
public class KafkaConsumerTest implements Runnable {
private final KafkaConsumer<String, String> consumer;
private ConsumerRecords<String, String> msgList;
private String topic;
private static final String GROUPID = "mabb-g1";
public static void main(String[] args) {
KafkaConsumerTest test1 = new KafkaConsumerTest("test6");
Thread thread1 = new Thread(test1);
thread1.start();
}
public KafkaConsumerTest(String topicName) {
Properties props = new Properties();
//kafka消费的的地址
props.put("bootstrap.servers", "h2:9092,h3:9092,h3:9092");
//组名 不同组名可以重复消费
props.put("group.id", GROUPID);
//是否自动提交
props.put("enable.auto.commit", "true");
//设置自动提交offset的间隔
props.put("auto.commit.interval.ms ", "100");
//从poll(拉)的回话处理时长
props.put("auto.commit.interval.ms", "100");
//超时时间
props.put("session.timeout.ms", "30000");
//一次最大拉取的条数
props.put("max.poll.records", 20);
// earliest: 当各分区下有已提交的offset时,从提交的offset开始消费;无提交的offset时,从头开始消费
// latest: 当各分区下有已提交的offset时,从提交的offset开始消费;无提交的offset时,消费新产生的该分区下的数据
// none: topic各分区都存在已提交的offset时,从offset后开始消费;只要有一个分区不存在已提交的offset,则抛出异常
// props.put("auto.offset.reset", "earliest");
//序列化
props.put("key.deserializer", StringDeserializer.class.getName());
props.put("value.deserializer", StringDeserializer.class.getName());
this.consumer = new KafkaConsumer<String, String>(props);
this.topic = topicName;
//订阅主题列表topic
this.consumer.subscribe(Arrays.asList(topic));
}
@Override
public void run() {
int messageNo = 1;
try {
for (; ; ) {
msgList = consumer.poll(20);
if (null != msgList && msgList.count() > 0) {
for (ConsumerRecord<String, String> record : msgList) {
System.out.println(messageNo + "=======receive: key = " + record.key() + ", value = " + record.value() + " offset===" + record.offset());
messageNo++;
}
} else {
Thread.sleep(10000);
}
}
} catch (InterruptedException e) {
e.printStackTrace();
} finally {
consumer.close();
}
}
}
attention:
1.第29行,设置***enable.auto.commit*** 为true,客户端才会在固定的周期提交当前的offset给server。
2.第39行,设置***auto.offset.reset***为none,从之前记录的offset开始消费。
flink
flink会把offset提交到kakfa,由kafka来维护offset,在失败恢复时从上次位置继续消费。
代码如下:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "h2:9092");
properties.setProperty("group.id", "dev_test1");
properties.setProperty("enable.auto.commit", "true");
FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011("topic1", new SimpleStringSchema(), properties);
// start from the latest record
// myConsumer.setStartFromLatest();
// start from the earliest record
// myConsumer.setStartFromEarliest();
// 从消息到达kafka的物理时间开始
// long time = System.currentTimeMillis() - TimeUnit.MINUTES.toMillis(2);
// myConsumer.setStartFromTimestamp(time);
// 从kafka维护的offset开始
myConsumer.setStartFromGroupOffsets();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataSource1 = env.addSource(myConsumer, "kafka-source-topic1");
第8行~第15行描述了可以从4中位置开始消费kafka,分别是latest、earliest、物理时间、offset。
spark streaming
本文不阐述spark streaming是如何维护kafka offset。
spark structured streaming
spark structured streaming不使用kafka维护的offset,它自己在checkpoint维护一套offset,所以与kafka offset相关的配置有一定的特殊性,特殊性如下:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
Parameter | Description |
---|---|
group.id | Kafka source will create a unique group id for each query automatically. |
auto.offset.reset | Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off. |
enable.auto.commit | Kafka source doesn’t commit any offset. |
这里尤其注意auto.offset.reset的说明。
配置kafak的参数如下:
Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g, stream.option(“kafka.bootstrap.servers”, “host:port”).
可以通过设置checkpoint,让spark来维护offset,当job重启时,从上次结束的位置继续消费,代码如下:
val query = df2.writeStream
.outputMode("append")
.option("checkpointLocation", "/path/to/checkpoint")
.format("console")
.option("truncate", "false")
.start()
Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.
attention
kafka client发送offset到服务器的间隔是***auto.commit.interval.ms***参数指定的。如果拉取了数据,但是还没有提交offset,这个时候程序挂掉,下次重新消费时这部分数据会重复消费。
为了避免这个情况,直接使用kafka api时,可以设置手动提交offset。
使用flink时,flink会在checkpoint以后再提交offset,但是checkpoint也是有一定周期的。有少量重复读数据出现,代码如下:
public class KafkaTest {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "h2:9092");
properties.setProperty("group.id", "test-g1");
properties.setProperty("enable.auto.commit", "true");
FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011("topic1", new SimpleStringSchema(), properties);
myConsumer.setStartFromGroupOffsets();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(1000);
env.setStateBackend(new FsStateBackend("file:///Users/mabinbin/delete/checkpoint_dir/" + KafkaTest2.class.getName()));
DataStream<String> dataSource1 = env.addSource(KafkaProperties.getKafkaConsumer(), "kafka-source-test9");
dataSource1.print();
env.execute("KafkaTest");
}
参考
Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)