I\'m trying to build an application with kafka-python where a consumer reads data from a range of topics. It is extremely important that the consumer never reads the same message twice, but also never misses a message.
Everything seems to be working fine, except when I turn off the consumer (e.g. failure) and try to start reading from offset. I can only read all the messages from the topic (which creates double reads) or listen for new messages only (and miss messages that where emitted during the breakdown). I don\'t encounter this problem when pausing the consumer.
I created an isolated simulation in order to try to solve the problem.
Here the generic producer:
from time import sleep
from json import dumps
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=[\'localhost:9092\'])
x=0 # set manually to avoid duplicates
for e in range(1000):
if e <= x:
pass
else:
data = dumps(
{
\'number\' : e
}
).encode(\'utf-8\')
producer.send(\'numtest\', value=data)
print(e, \' send.\')
sleep(5)
And the consumer. If auto_offset_reset is set to \'earliest\', all the messages will be read again. If auto_offset_reset is set to \'latest\', no messages during down-time will be read.
from kafka import KafkaConsumer
from pymongo import MongoClient
from json import loads
## Retrieve data from kafka (WHAT ABOUT MISSED MESSAGES?)
consumer = KafkaConsumer(\'numtest\', bootstrap_servers=[\'localhost:9092\'],
auto_offset_reset=\'earliest\', enable_auto_commit=True,
auto_commit_interval_ms=1000)
## Connect to database
client = MongoClient(\'localhost:27017\')
collection = client.counttest.counttest
# Send data
for message in consumer:
message = loads(message.value.decode(\'utf-8\'))
collection.insert_one(message)
print(\'{} added to {}\'.format(message, collection))
I feel like the auto-commit isn\'t working properly.
I know that this questions is similar to this one, but I would like a specific solution.
Thanks for helping me out.