问题的发现是zookeeper kafka的消费offset很久没有更新,通过打日志发现
lastoffset其实是从发送的保存了所有发送的offset的pendingOffsets(源码里只是个treeset,阿里实现了ack,fail的异步,居然没用ConcurrentSkipListSet)中获取的,而该pendingOffsets删除数据,是ack后才调用的;如果某个offset的数据fail了,那么此时调用的是另外一个treeset集合,而且只是简单的remove,问题是都没塞数据,就直接remvoe了;如下面两张图:
解决方法:
把原先的treeset替换为ConcurrentSkipListSet,
fail方法:
public void fail(KafkaMessageId fail) { failedOffsets.add(fail); pendingOffsets.remove(fail.getOffset()); }
在PartitionConsumer的emmit方法第一行添加如下代码行:
if (!failedOffsets.isEmpty()) { fillFailMessage(); }
private void fillFailMessage() { ByteBufferMessageSet msgs; try { if (failedOffsets.isEmpty()) { return; } KafkaMessageId kafkaMessageId = failedOffsets.pollFirst(); msgs = consumer.fetchMessages(kafkaMessageId.getPartition(), kafkaMessageId.getOffset()); List<Long> failedOffset = failedOffsets.stream().mapToLong(KafkaMessageId::getOffset).boxed().collect (Collectors.toList()); for (MessageAndOffset msg : msgs) { if (failedOffset.contains(msg.offset())) { LOG.info("failToSend data is parition :" + partition + " , offset : " + msg.offset() +"failedOffsets size : "+failedOffset.size()); pendingOffsets.add(kafkaMessageId.getOffset()); emittingMessages.add(msg); failedOffsets.removeIf(k->{ return k.getOffset() == msg.offset(); }); } } } catch (Exception e) { e.printStackTrace(); LOG.error(e.getMessage(), e); } }
这里你可以自己决定是否过滤已经发送的。到此结束