消息中间件—Kafka（二）进阶

最新推荐文章于 2022-12-06 07:42:30 发布

烟锁迷城

最新推荐文章于 2022-12-06 07:42:30 发布

阅读量398

点赞数 1

分类专栏：消息中间件文章标签： java kafka 分布式

本文链接：https://blog.csdn.net/jiayibingdong/article/details/114192284

版权

消息中间件专栏收录该内容

9 篇文章 1 订阅

订阅专栏

一、生产者原理

1、生产者结构

在这里插入图片描述

2、源码分析

当一个producer被创建

	Producer<String,String> producer = new KafkaProducer<String,String>(pros);

在this方法里，

	public KafkaProducer(Properties properties) {
        this(propsToMap(properties), (Serializer)null, (Serializer)null, (ProducerMetadata)null, (KafkaClient)null, (ProducerInterceptors)null, Time.SYSTEM);
    }

一个sender线程就会被创建。

	this.sender = this.newSender(logContext, kafkaClient, this.metadata);

执行send方法时，将启动interceptors拦截器链。

	public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
        ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);
        return this.doSend(interceptedRecord, callback);
    }

拦截器可以被自定义，加入一些你想加入的操作。

public class MyInterceptor implements ProducerInterceptor<String, String> {
    // 发送消息的时候触发
    @Override
    public ProducerRecord<String, String> onSend(ProducerRecord<String, String> record) {
        System.out.println("发送消息");
        return record;
    }

    // 收到服务端的ACK的时候触发
    @Override
    public void onAcknowledgement(RecordMetadata metadata, Exception exception) {
        System.out.println("消息被接收");
    }

    @Override
    public void close() {
        System.out.println("生产者关闭");
    }

    // 用键值对配置的时候触发
    @Override
    public void configure(Map<String, ?> configs) {
        System.out.println("键值配对");
    }
}

在真正执行发送任务的dosend方法中，有序列化工具：keySerializer和valueSerializer，它们都是Serializer，当然这个序列化工具也可以自定义。

try {
	   serializedKey = this.keySerializer.serialize(record.topic(), record.headers(), record.key());
} catch (ClassCastException var21) {
	   throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() + " to class " + this.producerConfig.getClass("key.serializer").getName() + " specified in key.serializer", var21);
}

byte[] serializedValue;
try {
	   serializedValue = this.valueSerializer.serialize(record.topic(), record.headers(), record.value());
} catch (ClassCastException var20) {
	   throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() + " to class " + this.producerConfig.getClass("value.serializer").getName() + " specified in value.serializer", var20);
}

完成序列化后，需要对存入哪一个分区进行计算

int partition = this.partition(record, serializedKey, serializedValue, cluster);

具体方法如下

private int partition(ProducerRecord<K, V> record, byte[] serializedKey, byte[] serializedValue, Cluster cluster) {
        Integer partition = record.partition();
        return partition != null ? partition : 
        	this.partitioner.partition(
        		record.topic(), record.key(), serializedKey, record.value(), serializedValue, cluster);
    }

由三目运算符决定接下来的运算结果，即
路由指定：

指定partition：就用这个
没有指定partition，自定义分区器：按照自定义规则
没有指定partition，没有自定义分区器，但是key不为空：hash后取余
没有指定partition，没有自定义分区器，key为空：整数自增取模
分区器即为partitioner，通过继承完成自定义

public class SimplePartitioner implements Partitioner {
    public SimplePartitioner() {

    }

    @Override
    public void configure(Map<String, ?> configs) {
    }

    @Override
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
        String k = (String) key;
        if (Integer.parseInt(k) % 2 == 0){
            return 0;
        }else{
            return 1;
        }
    }

    @Override
    public void close() {
    }
}

在没有自定义时，采用默认分区器

public class DefaultPartitioner implements Partitioner {
    private final StickyPartitionCache stickyPartitionCache = new StickyPartitionCache();

    public DefaultPartitioner() {
    }

    public void configure(Map<String, ?> configs) {
    }

    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
        return this.partition(topic, key, keyBytes, value, valueBytes, cluster, cluster.partitionsForTopic(topic).size());
    }

    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster, int numPartitions) {
        return keyBytes == null ? this.stickyPartitionCache.partition(topic, cluster) : Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
    }

    public void close() {
    }

    public void onNewBatch(String topic, Cluster cluster, int prevPartition) {
        this.stickyPartitionCache.nextPartition(topic, cluster, prevPartition);
    }
}

若keyBytes不为空，则调用Utils.murmur2（一个一致性hash算法），对分区数取模

Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;

若keyBytes为空

this.stickyPartitionCache.partition(topic, cluster);

在此方法内部逻辑如下

public int partition(String topic, Cluster cluster) {
        Integer part = (Integer)this.indexCache.get(topic);
        return part == null ? this.nextPartition(topic, cluster, -1) : part;
    }

    public int nextPartition(String topic, Cluster cluster, int prevPartition) {
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        Integer oldPart = (Integer)this.indexCache.get(topic);
        Integer newPart = oldPart;
        if (oldPart != null && oldPart != prevPartition) {
            return (Integer)this.indexCache.get(topic);
        } else {
            List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
            Integer random;
            if (availablePartitions.size() < 1) {
                random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
                newPart = random % partitions.size();
            } else if (availablePartitions.size() == 1) {
                newPart = ((PartitionInfo)availablePartitions.get(0)).partition();
            } else {
                while(newPart == null || newPart.equals(oldPart)) {
                    random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
                    newPart = ((PartitionInfo)availablePartitions.get(random % availablePartitions.size())).partition();
                }
            }

            if (oldPart == null) {
                this.indexCache.putIfAbsent(topic, newPart);
            } else {
                this.indexCache.replace(topic, prevPartition, newPart);
            }

            return (Integer)this.indexCache.get(topic);
        }
    }

ThreadLocalRandom.current().nextInt()方法会随机生成一个整数，这个整数会对分区数取模。假如下一次key值依旧为空，则这个整数会自增，然后继续对分区数取模

if (availablePartitions.size() < 1) {
	//关键代码
    random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
    newPart = random % partitions.size();
}

在dosend方法中，分区选择结束后，需要对消息进行发送，但是kafka对消息不是实时发送的，它会累积消息的数量到限定值，或者达到发送时间之后才发送，因此需要一个对消息累加的方法

result = this.accumulator.append(tp, timestamp, serializedKey, serializedValue, headers, interceptCallback, remainingWaitMs, false, nowMs);

this.accumulator的类型是RecordAccumulator，实际上本质还是一个ConcurrentMap

private final ConcurrentMap<TopicPartition, Deque<ProducerBatch>> batches;

它以topicpartition主题分区为key，也就是说每一个分区里都会有一个累加器。如果这个累加器已经满了或者创建了新的Batch（batchIsFull代表batch是否满了，newBatchCreated代表是否本次append增加了新的batch），就会唤醒那个在最开始创建的sender线程

if (result.batchIsFull || result.newBatchCreated) {
	this.log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
	this.sender.wakeup();
}

ProducerBatch类中，实际执行累加方法的是tryappend方法，在这个方法中，hasRoomFor会先检测空间是否足够，然后执行消息追加append方法，这两个方法都来自this.recordsBuilder，即MemoryRecordsBuilder。

public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers, Callback callback, long now) {
   if (!this.recordsBuilder.hasRoomFor(timestamp, key, value, headers)) {
       return null;
   } else {
       Long checksum = this.recordsBuilder.append(timestamp, key, value, headers);
       this.maxRecordSize = Math.max(this.maxRecordSize, AbstractRecords.estimateSizeInBytesUpperBound(this.magic(), this.recordsBuilder.compressionType(), key, value, headers));
       this.lastAppendTime = now;
       FutureRecordMetadata future = new FutureRecordMetadata(this.produceFuture, (long)this.recordCount, timestamp, checksum, key == null ? -1 : key.length, value == null ? -1 : value.length, Time.SYSTEM);
       this.thunks.add(new ProducerBatch.Thunk(callback, future));
       ++this.recordCount;
       return future;
   }
}

在MemoryRecordsBuilder类中，在上文中调用两个方法，hasRoomFor和append，我们主要看append

public Long append(long timestamp, ByteBuffer key, ByteBuffer value, Header[] headers) {
    return this.appendWithOffset(this.nextSequentialOffset(), timestamp, key, value, headers);
}
//真正执行的方法
private Long appendWithOffset(long offset, boolean isControlRecord, long timestamp, ByteBuffer key, ByteBuffer value, Header[] headers) {
   try {
       if (isControlRecord != this.isControlBatch) {
           throw new IllegalArgumentException("Control records can only be appended to control batches");
       } else if (this.lastOffset != null && offset <= this.lastOffset) {
           throw new IllegalArgumentException(String.format("Illegal offset %s following previous offset %s (Offsets must increase monotonically).", offset, this.lastOffset));
       } else if (timestamp < 0L && timestamp != -1L) {
           throw new IllegalArgumentException("Invalid negative timestamp " + timestamp);
       } else if (this.magic < 2 && headers != null && headers.length > 0) {
           throw new IllegalArgumentException("Magic v" + this.magic + " does not support record headers");
       } else {
           if (this.firstTimestamp == null) {
               this.firstTimestamp = timestamp;
           }
			//关键代码
           if (this.magic > 1) {
               this.appendDefaultRecord(offset, timestamp, key, value, headers);
               return null;
           } else {
               return this.appendLegacyRecord(offset, timestamp, key, value, this.magic);
           }
       }
   } catch (IOException var10) {
       throw new KafkaException("I/O exception when writing to the append stream, closing", var10);
   }
}

如果magic大于1，执行appendDefaultRecord方法，然后是DefaultRecord.writeTo方法，Utils.writeTo方法，到此得知真正维护消息累加的实际上是buffer，它记录写入的大小，写入的位置，而每条消息都被追加到DataOutput对象this.appendStream中,即DataOutputStream

private void appendDefaultRecord(long offset, long timestamp, ByteBuffer key, ByteBuffer value, Header[] headers) throws IOException {
   this.ensureOpenForRecordAppend();
   int offsetDelta = (int)(offset - this.baseOffset);
   long timestampDelta = timestamp - this.firstTimestamp;
   //关键代码
   int sizeInBytes = DefaultRecord.writeTo(this.appendStream, offsetDelta, timestampDelta, key, value, headers);
   this.recordWritten(offset, timestamp, sizeInBytes);
}

public static void writeTo(DataOutput out, ByteBuffer buffer, int length) throws IOException {
    if (buffer.hasArray()) {
        out.write(buffer.array(), buffer.position() + buffer.arrayOffset(), length);
    } else {
        int pos = buffer.position();

        for(int i = pos; i < length + pos; ++i) {
            out.writeByte(buffer.get(i));
        }
    }
}

二、服务端响应

1、服务端结构

在这里插入图片描述

2、服务端节点保存数据

当一条消息发送过来，判断是否发送成功返回ACK有两种选择方案，其一是需要有半数以上的节点同步数据完成，才算同步成功，因此集群服务器一般都是单数个，避免产生无法达到半数以上的尴尬场景，其二是所有的节点都同步数据成功，才算成功。方案一的优势是延迟小，方案二的优势是可靠性高，在kafka中，最终选择方案二。

假如有一个follower节点挂掉了，它无法回应leader的数据同步响应，那么是否其他节点都只能等待？当然不行，因此leader只需要那些能正常回复消息的follower节点完成同步即可，那么leader是如何判断该节点是否能正确的响应呢？

在kafka中，有一个set，ISR（Iin-sync replica set）专门保存那些正常的follower。假如一个follower在一个特定的时间之内都没有响应过leader的消息，那么这个follower将被移除出ISR。假如一个被移除的follower在一个特定的时间之内响应了leader的消息，那么这个follower将被加入ISR。这个时间通过replica.lag.time.max.ms来设置。

3、选举机制

假如leader挂掉，那么将会发生follower选举

选举人：Broker Controller，当需要选举时，各个副本会先产生出一个选举主持人，这个选举主持人的产生条件是第一个在Zookeeper成功写入一个/controller节点的副本。如果这个选举主持人也宕机了，那么它在zookeeper上创建的临时节点/controller就会消失，当zookeeper监听到这个事件发生，就会重新进行选举。
在kafka中，有三个区间，AR代表全部副本，ISR代表正常连接的副本，OSR代表掉线的副本，在选举中，所有在ISR中的follower都将参加选举。如果ISR中没有任何follower，那么只能从OSR中选举，但这样选举是有风险的，因为一旦follower长时间没有和leader保持连接，那么一定会丢失一部分数据，假如以它为新的leader，所有数据以它为基准同步就会彻底丢失部分数据，因此后半部分所提到的选举也被称为不干净的选举，在kafka中有对应的开关来进行控制：unclean.leader.election.enable=false，毫无疑问，这个值默认为false
在进行选举时，选举算法会更加倾向于选举出序号最小，即最早被创建的那个副本。

4、主从同步

当leader选举出来后，就需要同步数据
LEO（Log End Offset）：下一条等待写入的消息的offset（最新的offset + 1）
HW（Hign Watermark）：ISR中最小的LEO，限制能被消费者消费的最新的消息，因为一个消息如果没有被完全同步就被消费者消费掉，万一其他副本出现宕机，这条消息就会丢失，消费者就不知道这条消息到底从哪里来了。
主从同步的步骤如下：

Follower节点会向Leader发送一个fetch请求，leader向follower发送数据后，需要更新follower的LEO
follow接收到数据响应后依次写入消息并更新LEO
Leader更新HW

在这里插入图片描述

5、故障处理

故障分为两种，follower故障和leader故障

1）follower故障

当某一个follower发生故障，它将先被踢出ISR队列，恢复之后，会先将HW之后的消息全部扔掉，然后从leader中同步消息，避免发生消息不一致的情况，然后加回ISR

2）leader故障

当一个leader宕机，序号最小的副本将被选举为leader，其他副本将会扔掉HW之后的消息，然后从新leader中进行消息同步，

4、服务端ACK响应

在代码中，ACK的响应级别可以被设置

0时不等待ACK，只管发送
1时leader写入返回ACK，这个是默认值
-1或all时leader和全部follower写入返回ACK

Properties pros=new Properties();
pros.put("acks","1");

三、分片存储

1、分片存储结构

在基础篇中，提到过为了避免消息的积压，topic以分割多个partition来缓解存储压力，同时为了消息存储的健壮性，会在多个broker中创造副本。
在这里插入图片描述
当副本数目和broker一样多时，每一个broker都会有一个副本，下图有3个分区，3个副本

当副本数目小于broker数量时，就不是每一个broker都会有一个副本了，下图有4个分区，2个副本

2、分配规则

AdminUtils.scala–assignReplicasToBrokers

副本因子不能大于Broker的个数
第一个分区（编号为0）的第一个副本（leader）放置位置是随机从brokerList选择的
其他分区的第一个副本（leader）放置位置相对于第0个分区以此往后移动
其他副本（follower）是随机分布的
这样做的优点是可以尽可能减少一个broker上的leader数量，避免宕机时有多个partition进入选举状态。

3、分段规则

每一个partition都在承受读写，并且在累积所有的消息数据，为了降低压力，kafka采取了将partition继续分段（segment）的方式。
每一个partition都有三个文件，.log，.index，.timeindex，三个一组，.log文件存储消息，.index为偏移量索引，.timeindex为时间戳索引，分段有三种规则

大小控制：当.log文件的大小达到极限时，会自动分段，以偏移量为准，生成新的一组文件
时间控制：当时间达到限制是，会自动分段，以偏移量为准，生成新的一组文件
索引控制：当索引达到最大量，无法继续存储时，会自动分段，以偏移量为准，生成新的一组文件

.index被称为偏移量索引，即offset index，offset在kafka中即代表消息的序号，但是在kafka中采用的索引是稀疏索引，与一般的索引不同，稀疏索引不会为每一个消息生成索引，而是等待消息的大小达到某一个限制值之后，生成一条稀疏索引，这个设置值为log.index.interval.bytes=4096，4096为默认值，也就意味着，每4kb大小的消息生成一个稀疏索引。索引越密集，占用的空间越多，索引越稀疏，查找的性能开销就越大。
稀疏索引时间复杂度计算：O（log2n）+O（m）n为索引文件个数，m稀疏程度
.timeindex被称为时间戳索引，即timestamp index，每一个消息都需要记录一个对应的时间戳，在需要对消息进行排序，或者时间计算时都需要这个时间戳。可用的时间戳其实有两种，创建时间，或者写入存储时间，log.message.timestamp.type=CreateTime/LogAppendTime。

索引查找过程：

根据offset判断在哪个segment中
在segment的indexfile中，根据offset找到消息的postition
根据postition从log文件中比较，最终找到信息。

4、消息清理规则

过于久远的信息始终积压肯定不行，需要定期的采用某几种策略进行清理。

删除开关：log.cleaner.enable=true，是否删除消息，默认为true
清理策略：log.cleanup.policy=delete/compact，采用的清理策略，删除或压缩
删除周期：log.retention.check.interval.ms=30000，可设置时长（ms）
过期定义：log.retention.hours，log.retention.minutes，log.retention.ms，默认是168小时，即一个星期
文件限制：log.retention.bytes，log.segment.bytes
注意，当采用压缩策略时，由于在kafka中是以key-value形式存储消息的，当key相同时，后续的消息要比之前的消息有意义，所以压缩实际上就是去掉之前key重复的消息达到减少存储消耗的目的。

四、Kafka消费原理

在基础篇中有提到，在kafka中为了保证顺序消费，有一个专门存储消息与消费者之间关系的topic，__consumer_offsets。
在这里插入图片描述

GroupMetadata：保存了消费者组中各个消费者的信息（每个消费者有编号）
offsetAndMetadata：保存了消费者组和各个partition的offset位移信息元数据

在这里插入图片描述

1、消费策略

多个消费者组和一个partition关系：可重复消费
一个消费者组中消费者和分区数量相等：每一个消费者消费一个分区
一个消费者组中消费者小于分区数量：必然会有消费者消费多个分区的情况，分配方案可以采用范围分配（RangeAssignor），轮询分配（RoundRobinAssignor），粘滞分配（StickyAssignor），或者使用assign方法指定消费的分区列表，都可以。
一个消费者组中消费者大于分区数量：必然会有消费者无法消费分区的情况，通常而言，分配一旦完成，就不会再更改，这也就意味着，没有分区可以消费的将始终没有，但有情况例外，就是分区重分配。

2、分区重分配

rebalance分区重分配

所有的broker先选举出一个coordinator，话事人
其他消费者和coordinator连接
参与的消费者选举出leader，确定方案，通知到话事人
话事人通过方案后，通知所有消费者分区方案

五、快速处理消息原理

顺序读写：顺序I/O（在磁盘上的存储是连续的）和磁盘I/O（在磁盘上的存储时分散的），在kafka中采用了顺序I/O，因为log文件不删除
索引
批量读写和文件压缩
零拷贝：在通常I/O中，数据要经过DMA（直接内存访问），cpu拷贝，等4次拷贝才能完成数据的传输，但是在零拷贝技术中，只要支持SG-DMA就可以直接从磁盘拷贝到网卡，只有一次拷贝，虽然不是真正的零拷贝，但速度显然有很大提升。

一般拷贝过程
在这里插入图片描述
零拷贝过程