【003】- Kafka技术内幕之Producer Partition(分区)

最新推荐文章于 2024-07-18 11:08:52 发布

zhangiongcolin

最新推荐文章于 2024-07-18 11:08:52 发布

阅读量1.4k

点赞数 1

分类专栏： Apache kafka 文章标签： kafka 技术内幕 Producer 源码 partition

本文链接：https://blog.csdn.net/zhangxiongcolin/article/details/83818839

版权

Apache kafka 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

我们在前面提到过，kafka的topic是个逻辑概念，实际处理消息处理的是topic的partition。本篇我们将介绍kafka消息发送时是如何分区的以及如何自定义分区。
关注微信公众号，获取更多内容
在这里插入图片描述

一. 默认分区
kafka在发送消息时，有两个参数，一个是key,一个是value，key是跟分区相关的，表示该消息应该发送到哪个分区上。当我们在发送消息时，如果不指定key,则kafka内部默认会进行分区，如果传递了key,则按照key值进行分区。

在kafka中有个接口类Partitioner，该类中有个方法是用来计算消息发送的分区
int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster);
参数说明：
topic：主题名称
key: 分区的key，如果没有，则为null
keyBytes：序列化后的key,如果没有，则为null
value：f发送的消息体
valueBytes：序列化后的value
cluster：当前集群的元数据信息

在KafkaProducer中定义了一个私有方法，该方法计算消息的partition，如果消息有分区，则直接返回，否则使用配置文件中指定的分区类来计算分区
private int partition(ProducerRecord<K, V> record, byte[] serializedKey, byte[] serializedValue, Cluster cluster) {
Integer partition = record.partition();
return partition != null ?
partition :
partitioner.partition(
record.topic(), record.key(), serializedKey, record.value(), serializedValue, cluster);
}

在构造KafkaProducer时，我们可以看到如下代码，这个主要是从配置中读取分区类
this.partitioner = config.getConfiguredInstance(ProducerConfig.PARTITIONER_CLASS_CONFIG, Partitioner.class);
而在ProducerConfig中，我们又看到如下的定义，PARTITIONER_CLASS_CONFIG指向了DefaultPartitioner
public static final String PARTITIONER_CLASS_CONFIG = “partitioner.class”;
CONFIG = new ConfigDef().define(PARTITIONER_CLASS_CONFIG,
Type.CLASS,
DefaultPartitioner.class,
Importance.MEDIUM, PARTITIONER_CLASS_DOC)
DefaultPartitioner是Partitoner的实现类，里面具体定义了默认分区的方式
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
// 根据topic获取topic对应的分区信息，PartitionInfo保存了每个分区信息，包括主题，分区，leader，replicas等
List partitions = cluster.partitionsForTopic(topic);
// 获取主题的分区数
int numPartitions = partitions.size();
// 没有传递key值的情况
if (keyBytes == null) {
//获取topic计数器
int nextValue = nextValue(topic);
// 获取可用的分区
List availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// 没有可用的partition
return Utils.toPositive(nextValue) % numPartitions;
}
} else {
// key不为空，选择分区
return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
}

(1) 不指定key的分区
在DefaultPartitioner中定义了ConcurrentMap<String, AtomicInteger> topicCounterMap；它表示topic的计数器
在key为null的情况下，会执行下述代码
private int nextValue(String topic) {
//根据topic获取当前主题的计数器值
AtomicInteger counter = topicCounterMap.get(topic);
// 第一次发送消息
if (null == counter) {
// 创建随机数
counter = new AtomicInteger(ThreadLocalRandom.current().nextInt());
AtomicInteger currentCounter = topicCounterMap.putIfAbsent(topic, counter);
if (currentCounter != null) {
counter = currentCounter;
}
}
// 递增，产生新的计数器
return counter.getAndIncrement();
}

代码测试
private final static ConcurrentMap<String, AtomicInteger> topicCounterMap = new ConcurrentHashMap<String, AtomicInteger>();

public static void main(String args[]) {
    for (int i = 0; i < 10; i++) {
        int nextValue = nextValue("beardata");
        System.out.println("选择分区：" + toPositive(nextValue % 2));
    }
}

private static int nextValue(String topic) {
    AtomicInteger counter = topicCounterMap.get(topic);
    if (null == counter) {
        counter = new AtomicInteger(ThreadLocalRandom.current().nextInt());
        AtomicInteger currentCounter = topicCounterMap.putIfAbsent(topic, counter);
        if (currentCounter != null) {
            counter = currentCounter;
        }
    }
    return counter.getAndIncrement();
}

public static int toPositive(int number) {
    return number & 0x7fffffff;
}

在这里插入图片描述

实践案例
我们创建主题beardata,并指定3个分区，我们在生产者端不指定key,发送5条消息0,1,2,3,4,观察每个分区的数据情况
首先我们查看beardata的分区，
./bin/kafka-topics.sh --describe --topic beardata --zookeeper bigdata000:2181
在这里插入图片描述
可以看到beardata主题有两个partition，分别为0,1
我们发送消息并在消费端分别消费beardata的partition0和partition1

结论：当我们未指定key时，消息是均匀分发送给每个partition

(2) 指定key值的分区
调用Utils.murmur2(keyBytes)，返回keyBytes的32位hash值
我们在发送消息时，指定两个key，分别为colin和harper，colin发送0,2,4,6,8消息，harper发送1,3,5,7,9消息
在这里插入图片描述

在这里插入图片描述
结论：colin被hash成1，harper被hash成0，分别发送给两个分区

二. 自定义分区
有些场景，我们需要自己定义分区策略，以满足我们的业务需求。一种场景是同一个topic里面消息体处理业务的类型不同，即一个消息包含了不同的几类业务，假如创建多个主题，则开销比较大，我们可以考虑根据消息类型将消息发送到不同的partition上，然后消费端根据partition去处理不同的业务类型，这是我们可以考虑自定义partition。
我们在前面提到过，kafka的Partitioner是一个接口，我们可以实现此接口，定义自己的分区处理逻辑，然后在加载配置时指定我们自定义的分区类即可。在我们下面自定义分区实现中，我们把value大于100的消息发送到beardata的partition0上，小于等于100的发送到beardata的partition1上。
实现代码：
import org.apache.kafka.clients.producer.Partitioner;
import org.apache.kafka.common.Cluster;
import java.util.Map;

/**

Author: zhangxiong
Date: 18-8-5 下午4:55
Desc:
*/
public class BearDataPartitioner implements Partitioner {

public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
int v = Integer.parseInt(value.toString());
if (v > 100) {
return 0;
} else {
return 1;
}
}

public void close() {

}

public void configure(Map<String, ?> map) {

}
}