Flink Kafka

不名一文

已于 2022-12-07 10:33:48 修改

阅读量2.7k

点赞数

分类专栏： flink kafka 文章标签： flink kafka 大数据

于 2022-01-14 19:41:28 首次发布

本文链接：https://blog.csdn.net/u012485099/article/details/122500468

版权

flink 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

kafka

2 篇文章 0 订阅

订阅专栏

1.Flink读取kafka策略

读取kafka策略有

org.apache.kafka.clients.consumer.RangeAssignor
org.apache.kafka.clients.consumer.RoundRobinAssignor
org.apache.kafka.clients.consumer.StickyAssignor
org.apache.kafka.clients.consumer.CooperativeStickyAssignor

默认为 RangeAssignor，flinksql中可以如下调整:

'properties.partition.assignment.strategy' = 'org.apache.kafka.clients.consumer.RoundRobinAssignor'

kafka 消费者的消费策略以及再平衡_健康平安的活着的博客-CSDN博客_kafka 消费策略https://blog.csdn.net/u011066470/article/details/124090278

2.Flink写入Kafka策略

2.1默认构造器

一般情况下使用

FlinkKafkaProducer(String topicId,SerializationSchema<IN> serializationSchema,Properties producerConfig)

构造器，默认为 FlinkFixedPartitioner分区器。公式为parallelInstanceId % partitions.length。即按照分区轮询，如 flink sink有5个subtask分区，kafka 有3个分区。则 1 -> 1,2 -> 2, 3 -> 3,4 -> 1,5 -> 2 。如此

代码如下

// targetTopic为flink task sink的 parallelism，partitions 为kafka的分区，parallelInstanceId 为 当前task 在 parallelism中的编号
@Override
    public int partition(T record, byte[] key, byte[] value, String targetTopic, int[] partitions) {
        Preconditions.checkArgument(
                partitions != null && partitions.length > 0,
                "Partitions of the target topic is empty.");

        return partitions[parallelInstanceId % partitions.length];
    }

思考：按照该策略，如果sink subtask数比topic的partition数少，会不会有partition没有数据？

待研究

2.2.自定义Kafka schema

如自定义 KafkaSerializationSchema。且调用

FlinkKafkaProducer(String defaultTopic,KafkaSerializationSchema<IN> serializationSchema,Properties producerConfig,FlinkKafkaProducer.Semantic semantic)

构造器。此时分区器为null。

此时依次调用了
record = kafkaSchema.serialize(next, context.timestamp());
transaction.producer.send(record, callback);

方法。 send方法为kafka的发送方法DefaultPartitioner 代码如下

/**
 * The default partitioning strategy:
 * <ul>
 * <li>If a partition is specified in the record, use it
 * <li>If no partition is specified but a key is present choose a partition based on a hash of the key
 * <li>If no partition or key is present choose the sticky partition that changes when the batch is full.
 * 
 * See KIP-480 for details about sticky partitioning.
 */
public class DefaultPartitioner implements Partitioner {

    private final StickyPartitionCache stickyPartitionCache = new StickyPartitionCache();

    public void configure(Map<String, ?> configs) {}

    /**
     * Compute the partition for the given record.
     *
     * @param topic The topic name
     * @param key The key to partition on (or null if no key)
     * @param keyBytes serialized key to partition on (or null if no key)
     * @param value The value to partition on or null
     * @param valueBytes serialized value to partition on or null
     * @param cluster The current cluster metadata
     */
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
        if (keyBytes == null) {
            return stickyPartitionCache.partition(topic, cluster);
        } 
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        int numPartitions = partitions.size();
        // hash the keyBytes to choose a partition
        return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
    }

    public void close() {}
    
    /**
     * If a batch completed for the current sticky partition, change the sticky partition. 
     * Alternately, if no sticky partition has been determined, set one.
     */
    public void onNewBatch(String topic, Cluster cluster, int prevPartition) {
        stickyPartitionCache.nextPartition(topic, cluster, prevPartition);
    }
}

即：

如果指定了分区，则写入指定分区
如果则定了key，则按照key进行hash计算分区
如果没有指定key则采用粘性分区，即分批随机写入，保证负载均衡

关于 StickyPartitionCache.nextPartition 代码如下:

public int nextPartition(String topic, Cluster cluster, int prevPartition) {
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        Integer oldPart = indexCache.get(topic);
        Integer newPart = oldPart;
        // Check that the current sticky partition for the topic is either not set or that the partition that 
        // triggered the new batch matches the sticky partition that needs to be changed.
        if (oldPart == null || oldPart == prevPartition) {
            List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
            if (availablePartitions.size() < 1) {
                Integer random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
                newPart = random % partitions.size();
            } else if (availablePartitions.size() == 1) {
                newPart = availablePartitions.get(0).partition();
            } else {
                while (newPart == null || newPart.equals(oldPart)) {
                    Integer random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
                    newPart = availablePartitions.get(random % availablePartitions.size()).partition();
                }
            }
            // Only change the sticky partition if it is null or prevPartition matches the current sticky partition.
            if (oldPart == null) {
                indexCache.putIfAbsent(topic, newPart);
            } else {
                indexCache.replace(topic, prevPartition, newPart);
            }
            return indexCache.get(topic);
        }
        return indexCache.get(topic);
    }

关于Sticky Partitioner ，具体参考 Apache Kafka Producer Improvements: Sticky Partitionerhttps://www.confluent.io/blog/apache-kafka-producer-improvements-sticky-partitioner/

待整理负载均衡:kafka的Rebalance问题分析_大叶子不小的博客-CSDN博客_kafka的rebalance

3.Flink 提交 Kafka offset

3.1 提交offset规则

Flink Kafka Consumer 允许有配置如何将 offset 提交回 Kafka broker 的行为。Flink Kafka Consumer 不依赖于提交的 offset 来实现容错保证。提交的 offset 只是一种方法，用于公开 consumer 的进度以便进行监控。

配置 offset 提交行为的方法是否相同，取决于是否为 job 启用了 checkpointing。

禁用 Checkpointing： 如果禁用了 checkpointing，则 Flink Kafka Consumer 依赖于内部使用的 Kafka client 自动定期 offset 提交功能。需设置 enable.auto.commit 或者 auto.commit.interval.ms值
enable.auto.commit默认值为 true，auto.commit.interval.ms 默认值为5000。具体可查看kafka官网或者 org.apache.kafka.clients.consumer.ConsumerConfig 类
启用 Checkpointing： 如果启用了 checkpointing，那么当 checkpointing 完成时，Flink Kafka Consumer 将提交的 offset 存储在 checkpoint 状态中。确保 Kafka broker 中提交的 offset 与 checkpoint 状态中的 offset 一致。用户可以通过调用 consumer 上的 setCommitOffsetsOnCheckpoints(boolean) 方法来禁用或启用 offset 的提交(默认情况下，这个值是 true )。注意，在这个场景中，Properties 中的自动定期 offset 提交设置会被完全忽略

Kafka | Apache FlinkApache Kafka Connector # Flink provides an Apache Kafka connector for reading data from and writing data to Kafka topics with exactly-once guarantees.Dependency # Apache Flink ships with a universal Kafka connector which attempts to track the latest version of the Kafka client. The version of the client it uses may change between Flink releases. Modern Kafka clients are backwards compatible with broker versions 0.10.0 or later. For details on Kafka compatibility, please refer to the official Kafka documentation.https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kafka/#consumer-offset-committing

3.4 Last Committed Offset

Last Committed Offset表示consumer Group 已经提交的offset。记录当前消费点位，用于下次消费时定位offset

3.3 lag监控

如果开启了checkpoint，且时间周期为10min(10min提交一次)，此情况下，通过Last Committed Offset来监控kafka lag显然是不对的。kafka提供了相关指标来进行监控，如 records-lag-max。该指标为当前partition的Log End Offset(LEO) - Current Position Offset，在flink中可以将该指标上报给Prometheus进行监控，另外flink也有一些指标可供监控使用

Apache Kafka

Kafka | Apache Flink

不名一文

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flink Kafka

如自定义 KafkaSerializationSchema。且调用 FlinkKafkaProducer(String defaultTopic,KafkaSerializationSchema serializationSchema,Properties producerConfig,FlinkKafkaProducer.Semantic semantic) 构造器。此时分区器为null。此时依次调用了方法。send方法为kafka的发送方法。
复制链接

扫一扫

专栏目录