Flink Sink KafkaFlink作为生产者中并行度同kafkapartition关系

最新推荐文章于 2024-05-29 00:32:03 发布

赣江

最新推荐文章于 2024-05-29 00:32:03 发布

阅读量1.8k

点赞数

分类专栏：大数据相关文章标签： flink kafka-producer flink-sink kafka-sink

本文链接：https://blog.csdn.net/mar_ljh/article/details/105844607

版权

大数据相关专栏收录该内容

18 篇文章 1 订阅

订阅专栏

本文主要针对Flink1.0中关于Flink Sink的并行度和KafkaPartition的关系，官网见：

https://ci.apache.org/projects/flink/flink-docs-master/dev/connectors/kafka.html#kafka-producer-partitioning-scheme

By default, if a custom partitioner is not specified for the Flink Kafka Producer, the producer will use a FlinkFixedPartitioner that maps each Flink Kafka Producer parallel subtask to a single Kafka partition (i.e., all records received by a sink subtask will end up in the same Kafka partition).

A custom partitioner can be implemented by extending the FlinkKafkaPartitioner class. All Kafka versions’ constructors allow providing a custom partitioner when instantiating the producer. Note that the partitioner implementation must be serializable, as they will be transferred across Flink nodes. Also, keep in mind that any state in the partitioner will be lost on job failures since the partitioner is not part of the producer’s checkpointed state.

It is also possible to completely avoid using and kind of partitioner, and simply let Kafka partition the written records by their attached key (as determined for each record using the provided serialization schema). To do this, provide a null custom partitioner when instantiating the producer. It is important to provide null as the custom partitioner; as explained above, if a custom partitioner is not specified the FlinkFixedPartitioner is used instead.

按照官网说明，当用户构造参数中没有自定义Partitioner，则使用FlinkFixedPartitioner ，极端情况下会出现Kafka Partition Skew

环境：

Flink1.10、kafka_2.11-1.1.1、jdk8

生产者：

创建kafka topic:指定3副本，3个partition，topic：message-test

kafka_2.11-1.1.1/bin/kafka-topics.sh --create --zookeeper flu02:2181/kafka --replication-factor 3 --partitions 3 --topic message-test

消息生成：nc-lk，手动提交数据，这里直接使用1-100的数字（直接在execl中复制即可）

生产者使用Flink-Sink，代码如下：

public class KafkaProducer {
    private static final Random RANDOM = new Random();
    // 这里的key务必注意要让flink hash分散开，而不是计算出来的值相同
    private static final String[] keys = "a b f".split("\\W+");

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("flu03", 2020);


        FlinkKafkaProducer<String> myProducer = new FlinkKafkaProducer<String>(
                "message", new KafkaSerializationSchema<String>() {
            @Override
            public ProducerRecord<byte[], byte[]> serialize(String element, @Nullable Long timestamp) {

                int i = RANDOM.nextInt(keys.length);
                String key = keys[i];

                System.out.println("key? " + key + " ,value:" + element);

                return new ProducerRecord("message-test", key, element);
            }
        // 注意这里最后一个数字，是kafkaProducerPoolSize，记作C
        }, initKafkaConfig(), FlinkKafkaProducer.Semantic.EXACTLY_ONCE,1);
        // 设置并行度，记作 P
        text.addSink(myProducer).setParallelism(1);

        env.execute("kafka producer");
    }

    private static Properties initKafkaConfig() {
        final Properties properties = new Properties();
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "flu02:9092,flu03:9092,flu04:9092");
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        //properties.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG,MessageProducerInterceptor.class.getName());
        properties.put(ProducerConfig.CLIENT_ID_CONFIG, "hangz-factory");
        properties.put(ProducerConfig.RETRIES_CONFIG, 3);
        properties.put(ProducerConfig.ACKS_CONFIG, "all");
        return properties;
    }
}

上面代码中CP两处需要不断修改，来查看partition接收的消息个数。

消费者：

// 使用异步同步结合消费数据
public class ASyncAndSyncCommitConsumer {
    private static final Logger LOGGER = LoggerFactory.getLogger(ASyncAndSyncCommitConsumer.class);

    public static void main(String[] args) {
        // 用count记录：partition 以及 partition接收到的数量
        Map<Integer,Long> count = new HashMap<>(16);
        KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(loadProp());
        consumer.subscribe(Collections.singletonList("message-test"));
        try {
            for (; ; ) {
                ConsumerRecords<String, String> records = consumer.poll(100);

                records.forEach(record -> {

                    Long value = count.getOrDefault(record.partition(), 0L);
                    count.put(record.partition(),value+1);
                    LOGGER.info("record.key()={}, record.offset()={}, record.partition()={}, record.timestamp()={}, " +
                            "record.value()={}, count:{}", record.key(), record.offset(), record.partition(), record.timestamp(), record.value(),count.toString());

                });
                consumer.commitAsync();
            }
        }catch (Exception e){
            LOGGER.error("Unexpected error",e);
        }finally {
            try {
                consumer.commitSync();
            }finally {
                consumer.close();
            }
        }
    }

    private static Properties loadProp() {
        Properties properties = new Properties();
        properties.put("bootstrap.servers", "flu02:9092,flu03:9092,flu04:9092");
        properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put("group.id", "test_group-1");
        properties.put("auto.offset.reset", "latest");
        // 非自动提交offset
        properties.put("enable.auto.commit", "false");
        return properties;
    }
}

测试：

看下kafka topic状态：kafka_2.11-1.1.1/bin/kafka-topics.sh --describe --zookeeper flu02:2181/kafka --topic message-test

正常。

启动服务端接收消息：nc -ik 2020复制粘贴1000个数字。

测试结果：

消费者打印示例：

[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2435, record.partition()=0, record.timestamp()=1588151892721, record.value()=64, count:{0=324, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2436, record.partition()=0, record.timestamp()=1588151892744, record.value()=66, count:{0=325, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2437, record.partition()=0, record.timestamp()=1588151892744, record.value()=69, count:{0=326, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2438, record.partition()=0, record.timestamp()=1588151892744, record.value()=71, count:{0=327, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2439, record.partition()=0, record.timestamp()=1588151892745, record.value()=76, count:{0=328, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2440, record.partition()=0, record.timestamp()=1588151892745, record.value()=86, count:{0=329, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2441, record.partition()=0, record.timestamp()=1588151892745, record.value()=88, count:{0=330, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2442, record.partition()=0, record.timestamp()=1588151892745, record.value()=90, count:{0=331, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2443, record.partition()=0, record.timestamp()=1588151892746, record.value()=91, count:{0=332, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2444, record.partition()=0, record.timestamp()=1588151892746, record.value()=92, count:{0=333, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2445, record.partition()=0, record.timestamp()=1588151892746, record.value()=94, count:{0=334, 1=307, 2=359}

多次测试结果：

并行度为1，默认kafkaProducersPoolSize=1：   -->{0=35, 1=28, 2=37},测试第二次：{0=334, 1=307, 2=359}
并行度为1，默认kafkaProducersPoolSize=3：   -->{0=39, 1=24, 2=37} 
并行度为1，默认kafkaProducersPoolSize=4：   --> {0=33, 1=31, 2=36}

并行度为3，默认kafkaProducersPoolSize=1：   --> {0=36, 1=37, 2=27}
并行度为3，默认kafkaProducersPoolSize=3：   --> {0=209, 1=198, 2=193}
并行度为3，默认kafkaProducersPoolSize=5：   --> {0=235, 1=231, 2=234}

并行度为4，默认kafkaProducersPoolSize=1：   --> {0=249, 1=226, 2=225}
并行度为4，默认kafkaProducersPoolSize=3：   --> {0=327, 1=347, 2=326}
并行度为4，默认kafkaProducersPoolSize=4：   --> {0=332, 1=328, 2=340}