Flink Sink KafkaFlink作为生产者中并行度同kafkapartition关系

本文主要针对Flink1.0中关于Flink Sink的并行度和KafkaPartition的关系,官网见:

https://ci.apache.org/projects/flink/flink-docs-master/dev/connectors/kafka.html#kafka-producer-partitioning-scheme

By default, if a custom partitioner is not specified for the Flink Kafka Producer, the producer will use a FlinkFixedPartitioner that maps each Flink Kafka Producer parallel subtask to a single Kafka partition (i.e., all records received by a sink subtask will end up in the same Kafka partition).

A custom partitioner can be implemented by extending the FlinkKafkaPartitioner class. All Kafka versions’ constructors allow providing a custom partitioner when instantiating the producer. Note that the partitioner implementation must be serializable, as they will be transferred across Flink nodes. Also, keep in mind that any state in the partitioner will be lost on job failures since the partitioner is not part of the producer’s checkpointed state.

It is also possible to completely avoid using and kind of partitioner, and simply let Kafka partition the written records by their attached key (as determined for each record using the provided serialization schema). To do this, provide a null custom partitioner when instantiating the producer. It is important to provide null as the custom partitioner; as explained above, if a custom partitioner is not specified the FlinkFixedPartitioner is used instead.

按照官网说明,当用户构造参数中没有自定义Partitioner,则使用FlinkFixedPartitioner ,极端情况下会出现Kafka Partition Skew

环境:

Flink1.10、kafka_2.11-1.1.1、jdk8

生产者:

创建kafka topic:指定3副本,3个partition,topic:message-test

kafka_2.11-1.1.1/bin/kafka-topics.sh --create --zookeeper flu02:2181/kafka --replication-factor 3 --partitions 3 --topic message-test

消息生成:nc-lk,手动提交数据,这里直接使用1-100的数字(直接在execl中复制即可)

生产者使用Flink-Sink,代码如下:

public class KafkaProducer {
    private static final Random RANDOM = new Random();
    // 这里的key务必注意要让flink hash分散开,而不是计算出来的值相同
    private static final String[] keys = "a b f".split("\\W+");

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("flu03", 2020);


        FlinkKafkaProducer<String> myProducer = new FlinkKafkaProducer<String>(
                "message", new KafkaSerializationSchema<String>() {
            @Override
            public ProducerRecord<byte[], byte[]> serialize(String element, @Nullable Long timestamp) {

                int i = RANDOM.nextInt(keys.length);
                String key = keys[i];

                System.out.println("key? " + key + " ,value:" + element);

                return new ProducerRecord("message-test", key, element);
            }
        // 注意这里最后一个数字,是kafkaProducerPoolSize,记作C
        }, initKafkaConfig(), FlinkKafkaProducer.Semantic.EXACTLY_ONCE,1);
        // 设置并行度,记作 P
        text.addSink(myProducer).setParallelism(1);

        env.execute("kafka producer");
    }

    private static Properties initKafkaConfig() {
        final Properties properties = new Properties();
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "flu02:9092,flu03:9092,flu04:9092");
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        //properties.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG,MessageProducerInterceptor.class.getName());
        properties.put(ProducerConfig.CLIENT_ID_CONFIG, "hangz-factory");
        properties.put(ProducerConfig.RETRIES_CONFIG, 3);
        properties.put(ProducerConfig.ACKS_CONFIG, "all");
        return properties;
    }
}

上面代码中CP两处需要不断修改,来查看partition接收的消息个数。

消费者:

// 使用异步同步结合消费数据
public class ASyncAndSyncCommitConsumer {
    private static final Logger LOGGER = LoggerFactory.getLogger(ASyncAndSyncCommitConsumer.class);

    public static void main(String[] args) {
        // 用count记录:partition 以及 partition接收到的数量
        Map<Integer,Long> count = new HashMap<>(16);
        KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(loadProp());
        consumer.subscribe(Collections.singletonList("message-test"));
        try {
            for (; ; ) {
                ConsumerRecords<String, String> records = consumer.poll(100);

                records.forEach(record -> {

                    Long value = count.getOrDefault(record.partition(), 0L);
                    count.put(record.partition(),value+1);
                    LOGGER.info("record.key()={}, record.offset()={}, record.partition()={}, record.timestamp()={}, " +
                            "record.value()={}, count:{}", record.key(), record.offset(), record.partition(), record.timestamp(), record.value(),count.toString());

                });
                consumer.commitAsync();
            }
        }catch (Exception e){
            LOGGER.error("Unexpected error",e);
        }finally {
            try {
                consumer.commitSync();
            }finally {
                consumer.close();
            }
        }
    }

    private static Properties loadProp() {
        Properties properties = new Properties();
        properties.put("bootstrap.servers", "flu02:9092,flu03:9092,flu04:9092");
        properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put("group.id", "test_group-1");
        properties.put("auto.offset.reset", "latest");
        // 非自动提交offset
        properties.put("enable.auto.commit", "false");
        return properties;
    }
}

测试:

看下kafka topic状态:kafka_2.11-1.1.1/bin/kafka-topics.sh --describe --zookeeper flu02:2181/kafka --topic message-test

正常。

启动服务端接收消息:nc -ik 2020复制粘贴1000个数字。

测试结果:

消费者打印示例:

[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2435, record.partition()=0, record.timestamp()=1588151892721, record.value()=64, count:{0=324, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2436, record.partition()=0, record.timestamp()=1588151892744, record.value()=66, count:{0=325, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2437, record.partition()=0, record.timestamp()=1588151892744, record.value()=69, count:{0=326, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2438, record.partition()=0, record.timestamp()=1588151892744, record.value()=71, count:{0=327, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2439, record.partition()=0, record.timestamp()=1588151892745, record.value()=76, count:{0=328, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2440, record.partition()=0, record.timestamp()=1588151892745, record.value()=86, count:{0=329, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2441, record.partition()=0, record.timestamp()=1588151892745, record.value()=88, count:{0=330, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2442, record.partition()=0, record.timestamp()=1588151892745, record.value()=90, count:{0=331, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2443, record.partition()=0, record.timestamp()=1588151892746, record.value()=91, count:{0=332, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2444, record.partition()=0, record.timestamp()=1588151892746, record.value()=92, count:{0=333, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main  INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2445, record.partition()=0, record.timestamp()=1588151892746, record.value()=94, count:{0=334, 1=307, 2=359}

多次测试结果:

并行度为1,默认kafkaProducersPoolSize=1:   -->{0=35, 1=28, 2=37},测试第二次:{0=334, 1=307, 2=359}
并行度为1,默认kafkaProducersPoolSize=3:   -->{0=39, 1=24, 2=37} 
并行度为1,默认kafkaProducersPoolSize=4:   --> {0=33, 1=31, 2=36}

并行度为3,默认kafkaProducersPoolSize=1:   --> {0=36, 1=37, 2=27}
并行度为3,默认kafkaProducersPoolSize=3:   --> {0=209, 1=198, 2=193}
并行度为3,默认kafkaProducersPoolSize=5:   --> {0=235, 1=231, 2=234}

并行度为4,默认kafkaProducersPoolSize=1:   --> {0=249, 1=226, 2=225}
并行度为4,默认kafkaProducersPoolSize=3:   --> {0=327, 1=347, 2=326}
并行度为4,默认kafkaProducersPoolSize=4:   --> {0=332, 1=328, 2=340}

结论:

默认使用:round-robin,而没有使用类似:FlinkFixedPartitioner 

这里并没有对线程进行测试,在使用的过程中,尽量让sink的并行度和kafka的partition一致,是 比较理想的状态。

初略调试了源码:可以看到比较关键的

FlinkKafkaProducer,默认当partition为null并没有使用FlinkFixedPartitioner 。

在Flink内部有两阶段提交:TwoPhaseCommitSinkFunction等等

 

错误之处,请大佬指出,感激!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值