本文主要针对Flink1.0中关于Flink Sink的并行度和KafkaPartition的关系,官网见:
By default, if a custom partitioner is not specified for the Flink Kafka Producer, the producer will use a FlinkFixedPartitioner that maps each Flink Kafka Producer parallel subtask to a single Kafka partition (i.e., all records received by a sink subtask will end up in the same Kafka partition).
A custom partitioner can be implemented by extending the FlinkKafkaPartitioner class. All Kafka versions’ constructors allow providing a custom partitioner when instantiating the producer. Note that the partitioner implementation must be serializable, as they will be transferred across Flink nodes. Also, keep in mind that any state in the partitioner will be lost on job failures since the partitioner is not part of the producer’s checkpointed state.
It is also possible to completely avoid using and kind of partitioner, and simply let Kafka partition the written records by their attached key (as determined for each record using the provided serialization schema). To do this, provide a null custom partitioner when instantiating the producer. It is important to provide null as the custom partitioner; as explained above, if a custom partitioner is not specified the FlinkFixedPartitioner is used instead.
按照官网说明,当用户构造参数中没有自定义Partitioner,则使用FlinkFixedPartitioner
,极端情况下会出现Kafka Partition Skew
环境:
Flink1.10、kafka_2.11-1.1.1、jdk8
生产者:
创建kafka topic:指定3副本,3个partition,topic:message-test
kafka_2.11-1.1.1/bin/kafka-topics.sh --create --zookeeper flu02:2181/kafka --replication-factor 3 --partitions 3 --topic message-test
消息生成:nc-lk,手动提交数据,这里直接使用1-100的数字(直接在execl中复制即可)
生产者使用Flink-Sink,代码如下:
public class KafkaProducer {
private static final Random RANDOM = new Random();
// 这里的key务必注意要让flink hash分散开,而不是计算出来的值相同
private static final String[] keys = "a b f".split("\\W+");
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> text = env.socketTextStream("flu03", 2020);
FlinkKafkaProducer<String> myProducer = new FlinkKafkaProducer<String>(
"message", new KafkaSerializationSchema<String>() {
@Override
public ProducerRecord<byte[], byte[]> serialize(String element, @Nullable Long timestamp) {
int i = RANDOM.nextInt(keys.length);
String key = keys[i];
System.out.println("key? " + key + " ,value:" + element);
return new ProducerRecord("message-test", key, element);
}
// 注意这里最后一个数字,是kafkaProducerPoolSize,记作C
}, initKafkaConfig(), FlinkKafkaProducer.Semantic.EXACTLY_ONCE,1);
// 设置并行度,记作 P
text.addSink(myProducer).setParallelism(1);
env.execute("kafka producer");
}
private static Properties initKafkaConfig() {
final Properties properties = new Properties();
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "flu02:9092,flu03:9092,flu04:9092");
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
//properties.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG,MessageProducerInterceptor.class.getName());
properties.put(ProducerConfig.CLIENT_ID_CONFIG, "hangz-factory");
properties.put(ProducerConfig.RETRIES_CONFIG, 3);
properties.put(ProducerConfig.ACKS_CONFIG, "all");
return properties;
}
}
上面代码中CP两处需要不断修改,来查看partition接收的消息个数。
消费者:
// 使用异步同步结合消费数据
public class ASyncAndSyncCommitConsumer {
private static final Logger LOGGER = LoggerFactory.getLogger(ASyncAndSyncCommitConsumer.class);
public static void main(String[] args) {
// 用count记录:partition 以及 partition接收到的数量
Map<Integer,Long> count = new HashMap<>(16);
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(loadProp());
consumer.subscribe(Collections.singletonList("message-test"));
try {
for (; ; ) {
ConsumerRecords<String, String> records = consumer.poll(100);
records.forEach(record -> {
Long value = count.getOrDefault(record.partition(), 0L);
count.put(record.partition(),value+1);
LOGGER.info("record.key()={}, record.offset()={}, record.partition()={}, record.timestamp()={}, " +
"record.value()={}, count:{}", record.key(), record.offset(), record.partition(), record.timestamp(), record.value(),count.toString());
});
consumer.commitAsync();
}
}catch (Exception e){
LOGGER.error("Unexpected error",e);
}finally {
try {
consumer.commitSync();
}finally {
consumer.close();
}
}
}
private static Properties loadProp() {
Properties properties = new Properties();
properties.put("bootstrap.servers", "flu02:9092,flu03:9092,flu04:9092");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("group.id", "test_group-1");
properties.put("auto.offset.reset", "latest");
// 非自动提交offset
properties.put("enable.auto.commit", "false");
return properties;
}
}
测试:
看下kafka topic状态:kafka_2.11-1.1.1/bin/kafka-topics.sh --describe --zookeeper flu02:2181/kafka --topic message-test
正常。
启动服务端接收消息:nc -ik 2020复制粘贴1000个数字。
测试结果:
消费者打印示例:
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2435, record.partition()=0, record.timestamp()=1588151892721, record.value()=64, count:{0=324, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2436, record.partition()=0, record.timestamp()=1588151892744, record.value()=66, count:{0=325, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2437, record.partition()=0, record.timestamp()=1588151892744, record.value()=69, count:{0=326, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2438, record.partition()=0, record.timestamp()=1588151892744, record.value()=71, count:{0=327, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2439, record.partition()=0, record.timestamp()=1588151892745, record.value()=76, count:{0=328, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2440, record.partition()=0, record.timestamp()=1588151892745, record.value()=86, count:{0=329, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2441, record.partition()=0, record.timestamp()=1588151892745, record.value()=88, count:{0=330, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2442, record.partition()=0, record.timestamp()=1588151892745, record.value()=90, count:{0=331, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2443, record.partition()=0, record.timestamp()=1588151892746, record.value()=91, count:{0=332, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2444, record.partition()=0, record.timestamp()=1588151892746, record.value()=92, count:{0=333, 1=307, 2=359}
[29/04/20 05:18:08:008 CST] main INFO consumer.ASyncAndSyncCommitConsumer: record.key()=f, record.offset()=2445, record.partition()=0, record.timestamp()=1588151892746, record.value()=94, count:{0=334, 1=307, 2=359}
多次测试结果:
并行度为1,默认kafkaProducersPoolSize=1: -->{0=35, 1=28, 2=37},测试第二次:{0=334, 1=307, 2=359}
并行度为1,默认kafkaProducersPoolSize=3: -->{0=39, 1=24, 2=37}
并行度为1,默认kafkaProducersPoolSize=4: --> {0=33, 1=31, 2=36}
并行度为3,默认kafkaProducersPoolSize=1: --> {0=36, 1=37, 2=27}
并行度为3,默认kafkaProducersPoolSize=3: --> {0=209, 1=198, 2=193}
并行度为3,默认kafkaProducersPoolSize=5: --> {0=235, 1=231, 2=234}
并行度为4,默认kafkaProducersPoolSize=1: --> {0=249, 1=226, 2=225}
并行度为4,默认kafkaProducersPoolSize=3: --> {0=327, 1=347, 2=326}
并行度为4,默认kafkaProducersPoolSize=4: --> {0=332, 1=328, 2=340}
结论:
默认使用:round-robin,而没有使用类似:FlinkFixedPartitioner
这里并没有对线程进行测试,在使用的过程中,尽量让sink的并行度和kafka的partition一致,是 比较理想的状态。
初略调试了源码:可以看到比较关键的
FlinkKafkaProducer,默认当partition为null并没有使用FlinkFixedPartitioner
。
在Flink内部有两阶段提交:TwoPhaseCommitSinkFunction等等
错误之处,请大佬指出,感激!