Flink DataSet Kafka Sink

1. 说明

Flink 一般都是用于realtime 计算的,不过其中的 DataSet 也提供了batch API。本人在项目中也好奇试用了下,其中遇到一个需求就是把DataSet的数据Sink到 Kafka。

需要注意的是Flink 官方的DataSet是不提供Kafka Sink API的,需要自己实现。当然也分 DataSet 数据量的大小,有不同的实现方式。

2. 小数据量

这个就比较水了,对于DataSet 先 collect() 一下转成 List。再把List 中的数据发送即可。优点就是简单,缺点就是不是并行发送的,使用小规模数据集还行,大规模数据根本不可行。代码:

2.1 运行主类

import com.alibaba.fastjson.JSON;
import dataset.sinkdata.kafka.bean.Json;
import dataset.sinkdata.kafka.method2.KafkaOutputFormat;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import java.util.Random;

public class Demo {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataSet<String> value = ... 

        List<String> resLs = value.collect();
        SinkToKafka.kafkaProducer(resLs);

        env.execute();
    }
}

2.2 SinkToKafka 类

import com.typesafe.config.Config;
import com.typesafe.config.ConfigFactory;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.List;
import java.util.Properties;

public class SinkToKafka {
    private static Config config = ConfigFactory.load();

    public static void kafkaProducer(List<String> resLS) {
        Properties props = new Properties();
        props.setProperty("bootstrap.servers", "localhost:9092");
        props.setProperty("acks", "all");
        props.setProperty("retries", "0");
        props.setProperty("batch.size", "10");
        props.setProperty("linger.ms", "1");
        props.setProperty("buffer.memory", "10240");
        props.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        Producer<String, String> producer = new KafkaProducer<>(props);

        for (String context : resLS) {
            producer.send(new ProducerRecord<>("Test_Topic", String.valueOf(System.currentTimeMillis()), context));
        }

        producer.close();
    }
}

3. 大数据量

对于大数据量,好在Flink DataSet Sink 提供了output() 方法,可以让用户自定义OutputFormat。以下是我参考 flink-jdbc 中的JDBCOutputFormat类写的 KafkaOutputFormat。试了下还是可以用的,性能上完全秒杀上文的方法。

3.1 KafkaOutputFormat

import org.apache.flink.api.common.io.RichOutputFormat;
import org.apache.flink.configuration.Configuration;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.Properties;

public class KafkaOutputFormat extends RichOutputFormat<String> {

    private static final Logger LOG = LoggerFactory.getLogger(KafkaOutputFormat.class);

    private String servers;
    private String topic;
    private String acks;
    private String retries;
    private String batchSize;
    private String bufferMemory;
    private String lingerMS;

    private Producer<String, String> producer;


    @Override
    public void configure(Configuration parameters) {

    }

    @Override
    public void open(int taskNumber, int numTasks) throws IOException {
        Properties props = new Properties();

        props.setProperty("bootstrap.servers", this.servers);
        if (this.acks != null) {
            props.setProperty("acks", this.acks);
        }

        if (this.retries != null) {
            props.setProperty("retries", this.retries);
        }

        if (this.batchSize != null) {
            props.setProperty("batch.size", this.batchSize);
        }

        if (this.lingerMS != null) {
            props.setProperty("linger.ms", this.lingerMS);
        }

        if (this.bufferMemory != null) {
            props.setProperty("buffer.memory", this.bufferMemory);
        }

        props.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        producer = new KafkaProducer<>(props);
    }

    @Override
    public void writeRecord(String record) throws IOException {
        producer.send(new ProducerRecord<>(this.topic, String.valueOf(System.currentTimeMillis()), record));
    }

    @Override
    public void close() throws IOException {
        producer.close();
    }


    public static KafkaOutputFormatBuilder buildKafkaOutputFormat() {
        return new KafkaOutputFormatBuilder();
    }


    public static class KafkaOutputFormatBuilder {
        private final KafkaOutputFormat format;

        public KafkaOutputFormatBuilder() {
            this.format = new KafkaOutputFormat();
        }

        public KafkaOutputFormatBuilder setBootstrapServers(String servers) {
            format.servers = servers;
            return this;
        }

        public KafkaOutputFormatBuilder setTopic(String topic) {
            format.topic = topic;
            return this;
        }

        public KafkaOutputFormatBuilder setAcks(String acks) {
            format.acks = acks;
            return this;
        }

        public KafkaOutputFormatBuilder setRetries(String retries) {
            format.retries = retries;
            return this;
        }

        public KafkaOutputFormatBuilder setBatchSize(String batchSize) {
            format.batchSize = batchSize;
            return this;
        }

        public KafkaOutputFormatBuilder setBufferMemory(String bufferMemory) {
            format.bufferMemory = bufferMemory;
            return this;
        }

        public KafkaOutputFormatBuilder setLingerMs(String lingerMS) {
            format.lingerMS = lingerMS;
            return this;
        }


        public KafkaOutputFormat finish() {
            if (format.servers == null) {
                LOG.info("servers was not supplied separately.");
            }
            if (format.topic == null) {
                LOG.info("topic was not supplied separately.");
            }
            return format;
        }
    }
}

3.2 运行主类

import dataset.sinkdata.kafka.bean.Json;
import dataset.sinkdata.kafka.method2.KafkaOutputFormat;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;

public class Demo {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataSet<String> value = ...

        value.output(
            KafkaOutputFormat.buildKafkaOutputFormat()
                .setBootstrapServers("localhost:9092")
                .setTopic("Test_Topic_1")
                .setAcks("all")
                .setBatchSize("10")
                .setBufferMemory("10240")
                .setLingerMs("1")
                .setRetries("0")
                .finish()
        );
        env.execute();
    }
}
  • 3
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
要自定义 FlinkKafka Sink,您可以通过实现 `org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema` 接口来实现自定义的序列化逻辑。以下是一个简单的示例: ```java import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema; import org.apache.kafka.clients.producer.ProducerRecord; public class CustomKafkaSerializationSchema implements KafkaSerializationSchema<String> { private final String topic; public CustomKafkaSerializationSchema(String topic) { this.topic = topic; } @Override public ProducerRecord<byte[], byte[]> serialize(String element, Long timestamp) { // 将 String 类型的数据序列化为字节数组,可以根据需要进行自定义序列化逻辑 byte[] serializedValue = element.getBytes(); return new ProducerRecord<>(topic, serializedValue); } } ``` 然后,您可以在 Flink 程序中使用这个自定义的 KafkaSink,示例如下: ```java import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer; public class CustomKafkaSinkExample { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // 创建数据流 DataStream<String> stream = ... // 创建自定义的 KafkaSink String topic = "your-topic"; FlinkKafkaProducer<String> kafkaSink = new FlinkKafkaProducer<>( topic, new CustomKafkaSerializationSchema(topic), properties); // 将数据流写入 Kafka stream.addSink(kafkaSink); env.execute("Custom Kafka Sink Example"); } } ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值