Flink手动维护kafka的offset

数据整合即创造

引言

对比spark来说一下,flink是如何像spark一样将kafka的offset维护到redis中,从而保证数据的一次仅一次消费,做到数据不丢失、不重复,用过spark的同学都知道,spark在读取kafka数据后,DStream(准确是InputDStream[ConsumerRecord[String, String]])中会有这几个信息:topic、partition、offset、key、value、timestamp等等信息,在维护的时候只需要对DStream进行一次foreach操作就可以了,根据场景选择保存offset的位置,再次重启的时候,读取redis中的offset就可以了。
初次使用flink的同学会发现,flink从env.addSource获得到是DataStream[String],里面的内容直接是value,那么该怎么处理?

步骤

  1. 重写FlinkKafkaConsumer010:组建NewKafkaDStream
  2. 存储offset到redis
  3. 读取

代码

import java.nio.charset.StandardCharsets
import java.util._

import com.oneniceapp.bin.KafkaDStream
import my.nexus.util.StringUtils //私仓
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer010, FlinkKafkaConsumerBase}
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer010, FlinkKafkaConsumerBase, KafkaDeserializationSchema}
import org.apache.flink.streaming.util.serialization.KeyedDeserializationSchema
import org.slf4j.LoggerFactory
import redis.clients.jedis.Jedis

import scala.collection.JavaConversions._

1、创建NewKafkaDStream对象

case class KafkaDStream(topic:String, partition:Int, offset:Long, keyMessage:String, message:String){
}

2、组建kafka信息到NewKafkaDStream

  /**
    * 组建kafka信息
    * @param topic
    * @param groupid
    * @return
    */
  def createKafkaSource(topic:java.util.List[String], groupid:String): FlinkKafkaConsumer010[KafkaDStream] ={

    // kafka消费者配置
    //KeyedDeserializationSchema太旧了,用KafkaDeserializationSchema
    val dataStream = new FlinkKafkaConsumer010[KafkaDStream](topic:java.util.List[String], new KafkaDeserializationSchema[KafkaDStream]() {
      override def getProducedType: TypeInformation[KafkaDStream] = TypeInformation.of(new TypeHint[KafkaDStream]() {})

      override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]]): KafkaDStream = {
		var key: String = null
        var value: String = null
        if (record.key != null) {
          key = new String(record.key())
        }
        if (record.value != null) {
          value = new String(record.value())
        }
        val kafkasource = new KafkaDStream(record.topic(), record.partition(), record.offset(), record.timestamp(),value)

        kafkasource
      }
      override def isEndOfStream(s: KafkaDStream) = false
    }, getKafkaProperties(groupid))

    //是否自动提交offset
    dataStream.setCommitOffsetsOnCheckpoints(true)

    dataStream
  }
  
/**
* kafka配置
* @param groupId
* @return
*/
private def getKafkaProperties(groupId:String): Properties = {
   val kafkaProps: Properties = new Properties()
   kafkaProps.setProperty("bootstrap.servers", "kafka.brokersxxxxxxx")
   kafkaProps.setProperty("group.id", groupId)
   kafkaProps.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
   kafkaProps.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
   kafkaProps
 }

  /**
    * 从redis中获取kafka的offset
    * @param topic
    * @param groupId
    * @return
    */
  def getSpecificOffsets(topic:java.util.ArrayList[String]): java.util.Map[KafkaTopicPartition, java.lang.Long]  ={

    import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
    val specificStartOffsets: java.util.Map[KafkaTopicPartition, java.lang.Long] = new java.util.HashMap[KafkaTopicPartition, java.lang.Long]()

    for(topic <- topic){
      val jedis = new Jedis(redis_host, redis_port)
      val key = s"my_flink_$topic"
      val partitions = jedis.hgetAll(key).toList
      for(partition <- partitions){
        if(!StringUtils.isEmpty(topic) && !StringUtils.isEmpty(partition._1) && !StringUtils.isEmpty(partition._2)){
          Logger.warn("topic:"+topic.trim, partition._1.trim.toInt, partition._2.trim.toLong)
          specificStartOffsets.put(new KafkaTopicPartition(topic.trim, partition._1.trim.toInt), partition._2.trim.toLong)
        }
      }
      jedis.close()
    }
    specificStartOffsets
  }

3、正文获取kafka的数据

val topics = new java.util.ArrayList[String]
topics.add(myTopic)
val consumer = createKafkaSource(topics, groupId)
consumer.setStartFromSpecificOffset(getSpecificOffsets(topics))
val dataStream = env.addSource(consumer)

4、保存offset,一般写在自定义sink的invoke里,保证处理完了再存储offset

def setOffset(topic:String, partition:Int, offset:Long): Unit ={
   val jedis = new Jedis(GetPropKey.redis_host, GetPropKey.redis_port)
   val gtKey = s"my_flink_$topic"
   jedis.hset(gtKey, partition.toString, offset.toString)
   jedis.close()
 }

其他

在使用flink的时候发现一个比较有意思的事情,和spark一样,如果不额外进行数据分区,保证原有的并行度,kafka的分区是固定的,不用担心统一partition的乱序的问题。

  • 3
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
好的,下面是一个简单的 Flink 设置 Kafka Offset 的示例代码: ```java import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.api.java.utils.ParameterTool; import org.apache.flink.streaming.api.CheckpointingMode; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.source.SourceFunction; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09; import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema; import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema; import org.apache.flink.streaming.connectors.kafka.KafkaSink; import org.apache.flink.streaming.connectors.kafka.config.StartupMode; import org.apache.flink.streaming.util.serialization.JSONKeyValueDeserializationSchema; import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema; import org.apache.flink.util.Collector; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.clients.producer.ProducerConfig; import org.apache.kafka.common.serialization.StringDeserializer; import org.apache.kafka.common.serialization.StringSerializer; import java.util.Properties; public class FlinkKafkaOffsetDemo { public static void main(String[] args) throws Exception { // 从参数中读取 Kafka topic 和其他参数 final ParameterTool parameterTool = ParameterTool.fromArgs(args); String topic = parameterTool.get("topic"); String brokers = parameterTool.get("brokers"); String groupId = parameterTool.get("group-id"); String offsetReset = parameterTool.get("offset-reset", "latest"); // 设置 Flink 的执行环境 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE); // 设置 Kafka Consumer 的配置 Properties kafkaProps = new Properties(); kafkaProps.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers); kafkaProps.setProperty(ConsumerConfig.GROUP_ID_CONFIG, groupId); kafkaProps.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName()); kafkaProps.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName()); kafkaProps.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, offsetReset); // 从 Kafka 中读取数据 FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>(topic, new SimpleStringSchema(), kafkaProps); kafkaConsumer.setStartFromEarliest(); DataStream<String> input = env.addSource(kafkaConsumer); // 对数据进行处理 DataStream<String> result = input.flatMap(new FlatMapFunction<String, String>() { @Override public void flatMap(String value, Collector<String> out) throws Exception { out.collect(value); } }); // 将数据写入 Kafka Properties producerProps = new Properties(); producerProps.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers); producerProps.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); producerProps.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); KafkaSerializationSchema<String> kafkaSerializationSchema = new KeyedSerializationSchema<String>() { @Override public byte[] serializeKey(String element) { return null; } @Override public byte[] serializeValue(String element) { return element.getBytes(); } @Override public String getTargetTopic(String element) { return topic; } }; KafkaSink<String> kafkaSink = new KafkaSink<>(producerProps, kafkaSerializationSchema); result.addSink(kafkaSink); // 执行 Flink Job env.execute("Flink Kafka Offset Demo"); } } ``` 在上面的示例中,我们使用 FlinkKafkaConsumer 设置了 Kafka Consumer 的配置,并从 Kafka 中读取了数据。在从 Kafka 中读取数据的过程中,我们可以通过设置 `setStartFromEarliest()` 或 `setStartFromLatest()` 方法来设置从什么位置开始读取数据。 读取到的数据会经过我们自定义的 `flatMap()` 函数进行处理,然后再将处理后的数据写入 Kafka 中。在写入数据时,我们使用了 KafkaSink,并设置了 Kafka Producer 的配置和序列化方式。 在实际使用时,我们可以根据具体的业务场景来设置 Kafka Consumer 的 offset,以实现更加灵活的数据处理。
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值