基于redis的控制sparkStreaming 对接kafka 精确一次消费数据的解决方案

最新推荐文章于 2024-04-06 22:20:19 发布

sghuu

最新推荐文章于 2024-04-06 22:20:19 发布

阅读量1.6k

点赞数

分类专栏： spark kafka

本文链接：https://blog.csdn.net/sghuu/article/details/103587487

版权

spark 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

kafka

9 篇文章 0 订阅

订阅专栏

demo程序使用"KafkaUtils.createDirectStream"创建Kafka输入流，此API内部使用了Kafka客户端低阶API，不支持offset自动提交（提交到zookeeper）。

"KafkaUtils.createDirectStream"官方文档：

http://spark.apache.org/docs/2.2.0/streaming-kafka-0-8-integration.html

3.对策
方案一）通过zookeeper提供的API，自己编写代码，将offset提交到zookeeper；服务启动时，从zookeeper读取offset，并作为"KafkaUtils.createDirectStream"的输入参数
优点：可与基于zookeeper的监控系统融合，对消费情况进行监控

缺点：频繁的读写offset可能影响zookeeper集群性能，从而影响到Kafka集群的稳定性

方案二）自己编写代码维护offset，并将offset保存到MongoDB或者redis
优点：不影响zookeeper集群性能；可基于MongoDB或者redis自主实现消费情况的监控

缺点：无法与基于zookeeper的监控系统融合

4.代码示例
基于上述方案二，将offset保存到redis，并在服务重启时从redis获取offset，确保不会重复消费。

1）Scala操作redis的工具类

package xxx.demo.scala_test

 

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol}

import org.apache.commons.pool2.impl.GenericObjectPoolConfig

import org.slf4j.LoggerFactory

import com.typesafe.scalalogging.slf4j.Logger

 

class RedisUtil extends Serializable {

  @transient private var pool: JedisPool = null

  @transient val logger = Logger(LoggerFactory.getLogger("cn.com.flaginfo.demo.scala_test.RedisUtil"))

 

  def makePool(redisHost: String, redisPort: Int,

               password: String, database: Int): Unit = {

    if (pool == null) {

      val poolConfig = new GenericObjectPoolConfig()

      pool = new JedisPool(poolConfig, redisHost, redisPort, Protocol.DEFAULT_TIMEOUT, password, database)

       

      val hook = new Thread {

        override def run = {

          pool.destroy()

          logger.debug("JedisPool destroyed by ShutdownHook")

        }

      }

      sys.addShutdownHook(hook.run)

    }

  }

 

  def jedisPool: JedisPool = {

    assert(pool != null)

    pool

  }

   

  def generateKafkaOffsetGroupIdTopicKey(groupId : String, topic : String) : String = {

    groupId + "/" + topic

  }

}

2）初始化redis工具

import kafka.serializer.{StringDecoder, DefaultDecoder}

import kafka.common.TopicAndPartition

import kafka.message.MessageAndMetadata

import org.apache.spark._

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.kafka._

import org.apache.spark.sql.SparkSession

import org.bson.Document

import com.mongodb.spark.config._

import com.mongodb.spark._

import com.mongodb._

import xxx.demo.model._

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol}

import scala.collection.JavaConversions.{mapAsScalaMap}

import org.slf4j.LoggerFactory

import com.typesafe.scalalogging.slf4j.Logger

...

  var redisUtil = new RedisUtil()

  redisUtil.makePool(redisHost, redisPort, redisPassword, redisDatabase)

  var jedisPool = redisUtil.jedisPool

注意，不需要密码验证时，redisPassword必须设置为null，空字符串会报错。

3）从redis获取上次的offset

var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName)

var allOffset: java.util.Map[String, String] = jedisPool.getResource().hgetAll(kafkaOffsetKey)

val fromOffsets = scala.collection.mutable.Map[TopicAndPartition,Long]()

if( allOffset != null && !allOffset.isEmpty() ){

  // Jedis获取的Java Map转换为Scala Map

  var allOffsetScala : scala.collection.mutable.Map[String, String] = mapAsScalaMap[String, String](allOffset)

  for(offset <- allOffsetScala){

    // 将offset传入kafka参数。offset._1 : partition, offset._2 : offset

    fromOffsets += (TopicAndPartition(newsAnalysisTopic, offset._1.toInt) -> offset._2.toLong)

  }

  logger.debug( "fromOffsets : " + fromOffsets.toString() )

}

else{

  // 初次消费

  for( i <- 0 to (newsAnalysisTopicPartitionCount - 1) ){

    fromOffsets += (TopicAndPartition(newsAnalysisTopic, i) -> 0)

  }

  logger.debug( "fromOffsets : " + fromOffsets.toString() )

}

// mutable转换为imutable

var imutableFromOffsets = Map[TopicAndPartition,Long](

  fromOffsets.map(kv => (kv._1, kv._2)).toList: _*

)

4）定义消息过滤器：根据metadata取出需要的字段

val messageHandler: (MessageAndMetadata[String, String]) => (String,String, Long, Int) = (mmd: MessageAndMetadata[String, String]) =>

    (mmd.topic, mmd.message, mmd.offset, mmd.partition)

5）创建kafka输入流

val kafkaParam = Map[String, String](

  "bootstrap.servers" -> kafkaServer,

  "group.id" -> groupId,

  "client.id" -> clientId,

  "auto.offset.reset" -> "smallest",

  "enable.auto.commit" -> "false"

  )

var kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String,String, Long, Int)](ssc, kafkaParam, imutableFromOffsets, messageHandler)

其中ssc为StreamingContext对象

6）业务逻辑代码中，将offset更新到redis

var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName)

// _._1 : topic name, _._2 : message body, _._3 : offset, _._4 : partition

kafkaStream.foreachRDD { rdd =>

  if( !rdd.isEmpty() ){  // 此处判断可防止offsetRanges.foreach循环意外执行

    var offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

    // 处理数据

    rdd.foreach{ row =>

      logger.info("message : " + row + offsetRanges)

    }

    // 开启Redis事务

    var jedis = jedisPool.getResource()

    var jedisPipeline = jedis.pipelined()

    jedisPipeline.multi()

    // 更新offset

    offsetRanges.foreach { offsetRange =>

      logger.debug("partition : " + offsetRange.partition + " fromOffset:  " + offsetRange.fromOffset + " untilOffset: " + offsetRange.untilOffset)

      jedisPipeline.hset(kafkaOffsetKey, offsetRange.partition.toString(), offsetRange.untilOffset.toString())

    }

    jedisPipeline.exec() //提交事务

    jedisPipeline.sync //关闭pipeline

    jedis.close()

  }

}
————