1.问题描述
使用SparkStreaming连接Kafka的demo程序每次重启,都会从Kafka队列里第一条数据开始消费。
修改enable.auto.commit相关参数都无效。
2.原因分析
demo程序使用"KafkaUtils.createDirectStream"创建Kafka输入流,此API内部使用了Kafka客户端低阶API,不支持offset自动提交(提交到zookeeper)。
"KafkaUtils.createDirectStream"官方文档:
http://spark.apache.org/docs/2.2.0/streaming-kafka-0-8-integration.html
3.对策
方案一)通过zookeeper提供的API,自己编写代码,将offset提交到zookeeper;服务启动时,从zookeeper读取offset,并作为"KafkaUtils.createDirectStream"的输入参数
优点:可与基于zookeeper的监控系统融合,对消费情况进行监控
缺点:频繁的读写offset可能影响zookeeper集群性能,从而影响到Kafka集群的稳定性
方案二)自己编写代码维护offset,并将offset保存到MongoDB或者redis
优点:不影响zookeeper集群性能;可基于MongoDB或者redis自主实现消费情况的监控
缺点:无法与基于zookeeper的监控系统融合
4.代码示例
基于上述方案二,将offset保存到redis,并在服务重启时从redis获取offset,确保不会重复消费。
1)Scala操作redis的工具类
package xxx.demo.scala_test import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol} import org.apache.commons.pool2.impl.GenericObjectPoolConfig import org.slf4j.LoggerFactory import com.typesafe.scalalogging.slf4j.Logger class RedisUtil extends Serializable { @transient private var pool: JedisPool = null @transient val logger = Logger(LoggerFactory.getLogger( "cn.com.flaginfo.demo.scala_test.RedisUtil" )) def makePool(redisHost: String, redisPort: Int, password: String, database: Int): Unit = { if (pool == null ) { val poolConfig = new GenericObjectPoolConfig() pool = new JedisPool(poolConfig, redisHost, redisPort, Protocol.DEFAULT_TIMEOUT, password, database) val hook = new Thread { override def run = { pool.destroy() logger.debug( "JedisPool destroyed by ShutdownHook" ) } } sys.addShutdownHook(hook.run) } } def jedisPool: JedisPool = { assert (pool != null ) pool } def generateKafkaOffsetGroupIdTopicKey(groupId : String, topic : String) : String = { groupId + "/" + topic } } |
2)初始化redis工具
import kafka.serializer.{StringDecoder, DefaultDecoder} import kafka.common.TopicAndPartition import kafka.message.MessageAndMetadata import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.kafka._ import org.apache.spark.sql.SparkSession import org.bson.Document import com.mongodb.spark.config._ import com.mongodb.spark._ import com.mongodb._ import xxx.demo.model._ import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol} import scala.collection.JavaConversions.{mapAsScalaMap} import org.slf4j.LoggerFactory import com.typesafe.scalalogging.slf4j.Logger ... var redisUtil = new RedisUtil() redisUtil.makePool(redisHost, redisPort, redisPassword, redisDatabase) var jedisPool = redisUtil.jedisPool |
注意,不需要密码验证时,redisPassword必须设置为null,空字符串会报错。
3)从redis获取上次的offset
var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName) var allOffset: java.util.Map[String, String] = jedisPool.getResource().hgetAll(kafkaOffsetKey) val fromOffsets = scala.collection.mutable.Map[TopicAndPartition,Long]() if ( allOffset != null && !allOffset.isEmpty() ){ // Jedis获取的Java Map转换为Scala Map var allOffsetScala : scala.collection.mutable.Map[String, String] = mapAsScalaMap[String, String](allOffset) for (offset <- allOffsetScala){ // 将offset传入kafka参数。offset._1 : partition, offset._2 : offset fromOffsets += (TopicAndPartition(newsAnalysisTopic, offset._1.toInt) -> offset._2.toLong) } logger.debug( "fromOffsets : " + fromOffsets.toString() ) } else { // 初次消费 for ( i <- 0 to (newsAnalysisTopicPartitionCount - 1 ) ){ fromOffsets += (TopicAndPartition(newsAnalysisTopic, i) -> 0 ) } logger.debug( "fromOffsets : " + fromOffsets.toString() ) } // mutable转换为imutable var imutableFromOffsets = Map[TopicAndPartition,Long]( fromOffsets.map(kv => (kv._1, kv._2)).toList: _* ) |
4)定义消息过滤器:根据metadata取出需要的字段
val messageHandler: (MessageAndMetadata[String, String]) => (String,String, Long, Int) = (mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message, mmd.offset, mmd.partition) |
5)创建kafka输入流
val kafkaParam = Map[String, String]( "bootstrap.servers" -> kafkaServer, "group.id" -> groupId, "client.id" -> clientId, "auto.offset.reset" -> "smallest" , "enable.auto.commit" -> "false" ) var kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String,String, Long, Int)](ssc, kafkaParam, imutableFromOffsets, messageHandler) |
其中ssc为StreamingContext对象
6)业务逻辑代码中,将offset更新到redis
var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName) // _._1 : topic name, _._2 : message body, _._3 : offset, _._4 : partition kafkaStream.foreachRDD { rdd => if ( !rdd.isEmpty() ){ // 此处判断可防止offsetRanges.foreach循环意外执行 var offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges // 处理数据 rdd.foreach{ row => logger.info( "message : " + row + offsetRanges) } // 开启Redis事务 var jedis = jedisPool.getResource() var jedisPipeline = jedis.pipelined() jedisPipeline.multi() // 更新offset offsetRanges.foreach { offsetRange => logger.debug( "partition : " + offsetRange.partition + " fromOffset: " + offsetRange.fromOffset + " untilOffset: " + offsetRange.untilOffset) jedisPipeline.hset(kafkaOffsetKey, offsetRange.partition.toString(), offsetRange.untilOffset.toString()) } jedisPipeline.exec() //提交事务 jedisPipeline.sync //关闭pipeline jedis.close() } } |