偏移量保存到数据库
一、版本区别
之前版本的kafka偏移量都是保存在kafka中的,而现在的kafka偏移量保存在了自己的一个特殊主题__consumer__offsets中
二、维护思路
根据传入的主题以及消费者组,先判断库中是否存在当前消费者组的消费记录,如果不存在,则证明为第一次消费,获取主题每分区当前的偏移量保存入库,如果存在,则读取库中各分区偏移量字段,封装为MAP,传入创建Dstream函数中创建离散流。当spark流中每一个spark任务完成之后,同步更新库中偏移量字段,完成偏移量提交。
可能存在的问题,如果任务停止时间过长,当前库中的偏移量已经不存在kafka缓冲区中此时爆出异常OffsetOutofRangeException,为了避免异常出现,需要每次启动创建流的时候,判断当前的偏移量是否存在kafka中,如果不存在则自动矫正
三、代码实现
首先我们需要实现一个方法获取当前偏移量的最小值
1.获取偏移量范围
def getTopicOffset(topicName: String, MinOrMax: Int): Map[TopicPartition, Long] = {
val parser = new OptionParser(false)
val clientId = "GetOffset"
val brokerList = brokerListOpt
ToolsUtils.validatePortOrDie(parser, brokerList)
val metadataTargetBrokers = ClientUtils.parseBrokerList(brokerList)
val topic = topicName
val time = MinOrMax
val topicsMetadata = ClientUtils.fetchTopicMetadata(Set(topic), metadataTargetBrokers, clientId, 1000).topicsMetadata
if (topicsMetadata.size != 1 || !topicsMetadata.head.topic.equals(topic)) {
System.err.println(("Error: no valid topic metadata for topic: %s, " + " probably the topic does not exist, run ").format(topic) +
"kafka-list-topic.sh to verify")
Exit.exit(1)
}
val partitions = topicsMetadata.head.partitionsMetadata.map(_.partitionId)
val fromOffsets = collection.mutable.HashMap.empty[TopicPartition, Long]
partitions.foreach { partitionId =>
val partitionMetadataOpt = topicsMetadata.head.partitionsMetadata.find(_.partitionId == partitionId)
partitionMetadataOpt match {
case Some(metadata) =>
metadata.leader match {
case Some(leader) =>
val consumer = new SimpleConsumer(leader.host, leader.port, 10000, 100000, clientId)
val topicAndPartition = TopicAndPartition(topic, partitionId)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(time, 1)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
fromOffsets += (new TopicPartition(topic, partitionId.toInt) -> offsets.mkString(",").toLong)
case None => System.err.println("Error: partition %d does not have a leader. Skip getting offsets".format(partitionId))
}
case None => System.err.println("Error: partition %d does not exist".format(partitionId))
}
}
fromOffsets.toMap
}
参数说明
topicName: String,当前主题的名字
MinOrMax: Int,-1当前主题偏移量最大值,-2当前主题偏移量最小值
2.获取该消费者组最后提交的偏移量并且自动矫正偏移量
def getLastCommittedOffsets(topicName: Array[String], groups: String): Map[TopicPartition, Long] = {
val toplen = topicName.size
if (LOG.isInfoEnabled())
LOG.info("||--Topic:{},getLastCommittedOffsets from PGSQL By JINGXI--||", topicName)
//从PGSQL获取上一次存的Offset
//根据主题获取数据库中保存的偏移量
var sql_str = "SELECT * FROM spark_offsets_manager where groups = ? and topics = ?"
for (x <- 0 until toplen - 1) {
sql_str += "or topics = ?"
}
val conn = getConn()
//开启事务
conn.setAutoCommit(false)
val fromOffsets = collection.mutable.HashMap.empty[TopicPartition, Long]
//判断当前组主题是否存在
try {
for (x <- 0 until toplen) {
val statement = conn.prepareStatement("SELECT * FROM spark_offsets_manager where groups = ? and topics = ?")
statement.setString(1, groups)
statement.setString(2, topicName(x))
val result = statement.executeQuery()
// println(result.next())
if (!result.next()) {
val a = getTopicOffset(topicName(x), -2)
val tops = a.keys.mkString(",").split(",")
for (m <- 0 until tops.size) {
val partition = tops(m).split("-")(1)
val statement = conn.prepareStatement("INSERT INTO spark_offsets_manager (topics,partitions,lastsaveoffsets,groups) VALUES(?,?,?,?)")
statement.setString(1, topicName(x))
statement.setInt(2, partition.toInt)
statement.setLong(3, 0L)
statement.setString(4, groups)
statement.execute()
conn.commit()
}
}
}
}
try {
val statement = conn.prepareStatement(sql_str)
statement.setString(1, groups)
for (x <- 0 until toplen) {
statement.setString(x + 2, topicName(x))
}
// Execute Query
val rs = statement.executeQuery()
var columnCount = rs.getMetaData().getColumnCount();
while (rs.next) {
val topic = rs.getString("topics")
val partition = rs.getString("partitions")
val lastsaveoffset = rs.getString("lastsaveoffsets")
val minOffset = getTopicOffset(topic, -2).get(new TopicPartition(topic, partition.toInt)).mkString(",").toLong
val lastOffset = if (lastsaveoffset == null) {
val statement = conn.prepareStatement("UPDATE spark_offsets_manager SET lastsaveoffsets = ? WHERE topics = ? and partitions = ? and groups = ?")
statement.setLong(1, minOffset)
statement.setString(2, topic)
statement.setInt(3, partition.toInt)
statement.setString(4, groups)
statement.execute()
minOffset
} else if (lastsaveoffset.toLong < minOffset) {
val statement = conn.prepareStatement("UPDATE spark_offsets_manager SET lastsaveoffsets = ? WHERE topics = ? and partitions = ? and groups = ?")
statement.setLong(1, minOffset)
statement.setString(2, topic)
statement.setInt(3, partition.toInt)
statement.setString(4, groups)
statement.execute()
minOffset
} else lastsaveoffset.toLong
fromOffsets += (new TopicPartition(topic, partition.toInt) -> lastOffset.toLong)
// println(topic + ","+partition + "," + lastOffset)
}
conn.commit()
} finally {
conn.close
}
fromOffsets.toMap
}
代码优化没有做到最有,但是逻辑已经实现,各位大佬可以自由发挥。
3.获取数据库连接方法
def getConn(): Connection = {
val conn = DatabaseUtils.getConn()
conn
}
可以自行封装一个DatabaseUtils,我使用的是pgsql,可以根据自己业务需求,选择合适的数据库,当然如果想保存在非数据库中,逻辑需要自己实现
四、库表字段
可以在此字段的基础上进行扩展
欢迎扫码进群,期待更优秀的你!