Spark Streaming + Kafka的offset管理

简述

Kafka+Spark Streaming主要用于实时流处理。到目前为止,在大数据领域中是一种非常常见的架构。Kafka在其中主要起着一个缓冲的作用,所有的实时数据都会经过kafka。所以对kafka offset的管理是其中至关重要的一环。一但管理不善,就会到导致数据丢失或重复消费。Spark Streaming + Kafka有两种消费模式,一种是基于Receiver的,一种采用Kafka Direct API。这里采用Direct模式(推荐)。

消费语义

  1. At-most-once 最多消费一次,可能会导致数据丢失。offset已经提交了,数据还没处理完,消费者挂了
  2. At-least-once 最少消费一次,可能会导致重复消费。数据处理完了,offset还没提交之前,消费者挂了
  3. Exactly-once 恰好消费一次,保证数据不丢失不重复消费。offset和数据同时处理完毕。 需要保证offset提交和数据存储在同一个事务里面,即存储到同一个库,例如Mysql等等。不同库难以实现该语义。

offset的三种管理方式

  1. 自动提交offset(彻底放弃使用这种方式吧):enable.auto.commit=true。一但consumer挂掉,就会导致数据丢失或重复消费。offset不可控。
  2. Kafka自身的offset管理(属于At-least-once语义,如果做好了幂等性,可以使用这种方式):在Kafka 0.10+版本中,offset的默认存储由ZooKeeper移动到了一个自带的topic中,名为__consumer_offsets。Spark Streaming也专门提供了commitAsync() API用于提交offset。需要将参数修改为enable.auto.commit=false。在我实际测试中发现,这种offset的管理方式,不会丢失数据,但会出现重复消费。停掉streaming应用程序再次启动后,会再次消费停掉前最后的一个批次数据,应该是由于offset是异步提交的方式导致,offset更新不及时引起的。因此需要做好数据的幂等性。(修改源码将异步改为同步,应该是可以做到Exactly-once语义的)
  3. 自定义offset(推荐,采用这种方式,可以做到At-least-once语义):可以将offset存放在第方三储中,包括RDBMS、Redis、ZK、ES甚至Kafka中。若消费数据存储在带事务的组件上,则强烈推荐将offset存储在一起,借助事务实现 Exactly-once 语义。

offset管理demo

自带offset管理

  1. pom.xml
    <properties>
      <scala.version>2.11.12</scala.version>
      <spark.version>2.4.2</spark.version>
      <es.version>6.6.2</es.version>
    </properties>
    
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
      <!--<scope>provided</scope>-->
    </dependency>
    <dependency>
      <groupId>org.apache.zookeeper</groupId>
      <artifactId>zookeeper</artifactId>
      <version>3.4.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
  1. Streaming demo
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{SparkConf, TaskContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._

object BackProessureApp {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("BackProessureApp").setMaster("local[2]")
      .set("spark.streaming.stopGracefullyOnShutdown","true")
    val ssc = new StreamingContext(conf,Seconds(5))

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "random",
      "auto.offset.reset" -> "latest",
      //一定要设置成false,否则不生效
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )

    val topics = Array("streaming-topic")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String,String](topics,kafkaParams)
    )

    stream.foreachRDD(rdd => {
      //获取当前批次的offset信息
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd.foreachPartition(iter =>{iter
       //遍历打印offset
        val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
        println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset} ${o.topicPartition()}")
      }
      )
      //异步提交offset
      stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    })

    ssc.start()
    ssc.awaitTermination()

  }

}

将offset存储在MySQL中

  1. 数据库表
    create table kafka_offset(
    group_id varchar(30),
    topic varchar(30),
    partition_id int(5),
    fromOffset bigint(18),
    untilOffset bigint(18),
    primary key(topic,group_id,partition_id)
    );

2.scalikejdbc连接配置(在resources资源目录下:application.conf)

db.default.driver="com.mysql.jdbc.Driver"
db.default.url="jdbc:mysql://127.0.0.1:3306/rz?characterEncoding=utf8&useSSL=false&serverTimezone=UTC&rewriteBatchedStatements=true"
db.default.user="root"
db.default.password="123456"
  1. pom.xml
    <properties>
      <scala.version>2.11.12</scala.version>
      <spark.version>2.4.2</spark.version>
      <es.version>6.6.2</es.version>
    </properties>
    
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
      <!--<scope>provided</scope>-->
    </dependency>
    <dependency>
      <groupId>org.apache.zookeeper</groupId>
      <artifactId>zookeeper</artifactId>
      <version>3.4.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.24</version>
    </dependency>
    <dependency>
      <groupId>org.scalikejdbc</groupId>
      <artifactId>scalikejdbc_2.11</artifactId>
      <version>3.3.2</version>
    </dependency>
    <dependency>
      <groupId>org.scalikejdbc</groupId>
      <artifactId>scalikejdbc-config_2.11</artifactId>
      <version>3.3.2</version>
    </dependency>
  1. Streaming demo
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import scalikejdbc._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, TaskContext}

/**
  * 自定义offset存储
  */
object DefinedOffSetApp {

  private val user = "root"
  private val password = "123456"
  private val url = "jdbc:mysql://127.0.0.1:3306/rz?characterEncoding=utf8&useSSL=false&serverTimezone=UTC&rewriteBatchedStatements=true"

  def main(args: Array[String]): Unit = {
   //初始化spark streaming应用程序
    val conf = new SparkConf().setAppName("DefinedOffSetApp").setMaster("local[2]")
      .set("spark.streaming.stopGracefullyOnShutdown","true")
    val ssc = new StreamingContext(conf,Seconds(5))
	//kafka consumer的参数配置
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "random",
      "auto.offset.reset" -> "latest",
      //必须改为false
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    
    //初始化scalikejdbc
    scalikejdbc.config.DBs.setupAll()
    //MySQL中获取最新的指定groupid和topic的offset信息,若不存则返回空的map
    val fromOffsets = DB.readOnly( implicit session => {
    //这里应为表里只有这个topic和groupid信息,所以没做where 过滤。这点需要注意
      SQL("select * from kafka_offset").map(rs => {
        new TopicPartition(rs.string("topic"),rs.int("partition_id")) -> rs.long("untilOffset")
      }).list().apply()
    }).toMap

    val topics = Array("streaming-topic")
    //获取DirectStream从指定的partition的offset开始消费,若fromOffsets为空,这从头开始消费
    val stream = if(fromOffsets.isEmpty)
                  KafkaUtils.createDirectStream[String, String](
                    ssc,
                    LocationStrategies.PreferConsistent,
                    ConsumerStrategies.Subscribe[String,String](topics,kafkaParams)
                  )
                else
                  KafkaUtils.createDirectStream[String, String](
                    ssc,
                    LocationStrategies.PreferConsistent,
                    ConsumerStrategies.Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
                  )
	
	//获取当前批次的offset信息
    var offsetRanges:Array[OffsetRange] = Array.empty
    stream.transform(rdd => {
      offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd.foreachPartition{iter =>
        val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
        println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
      }
      rdd
    }).map(x =>
      (x.key(),x.value())
    ).foreachRDD(rdd => {
      rdd(println)
      //遍历不同分区的offset信息,并更新在MySQL中
      offsetRanges.foreach(x => {
        DB.autoCommit( implicit session => {
          SQL("replace into kafka_offset(topic,group_id,partition_id,fromOffset,untilOffset) values (?,?,?,?,?)")
            .bind(x.topic,"random",x.partition,x.fromOffset,x.untilOffset)
            .update().apply()
        })
      })
    }
    )
    ssc.start()
    ssc.awaitTermination()

  }
}
  • 1
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值