简述
Kafka+Spark Streaming主要用于实时流处理。到目前为止,在大数据领域中是一种非常常见的架构。Kafka在其中主要起着一个缓冲的作用,所有的实时数据都会经过kafka。所以对kafka offset的管理是其中至关重要的一环。一但管理不善,就会到导致数据丢失或重复消费。Spark Streaming + Kafka有两种消费模式,一种是基于Receiver的,一种采用Kafka Direct API。这里采用Direct模式(推荐)。
消费语义
- At-most-once 最多消费一次,可能会导致数据丢失。offset已经提交了,数据还没处理完,消费者挂了
- At-least-once 最少消费一次,可能会导致重复消费。数据处理完了,offset还没提交之前,消费者挂了
- Exactly-once 恰好消费一次,保证数据不丢失不重复消费。offset和数据同时处理完毕。 需要保证offset提交和数据存储在同一个事务里面,即存储到同一个库,例如Mysql等等。不同库难以实现该语义。
offset的三种管理方式
- 自动提交offset(彻底放弃使用这种方式吧):enable.auto.commit=true。一但consumer挂掉,就会导致数据丢失或重复消费。offset不可控。
- Kafka自身的offset管理(属于At-least-once语义,如果做好了幂等性,可以使用这种方式):在Kafka 0.10+版本中,offset的默认存储由ZooKeeper移动到了一个自带的topic中,名为__consumer_offsets。Spark Streaming也专门提供了commitAsync() API用于提交offset。需要将参数修改为enable.auto.commit=false。在我实际测试中发现,这种offset的管理方式,不会丢失数据,但会出现重复消费。停掉streaming应用程序再次启动后,会再次消费停掉前最后的一个批次数据,应该是由于offset是异步提交的方式导致,offset更新不及时引起的。因此需要做好数据的幂等性。(修改源码将异步改为同步,应该是可以做到Exactly-once语义的)
- 自定义offset(推荐,采用这种方式,可以做到At-least-once语义):可以将offset存放在第方三储中,包括RDBMS、Redis、ZK、ES甚至Kafka中。若消费数据存储在带事务的组件上,则强烈推荐将offset存储在一起,借助事务实现 Exactly-once 语义。
offset管理demo
自带offset管理
- pom.xml
<properties>
<scala.version>2.11.12</scala.version>
<spark.version>2.4.2</spark.version>
<es.version>6.6.2</es.version>
</properties>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
- Streaming demo
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{SparkConf, TaskContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
object BackProessureApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("BackProessureApp").setMaster("local[2]")
.set("spark.streaming.stopGracefullyOnShutdown","true")
val ssc = new StreamingContext(conf,Seconds(5))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "random",
"auto.offset.reset" -> "latest",
//一定要设置成false,否则不生效
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("streaming-topic")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String,String](topics,kafkaParams)
)
stream.foreachRDD(rdd => {
//获取当前批次的offset信息
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition(iter =>{iter
//遍历打印offset
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset} ${o.topicPartition()}")
}
)
//异步提交offset
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
})
ssc.start()
ssc.awaitTermination()
}
}
将offset存储在MySQL中
- 数据库表
create table kafka_offset(
group_id varchar(30),
topic varchar(30),
partition_id int(5),
fromOffset bigint(18),
untilOffset bigint(18),
primary key(topic,group_id,partition_id)
);
2.scalikejdbc连接配置(在resources资源目录下:application.conf)
db.default.driver="com.mysql.jdbc.Driver"
db.default.url="jdbc:mysql://127.0.0.1:3306/rz?characterEncoding=utf8&useSSL=false&serverTimezone=UTC&rewriteBatchedStatements=true"
db.default.user="root"
db.default.password="123456"
- pom.xml
<properties>
<scala.version>2.11.12</scala.version>
<spark.version>2.4.2</spark.version>
<es.version>6.6.2</es.version>
</properties>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.24</version>
</dependency>
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc_2.11</artifactId>
<version>3.3.2</version>
</dependency>
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc-config_2.11</artifactId>
<version>3.3.2</version>
</dependency>
- Streaming demo
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import scalikejdbc._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, TaskContext}
/**
* 自定义offset存储
*/
object DefinedOffSetApp {
private val user = "root"
private val password = "123456"
private val url = "jdbc:mysql://127.0.0.1:3306/rz?characterEncoding=utf8&useSSL=false&serverTimezone=UTC&rewriteBatchedStatements=true"
def main(args: Array[String]): Unit = {
//初始化spark streaming应用程序
val conf = new SparkConf().setAppName("DefinedOffSetApp").setMaster("local[2]")
.set("spark.streaming.stopGracefullyOnShutdown","true")
val ssc = new StreamingContext(conf,Seconds(5))
//kafka consumer的参数配置
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "random",
"auto.offset.reset" -> "latest",
//必须改为false
"enable.auto.commit" -> (false: java.lang.Boolean)
)
//初始化scalikejdbc
scalikejdbc.config.DBs.setupAll()
//MySQL中获取最新的指定groupid和topic的offset信息,若不存则返回空的map
val fromOffsets = DB.readOnly( implicit session => {
//这里应为表里只有这个topic和groupid信息,所以没做where 过滤。这点需要注意
SQL("select * from kafka_offset").map(rs => {
new TopicPartition(rs.string("topic"),rs.int("partition_id")) -> rs.long("untilOffset")
}).list().apply()
}).toMap
val topics = Array("streaming-topic")
//获取DirectStream从指定的partition的offset开始消费,若fromOffsets为空,这从头开始消费
val stream = if(fromOffsets.isEmpty)
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String,String](topics,kafkaParams)
)
else
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)
//获取当前批次的offset信息
var offsetRanges:Array[OffsetRange] = Array.empty
stream.transform(rdd => {
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition{iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
rdd
}).map(x =>
(x.key(),x.value())
).foreachRDD(rdd => {
rdd(println)
//遍历不同分区的offset信息,并更新在MySQL中
offsetRanges.foreach(x => {
DB.autoCommit( implicit session => {
SQL("replace into kafka_offset(topic,group_id,partition_id,fromOffset,untilOffset) values (?,?,?,?,?)")
.bind(x.topic,"random",x.partition,x.fromOffset,x.untilOffset)
.update().apply()
})
})
}
)
ssc.start()
ssc.awaitTermination()
}
}