005SparkStreaming
SparkStreaming应用程序如何保证Exactly-Once
一个流式计算如果想要保证Exactly-Once,那么首先要对这三个点有有要求:
(1)Source支持Replay。
(2)流计算引擎本身处理能保证Exactly-Once。
(3)Sink支持幂等或事务更新
也就是说如果要想让一个SparkSreaming的程序保证Exactly-Once,那么从如下三个角度出发:
(1)接收数据:从Source中接收数据。
(2)转换数据:用DStream和RDD算子转换。(SparkStreaming内部天然保证Exactly-Once)
(3)储存数据:将结果保存至外部系统。
如果SparkStreaming程序需要实现Exactly-Once语义,那么每一个步骤都要保证Exactly-Once。
pom.xml添加内容如下:
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc_2.11</artifactId>
<version>3.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scalikejdbc/scalikejdbc-config -->
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc-config_2.11</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.39</version>
</dependency>
代码实现:
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, HasOffsetRanges, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.slf4j.LoggerFactory
import scalikejdbc.{ConnectionPool, DB, _}
/**
* SparkStreaming EOS:
* Input:Kafka
* Process:Spark Streaming
* Output:Mysql
*
* 保证EOS:
* 1、偏移量自己管理,即enable.auto.commit=false,这里保存在Mysql中
* 2、使用createDirectStream
* 3、事务输出: 结果存储与Offset提交在Driver端同一Mysql事务中
*/
object SparkStreamingEOSKafkaMysqlAtomic {
@transient lazy val logger = LoggerFactory.getLogger(this.getClass)
def main(args: Array[String]): Unit = {
val topic="topic1"
val group="spark_app1"
//Kafka配置
val kafkaParams= Map[String, Object](
"bootstrap.servers" -> "node1:6667,node2:6667,node3:6667",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"auto.offset.reset" -> "latest",//latest earliest
"enable.auto.commit" -> (false: java.lang.Boolean),
"group.id" -> group)
//在Driver端创建数据库连接池
ConnectionPool.singleton("jdbc:mysql://node3:3306/bigdata", "", "")
val conf = new SparkConf().setAppName(this.getClass.getSimpleName.replace("$",""))
val ssc = new StreamingContext(conf,Seconds(5))
//1)初次启动或重启时,从指定的Partition、Offset构建TopicPartition
//2)运行过程中,每个Partition、Offset保存在内部currentOffsets = Map[TopicPartition, Long]()变量中
//3)后期Kafka Topic分区动扩展,在运行过程中不能自动感知
val initOffset=DB.readOnly(implicit session=>{
sql"select `partition`,offset from kafka_topic_offset where topic =${topic} and `group`=${group}"
.map(item=> new TopicPartition(topic, item.get[Int]("partition")) -> item.get[Long]("offset"))
.list().apply().toMap
})
//CreateDirectStream
//从指定的Topic、Partition、Offset开始消费
val sourceDStream =KafkaUtils.createDirectStream[String,String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Assign[String,String](initOffset.keys,kafkaParams,initOffset)
)
sourceDStream.foreachRDD(rdd=>{
if (!rdd.isEmpty()){
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetRanges.foreach(offsetRange=>{
logger.info(s"Topic: ${offsetRange.topic},Group: ${group},Partition: ${offsetRange.partition},fromOffset: ${offsetRange.fromOffset},untilOffset: ${offsetRange.untilOffset}")
})
//统计分析
//将结果收集到Driver端
val sparkSession = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import sparkSession.implicits._
val dataFrame = sparkSession.read.json(rdd.map(_.value()).toDS)
dataFrame.createOrReplaceTempView("tmpTable")
val result=sparkSession.sql(
"""
|select
| --每分钟
| eventTimeMinute,
| --每种语言
| language,
| -- 次数
| count(1) pv,
| -- 人数
| count(distinct(userID)) uv
|from(
| select *, substr(eventTime,0,16) eventTimeMinute from tmpTable
|) as tmp group by eventTimeMinute,language
""".stripMargin
).collect()
//在Driver端存储数据、提交Offset
//结果存储与Offset提交在同一事务中原子执行
//这里将偏移量保存在Mysql中
DB.localTx(implicit session=>{
//结果存储
result.foreach(row=>{
sql"""
insert into twitter_pv_uv (eventTimeMinute, language,pv,uv)
value (
${row.getAs[String]("eventTimeMinute")},
${row.getAs[String]("language")},
${row.getAs[Long]("pv")},
${row.getAs[Long]("uv")}
)
on duplicate key update pv=pv,uv=uv
""".update.apply()
})
//Offset提交
offsetRanges.foreach(offsetRange=>{
val affectedRows = sql"""
update kafka_topic_offset set offset = ${offsetRange.untilOffset}
where
topic = ${topic}
and `group` = ${group}
and `partition` = ${offsetRange.partition}
and offset = ${offsetRange.fromOffset}
""".update.apply()
if (affectedRows != 1) {
throw new Exception(s"""Commit Kafka Topic: ${topic} Offset Faild!""")
}
})
})
}
})
ssc.start()
ssc.awaitTermination()
}
}
ScalikeJDBC
什么是ScalikeJDBC
ScalikeJDBC是一款给Scala开发者使用的简洁DB访问类库,它是基于SQL的,使用者只需要关注SQL逻辑的编写,所有的数据库操作都交给ScalikeJDBC。这个类库内置包含了JDBC API,并且给用户提供了简单易用并且非常灵活的API。并且,QueryDSL(通用查询查询框架)使你的代码类型安全的并且可重复使用。
生产中可以使用。
IDEA项目中导入相关库(pom文件)
<!-- https://mvnrepository.com/artifact/org.scalikejdbc/scalikejdbc -->
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc_2.11</artifactId>
<version>3.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scalikejdbc/scalikejdbc-config -->
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc-config_2.11</artifactId>
<version>3.1.0</version>
</dependency>
<!-- mysql " mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
数据库操作
数据库连接配置信息
在IDEA的resources文件夹下创建application.conf:
#mysql的连接配置信息
db.default.driver="com.mysql.jdbc.Driver"
db.default.url="jdbc:mysql://localhost:3306/spark"
db.default.user="root"
db.default.password="123456"
scalikeJDBC默认加载default配置
或者使用自定义配置:
#mysql的连接配置信息
db.fred.driver="com.mysql.jdbc.Driver"
db.fred.url="jdbc:mysql://localhost:3306/spark"
db.fred.user="root"
db.fred.password="123456"
加载数据配置信息
//默认加载default配置信息
DBs.setup()
//加载自定义的fred配置信息
DBs.setup('fred)
查询数据库并封装数据
//配置mysql
DBs.setup()
//查询数据并返回单个列,并将列数据封装到集合中
val list = DB.readOnly({implicit session =>
SQL("select content from post")
.map(rs =>
rs.string("content")).list().apply()
})
for(s <- list){
println(s)
}
case class Users(id:String, name:String, nickName:String)
/**
* 查询数据库,并将数据封装成对象,并返回一个集合
*/
//配置mysql
DBs.setup('fred)
//查询数据并返回单个列,并将列数据封装到集合中
val users = NamedDB('fred).readOnly({implicit session =>
SQL("select * from users").map(rs =>
Users(rs.string("id"), rs.string("name"),
rs.string("nickName"))).list().apply()
})
for (u <- users){
println(u)
}
插入数据
AutoCommit
/**
* 插入数据,使用AutoCommit
* @return
*/
val insertResult = DB.autoCommit({implicit session =>
SQL("insert into users(name, nickName) values(?,?)").bind("test01", "test01")
.update().apply()
})
println(insertResult)
插入返回主键标识
/**
* 插入数据,并返回主键
* @return
*/
val id = DB.localTx({implicit session =>
SQL("insert into users(name, nickName, sex) values(?,?,?)").bind("test", "000", "male")
.updateAndReturnGeneratedKey("nickName").apply()
})
println(id)
事务插入
/**
* 使用事务插入数据库
* @return
*/
val tx = DB.localTx({implicit session =>
SQL("insert into users(name, nickName, sex) values(?,?,?)").bind("test", "haha", "male").update().apply()
//下一行会报错,用于测试
var s = 1 / 0
SQL("insert into users(name, nickName, sex) values(?,?,?)").bind("test01", "haha01", "male01").update().apply()
})
println(s"tx = ${tx}")
更新数据
/**
* 更新数据
* @return
*/
DB.localTx({implicit session =>
SQL("update users set nickName = ?").bind("xiaoming").update().apply()
})