Structured streaming 概念介绍
源源不断的数据Data stream 经过逻辑处理,增加到一张非常大的表里(内存中维护了一张非常大的结果表),每一批数据类似于表里面的一行,不断地增加和更新,然后把结果表再sink到外部,如mysql等。
*
*概念说完了,下面我们来干货:
**
写一个从kafka读取数据,然后分组统计,sink到mysql的例子。
准备:kafka的消息结构:{“clsd”: “200”,“hpzl”: “07”,“jgsj”: “2020-07-09 23:26:27”,“glbm”: “530601000000”,“sbbh”: “010040051205004100”,“dldm”: “80119”,“fxlx”: “1”,“hphm”: “皖xxx”,“gcbh”: “53d605f46a604790991e74cbe6de8171”,“cllx”: “X99”}
从消费者读取kafka的数据,代码如下:
object Structured_kafkaToMysql {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder().master("local[*]").appName("wordCount").getOrCreate()
sparkSession.sparkContext.setLogLevel("WARN")
val dFrame = sparkSession.readStream.format("kafka")
.option("kafka.bootstrap.servers", ConfigurationManager.getProperty(Constants.bootstrap_servers))
.option("subscribe", ConfigurationManager.getProperty(Constants.topic))
.load()
// dFrame.printSchema()
import sparkSession.implicits._
val ds = dFrame.selectExpr("CAST(value AS STRING)").as[String]
ds.createOrReplaceTempView("test")
ds.printSchema()
val passDf = sparkSession.sql(
"""
|select
|get_json_object(value,'$.hphm') as hphm,
|get_json_object(value,'$.hpzl') as hpzl,
|get_json_object(value,'$.sbbh') as sbbh,
|get_json_object(value,'$.dldm') as dldm,
|substr(get_json_object(value,'$.jgsj'),1,16) as date,
|CAST(get_json_object(value,'$.jgsj') AS Timestamp) jgsj,
|substr(get_json_object(value,'$.glbm'), 1,6) as glbm
|from test
""".stripMargin).toDF("hphm", "hpzl", "sbbh", "glbm","date", "jgsj", "dldm")
passDf.printSchema()
import org.apache.spark.sql.functions._
val resultdf=passDf.withWatermark("jgsj","5 minute")
.groupBy(window($"jgsj", "1 minute", "1 minute"),col("date"),col("glbm"),col("dldm"))
.count()
// .sort($"date".desc)
val sink = new MysqlSink(ConfigurationManager.getProperty("jdbc.url"),ConfigurationManager.getProperty("jdbc.user"),ConfigurationManager.getProperty("jdbc.password"))
val query = resultdf.writeStream
// .format("console")
.outputMode("update") // 输出所有的数据 complete update append
.trigger(Trigger.ProcessingTime(0)) // 尽快计算
.foreach(sink)
.start()
query .awaitTermination()
resultdf.show()
}
}
sink到mysql的方法
class MysqlSink(url: String, user: String, pwd: String) extends ForeachWriter[Row] {
var conn: Connection = _
override def open(partitionId: Long, epochId: Long): Boolean = {
Class.forName("com.mysql.jdbc.Driver")
conn = DriverManager.getConnection(url, user, pwd)
true
}
override def process(value: Row): Unit = {
val p = conn.prepareStatement("replace into real_time_pass(date,glbm,dldm,count ) values(?,?,?,?)")
p.setString(1, value(1).toString)
p.setString(2, value(2).toString)
p.setString(3, value(3).toString)
p.setString(4, value(4).toString)
p.execute()
}
override def close(errorOrNull: Throwable): Unit = {
conn.close()
}
}
注意:
1、 跨批次聚合需要用到withWatermark函数,这个函数表示小于 (结果表的最大时间-设定值的时间)的数据不参与计算。
2、输出的模式outputMode 有三种模式:
complete 每次输出所有的结果表数据
update 每次输出新增的和修改的数据
append 输出新增的数据
3、输出到mysql的函数需要自己写
运行程序,sink到mysqi表的结果:
附上表的设计: