DataSet和DataFrame API
自Spark 2.0以来,DataFrames和数据集可以表示静态的有界数据,也可以表示流式的无界数据。与静态数据集/DataFrames类似,您可以使用公共入口点SparkSession (Scala/Java/Python/R docs)从流数据源创建流数据集/数据集,并对它们应用与静态数据集/数据集相同的操作。
Input Sources
- File source(支持故障容错)
将写入目录中的文件作为数据流读取。支持的文件格式有文本,csv, json, orc, parquet。
csv
// 读取一个目录中自动写入的所有csv文件
//name,age,salary,sex,dept,deptNo
val userSchema = new StructType()
.add("name", "string")
.add("age", "integer")
.add("salary", "double")
.add("sex", "boolean")
.add("dept", "string")
.add("deptNo", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.option("header","true") //去除表头
.schema(userSchema) // 指定csv文件的架构
.csv("hdfs://train:9000/results/csv")
// 开始运行将运行计数打印到控制台的查询
val query = csvDF.writeStream
.outputMode("Append")
.format("console")
.start()
query.awaitTermination()
- Socket Source(不支持故障容错)
import spark.implicits._
// 创建输入流
val lines = spark.readStream
.format("socket")
.option("host", "train")
.option("port", 9999)
.load()
val words = lines.as[String]
.flatMap(_.split(" "))
.groupBy("value")
.count()
// 开始运行将运行计数打印到控制台的查询
val query = words.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
- Kafka Source(支持故障恢复)
// 订阅一个主题
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "train:9092")
.option("subscribe", "topic01")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
// 开始运行将运行计数打印到控制台的查询
val query = df.writeStream
.outputMode("complete")
.format("console")
.start()
Basic Operations
import org.apache.spark.sql.SparkSession
case class DeviceData(device: String, deviceType: String, signal: Double, time: String)
object KafkaSource {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.master("local[6]")
.getOrCreate()
import spark.implicits._
// 创建输入流
// 订阅一个主题
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "train:9092")
.option("subscribe", "topic01")
.load()
.selectExpr("CAST(value AS STRING)")
.as[String].map(line=>{
val ts = line.split(",")
DeviceData(ts(0),ts(1),ts(2).toDouble,ts(3))
}).toDF()
.groupBy("deviceType","device")
.mean("signal")
// 开始运行将运行计数打印到控制台的查询
val query = df.writeStream
.outputMode("Complete")
.format("console")
.start()
query.awaitTermination()
}
}
import org.apache.spark.sql.SparkSession
object KafkaSource {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.master("local[6]")
.getOrCreate()
import spark.implicits._
// 创建输入流
// 订阅一个主题
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "train:9092")
.option("subscribe", "topic01")
.load()
.selectExpr("CAST(value AS STRING)")
.as[String].map(line=>{
var ts = line.split(",")
DeviceData(ts(0),ts(1),ts(2).toDouble,ts(3))
}).toDF().createOrReplaceTempView("t_device")
var sql =
"""
|select device,deviceType,age(signal)
| from t_device
| group by deviceType,device
""".stripMargin
var results = spark.sql(sql)
// 开始运行将运行计数打印到控制台的查询
val query = results.writeStream
.outputMode("Complete")
.format("console")
.start()
query.awaitTermination()
}
}
Output Sinks
File sink(Append)
将输出存储到目录
val inputs = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "CentOS:9092")
.option("subscribePattern", "topic.*")
.load()
//3.产生StreamQuery对象
val query:StreamingQuery = inputs.writeStream
.outputMode(OutputMode.Append())
.format("csv")
.option("sep", ",")
.option("header", "true")//去除表头
.option("inferSchema", "true")
.option("path", "hdfs://train:9000/structured/csv")
.option("checkpointLocation", "hdfs://train:9000/structured-checkpoints")
.start()
注意:File Sink只允许用在Append模式,并且支持精准一次的写入。
Kafka sink(Append,Update,Complete)
将输出存储到Kafka中的一个或多个主题。在这里,我们描述了将流查询和批查询写入Kafka的支持。请注意,Kafka仅支持至少一次写入语义
。因此,在向Kafka写入流查询或批处理查询时,某些记录可能会重复。例如,如果Kafka需要重试Broker未确认的消息(即使Broker已经收到并写入了消息记录),就会发生这种情况。由于这些Kafka写语义,结构化流无法阻止此类重复项的发生。写入Kafka的DataFrame在架构中应包含以下级例:
如果未指定“topic”配置选项,则topic列为必填项。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.4.5</version>
</dependency>
//3.产生StreamQuery对象
val query:StreamingQuery = results.writeStream
.outputMode(OutputMode.Update())
.format("kafka")
.option("checkpointLocation", "hdfs://CentOS:9000/structured-checkpoints-kafka")
.option("kafka.bootstrap.servers", "CentOS:9092")
.option("topic", "topic01")//覆盖DF中topic字段
.start()
必须保证写入的DF中仅仅含有
value
字符串类型的字段、key
可选,如果没有系统认为是null。topic
可选,前提是必须在Option中配置topic,否则必须出现在字段列中。
完整案例
// 1,zhangsan,1,4.5
val inputs = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", "9999")
.load()
import org.apache.spark.sql.functions._
val results = inputs.as[String].map(_.split(","))
.map(ts => (ts(0).toInt, ts(1), ts(2).toInt * ts(3).toDouble))
.toDF("id", "name", "cost")
.groupBy("id", "name")
.agg(sum("cost") as "cost" )
.as[(Int,String,Double)]
.map(t=>(t._1+":"+t._2,t._3+""))
.toDF("key","value")
//3.产生StreamQuery对象
val query:StreamingQuery = results.writeStream
.outputMode(OutputMode.Update())
.format("kafka")
.option("checkpointLocation", "hdfs://CentOS:9000/structured-checkpoints-kafka")
.option("kafka.bootstrap.servers", "CentOS:9092")
.option("topic", "topic01")//覆盖DF中topic字段
.start()
query.awaitTermination()
Console sink(for debugging)
每次有触发器时,将输出打印到控制台/stdout。支持追加和完整输出模式。由于每次触发后都会收集全部输出并将其存储在驱动程序的内存中,因此应在数据量较小时用于调试目的。
//3.产生StreamQuery对象
val query:StreamingQuery = results.writeStream
.outputMode(OutputMode.Complete())
.format("console")
.option("numRows","2")
.option("truncate","true")
.start()
Memory sink(for debugging)
输出作为内存表存储在内存中。支持追加和完整输出模式。当整个输出被收集并存储在驱动程序的内存中时,应将其用于调试低数据量的目的。因此,请谨慎使用。
val lines:DataFrame = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", 9999)
.load()
//2.持续查询,和SQL保持一致
val wordCounts:DataFrame = lines.as[String]
.flatMap(_.split(" "))
.groupBy("value")
.count()
//3.产生StreamQuery对象
val query:StreamingQuery = wordCounts.writeStream
.outputMode(OutputMode.Complete())
.format("memory")
.queryName("t_word")
.start()
new Thread(){
override def run(): Unit = {
while(true){
Thread.sleep(1000)
spark.sql("select * from t_word").show()
}
}
}.start()
Foreach|ForeachBatch sink
对输出中的记录运行任意输出。使用foreach和foreachBatch操作,您可以在流查询的输出上应用任意操作并编写逻辑。它们的用例略有不同-虽然foreach允许在每一行上使用自定义写逻辑,但是foreachBatch允许在每个微批处理的输出上进行任意操作和自定义逻辑。
foreachBatch
foreachBatch(…)允许您指定在流查询的每个微批处理的输出数据上执行的函数。
val lines:DataFrame = spark.readStream
.format("socket")
.option("host", "CentOS")
.option("port", 9999)
.load()
//2.持续查询,和SQL保持一致
val wordCounts:DataFrame = lines.as[String]
.flatMap(_.split(" "))
.groupBy("value")
.count()
//3.产生StreamQuery对象
val query:StreamingQuery = wordCounts.writeStream
.outputMode(OutputMode.Complete())
.foreachBatch((ds,batchID)=>{
ds.write
.mode(SaveMode.Overwrite).format("json")
.save("hdfs://CentOS:9000/results/structured-json")
})
.start()
Foreach
class WordCountWriter extends ForeachWriter[Row]{
override def open(partitionId: Long, epochId: Long): Boolean = {
// println("打开了链接")
true //执行process
}
override def process(value: Row): Unit = {
var Row(word,count)=value
println(word,count)
}
override def close(errorOrNull: Throwable): Unit = {
// println("释放资源")
}
}
//3.产生StreamQuery对象
val query:StreamingQuery = wordCounts.writeStream
.outputMode(OutputMode.Complete())
.foreach(new WordCountWriter)
.start()
写数据到Redis
object StructedStreamForEach {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
val lines: DataFrame = spark.readStream
.format("socket")
.option("host", "train")
.option("port", 9999)
.load()
//2.持续查询,和SQL保持一致
val wordCounts: DataFrame = lines.as[String]
.flatMap(_.split(" "))
.groupBy("value")
.count()
//3.产生StreamQuery对象
val query: StreamingQuery = wordCounts.writeStream
.outputMode(OutputMode.Complete())
.foreach(new WordCountWriter)
.start()
query.awaitTermination()
}
}
class WordCountWriter extends ForeachWriter[Row]{
lazy val jedisPool:JedisPool=createJedisPool()
var jedis:Jedis=null
def createJedisPool(): JedisPool = {
new JedisPool("train",6379)
}
override def open(partitionId: Long, epochId: Long): Boolean = {
jedis=jedisPool.getResource
true //执行process
}
override def process(value: Row): Unit = {
var Row(word,count)=value
println(word,count)
jedis.set(word.toString,count.toString)
}
override def close(errorOrNull: Throwable): Unit = {
jedis.close()
}
}