SparkStream
流计算
- 大量、快速、时变、持续到达
- 特征:
- 数据快速持续到达
- 数据来源多,格式复杂
- 数据量大
- 注重数据的整体价值
- 数据顺序有可能颠倒或不完整
- 处理流程:数据采集 实时分析处理 结果反馈
- 处理引擎:低延迟、可扩展、高可靠
Spark Streaming
- 支持的输入源:Kafka、Flume、HDFS\TCP socket
- 支持的输出源:HDFS、Databases、Dashboards
- spark是以线程级别并行,实时响应级别高,可实现秒级响应,实现高效的流计算
- Spark Streaming的数据抽象:DStream
DStream的操作
- 编写Spark Streaming程序的固定步骤:
- 开始接收数据和处理流程
streamingContext.start()
- 等待处理结束
streamingContext.awaitTermination()
- 手动结束流计算进程
streamingContext.stop()
- 开始接收数据和处理流程
// 创建sparkContext对象
import org.apache.spark._
import org.apache.spark.streaming._
//设置为本地运行模式,2个线程,一个监听,一个处理数据
val conf = new SparkConf().setMaster("local[2]").setAppName("TestDStream")
//时间间隔为2秒
val ssc = new StreamingContext(sc.Seconds(1))
imput DStream(基本输入源)
-
文件流
val lines = sc.textFileStream("xxxx") val words = lines.flatmap(_.split(" ")) val wordcounts = words.map(x=>(x,1)).reduceByKey(_+_) wordcounts.println() ssc.start() ssc.awaitTermination()
编译打包:vim simple.sbt
在simple.sbt文件中输入:name:="Simple Project" version := "1.0" scalaVersion:="2.11.8" libraryDependencies += "org.apache.spark"%"spark-streaming_2.11"%"2.1.0"
执行stb打包编译的命令:
sbt package sparlk -submit --class"xxx"/xx/xx/xx.jar
-
套接字流
// NetworkWordCount.scala
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel
object NetworkWordCount{
def main(args:Array[String]){
// 需要提供两个初始参数:hostname主机名称、port端口号
if(args.length < 2){
System.err.println("Usage:NetworkWordCount<hostname><port>")
System.exit(1)
}
//设置日志显示级别
StreamingExamples.setStreamingLogLevels()
val conf = new SparkConf().setMaster("local[2]").setAppName("Name")
val ssc = new StreamingContext(conf, Seconds(1))
//socketTextStream中的两个参数args(0)为主机名, args(1)为端口号
val lines = ssc.socketTextStream(args(0), args(1).toInt,
StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x,1).reduceByKey(_+_))
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
设置日志格式:
使用stb将scala文件打包,提交到spark集群上运行
打开一个中断,启动TCP服务端,向客户端发送数据
- 从RDD队列流中获取数据
package com.dw.streaming
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import scala.collection.mutable
object queue {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("Name")
val ssc: StreamingContext = new StreamingContext(conf, Seconds(3))
val rddQueue: mutable.SynchronizedQueue[RDD[Int]] = new mutable.SynchronizedQueue[RDD[Int]] {}
val queueStream: InputDStream[Int] = ssc.queueStream(rddQueue)
val mappedQueue: DStream[(Int, Int)] = queueStream.map(x => (x % 10, 1))
val reduceStream: DStream[(Int, Int)] = mappedQueue.reduceByKey(_ + _)
reduceStream.print()
ssc.start()
// ssc.awaitTermination()
for (i<-1 to 10){
rddQueue += ssc.sparkContext.makeRDD(1 to 100,2)
Thread.sleep(1000)
}
ssc.stop()
}
}
DStream转换操作——无状态、有状态
无状态即不保留之前的某个采集周期的数据,只对采集周期内的数据进行处理;有状态即保留之前的某一个采集周期的数据,需要设定checkpoint路径
无状态操作:
- map(func)
- flatMap(func)
- filter(func)
- repartition(numPartitons)
- reduce(func)
- count()
- union(otherStream)
- countByValue():
- reduceByKey():key相同的进行聚合
- join(otherStream, [numTask]):l两个DStream连接* cogroup(otherStream, [numTasks])
- transform(func): 获取RDD后对RDD进行操作,在两种情况下使用:
- DStream功能区不完善
- 需要代码周期性执行时
有状态操作:
滑动窗口转换操作:
需要设定两个参数:
1. 滑动窗口大小(窗口时长):采集周期的整数倍
2. 设定滑动窗口时间间隔大小(步长):
-
countByWindow(windowLength,slideInterval):返回流中一个窗口内的元素个数
-
reduceByWindow(func,windowLength,slideInterval):通过使用自定义函数整合滑动窗口区间流元素,来创建一个新的单元素流
-
reduceByKeyAndWindow(func, invfunc, windowLength, slideInterval):当窗口比较大,但滑动幅度比较小,那么可以采用增加数据和删除数据的方式,无需重复计算,提升性能(即把过去的窗口中的数据减掉,将新增的数据加上)invfunc为func的反函数(例如func为_+,则invfunc为-_)
updateStateByKey操作
package com.dw.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object window {
def main(args: Array[String]): Unit = {
//设置log4j日志级别
StreamingExample.setStreamingLogLevels()
val conf = new SparkConf().setMaster("local[2]").setAppName("Name")
val ssc: StreamingContext = new StreamingContext(conf, Seconds(3))
ssc.checkpoint("cp")
val lines: ReceiverInputDStream[String] = ssc.socketTextStream("localhost", 9999)
val words: DStream[String] = lines.flatMap(_.split(" "))
val wordDstream: DStream[(String, Int)] = words.map((_, 1))
val state: DStream[(String, Int)] = wordDstream.updateStateByKey(
(values: Seq[Int], buff: Option[Int]) => {
val currentCount: Int = values.sum
val previousCount: Int = buff.getOrElse(0)
Some(currentCount + previousCount)
}
)
state.print()
ssc.start()
ssc.awaitTermination()
}
}
updateStateByKey()中的也可以传入一个自定义函数updataFunc
DStream输出操作
输出为文件:
stateDstream.saveAsTextFiles("xxxx")
输出到MySQL数据库中:
stateDstream.foreachRDD( rdd =>
def func(records: Iterateot[(String,Int)](
var conn:Connection = null
var stumt: PreparedStatement = null
try{
val url = "xxx"
val user = "root"
val password = "hadoop"
conn = DriverManager.getConnection(url,user,password)
records.foreach(p =>{
val sql = "insert into wordcount(word.count) value("...")"
stmt = conn.prepareStatement(sql);
stmt.setInt(1,p._2.toInt)
stmt.executeUpdate()
})
}
catch {
case e:Excepttion => e.printStackTrace()
} finally{
}
val repartitionedRDD = rdd.repartition(3)
repatitionedRDD.foreachPartition(func)
})
sc.start()
sc.awaitTermination()
Structured Streaming
Structured Streaming是基于DataFrame的数据抽象,引入持续式流处理模式,将流处理延迟至毫秒级别