Spark流计算
Transformations
DStream转换与RDD的转换类似,将DStream转换成新的DStream.DStream常⻅的许多算⼦使⽤和SparkRDD保持⼀致。
map算⼦
//1,zhangsan,true
lines.map(line=> line.split(","))
.map(words=>(words(0).toInt,words(1),words(2).toBoolean))
.print()
flatMap
//hello spark
lines.flatMap(line=> line.split("\\s+"))
.map((_,1)) //(hello,1)(spark,1)
.print()
filter
//只会对含有hello的数据过滤,方法返回true的记录拿过来
lines.filter(line => line.contains("hello"))
.flatMap(line=> line.split("\\s+"))
.map((_,1))
.print()
repartition(修改分区)
lines.repartition(10) //修改程序并⾏度 分区数
.filter(line => line.contains("hello"))
.flatMap(line=> line.split("\\s+"))
.map((_,1))
.print()
union(将两个流合并)
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
.filter(line => line.contains("hello"))
.flatMap(line=> line.split("\\s+"))
.map((_,1))
.print()
注意:相当于两个Receivers,所以分配核的时候最少给三个
count
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
.flatMap(line=> line.split("\\s+"))
.count() //计算微批处中RDD元素的个数
.print()
reduce(func )
将结果进行合并
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10) // aa bb
.flatMap(line=> line.split("\\s+"))
.reduce(_+"|"+_)//aa|bb
.print()
countByValue(key计数)
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10) // a a b c
.flatMap(line=> line.split("\\s+"))
.countByValue() //(a,2) (b,1) (c,1)
.print()
reduceByKey(func , [numTasks ])
var lines:DStream[String]=ssc.socketTextStream("CentOS",9999) //this is spark this
lines.repartition(10)
.flatMap(line=> line.split("\\s+").map((_,1)))
.reduceByKey(_+_)// (this,2)(is,1)(spark ,1)
.print()
join(otherStream , [numTasks ])
//1 zhangsan
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
//1 apple 1 4.5
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
val userPair:DStream[(String,String)]=stream1.map(line=>{
var tokens= line.split(" ")
(tokens(0),tokens(1))
})
val orderItemPair:DStream[(String,(String,Double))]=stream2.map(line=>{
//line为传入的参数
val tokens = line.split(" ")
//返回
(tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
userPair.join(orderItemPair).map(t=>(t._1,t._2._1,t._2._2._1,t._2._2._2))//1 zhangsan apple 4.5
.print()
(1,(zhangsan,(apple,4.5)))
必须保证两个流需要join的数据落⼊同⼀个RDD时间批次下,否则⽆法完成 join ,因此意义不⼤。
transform
可以使⽤stream和RDD做计算,因为transform可以拿到底层macro batch RDD,继⽽实现stream-batch join
//1 apple 2 4.5
val orderLog: DStream[String] = ssc.socketTextStream("CentOS",8888)
var userRDD=ssc.sparkContext.makeRDD(List(("1","zhangs"),("2","wangw")))
//常量类型为(String,(String,Double))
val orderItemPair:DStream[(String,(String,Double))]=orderLog.map(line=>{
val tokens = line.split(" ")
(tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
//动态rdd与静态rdd相join
orderItemPair.transform(rdd=> rdd.join(userRDD))
.print()
updateStateByKey(有状态计算,全量输出)
//必须设置检查点,存储计算状态,进行备份
ssc.checkpoint("hdfs://zly:9000/spark-checkpoint")//状态快照
val lines: DStream[String] = ssc.socketTextStream("zly",9999)
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
//相同key值所对应集合。状态如果有值就加没有就为0,此状态会存在内存中。
val newCount = newValues.sum + runningCount.getOrElse(0)
Some(newCount) }
lines.flatMap(_.split("\\s+"))
.map((_,1))
.updateStateByKey(updateFunction) .print()
必须设定checkpointdir⽤于存储程序的状态信息。对内存消耗⽐较严重。
mapWithState(有状态计算,增量输出)
只会输出有更新的key,更新的key加载到内存,没更新的key会放在磁盘中
ssc.checkpoint("hdfs://zly:9000/spark-checkpoint")//状态快照
val lines: DStream[String] = ssc.socketTextStream("zly",9999)
lines.flatMap(_.split("\\s+"))
.map((_,1))
//设置增量输出
key,value
.mapWithState(StateSpec.function((k:String,v:Option[Int],state:State[Int])=>{
var historyCount=0
//如果状态存在,就先获取状态的值进行historyCount累加,再与新值相加
//如果状态不存在,0+新值
if(state.exists()){
historyCount=state.get()
}
historyCount += v.getOrElse(0)
//更新状态
state.update(historyCount)
(k,historyCount)
}))
.print()
必须设定checkpointdir⽤于存储程序的状态信息
DStream故障恢复
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}
object SparkWordCountFailRecorver {
def main(args: Array[String]): Unit = {
var checkpointDir="hdfs://zly:9000/spark-checkpoint1"
//先检查检查点能否恢复,如果不能恢复,执⾏recoveryFunction
var ssc= StreamingContext.getOrCreate(checkpointDir,recoveryFunction)
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("hdfs://zly:9000/spark-checkpoint1")//状态快照
//5.启动流计算
ssc.start()
ssc.awaitTermination()
}
var recoveryFunction=()=>{
println("======recoveryFunction========")
Thread.sleep(3000)
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(1))
val lines: DStream[String] = ssc.socketTextStream("zly",9999)
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount = newValues.sum + runningCount.getOrElse(0)
Some(newCount)
}
lines.flatMap(_.split("\\s+"))
.map((_,1))
.mapWithState(StateSpec.function((k:String,v:Option[Int],state:State[Int])=>{
var historyCount=0
if(state.exists()){
historyCount=state.get()
}
historyCount += v.getOrElse(0)
//更新状态
state.update(historyCount)
(k,historyCount)
}))
.print()
//返回值
ssc
}
}
诟病:⼀旦状态持久化后,⽤户修改代码就不可⻅了,因为系统并不会调
⽤ recoveryFunction ,如果希望修改的代码⽣效,必须⼿动删除检查点⽬录。
Window Operations
Spark Streaming还提供了窗⼝计算,可让您在数据的滑动窗⼝上应⽤转换。下图说明了此滑动窗⼝。
如图所示,每当窗⼝在源DStream上滑动时,落⼊窗⼝内的源RDD就会合并并对其进⾏操作,以⽣成窗⼝DStream的RDD。在上图中,该操作将应⽤于数据的最后3个时间单位,并以2个时间单位滑动。这表明任何窗⼝操作都需要指定两个参数。
① 窗⼝⻓度-窗⼝的持续时间(3倍时间单位)。
② 滑动间隔-进⾏窗⼝操作的间隔(2倍时间单位)。
注意 这两个参数必须是源DStrem的 批处理间隔 的倍数,因为对于DStream⽽⾔,微批值原⼦性的最⼩处理单位。通常在流计算中,如果 窗⼝⻓度 = 滑动间隔 称该窗⼝为滚动窗⼝没有元素交叠;如果 窗⼝⻓度 > 滑动间隔 称个窗⼝为滑动窗⼝存在元素交叠;⼀般情况下所有的流的窗⼝⻓度 >= 滑动间隔,因为如果⼩于滑动间隔,会有数据的遗漏。
⼀些常⻅的窗⼝操作如下。所有这些操作均采⽤上述两个参数-windowLength和slideInterval
window(windowLength , slideInterval )
会把落在同一个窗口的微批合并成一个大的RDD
ssc.socketTextStream("zly", 9999)
.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.window(Seconds(2),Seconds(2))
.reduceByKey((v1, v2) => v1 + v2)
.print()
countByWindow(windowLength , slideInterval )
返回一个窗口中元素的个数
ssc.checkpoint("hdfs://zly:9000/spark-checkpoints")
ssc.socketTextStream("zly", 9999)
.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.countByWindow(Seconds(2),Seconds(2))
.print()
相当于先window然后使⽤count算⼦,必须设置检查点目录
reduceByWindow(func , windowLength , slideInterval )
ssc.socketTextStream("zly", 9999)
.flatMap(line => line.split("\\s+"))
.reduceByWindow(_+" | "+_,Seconds(2),Seconds(2))
.print()
相当于先window然后使⽤reduce算⼦
reduceByKeyAndWindow(func , windowLength , slideInterval , [numTasks ])
ssc.socketTextStream("zly", 9999)
.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKeyAndWindow(_+_,Seconds(2),Seconds(2))
.print()
相当于先window然后使⽤reduceByKey算⼦
reduceByKeyAndWindow(func , invFunc , windowLength , slideInterval , [numTasks ])
ssc.checkpoint("hdfs://zly:9000/spark-checkpoints")
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("zly", 9999)
.flatMap(line => line.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(
(v1,v2)=>v1+v2,//加上新移⼊元素
(v1,v2)=>v1-v2,//减去有移除元素
Seconds(4),Seconds(1),
filterFunc = t=> t._2 > 0) //过滤掉值=0元素
.print()
必须重叠元素过半,使⽤以上⽅法效率⾼。
Output Operations
输出操作允许将DStream的数据推出到外部系统,例如数据库或⽂件系统。由于输出操作实际上允许外部系统使⽤转换后的数据,因此它们会触发所有DStream转换的实际执⾏(类似于RDD的操作)。当前,定义了以下输出操作:
Kafka Sink
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig,
ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer
object KafkaSink {
def createKafkaConnection(): KafkaProducer[String, String] = {
val props = new Properties()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"zly:9092")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getNam
e)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getN
ame)
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG,"true")//开启幂等性
props.put(ProducerConfig.RETRIES_CONFIG,"2")//设置重试
props.put(ProducerConfig.BATCH_SIZE_CONFIG,"100")//设置缓冲区⼤⼩
props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")//最多延迟1000毫秒
new KafkaProducer[String,String](props)
}
lazy val kafkaProducer:KafkaProducer[String,String]= createKafkaConnection()
Runtime.getRuntime.addShutdownHook(new Thread(){
override def run(): Unit = {
kafkaProducer.close()
}
})
def save(vs: Iterator[(String, Int)],topic:String,): Unit = {
try{
vs.foreach(tuple=>{
val record = new ProducerRecord[String,String] (topic,tuple._1,tuple._2.toString)
kafkaProducer.send(record)
})
}catch {
case e:Exception=> println("发邮件,出错啦~")
}
}
}
DStream整合DataframeSQL
ssc.socketTextStream("zly", 9999)
.flatMap(line => line.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(
(v1,v2)=>v1+v2,//加上新移⼊元素
(v1,v2)=>v1-v2,//减去有移除元素
Seconds(4),Seconds(2),
filterFunc = t=> t._2 > 0) //过滤掉值=0元素
.foreachRDD(rdd=>{
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val wordsDataFrame = rdd.toDF("word","count")
val props = new Properties()
props.put("user", "root")
props.put("password", "123456")
wordsDataFrame .write
.mode(SaveMode.Append)
.jdbc("jdbc:mysql://zly:3306/mysql","t_wordcount",props)
})