文章目录
1、window函数操作
1.1、 window函数的简介
- SparkStreaming提供了滑动窗口的操作。这样的话,就可以计算窗口内的n个micro-batch的数据,进行聚合.
- 窗口有两个参数:
- 窗口大小:指的就是有几个单位时间(time unit)的micro-batch
- 滑动周期:类似于定时器,几个单位时间一滑动。
- 注意: 这两个参数的值都必须是单位时间的倍数
- 如上图表示:
窗口的长度是3 time,滑动周期是2 time。
在time 3时,窗口里的数据是3个batch,分别是time1,time2,time3,可以进行聚合,
然后到了 time5 窗口进行滑动,此时窗口里的数据是time3,time4,time5
1.2、 常用的窗口函数
1.3、 案例演示:
1.3.1、
package com.xxx.SparkStreaming.Day03
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object _01WindowTest {
def main(args: Array[String]): Unit = {
//过滤掉不需要的日志信息,留下警告级别以上的
Logger.getLogger("org").setLevel(Level.WARN)
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("redisOffset")
val ssc = new StreamingContext(conf, Seconds(10))
val dstream: ReceiverInputDStream[String] = ssc.socketTextStream("datanode01", 10086)
dstream.print()
val winDstream: DStream[(String, Int)] = dstream.map((_, 1)).window(Seconds(30), Seconds(30))
winDstream.reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()
}
}
1.3.2、
package com.xxx.SparkStreaming.Day03
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* 计算百度热搜词最近30s的top3
*
*
*/
object _02WindowTest {
def main(args: Array[String]): Unit = {
//过滤掉不需要的日志信息,留下警告级别以上的
Logger.getLogger("org").setLevel(Level.WARN)
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("redisOffset")
val ssc = new StreamingContext(conf, Seconds(10))
val dstream: ReceiverInputDStream[String] = ssc.socketTextStream("datanode01", 10086)
// dstream.print()
// val winDstream: DStream[(String, Int)] = dstream.map((_, 1)).window(Seconds(30), Seconds(30))
// winDstream.reduceByKey(_+_).print()
val result: DStream[(String, Int)] = dstream.map((_, 1)).reduceByKeyAndWindow(((x: Int, y: Int) => x + y), Seconds(30), Seconds(20))
result.foreachRDD(rdd=>{
val tuples: Array[(String, Int)] = rdd.sortBy(_._2, false).take(3)
for (elem <- tuples) {
println(elem)
}
})
ssc.start()
ssc.awaitTermination()
}
}
1.3.3、
package com.xxx.SparkStreaming.Day03
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* 计算最近30秒的百度热搜词的top3
* java
* c++
* bigdata
* python
* .....
*/
object _03WindowTest {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("window").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(10))
//countByWindow算子必须指定checkpoint
// ssc.checkpoint("data1")
//从nc上获取数据
val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("datanode01", 10086)
//计算滑动窗口内的记录总条数
// dStream.countByWindow(Seconds(30),Seconds(30)).print()
val result: DStream[(String, Int)] = dStream.map((_, 1)).reduceByKeyAndWindow((x:Int, y:Int) => (x + y), Seconds(30),Seconds(30))
.transform(rdd => {
val tuples: Array[(String, Int)] = rdd.sortBy(_._2, false).take(3)
ssc.sparkContext.parallelize(tuples)
})
result.print()
ssc.start()
ssc.awaitTermination()
}
}
2、sparkStreaming和SparkSQL的整合
package com.xxx.SparkStreaming.Day03
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* SparkSQL与SparkStreaming的整合案例:
*
* SparkStreaming的foreachRDD和Transform算子可以使用RDD的算子进行计算,实际就是使用SparkCore.
*
* 而与SparkSQL进行整合的话,那就是用DF,DS.
*
*
* 需求:
* 每一小时一统计商品的销售量的top3
* id name num
* A1001 毛衣 100
* A1002 手机 55
* A1001 毛衣 99
* A1002 手机 33
* A1003 裤子 12
* A1004 袜子 14
* A1004 袜子 16
* A1005 泳裤 100
* A1001 毛衣 44
* A1006 泳裤 1
*/
object _04SparkStreamingAndSparkSQL {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("sparkAndSparkSQL")
val sc = new SparkContext(conf)
//要想使用SparkSQL必须创建SparkSession对象
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
//创建SparkStreaming上下文对象
//注意:SparkSQL与SparkStreaming整合需要调用StreamingContext(sparkContext:SparkContext,duration:Duration)
//否则会出现多次创建sparkContext,会报以下错
//Exception in thread "main" org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
val ssc = new StreamingContext(sc, Seconds(30))
//获取socket获得的数据的sStream
val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("qianfeng01", 10086)
//调用SparkStreaming算子对数据进行切分
val dStream1: DStream[(String, Int)] = dStream.map(x => {
val strings: Array[String] = x.split(" ")
(strings(1), strings(2).toInt)
}).reduceByKeyAndWindow(_ + _, Seconds(30))
//使用SparkSQL进行排序取top3
dStream1.transform(rdd=>{
val df: DataFrame = rdd.toDF("name", "num")
df.createOrReplaceTempView("table")
val df1: DataFrame = spark.sql(
"""
|select
|name,num
|from table
|sort by num desc
|limit 3
|""".stripMargin)
df1.rdd
}).print()
ssc.start()
ssc.awaitTermination()
}
}