DStream中的算子分为两大类:Transformations 和Output
1、Transformations on DStreams
与RDDs类似,转换允许修改来自输入DStream的数据。DStreams支持普通Spark RDD上可用的许多转换。下面是一些常见的算子。
2、 Spark Streaming 的 transform、broadCast、updateStateByKey 操作:黑名单
object _1Streaming_tranform {
def main(args: Array[String]): Unit = {
//定义黑名单
val black_list=List("@","#","$","%")
Logger.getLogger("org.apache.hadoop").setLevel(Level.ERROR)
Logger.getLogger("org.apache.zookeeper").setLevel(Level.WARN)
Logger.getLogger("org.apache.hive").setLevel(Level.WARN)
//1.获取编程入口,StreamingContext
val conf = new SparkConf().setMaster("local[2]").setAppName("_1Streaming_tranform")
val ssc=new StreamingContext(conf,Seconds(2))
//2.从对应的网络端口读取数据
val inputDStream: ReceiverInputDStream[String] = ssc.socketTextStream("test",9999)
//2.1将黑名单广播出去
val bc = ssc.sparkContext.broadcast(black_list)
//2.2设置checkpoint
ssc.checkpoint("C:\\z_data\\checkPoint\\checkPoint_1")
//3业务处理
val wordDStream: DStream[String] = inputDStream.flatMap(_.split(" "))
//transform方法表示:从DStream拿出来一个RDD,经过transformsFunc计算之后,返回一个新的RDD
val fileterdDStream: DStream[String] = wordDStream.transform(rdd=>{
//过滤掉黑名单中的数据
val blackList: List[String] = bc.value
rdd.filter(word=>{
!blackList.contains(word)
})
})
//3.2统计相应的单词数
val resultDStream = fileterdDStream.map(msg => (msg, 1))
.updateStateByKey((values: Seq[Int], stats: Option[Int]) => {
Option(values.sum + stats.getOrElse(0))
})
//4打印output
resultDStream.print()
//5.开启streaming流
ssc.start()
ssc.awaitTermination()
}
}
3、window 窗口
//1.获取编程入口,StreamingContext
val conf = new SparkConf().setMaster("local[2]").setAppName("WordCount_Window")
val ssc=new StreamingContext(conf,Seconds(batchInvail.toLong))
//2.从对应的网络端口读取数据
val inputDStream: ReceiverInputDStream[String] = ssc.socketTextStream(hostname,port.toInt)
val lineDStream: DStream[String] = inputDStream.flatMap(_.split(" "))
val wordDStream: DStream[(String, Int)] = lineDStream.map((_,1))
/**
* 每隔4秒,算过去6秒的数据
* reduceFunc:数据合并的函数
* windowDuration:窗口的大小(过去6秒的数据)
* slideDuration:窗口滑动的时间(每隔4秒)
*/
val resultDStream: DStream[(String, Int)] = wordDStream.reduceByKeyAndWindow((kv1:Int, kv2:Int)=>kv1+kv2,
Seconds(batchInvail.toLong * 3),
Seconds(batchInvail.toLong * 2))
resultDStream.print()
ssc.start()
ssc.awaitTermination()
4、mapWithState
案例:是wordCount小程序,单词word持续增加,由5增加到100了,还会持续增加,因为用到了它的前置数据。必须checkPoint。
ssc.checkpoint("hdfs://linux-hadoop01.ibeifeng.com:8020/beifeng/spark/streaming/chkdir45254")
def mappingFunction(key: String, values: Option[Int], state: State[Long]): (String, Long) = {
// 获取之前状态的值
val preStateValue = state.getOption().getOrElse(0L)
// 计算出当前值
val currentStateValue = preStateValue + values.getOrElse(0)
// 更新状态值
state.update(currentStateValue)
// 返回结果
(key, currentStateValue)
}
val result = messages.flatMap(x => x.split(" ")).
map(x => (x, 1)).
reduceByKey((x, y) => {
x + y
}).mapWithState(StateSpec.function(mappingFunction _))//方法转函数