今天介绍一下Spark的实时计算框架--SparkStreaming。 Spark Streaming是建立在Spark上的实时计算框架,通过它提供的丰富的API、基于内存的高速执行引擎,用户可以结合流式、批处理和交互试查询应用。特点:低延时,高吞吐量,高容错性,支持hadoop和spark生态圈。
在大数据的实时流方面有两个主要的工具,一个是Storm,另一个就是SparkStreaming。Storm是真正意义上的实时流,真的就是来一条数据就处理一次,而SparkStreaming却是伪实时流,但是为什么很多公司现在都在使用SparkStreaming而没有使用Storm呢?因为Spark平台的强大功能和良好的兼容性。首先Spark可以与Haddop集群兼容,与Kafka、Hive、HDFS都兼容,与常用关系型和非关系型数据库也兼容;其次,SparkStreaming与SparkCore、SparkSQL可以自由切换;再者,spark集成了SParkGraph、SparkMlib,提供丰富的机器学习库。SparkStraming的特点:低延时,高吞吐量,高容错性,支持hadoop和spark生态圈。
下面开始正式介绍SparkStreaming:
一、Dstream
DStream表示一个连续的数据流,要么是从源接收的输入数据流,要么是由转换输入流生成的处理后的数据流。Dstream是一个基于RDD的流式高级抽象(一个流动的RDD)
程序入口,对比了sparkcore、sparksql、sparkstreaming:
val conf=new SparkConf().setMaster("").setAppName("") //sparkcore的程序入口
val sc=new SparkContext(conf)
//sparksql的程序入口
val conf=new SparkConf().setMaster("").setAppName("")
"val sc=new SparkContext(conf)
val sqlContext=new SQLContext(sc)"
val sparksession=sparkSession.bulider().config(conf).getOrcreate() //spark2
//sparkstreaming程序入口
val conf=new SparkConf().setMaster("").setAppName("")
val sc=new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(1))
可以类比sparkcore、sparksql,Dstream就是sparkstreaming的高级抽样。
二、demo
2.1 入门demo,在linux下用nc监听当前端口,但是有个问题,它不会自动累加监听结果,每次只是展示当前的输入
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
object SparkStreaming {
def main(args: Array[String]): Unit = {
//程序入口,注意这里是local[2]在本地才能代表双线程,一个线程读,一个线程写
val conf = new SparkConf()
conf.setAppName(s"${this.getClass.getSimpleName}").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(1))
//创建一个DStream,通过linux的nc工具监听端口,注意这里不是累加操作,只能监控你当前的,并不会把历史的数据进行累加
val dStream1: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop01", 9999)
val resultDstream1: DStream[(String, Int)] = dStream1.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)
resultDstream1.print()
ssc.start()
ssc.awaitTermination()
}
}
2.2 数据源为hdfs的情况
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
object SparkStreaming {
def main(args: Array[String]): Unit = {
//程序入口,注意这里是local[2]在本地才能代表双线程,一个线程读,一个线程写
val conf = new SparkConf()
conf.setAppName(s"${this.getClass.getSimpleName}").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(1))
//数据源在hdfs
val dStream: DStream[String] = ssc.textFileStream("hdfs://hadoop1:9000/streaming")
val resultDStream: DStream[(String, Int)] = dStream.flatMap(_.split(",")).map((_,1)).reduceByKey(_+_)
resultDStream.print()
ssc.start()
ssc.awaitTermination()
}
}
2.3 updataStateByKey算子的运用,这样可以对监控进行累加操作,但是注意这里必须要设置检查点
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
object SparkStreaming {
def main(args: Array[String]): Unit = {
//程序入口,注意这里是local[2]在本地才能代表双线程,一个线程读,一个线程写
val conf = new SparkConf()
conf.setAppName(s"${this.getClass.getSimpleName}").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(1))
//updataStateByKey算子:对监控进行累加操作,注意这里要设置检查点
ssc.checkpoint("hdfs://hadoop1:9000/streamingchekpointxx")
val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop01", 9999)
val resultDStream = dStream.flatMap(_.split(","))
.map((_, 1))
.updateStateByKey((values: Seq[Int], valuesState: Option[Int]) => {
val currentCount = values.sum
val lastCount = valuesState.getOrElse(0)
Some(currentCount + lastCount)
})
ssc.start()
ssc.awaitTermination()
}
}
2.4 transform算子的运用,黑名单过滤
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkStreaming {
def main(args: Array[String]): Unit = {
//程序入口,注意这里是local[2]在本地才能代表双线程,一个线程读,一个线程写
val conf = new SparkConf()
conf.setAppName(s"${this.getClass.getSimpleName}").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(1))
//Transformation操作,transform、filter的算子应用:黑名单过滤
val blackRDD = sc.parallelize(List("&","!","?"))
//设置成为一个广播变量
val broadcastblackRDD = sc.broadcast(blackRDD.collect())
ssc.checkpoint("hdfs://hadoop1:9000/streamingchekpoint")
//数据的输入
val dStream = ssc.socketTextStream("192.168.32.10",9999)
/**
* ? ! * 单词计数的时候,对这些特殊符号,进行过滤。不做统计
* 黑名单过滤
* ? ! * 黑名单
*/
//计算
val resultDStream = dStream.flatMap(_.split(","))
.map((_, 1))
.transform(rdd => {
//rdd -> SparkCore -> SparkCore的算子就可以使用了
//rdd -> DataFrame -> SparkSQL 编程也可以用了
val blackList = broadcastblackRDD.value.map((_, true))
val result = rdd.leftOuterJoin(rdd.sparkContext.parallelize(blackList))
result.filter(tuple => {
if (tuple._2._2.isEmpty) {
true
} else {
false
}
}).map(tuple => (tuple._1, tuple._2._1))
}).reduceByKey(_+_)
//输出
resultDStream.print()
//启动
ssc.start()
ssc.awaitTermination()
}
}
2.5 upstateByKey的容错运用
package TestExamples
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* 该场景是upstateBykey的再容错运用:在程序中断或失败后后再重新运行的时候可以接着上次中断的点继续操作
* 在生产环境中可能出现这种场景
*/
object TestSparkStreaming2 {
val checkpointDirectory="hdfs://hadoop1:9000/streamingchekpointX"
def functionToCreateContext(): StreamingContext = {
//程序入口
val conf = new SparkConf().setMaster("local[2]").setAppName(s"${this.getClass.getSimpleName}")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(1))
//数据的输入
val dStream = ssc.socketTextStream("192.168.32.10",9999)
//数据的处理
val resultDStream = dStream.flatMap(_.split(","))
.map((_, 1))
.updateStateByKey((values: Seq[Int], valuesState: Option[Int]) => {
val currentCount = values.sum
val lastCount = valuesState.getOrElse(0)
Some(currentCount + lastCount)
})
//程序的输出
resultDStream.print()
//设置检查点
ssc.checkpoint(checkpointDirectory)
ssc
}
def main(args: Array[String]): Unit = {
val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
//启动程序
ssc.start()
ssc.awaitTermination()
}
}
2.6 与mysql交互
连接池:
package utils
import java.sql.{Connection, DriverManager}
object MysqlPool {
private val max=8 ;//连接池的连接总数
private val connectionNum=10;//每次产生的连接数
private var conNum=0;//当前连接池已经产生的连接数
import java.util
private val pool=new util.LinkedList[Connection]();//连接池
{
Class.forName("com.mysql.jdbc.Driver")
}
/**
* 释放连接
*/
def releaseConn(conn:Connection):Unit={
pool.push(conn);
}
/**
* 获取连接
*/
def getJdbcCoon():Connection={
//同步代码块
AnyRef.synchronized({
if(pool.isEmpty()){
for( i <- 1 to connectionNum){
val conn=DriverManager.getConnection("jdbc:mysql://localhost:3306/test","root","root");
pool.push(conn);
conNum+1;
}
}
pool.poll();
})
}
}
package TestExamples
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import utils.MysqlPool
import utils.MysqlPool.getJdbcCoon
/**
* Created by wangyongxiang on 2018/5/16.
* 与mysql交互,将结果存到mysql表中,连接使用mysql连接池
* 用foreachRDDD算子将数据写入mysql
*/
object TestSparkStreaming3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName(s"${this.getClass.getSimpleName}").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(1))
ssc.checkpoint("hdfs://hadoop1:9000/streamingchekpointxx")
val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop01", 9999)
val resultDStream = dStream.flatMap(_.split(","))
.map((_, 1))
.updateStateByKey((values: Seq[Int], valuesState: Option[Int]) => {
val currentCount = values.sum
val lastCount = valuesState.getOrElse(0)
Some(currentCount + lastCount)
})
//foreachRdd是在driver端运行的,必须要如果不进行序列化会报错
resultDStream.foreachRDD(rdd=>{
rdd.foreachPartition(partition=>{
val jdbcCoon = getJdbcCoon()
val statement = jdbcCoon.createStatement()
partition.foreach( recored =>{
val word = recored._1
val count = recored._2
val sql=s"insert into wordcount values(now(),'${word}',${count})"
statement.execute(sql)
})
MysqlPool.releaseConn(jdbcCoon)
})
ssc.start()
ssc.awaitTermination()
})
}
}
2.7 窗口函数
package TestExamples
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* sparkstreaming的窗口操作
*/
object TestSparkStreaming4 {
def main(args: Array[String]): Unit = {
//程序入口 -》 跟生产对接了
val conf = new SparkConf().setMaster("local[2]").setAppName(s"${this.getClass.getSimpleName}")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//1 读取数据
//设置检查点
ssc.checkpoint("hdfs://hadoop1:9000/streamingchekpointxx")
val dStream = ssc.socketTextStream("192.168.32.10",9999)
//2 数据操作 transfrom updateStateBykey -》 跟生产对接了
//reduceByKeyAndWindow(函數,窗口宽度,滑动间隔) 后两个参数必须是Dstream时间间隔的倍数
val window = dStream.flatMap(_.split(","))
.map((_, 1))
.reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(6), Seconds(4))
//3 数据输出(MySQL Redis) -》 跟生产对接了
window.print()
//启动程序
ssc.start()
ssc.awaitTermination()
}
}
三、SparkStreaming最重要的环节,与Kafka整合,另开一篇。