一、SparkStreaming概述
1、SparkStreaming介绍
官网资料:Spark Streaming - Spark 3.5.1 Documentation
Spark Streaming是Spark的核心组件之一,为Spark提供了可拓展、高吞吐、容错的流计算能力。如下图所示,Spark Streaming可整合多种输入数据源,如Kafka、Flume、HDFS,甚至是普通的TCP套接字。经处理后的数据可存储至HDFS、Redis、HBase等等
Spark Streaming的基本原理是将实时数据流以时间片(秒级)为单位进行拆分,然后经Spark引擎以类似批处理的方式处理每个时间片数据。
Spark Streaming最主要的抽象是DStream(Discretized Stream,离散化数据流),表示连续不断的数据流。在内部实现上,Spark Streaming的输入数据按照时间片(如1秒)分成一段一段的RDD,每一段数据转换为Spark中的RDD,并且对DStream的操作都最终转变为对相应的RDD的操作:
2、SparkStreaming架构
运行机制:
- 客户端提交作业后启动Driver,Driver是spark作业的Master。
- 每个作业包含多个Executor,每个Executor以线程的方式运行task,Spark Streaming至少包含一个receiver task。
- Receiver接收数据后生成Block,并把BlockId汇报给Driver,然后备份到另外一个Executor上。
- ReceiverTracker维护Reciver汇报的BlockId。
- Driver定时启动JobGenerator,根据Dstream的关系生成逻辑RDD,然后创建Jobset,交给JobScheduler。
- JobScheduler负责调度Jobset,交给DAGScheduler,DAGScheduler根据逻辑RDD,生成相应的Stages,每个stage包含一到多个task。
- TaskScheduler负责把task调度到Executor上,并维护task的运行状态。
- 当tasks,stages,jobset完成后,单个batch才算完成。
二、SparkStreaming编程
1、创建prjspark工程
2、添加依赖
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8 </version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.15</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.3</version>
</dependency>
</dependencies>
3、监控接口数据
编写代码
package com.soft863.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object WordCountStreaming {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(5))
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("hadoop100", 9999)
// Split each line into words
val wordCounts = lines.flatMap(_.split(" "))
.map(word => (word, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
在hadoop100中执行
nc -lk 9999
如果找不到命令则进行安装
yum -y install nc
运行结果
4、监控HDFS数据
代码编写
package com.soft863.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object HDFSStreaming {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(5))
// Create a DStream that will connect to hostname:port, like localhost:999
val lines = ssc.textFileStream("hdfs://hadoop100:9000/data/worddir/")
// Split each line into words
val wordCounts = lines.flatMap(_.split(" "))
.map(word => (word, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
运行
hadoop fs -mkdir /data/worddir
hadoop fs -put /usr/local/data/words.txt /data/worddir
5、队列数据创建
代码编写
package com.soft863.streaming
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
object QueueStreaming {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("QueueRdd")
val ssc = new StreamingContext(conf, Seconds(1))
// Create the queue through which RDDs can be pushed to
// a QueueInputDStream
//创建RDD队列
val rddQueue = new mutable.Queue[RDD[Int]]()
// Create the QueueInputDStream and use it do some processing
// 创建QueueInputDStream
val inputStream = ssc.queueStream(rddQueue)
//处理队列中的RDD数据
val mappedStream = inputStream.map(x => (x % 10, 1))
val reducedStream = mappedStream.reduceByKey(_ + _)
//打印结果
reducedStream.print()
//启动计算
ssc.start()
for (i <- 1 to 30) {
rddQueue += ssc.sparkContext.makeRDD(1 to 300)
Thread.sleep(2000)
}
}
}
运行结果:
6、监控Kafka数据
代码编写
package com.soft863.streaming
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object KafkaStreaming {
def main(args: Array[String]): Unit = {
val sc = new SparkConf().setAppName("WC").setMaster("local[2]")
val ssc = new StreamingContext(sc, Seconds(5))
val brokers = "hadoop100:9092"
val topics = Array("test")
//消费者配置
val kafkaParam = Map[String, Object](
"bootstrap.servers" -> brokers, //用于初始化链接到集群的地址
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
//用于标识这个消费者属于哪个消费团体
"group.id" -> "group1",
// latest自动重置偏移量为最新的偏移量
"auto.offset.reset" -> "latest",
//如果是true,则这个消费者的偏移量会在后台自动提交
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val lineMap = KafkaUtils.createDirectStream[String, String](ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParam)
)
val lines = lineMap.map(record => record.value)
val words = lines.flatMap(_.split(" "))
val pair = words.map(x => (x, 1))
val wordCounts = pair.reduceByKeyAndWindow(
(x: Int, y: Int) => x + y,
Seconds(10),
Seconds(5))
// pair.reduceByKey(_ + _).print()
wordCounts.print
ssc.start
ssc.awaitTermination
}
}
启动zookeeper、kafka集群
向hadoop100上发送消息给test主题
kafka-console-producer.sh --broker-list hadoop100:9092 --topic test