SparkStreaming概述与编程

一、SparkStreaming概述

1、SparkStreaming介绍

官网资料:Spark Streaming - Spark 3.5.1 Documentation

Spark Streaming是Spark的核心组件之一,为Spark提供了可拓展、高吞吐、容错的流计算能力。如下图所示,Spark Streaming可整合多种输入数据源,如Kafka、Flume、HDFS,甚至是普通的TCP套接字。经处理后的数据可存储至HDFS、Redis、HBase等等

Spark Streaming的基本原理是将实时数据流以时间片(秒级)为单位进行拆分,然后经Spark引擎以类似批处理的方式处理每个时间片数据。

Spark Streaming最主要的抽象是DStream(Discretized Stream,离散化数据流),表示连续不断的数据流。在内部实现上,Spark Streaming的输入数据按照时间片(如1秒)分成一段一段的RDD,每一段数据转换为Spark中的RDD,并且对DStream的操作都最终转变为对相应的RDD的操作:

2、SparkStreaming架构 

运行机制:

  1. 客户端提交作业后启动Driver,Driver是spark作业的Master。
  2. 每个作业包含多个Executor,每个Executor以线程的方式运行task,Spark Streaming至少包含一个receiver task。
  3. Receiver接收数据后生成Block,并把BlockId汇报给Driver,然后备份到另外一个Executor上。
  4. ReceiverTracker维护Reciver汇报的BlockId。
  5. Driver定时启动JobGenerator,根据Dstream的关系生成逻辑RDD,然后创建Jobset,交给JobScheduler。
  6. JobScheduler负责调度Jobset,交给DAGScheduler,DAGScheduler根据逻辑RDD,生成相应的Stages,每个stage包含一到多个task。
  7. TaskScheduler负责把task调度到Executor上,并维护task的运行状态。
  8. 当tasks,stages,jobset完成后,单个batch才算完成。

二、SparkStreaming编程 

1、创建prjspark工程

参考第一个Spark程序-CSDN博客

2、添加依赖

 <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8 </version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.2.0</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.15</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>

    </dependencies>

3、监控接口数据

编写代码 

package com.soft863.streaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WordCountStreaming {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(conf, Seconds(5))

    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("hadoop100", 9999)

    // Split each line into words
    val wordCounts = lines.flatMap(_.split(" "))
      .map(word => (word, 1)).reduceByKey(_ + _)

    wordCounts.print()

    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}

在hadoop100中执行

nc -lk 9999

 如果找不到命令则进行安装

yum -y install nc

运行结果 

4、监控HDFS数据

代码编写

package com.soft863.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object HDFSStreaming {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(conf, Seconds(5))

    // Create a DStream that will connect to hostname:port, like localhost:999
    val lines = ssc.textFileStream("hdfs://hadoop100:9000/data/worddir/")

    // Split each line into words
    val wordCounts = lines.flatMap(_.split(" "))
      .map(word => (word, 1)).reduceByKey(_ + _)

    wordCounts.print()

    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}

运行

hadoop fs -mkdir /data/worddir
hadoop fs -put /usr/local/data/words.txt /data/worddir

 

 5、队列数据创建

代码编写

package com.soft863.streaming
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable

object QueueStreaming {
  def main(args: Array[String]) {

    val conf = new SparkConf().setMaster("local[2]").setAppName("QueueRdd")
    val ssc = new StreamingContext(conf, Seconds(1))

    // Create the queue through which RDDs can be pushed to
    // a QueueInputDStream
    //创建RDD队列
    val rddQueue = new mutable.Queue[RDD[Int]]()

    // Create the QueueInputDStream and use it do some processing
    // 创建QueueInputDStream
    val inputStream = ssc.queueStream(rddQueue)

    //处理队列中的RDD数据
    val mappedStream = inputStream.map(x => (x % 10, 1))
    val reducedStream = mappedStream.reduceByKey(_ + _)

    //打印结果
    reducedStream.print()

    //启动计算
    ssc.start()

    for (i <- 1 to 30) {
      rddQueue += ssc.sparkContext.makeRDD(1 to 300)
      Thread.sleep(2000)
    }
  }

}

运行结果:

6、监控Kafka数据 

代码编写

package com.soft863.streaming

import org.apache.kafka.common.serialization.StringDeserializer

import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object KafkaStreaming {
  def main(args: Array[String]): Unit = {
    val sc = new SparkConf().setAppName("WC").setMaster("local[2]")
    val ssc = new StreamingContext(sc, Seconds(5))
    val brokers = "hadoop100:9092"
    val topics = Array("test")
    //消费者配置
    val kafkaParam = Map[String, Object](
      "bootstrap.servers" -> brokers, //用于初始化链接到集群的地址
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      //用于标识这个消费者属于哪个消费团体
      "group.id" -> "group1",
      // latest自动重置偏移量为最新的偏移量
      "auto.offset.reset" -> "latest",
      //如果是true,则这个消费者的偏移量会在后台自动提交
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    val lineMap = KafkaUtils.createDirectStream[String, String](ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParam)
    )
    val lines = lineMap.map(record => record.value)
    val words = lines.flatMap(_.split(" "))
    val pair = words.map(x => (x, 1))
    val wordCounts = pair.reduceByKeyAndWindow(
      (x: Int, y: Int) => x + y,
      Seconds(10),
      Seconds(5))
    // pair.reduceByKey(_ + _).print()
    wordCounts.print
    ssc.start
    ssc.awaitTermination
  }

}

 启动zookeeper、kafka集群

向hadoop100上发送消息给test主题

kafka-console-producer.sh --broker-list hadoop100:9092 --topic test

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

数智侠

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值