Spark流计算

最新推荐文章于 2024-07-13 10:06:08 发布

小中.

最新推荐文章于 2024-07-13 10:06:08 发布

阅读量1.2k

点赞数 6

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/z1987865446/article/details/109461760

版权

Spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

SparkStreaming流计算

概述

一般流式计算会与批量计算相比较。在流式计算模型中，输入是持续的，可以认为在时间上是无界的，也就意味着，永远拿不到全量数据去做计算。同时，计算结果是持续输出的，也即计算结果在时间上也是无界的。流式计算一般对实时性要求较高，同时一般是先定义目标计算，然后数据到来之后将计算逻辑应用于数据。同时为了提高计算效率，往往尽可能采用增量计算代替全量计算。批量处理模型中，一般先有全量数据集，然后定义计算逻辑，并将计算应用于全量数据。特点是全量计算，并且计算结果一次性全量输出。
在这里插入图片描述
批处理 VS 流处理区别

类别	数据类型	数据级别	延迟时间	计算类型	场景
批计算	静态	GB+级别	30分钟或者几个小时	最终停止	离线
流计算	持续动态	一条记录，几百字节	毫秒或者亚秒	7*24小时	在线

目前主流流处理框架：Kafka Streaming、Storm（JStrom）、Spark Streaming 、Flink（BLink）

Kafka Streaming:是一套基于Kafka-Streaming库的一套流计算工具jar包，具有入门门槛低，简单容易集成等特点。
Apache Storm:一款纯粹的流计算引擎，能够达到每秒钟百万级别数据的低延迟处理框架。
Spark Streaming：是构建在Spark 批处理之上一款流处理框架。与批处理不同的是，流处理计算的数据是无界数据流，输出也是持续的。Spark Streaming底层将Spark RDD Batch 拆分成 Macro RDD Batch实现类似流处理的功能。因此spark Streaming在微观上依旧是批处理框架。
Flink DataStream:在实时性和应用性上以及性能都有很大的提升，是目前为止最火热的流计算引擎。

快速入门

导入依赖

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.4.5</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.4.5</version>
</dependency>

编写Driver

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWordCountTopology {
  def main(args: Array[String]): Unit = {
    //1.创建流计算所需的环境
    val conf = new SparkConf()
                    .setAppName("SparkWordCountTopology")
                    .setMaster("spark://CentOS:7077")
    val streamingContext = new StreamingContext(conf, Seconds(1))

    //2.创建持续输入DStream 细化
    val lines: DStream[String] = streamingContext.socketTextStream("CentOS", 9999)

    //3.对离散流进行转换 细化
    val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
      .map(word => (word, 1))
      .reduceByKey((v1, v2) => v1 + v2)

    //4.将计算结果输出到控制台
    result.print()

    //5.启动流计算
    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

使用mvn package进行打包

<build>
  <plugins>
    <!--scala编译插件-->
    <plugin>
      <groupId>net.alchim31.maven</groupId>
      <artifactId>scala-maven-plugin</artifactId>
      <version>4.0.1</version>
      <executions>
        <execution>
          <id>scala-compile-first</id>
          <phase>process-resources</phase>
          <goals>
            <goal>add-source</goal>
            <goal>compile</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
    <!--创建fatjar插件-->
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-shade-plugin</artifactId>
      <version>2.4.3</version>
      <executions>
        <execution>
          <phase>package</phase>
          <goals>
            <goal>shade</goal>
          </goals>
          <configuration>
            <filters>
              <filter>
                <artifact>*:*</artifact>
                <excludes>
                  <exclude>META-INF/*.SF</exclude>
                  <exclude>META-INF/*.DSA</exclude>
                  <exclude>META-INF/*.RSA</exclude>
                </excludes>
              </filter>
            </filters>
          </configuration>
        </execution>
      </executions>
    </plugin>
    <!--编译插件-->
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-compiler-plugin</artifactId>
      <version>3.2</version>
      <configuration>
        <source>1.8</source>
        <target>1.8</target>
        <encoding>UTF-8</encoding>
      </configuration>
      <executions>
        <execution>
          <phase>compile</phase>
          <goals>
            <goal>compile</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>

下载nc组件

[root@CentOS ~]# yum -y install nmap-ncat

启动nc服务

[root@CentOS ~]# nc -lk 9999

启动服务

[root@CentOS spark-2.4.5]# ./bin/spark-submit --master spark://CentOS:7077 --name SparkWordCountTopology --deploy-mode client  --class com.baizhi.quickstart.SparkWordCountTopology --total-executor-cores 6 /root/spark-dstream-1.0-SNAPSHOT.jar

用户可以访问webUI

Discretized Streams

Discretized Stream或DStream是Spark Streaming提供的基本抽象。它表示连续的数据流，可以是从源接收的输入数据流，也可以是通过转换输入流生成的已处理数据流。在内部，DStream由一系列连续的RDD表示，这是Spark对不可变分布式数据集的抽象。DStream中的每个RDD都包含来自特定时间间隔的数据，如下图所示。
在这里插入图片描述
应用于DStream的任何操作都转换为底层RDD上的操作。例如，在先前快速入门示例中，flatMap操作应用于行DStream中的每个RDD以生成单词DStream的RDD。如下图所示。

注意：通过对DStream底层运行机制的了解，在设计StreamingContext的时候要求设置的Seconds()间隔要略大于微批的计算时间。这样才可以有效的避免数据在Spark的内存中产生积压。

DStreams & Receivers

每个输入DStream（file stream除外，稍后讨论）都与Receiver对象相关联，该对象从源接收数据并将其存储在Spark的内存中进行处理。Spark Streaming提供了两类内建的输入源，用于接收外围系统数据：

内建输入源

Basic sources- Spark 的中StreamContext的API可以直接获取的数据源，例如：fileStream（读文件）、socket(测试)

socketTextStream

val lines: DStream[String] = ssc.socketTextStream("CentOS", 9999)

File Streams

val lines: DStream[String] = ssc.textFileStream("hdfs://CentOS:9000/words/src")

或者

val lines: DStream[(LongWritable,Text)] =  ssc.fileStream[LongWritable,Text,TextInputFormat]("hdfs://CentOS:9000/words/src")

根据时间监测hdfs://CentOS:9000/words是否有新文件产生，如果有新文件产生，系统就自动读取该文件。并不会监控文件内容的变化。提示：在测试的时候，一定要注意同步时间。

Queue of RDDs(测试)

val queueRDDS=new Queue[RDD[String]]();
//产生测试数据
new Thread(new Runnable {
  override def run(): Unit = {
    while(true){
      //往队列添加测试数据
      queueRDDS += ssc.sparkContext.makeRDD(List("this is a demo","hello hello"))
        Thread.sleep(500)
    }
  }
}).start()
//2.创建持续输入DStream 细化
val lines: DStream[String] =  ssc.queueStream(queueRDDS)

高级数据源

Advanced sources-并不是Spark自带的输入源，例如：Kafka、Flume、Kinesis等，这些一般都需要第三方支持。

Custom Receiver

class CustomReceiver (values:List[String]) extends Receiver[String](StorageLevel.MEMORY_ONLY) with Logging {

  override def onStart(): Unit = {
    new Thread(new Runnable {
      override def run(): Unit =  receive()
    } ).start()
  }
  override def onStop(): Unit = {}
  //接收来自外围系统的数据
  private def receive(): Unit = {
    try {
      while (!isStopped()){
        Thread.sleep(500)
        var line= values(new Random().nextInt(values.length))
        //随机写出去
        store(line)
      }
      restart("Trying to restart again")
    } catch {
      case t: Throwable =>
        restart("Error trying to restart again", t)
    }
  }
}

var arrays=List("this is a demo","good good ","study come on")
val lines: DStream[String] =  ssc.receiverStream(new CustomReceiver(arrays))

Spark对接Kafka

参考资料：http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
  <version>2.4.5</version>
</dependency>

package com.baizhi.inputs

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable

//[root@CentOS ~]# nc -lk 8888
object SparkStreamingKafkaTopology {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[6]")
      .setAppName("SparkStreamingWorldCountTopology")

    val ssc = new  StreamingContext(conf, Seconds(1))
    ssc.sparkContext.setLogLevel("ERROR")

    val kafkaParams=Map[String,Object](
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->"CentOS:9092",
      ConsumerConfig.GROUP_ID_CONFIG -> "g1",
      ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "latest",
      ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (false:java.lang.Boolean),
      ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG   -> classOf[StringDeserializer],
      ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
    )
    val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](List("topic04"), kafkaParams))

    kafkaStream.map(record=>record.value())
      .flatMap(_.split("\\s+"))
      .map((_,1))
      .reduceByKey(_+_)
      .print()



    //2.启动流计算服务
    ssc.start()
    ssc.awaitTermination()
  }

}

Transformations

DStream转换与RDD的转换类似，将DStream转换成新的DStream.DStream常见的许多算子使用和Spark RDD保持一致。

转换	含义
map(func)	Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

map算子

//1,zhangsan,true
lines.map(line=> line.split(","))
    .map(words=>(words(0).toInt,words(1),words(2).toBoolean))
    .print()

flatMap

//hello spark
lines.flatMap(line=> line.split("\\s+"))
        .map((_,1)) //(hello,1)(spark,1)
        .print()

filter

//只会对含有hello的数据过滤
lines.filter(line => line.contains("hello"))
    .flatMap(line=> line.split("\\s+"))
    .map((_,1))
    .print()

repartition(修改分区)

lines.repartition(10) //修改程序并行度 分区数
    .filter(line => line.contains("hello"))
    .flatMap(line=> line.split("\\s+"))
    .map((_,1))
    .print()

union(将两个流合并)

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
    .filter(line => line.contains("hello"))
    .flatMap(line=> line.split("\\s+"))
    .map((_,1))
    .print()

count

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
    .flatMap(line=> line.split("\\s+"))
    .count() //计算微批处中RDD元素的个数
    .print()

reduce(func)

aa bb
val stream: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream.flatMap(line=> line.split("\\s+"))
    .reduce(_+"|"+_)
    .print() // aa|bb

countByValue(key计数)

val stream: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream.repartition(10) // a a b c
    .flatMap(line=> line.split("\\s+"))
    .countByValue()  //(a，2) （b,1） (c,1)
    .print()

reduceByKey(func, [numTasks])

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999) //this is spark this
    lines.repartition(10)
    .flatMap(line=> line.split("\\s+").map((_,1)))
    .reduceByKey(_+_)// （this,2）(is,1)(spark ,1)
    .print()

join(otherStream, [numTasks])

//1 zhangsan
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
//1 apple 1 4.5
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)

val userPair:DStream[(String,String)]=stream1.map(line=>{
    var tokens= line.split(" ")
    (tokens(0),tokens(1))
})
val orderItemPair:DStream[(String,(String,Double))]=stream2.map(line=>{
    val tokens = line.split(" ")
    (tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
userPair.join(orderItemPair) //（key，（用户，订单项））
.map(t=>(t._1,t._2._1,t._2._2._1,t._2._2._2))//1 zhangsan apple 4.5
.print()

必须保证两个流需要join的数据落入同一个RDD批次，否则无法完成join,因此意义不大。

transform

可以使用stream和RDD做计算，因为transform可以拿到底层macro batch RDD，继而实现stream-batch join

//1 apple 2 4.5
val orderLog: DStream[String] = ssc.socketTextStream("CentOS",8888)
var userRDD=ssc.sparkContext.makeRDD(List(("1","zhangsan"),("2","wangwu"))) //静态

val orderItemPair:DStream[(String,(String,Double))]=orderLog.map(line=>{
    val tokens = line.split(" ")
    (tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
orderItemPair.transform(rdd=> rdd.join(userRDD))
.print()

updateStateByKey(有状态计算，全量输出)

package com.baizhi.state

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

//[root@CentOS ~]# nc -lk 8888
object SparkStreamingUpdateStateByKeyTopology {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[6]")
      .setAppName("SparkStreamingWorldCountTopology")

    val ssc = new  StreamingContext(conf, Seconds(1))
    ssc.sparkContext.setLogLevel("ERROR")

    //必须设置checkpoint目录，存储历史状态数据
    ssc.checkpoint("hdfs://CentOS:9000/checkpoints-dstream")

   //1.创建流计算的输入DStream本质就是一个在时间窗口大小为1秒的小批次的RDD   细化
    val lines:DStream[String] = ssc.socketTextStream("CentOS", 8888)

    lines.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .updateStateByKey((values:Seq[Int],state:Option[Int])=>{ //全量的输出
        var histoyCount=state.getOrElse(0)
         Some(histoyCount+ values.sum)
      })
      .print() //打印输出  细化

    //2.启动流计算服务
    ssc.start()
    ssc.awaitTermination()
  }

}

必须设定checkpointdir用于存储程序的状态信息。对内存消耗比较严重。

mapWithState(有状态计算，增量输出)

package com.baizhi.state

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}

//[root@CentOS ~]# nc -lk 8888
object SparkStreamingMapWithStateTopology {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[6]")
      .setAppName("SparkStreamingWorldCountTopology")

    val ssc = new  StreamingContext(conf, Seconds(1))
    ssc.sparkContext.setLogLevel("ERROR")

    //必须设置checkpoint目录，存储历史状态数据
    ssc.checkpoint("hdfs://CentOS:9000/checkpoints-dstream")

   //1.创建流计算的输入DStream本质就是一个在时间窗口大小为1秒的小批次的RDD   细化
    val lines:DStream[String] = ssc.socketTextStream("CentOS", 8888)

    lines.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .mapWithState(StateSpec.function((key:String,value:Option[Int],state:State[Int])=>{ //增量输出！
        var historyCount=0;
         if(state.exists()){
           historyCount=state.getOption().getOrElse(0)
         }
         //更新状态值
        var newCount=historyCount+value.getOrElse(0)
        state.update(newCount)
        (key,newCount)
      }))
      .print() //打印输出  细化

    //2.启动流计算服务
    ssc.start()
    ssc.awaitTermination()
  }

}

必须设定checkpointdir用于存储程序的状态信息。

DStream故障恢复

package com.baizhi.failover

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}

//[root@CentOS ~]# nc -lk 8888
object SparkStreamingMapWithStateTopology {

  def main(args: Array[String]): Unit = {

    var checkpointDir="hdfs://CentOS:9000/checkpoints-failover-dstream";
    //尝试从 checkpointDir 恢复，如果恢复失败 再去调用failover方法
    var ssc=StreamingContext.getOrCreate(checkpointDir,()=>failover(checkpointDir))
    //设置日志
    ssc.sparkContext.setLogLevel("ERROR")
    //2.启动流计算服务
    ssc.start()
    ssc.awaitTermination()
  }
  def failover(checkpointDir:String):StreamingContext={
    println("--------------------\n" * 10 )
    val conf = new SparkConf().setMaster("local[6]")
      .setAppName("SparkStreamingWorldCountTopology")
    val ssc = new  StreamingContext(conf, Seconds(1))
    //必须设置checkpoint目录，存储历史状态数据
    ssc.checkpoint(checkpointDir)

    //1.创建流计算的输入DStream本质就是一个在时间窗口大小为1秒的小批次的RDD   细化
    val lines:DStream[String] = ssc.socketTextStream("CentOS", 8888)
    lines.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .mapWithState(StateSpec.function((key:String,value:Option[Int],state:State[Int])=>{ //增量输出！
        var historyCount=0;
        if(state.exists()){
          historyCount=state.getOption().getOrElse(0)
        }
        //更新状态值
        var newCount=historyCount+value.getOrElse(0)
        state.update(newCount)
        (key,newCount)
      }))
      .print() //打印输出

    ssc
  }

}

诟病：一旦状态持久化后，用户许修改代码就不可见了，因为系统并不会调用recoveryFunction，如果希望修改的代码生效，必须手动删除检查点目录。

Window Operations

Spark Streaming还提供了窗口计算，可让您在数据的滑动窗口上应用转换。下图说明了此滑动窗口。

在这里插入图片描述

如图所示，每当窗口在源DStream上滑动时，落入窗口内的源RDD就会合并并对其进行操作，以生成窗口DStream的RDD。在上图中，该操作将应用于数据的最后3个时间单位，并以2个时间单位滑动。这表明任何窗口操作都需要指定两个参数。

窗口长度-窗口的持续时间（3倍时间单位）。
滑动间隔-进行窗口操作的间隔（2倍时间单位）。

注意这两个参数必须是源DStream的批处理间隔的倍数，因为对于DStream而言，微批RDD是原子性的最小处理单位。通常在流计算中，如果窗口长度=滑动间隔称该窗口为滚动窗口没有元素交叠；如果窗口长度>滑动间隔称个窗口为滑动窗口存在元素交叠；一般情况下所有的流的窗口长度 >= 滑动间隔，因为如果小于滑动间隔，会有数据的遗漏。

一些常见的窗口操作如下。所有这些操作均采用上述两个参数-windowLength和slideInterval。

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.

window(windowLength, slideInterval)

ssc.socketTextStream("CentOS", 9999)
          .flatMap(line => line.split("\\s+"))
          .map(word => (word, 1))
          .window(Seconds(2),Seconds(2))
          .reduceByKey((v1, v2) => v1 + v2)
          .print()

countByWindow(windowLength, slideInterval)

ssc.checkpoint("hdfs://CentOS:9000/spark-checkpoints")
ssc.socketTextStream("CentOS", 9999)
.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.countByWindow(Seconds(2),Seconds(2))
.print()

相当于先window然后使用count算子

reduceByWindow(func, windowLength, slideInterval)

ssc.socketTextStream("CentOS", 9999)
.flatMap(line => line.split("\\s+"))
.reduceByWindow(_+" | "+_,Seconds(2),Seconds(2))
.print()

相当于先window然后使用reduce算子

reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

ssc.socketTextStream("CentOS", 9999)
.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKeyAndWindow(_+_,Seconds(2),Seconds(2))
.print()

相当于先window然后使用reduceByKey算子

reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])

在这里插入图片描述

ssc.checkpoint("hdfs://CentOS:9000/spark-checkpoints")
ssc.sparkContext.setLogLevel("FATAL")

ssc.socketTextStream("CentOS", 9999)
.flatMap(line => line.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(
  (v1,v2)=>v1+v2,//加上新移入元素
  (v1,v2)=>v1-v2,//减去有移除元素
  Seconds(4),Seconds(1),
  filterFunc = t=> t._2 > 0) //过滤掉值=0元素
.print()

必须重叠元素过半，使用以上方法效率高。

Output Operations

输出操作允许将DStream的数据输出到外部系统，例如数据库或文件系统。由于输出操作实际上允许外部系统使用转换后的数据，因此它们会触发所有DStream转换的实际执行（类似于RDD的操作）。当前，定义了以下输出操作：


Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. Python API This is called pprint() in the Python API.
saveAsTextFiles(prefix, [suffix])	Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”.
saveAsObjectFiles(prefix, [suffix])	Save this DStream’s contents as `SequenceFiles` of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. Python API This is not available in the Python API.
saveAsHadoopFiles(prefix, [suffix])	Save this DStream’s contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. Python API This is not available in the Python API.
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

SparkStreamingMapWithStateTopology

package com.baizhi.outputs

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}
//发布命令
//[root@CentOS ~]# spark-submit --master spark://CentOS:7077 --name kafkaSink --deploy-mode cluster --class com.baizhi.outputs.SparkStreamingMapWithStateTopology --total-executor-cores 6 --driver-cores 2 /root/spark-dstream-1.0-SNAPSHOT.jar
object SparkStreamingMapWithStateTopology {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("spark://CentOS:7077")
      .setAppName("SparkStreamingWorldCountTopology")

    val ssc = new  StreamingContext(conf, Seconds(1))
    ssc.sparkContext.setLogLevel("ERROR")

    //必须设置checkpoint目录，存储历史状态数据
    ssc.checkpoint("hdfs://CentOS:9000/checkpoints-dstream")

   //1.创建流计算的输入DStream本质就是一个在时间窗口大小为1秒的小批次的RDD   细化
    val lines:DStream[String] = ssc.socketTextStream("CentOS", 8888)

    lines.flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .mapWithState(StateSpec.function((key:String,value:Option[Int],state:State[Int])=>{ //增量输出！
        var historyCount=0;
         if(state.exists()){
           historyCount=state.getOption().getOrElse(0)
         }
         //更新状态值
        var newCount=historyCount+value.getOrElse(0)
        state.update(newCount)
        (key,newCount)
      }))
      .foreachRDD((rdd,time)=>{
           rdd.foreachPartition(vs=>{
             //将数据写到外围系统
             KafkaSink.writeToKafka(vs.toList,"topic04")
           })
       })

    //2.启动流计算服务
    ssc.start()
    ssc.awaitTermination()
  }

}

Kafka Sink

package com.baizhi.outputs

import java.util.Properties

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer

object KafkaSink extends Serializable {
  val kafkaProducer: KafkaProducer[String,String] = createKafkaProducer()

  def createKafkaProducer(): KafkaProducer[String,String] = {
    val props = new Properties()
    //必要参数
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])

    //幂等性配置 retries 、acks
    props.put(ProducerConfig.ACKS_CONFIG,"all")
    props.put(ProducerConfig.RETRIES_CONFIG,"3")
    props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG,"true")
    props.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG,"5000")

    //设置批次
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"1024")
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")

    new KafkaProducer[String,String](props)
  }

  /**
   * @param values
   */
  def writeToKafka( values: List[(String, Int)],topic:String): Unit = {
    for(value <- values){
      val record = new ProducerRecord[String, String](topic, value._1, value._2 + "")
      kafkaProducer.send(record)
    }
    kafkaProducer.flush()
  }

  sys.addShutdownHook({
      if(kafkaProducer!=null){
        kafkaProducer.close()
      }
  })

}

RedisSink

package com.baizhi.outputs

import java.util.Properties

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer

object KafkaSink extends Serializable {
  val kafkaProducer: KafkaProducer[String,String] = createKafkaProducer()

  def createKafkaProducer(): KafkaProducer[String,String] = {
    val props = new Properties()
    //必要参数
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])

    //幂等性配置 retries 、acks
    props.put(ProducerConfig.ACKS_CONFIG,"all")
    props.put(ProducerConfig.RETRIES_CONFIG,"3")
    props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG,"true")
    props.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG,"5000")

    //设置批次
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"1024")
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")

    new KafkaProducer[String,String](props)
  }

  /**
   * @param values
   */
  def writeToKafka( values: List[(String, Int)],topic:String): Unit = {
    for(value <- values){
      val record = new ProducerRecord[String, String](topic, value._1, value._2 + "")
      kafkaProducer.send(record)
    }
    kafkaProducer.flush()
  }

  sys.addShutdownHook({
      if(kafkaProducer!=null){
        kafkaProducer.close()
      }
  })

}

DStream整合Dataframe SQL

ssc.socketTextStream("CentOS", 9999)
  .flatMap(line => line.split("\\s+"))
  .map((_,1))
  .reduceByKeyAndWindow(
  (v1,v2)=>v1+v2,//加上新移入元素
  (v1,v2)=>v1-v2,//减去有移除元素
  Seconds(4),Seconds(2),
  filterFunc = t=> t._2 > 0) //过滤掉值=0元素
  .foreachRDD(rdd=>{
     val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
      import spark.implicits._
      val wordsDataFrame = rdd.toDF("word","count")

      val props = new Properties()
      props.put("user", "root")
      props.put("password", "root")

      wordsDataFrame .write
      .mode(SaveMode.Append)
      .jdbc("jdbc:mysql://CentOS:3306/test","t_wordcount",props)
  })
)
      kafkaProducer.send(record)
    }
    kafkaProducer.flush()
  }

  sys.addShutdownHook({
      if(kafkaProducer!=null){
        kafkaProducer.close()
      }
  })

}

DStream整合Dataframe SQL

ssc.socketTextStream("CentOS", 9999)
  .flatMap(line => line.split("\\s+"))
  .map((_,1))
  .reduceByKeyAndWindow(
  (v1,v2)=>v1+v2,//加上新移入元素
  (v1,v2)=>v1-v2,//减去有移除元素
  Seconds(4),Seconds(2),
  filterFunc = t=> t._2 > 0) //过滤掉值=0元素
  .foreachRDD(rdd=>{
     val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
      import spark.implicits._
      val wordsDataFrame = rdd.toDF("word","count")

      val props = new Properties()
      props.put("user", "root")
      props.put("password", "root")

      wordsDataFrame .write
      .mode(SaveMode.Append)
      .jdbc("jdbc:mysql://CentOS:3306/test","t_wordcount",props)
  })

小中.

关注

6
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
Spark流计算

Spark流计算概述一般流式计算会与批量计算相比较。在流式计算模型中，输入是持续的，可以认为在时间上是无界的，也就意味着，永远拿不到全量数据去做计算。同时，计算结果是持续输出的，也即计算结果在时间上也是无界的。流式计算一般对实时性要求较高，同时一般是先定义目标计算，然后数据到来之后将计算逻辑应用于数据。同时为了提高计算效率，往往尽可能采用增量计算代替全量计算。批量处理模型中，一般先有全量数据集，然后定义计算逻辑，并将计算应用于全量数据。特点是全量计算，并且计算结果一次性全量输出。批处理 VS 流处理
复制链接

扫一扫