Spark Stream 教程

最新推荐文章于 2023-12-27 15:25:24 发布

leexurui

最新推荐文章于 2023-12-27 15:25:24 发布

阅读量1.1k

点赞数

分类专栏：并行计算与分布式计算

本文链接：https://blog.csdn.net/leexurui/article/details/52352085

版权

并行计算与分布式计算专栏收录该内容

16 篇文章 0 订阅

订阅专栏

import org.apache.spark.SparkConf

import org.apache.spark.storage.StorageLevel

import org.apache.spark.streaming.{Seconds, StreamingContext}

import org.apache.log4j.{Level, Logger}

import java.util.concurrent

object NetworkWordCount {

def main(args: Array[String]) {

if (args.length < 2) {

System.err.println("Usage: NetworkWordCount ")

System.exit(1)

}

//StreamingExamples.setStreamingLogLevels()

// 屏蔽不必要的日志显示终端上

Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

// Create the context with a 1 second batch size 1秒为单位分割数据流

val sparkConf = new SparkConf().setAppName("NetworkWordCount")

val ssc = new StreamingContext(sparkConf, Seconds(1))

// Create a socket stream on target ip:port and count the

// words in input stream of \n delimited text (eg. generated by 'nc')

// Note that no duplication in storage level only for running locally.

// Replication necessary in distributed scenario for fault tolerance.

val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

}

// scalastyle:on println

编译好jar之后，编辑./testStream.sh

/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/bin/spark-submit \

--class NetworkWordCount \

--master local[2] \

SparkStream.jar \

localhost 9999

localhost 9999表示它表示从TCP源（主机位localhost，端口为9999）获取的流式数据。

这个 lines 变量是一个DStream，表示即将从数据服务器获得的流数据。这个DStream的每条记录都代表一行文本。下一步，我们需要将DStream中的每行文本都切分为单词。

另外在本机新开一个命令行窗口，做4-7的操作
nc -lk 9999，守候9999端口（改命令是监听，sudo yum install NetCat，一般安装时都没有装）
在第一个窗口./testStream.sh
切换到nc -lk 9999窗口，屏幕显示进入等待输入
此时输入hello world ，然后回车
回到第一个命令行窗口，可见屏幕上有已经key-value分解了的

hello 1
world 1

字样，表明已经被stream按单词出现的数量处理好了

还可以

cat file1.txt | nc -lk 9999

上条命令中的红色竖杠就是管道

在这些转换操作准备好之后，要真正执行计算，需要调用如下的方法

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

或者ssc.awaitTermination(30000)  // 等待结束或者超过30自动结束，

再或者

   ssc.start()
   Thread.sleep(30000)
   ssc.stop()

大概执行30s后自动结束

离散流或者DStreams是Spark Streaming提供的基本的抽象，它代表一个连续的数据流。比如我们这里的1s一个DStream



Spark Streaming拥有两类数据源

基本源（Basic sources）：这些源在StreamingContext API中直接可用。例如文件系统、套接字连接、Akka的actor等。
streamingContext.fileStream[keyClass, valueClass, inputFormatClass](dataDirectory)
Spark Streaming将会监控dataDirectory目录，并且处理目录下生成的任何文件（嵌套目录不被支持）。需要注意：
```
1 所有文件必须具有相同的数据格式
2 所有文件必须在`dataDirectory`目录下创建
```

对于简单的文本文件，有一个更简单的方法streamingContext.textFileStream(dataDirectory)可以被调用。

- 基于自定义actor的流：DStream可以调用streamingContext.actorStream(actorProps, actor-name)方法从Akka actors获取的数据流来创建。具体的信息见自定义receiver指南 actorStream在Python API中不可用。
- RDD队列作为数据流：为了用测试数据测试Spark Streaming应用程序，人们也可以调用streamingContext.queueStream(queueOfRDDs)方法基于RDD队列创建DStreams。每个push到队列的RDD都被当做DStream的批数据，像流一样处理。
高级源（Advanced sources）：这些源包括Kafka,Flume,Kinesis,Twitter等等。它们需要通过额外的类来使用。为了从Kafka, Flume和Kinesis这些不在Spark核心API中提供的源获取数据，我们需要添加相关的模块spark-streaming-xyz_2.10到依赖中。例如，一些通用的组件如下表所示：

Source	Artifact
Kafka	spark-streaming-kafka_2.10
Flume	spark-streaming-flume_2.10
Kinesis	spark-streaming-kinesis-asl_2.10
Twitter	spark-streaming-twitter_2.10
ZeroMQ	spark-streaming-zeromq_2.10
MQTT	spark-streaming-mqtt_2.10

为了获取最新的列表，请访问Apache repository

参考http://www.cnblogs.com/shishanyuan/p/4747749.html

对于 SocketWordCount（跟Network WordCount差不多，就是每10s一个Dtream这里就不给了）

先执行Socket模拟器（模拟器 Socket 端口号为 9999 ，频度为 1 秒，）

/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/bin/spark-submit \

--class StreamingSimulation \

--master local[2] \

SparkStream.jar \

people.txt 9999 1000

people.txt 为当前目录下的一个文件，每行1个单词，

在执行 SocketWordCount

/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/bin/spark-submit \

--class SocketWordCount \

--master local[2] \

SparkStream.jar \

localhost 9999

可以看到效果

可以发现，由于 10s一个Dtream，所以正好从socket接受10个单词。

UpdateStateByKey操作

updateStateByKey操作允许不断用新信息更新它的同时保持任意状态。你需要通过两步来使用它

定义状态-状态可以是任何的数据类型
定义状态更新函数-怎样利用更新前的状态和从输入流里面获取的新值更新状态

让我们举个例子说明。在上面的例子中，每次显示10个单词，下一次显示另外10个，之间没有关联。如果你想保持一个文本数据流中每个单词的运行次数，运行次数用一个state表示，它的类型是整数

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}

这个函数被用到了DStream包含的单词上

// 定义更新状态方法，参数values为当前批次单词频度，state为以往批次单词频度

val updateFunc = (values: Seq[Int], state: Option[Int]) => {

val currentCount = values.foldLeft(0)(_ + _)

val previousCount = state.getOrElse(0)

Some(currentCount + previousCount)

}

更新函数将会被每个单词调用，wordCounts拥有一系列的1（从 val wordCounts = words.map(x => (x, 1))对而来），stateDstream拥有之前的次数。

val stateDstream = wordCounts.updateStateByKey[Int](updateFunc)

import org.apache.log4j.{Level, Logger}

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.streaming.{Seconds, StreamingContext}

import org.apache.spark.streaming.StreamingContext._

object StatefulWordCount {

def main(args: Array[String]) {

if (args.length != 2) {

System.err.println("Usage: StatefulWordCount ")

System.exit(1)

}

Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

// 定义更新状态方法，参数values为当前批次单词频度，state为以往批次单词频度

val updateFunc = (values: Seq[Int], state: Option[Int]) => {

val currentCount = values.foldLeft(0)(_ + _)

val previousCount = state.getOrElse(0)

Some(currentCount + previousCount)

}

val conf = new SparkConf().setAppName("StatefulWordCount").setMaster("local[2]")

val sc = new SparkContext(conf)

// 创建StreamingContext，Spark Steaming运行时间间隔为5秒

val ssc = new StreamingContext(sc, Seconds(5))

// 定义checkpoint目录为当前目录

ssc.checkpoint(".")

// 获取从Socket发送过来数据

val lines = ssc.socketTextStream(args(0), args(1).toInt)

val words = lines.flatMap(_.split(","))

val wordCounts = words.map(x => (x, 1))

// 使用updateStateByKey来更新状态，统计从运行开始以来单词总的次数

val stateDstream = wordCounts.updateStateByKey[Int](updateFunc)

stateDstream.print()

ssc.start()

ssc.awaitTermination()

}

这样，这个例子相比较前面实例中各时间段之间状态是相关的。

上面程序中还要注意：

Checkpointing

一个流应用程序必须全天候运行，所有必须能够解决应用程序逻辑无关的故障（如系统错误，JVM崩溃等）。为了使这成为可能，Spark Streaming需要checkpoint足够的信息到容错存储系统中，以使系统从故障中恢复。

何时checkpoint

应用程序在下面两种情况下必须开启checkpoint

使用有状态的transformation。如果在应用程序中用到了updateStateByKey或者reduceByKeyAndWindow，checkpoint目录必需提供用以定期checkpoint RDD。
从运行应用程序的driver的故障中恢复过来。使用元数据checkpoint恢复处理信息。

注意，没有前述的有状态的transformation的简单流应用程序在运行时可以不开启checkpoint。在这种情况下，从driver故障的恢复将是部分恢复（接收到了但是还没有处理的数据将会丢失）。这通常是可以接受的，许多运行的Spark Streaming应用程序都是这种方式。

刚刚程序里面 // 定义checkpoint目录为当前目录

ssc.checkpoint(".")

所以在hadoop的HDFS的当前目录下就会有记录

当然，打开都是乱码，劝你不用打开看看是啥了。

至于怎么使用checkpoint来从故障中恢复，可以参考

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala

https://aiyanbo.gitbooks.io/spark-programming-guide-zh-cn/content/spark-streaming/basic-concepts/checkpointing.html

窗口(window)操作

Spark Streaming也支持窗口计算，它允许你在一个滑动窗口数据上应用transformation算子。下图阐明了这个滑动窗口。

任何一个窗口操作都需要指定两个参数：

窗口长度：窗口的持续时间
滑动的时间间隔：窗口操作执行的时间间隔

这两个参数必须是源DStream的批时间间隔的倍数。

下面举例说明窗口操作。例如，你想扩展前面的例子用来计算过去30秒的词频，间隔时间是20秒。为了达到这个目的，我们必须在过去30秒的pairs DStream上应用reduceByKey 操作。用方法reduceByKeyAndWindow实现。

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(20))

对应的图如下：

val ssc = new StreamingContext(sc, Seconds(5))

// 定义checkpoint目录为当前目录

ssc.checkpoint(".")

// 通过Socket获取数据，该处需要提供Socket的主机名和端口号，数据保存在内存和硬盘中

val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_ONLY_SER)

val words = lines.flatMap(_.split(","))

// windows操作，第一种方式为叠加处理，第二种方式为增量处理

val wordCounts = words.map(x => (x , 1)).reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(args(2).toInt), Seconds(args(3).toInt))

//val wordCounts = words.map(x => (x , 1)).reduceByKeyAndWindow(_+_, _-_,Seconds(args(2).toInt), Seconds(args(3).toInt))

wordCounts.print()

ssc.start()

ssc.awaitTermination()

这里源DStream的批时间间隔是5，两个参数都要是5的倍数。

从结果可以看出，第一次算了5个，第二次算了15个，第3次算了25个，第四次算了30个，第5次算了30个。。。后面每次都算一个窗口的大小30个。（前面不是30个是因为窗口还没填满）

leexurui

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark Stream 教程

import org.apache.spark.SparkConfimport org.apache.spark.storage.StorageLevelimport org.apache.spark.streaming.{Seconds,StreamingContext}import org.apache.log4j.{Level, Logger}import java.util.co
复制链接

扫一扫