重新理解RDD和DStream

最新推荐文章于 2023-02-02 21:13:52 发布

我一拳打弯你A柱

最新推荐文章于 2023-02-02 21:13:52 发布

阅读量1.1k

点赞数 1

分类专栏： Spark 文章标签：大数据 spark

本文链接：https://blog.csdn.net/alian_w/article/details/112692207

版权

Spark 专栏收录该内容

22 篇文章 1 订阅

订阅专栏

重新理解RDD和DStream

大家好，我是一拳就能打爆A柱的A柱猛男

我还是对SparkStreaming的DStream研究不是很深，在做流式处理的时候老是遇到一些问题，比如rdd收集数据不知道去向何方，无法打印，数据无法处理、反馈等等。所以还是要研究一下DStream。

一、Spark DStream和RDD的官方文档

1.1 DStream介绍

DStream官方文档可以看到这一段话：

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval, as shown in the following figure.

重点在下面两个加粗句子里，DStream是一系列连续不断的RDD，DStream中的每一个RDD都包含了一定时间间隔里的数据。这也不难理解，在StreamingContext对象生成的时候我们会传入一个Seconds，这就是上述的intervals。

在这里插入图片描述

Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in the following figure.

上面这段是对DStream的操作作解释，DStream的每一个操作都会转成对其底层RDD的操作，下面这张图可能一下子不好理解，因为注释太少，我的理解是：

在这里插入图片描述

虚线代表一个DStream，所以一共两个DStream，DStream中装有一系列的RDD，分别是RDD（line from 0-1…），以此类推。所以对DStream的操作（flapMap）会对DStream里的RDDs做统一操作，然后生成下面的蓝色RDDs，最后蓝色RDDs就组成了words DStream。

1.2 Input DStream和接收器Receivers

Input DStreams are DStreams representing the stream of input data received from streaming sources. In the quick example, lines was an input DStream as it represented the stream of data received from the netcat server. Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing.

两句加粗是重点，Input DStream就是从流数据源中获取的数据流，每一个Input DStream都跟一个Receiver有关，这个Receiver对象从数据源接收数据并将数据存在Spark的内存中以待处理。

注：Spark Streaming提供两类内置流数据源，一个是基础数据源（Basic souces），从文件系统里拿。一个是高级数据源（Advanced sources）从kafka、kinesis等拿。

Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application. Therefore, it is important to remember that a Spark Streaming application needs to be allocated enough cores (or threads, if running locally) to process the received data, as well as to run the receiver(s).

这段很重要！当我们想要在程序中并行地去接收多种数据源的数据时，可以create多个Input DStream，当然也需要多个receivers 配合。注意Spark的excutors是一个一直在运行的任务，所以要注意核心的分配。

1.3 RDD介绍

DStream是由一系列的RDDs组成的，所以还是要看RDD的介绍。

Resilient Distributed Datasets (RDDs)，RDD的全称叫弹性分布式数据集，RDD是一个可以并行操作的并且对它内部的元素具有容错机制的数据集合。通过两种途径可以创建RDD：1、在driver程序中并行化一个数据集。2、从外部存储系统中引用数据集。

1.3.1 将集合并行化

想要将集合并行化很简单，只需要调用sparkContext.parallelize方法就可以了，代码如下：

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

经过sc.parallelize(data)后，数据将会分发到各个节点，对RDD对象的操作就是对所有节点上的数据做相同的操作。在并行化的时候可以设置参数（ partitions ）来决定分区的数量，若不指定则Spark会自动分，我们也可以手动设置：

sc.parallelize(data, 10)

1.3.2 RDD的操作

RDDs支持两类操作：

transformations：这类操作会在原数据集的基础上再创建一个新的数据集出来。Spark中的transformation操作都是lazy的，也就是说只有当你调用actions的时候才会执行之前的transformation，在调用action之前，这些transformation会被RDD记录下来，等RDD调用action了才会先执行之前的transformation，进而执行action，最后返回给driver。
actions：这类操作会将计算完的数据返回给driver。

1.4 总结

总结下来就是，RDD是针对数据的分布式数据集，在RDD上的操作会在所有节点统一进行。而DStream内包含了一系列的RDD，DStream可以处理流式数据，而RDD只能将数据并行化，对一个时间间隔内的数据装到RDD，RDD对多个节点的数据做统一的并行操作。

包含了一系列的RDD，DStream可以处理流式数据，而RDD只能将数据并行化，对一个时间间隔内的数据装到RDD，RDD对多个节点的数据做统一的并行操作。

我一拳打弯你A柱

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
5
评论
重新理解RDD和DStream

重新理解RDD和DStream我还是对SparkStreaming的DStream研究不是很深，在做流式处理的时候老是遇到一些问题，比如rdd收集数据不知道去向何方，无法打印，数据无法处理、反馈等等。所以还是要研究一下DStream。一、Spark DStream和RDD的官方文档1.1 DStream介绍DStream官方文档可以看到这一段话：Discretized Stream or DStream is the basic abstraction provided by Spark Stre
复制链接

扫一扫