Spark源码解析：DStream

最新推荐文章于 2024-05-07 20:40:36 发布

木东居士

最新推荐文章于 2024-05-07 20:40:36 发布

阅读量8.7k

点赞数 1

本文链接：https://blog.csdn.net/zhaodedong/article/details/73649905

版权

本文深入分析Spark Streaming中的核心概念DStream，从DStream的定义、与其他实时处理系统的区别，到源码层面探讨其依赖关系、RDD生成机制。通过对一个具体例子的解析，展示了DStream如何通过转换和生成RDD进行处理。文章旨在为读者提供对Spark Streaming的整体把握。

摘要由CSDN通过智能技术生成

0x00 前言

本篇是Spark源码解析的第二篇，主要通过源码分析Spark Streaming设计中最重要的一个概念——DStream。

本篇主要来分析Spark Streaming中的Dstream，重要性不必多讲，明白了Spark这个几个数据结构，容易对Spark有一个整体的把握。

和RDD那篇文章类似，虽说是分析Dstream，但是整篇文章会围绕着一个具体的例子来展开。算是对Spark Streaming源码的一个概览。

文章结构

Spark Streaming的一些概念，主要和Dstream相关
Dstream的整体设计
通过一个具体例子深入讲解

0x01 概念

什么是Spark Streaming

Scalable, high-throughput, fault-tolerant stream processing of live data streams!

一个实时系统，或者说是准实时系统。详细不再描述。

提一点就是，Streaming 的任务最后都会转化为Spark任务，由Spark引擎来执行。

Dstream

It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream.

RDD 的定义是一个只读、分区的数据集（an RDD is a read-only, partitioned collection of records），而 DStream 又是 RDD 的模板，所以我们把 Dstream 也视同数据集。

我的简单理解，Dstream是在RDD上面又封了一层的数据结构。下面是官网对Dstream描述的图。

Spark Streaming和其它实时处理程序的区别

此处是来自Spark作者的论文，写的很好，我就不翻译了，摘出来我关注的点。

我们把实时处理框架分为两种：Record-at-a-time和D-Stream processing model。

Record-at-a-time：

D-Stream processing model：

两者的区别：

Record-at-a-time processing model. Each node continuously receives records, updates internal state, and sends new records. Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux or DPC to ensure that replicas of each node see records in the same order (e.g., when they have multiple parent nodes).

D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream.

Record-at-a-time的问题：

In a record-at-a-time system, the major recovery challenge is rebuilding the state of a lost, or slow, node.

0x02 源码分析

Dstream

A DStream internally is characterized by a few basic properties:

A list of other DStreams that the DStream depends on

A time interval at which the DStream generates an RDD

A function that is used to generate an RDD after each time interval

Dstream这个数据结构有三块比较重要。

父依赖
生成RDD的时间间隔
一个生成RDD的function

这些对应到代码中的话如下，这些都会有具体的子类来实现，我们在后面的分析中就能看到。下面先顺着例子一点点讲。

abstract class DStream[T: ClassTag] ( @transient private[streaming] var ssc: StreamingContext ) extends Serializable with Logging {
   
  /** Time interval after which the DStream generates an RDD */
  def slideDuration: Duration
  /** List of parent DStreams on which this DStream depends on */
  def dependencies: List[DStream[_]]
  /** Method that generates an RDD for the given time */
  def compute(validTime: Time): Option[RDD[T]]
  // RDDs generated, marked as private[streaming] so that testsuites can access it
  @transient
  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
  // Reference to whole DStream graph
  private[streaming] var graph: DStreamGraph = null
 }

举个栗子

官网最基本的wordcount例子，和Spark的类似。虽简单，但是代表性很强。

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Second

最低0.47元/天解锁文章

木东居士

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
Spark源码解析：DStream

0x00 前言本篇是Spark源码解析的第二篇，主要通过源码分析Spark Streaming设计中最重要的一个概念——DStream。本篇主要来分析Spark Streaming中的Dstream，重要性不必多讲，明白了Spark这个几个数据结构，容易对Spark有一个整体的把握。和RDD那篇文章类似，虽说是分析Dstream，但是整篇文章会围绕着一个具体的例子来展开。算是对Spark S
复制链接

扫一扫