DStream

最新推荐文章于 2025-05-24 19:08:40 发布

daladongba

最新推荐文章于 2025-05-24 19:08:40 发布

阅读量1.7k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/a15512138486/article/details/106602928

版权

Spark 专栏收录该内容

7 篇文章

订阅专栏

DStream

- 1.什么是DStream
- 2.DStream的高级算子

1.什么是DStream

离散数据流或者DStream是SS提供的基本抽象。其表现数据的连续流，这个输入数据流可以来自于源，也可以来自于转换输入流产生的已处理数据流。内部而言，一个DStream以一系列连续的RDDs所展现，这些RDD是Spark对于不变的，分布式数据集的抽象。一个DStream中的每个RDD都包含来自一定间隔的数据，如下图：
在这里插入图片描述
在DStream上使用的任何操作都会转换为针对底层RDD的操作。

2.DStream的高级算子

UpdateStateByKey 操作

JavaPairDStream<String, Long> aggregatedDStream = 
			dateProvinceCityAdidDStream.updateStateByKey(
			new Function2<List<Long>, Optional<Long>, Optional<Long>>() {
           //v1为这次batch传来的对应某个key的value值，v2则是一个optional判断是否在之前存有值
           @Override
           public Optional<Long> call(List<Long> v1, Optional<Long> v2) throws Exception {
               Long clickCount = 0L;
               if (v2.isPresent()) {
                   clickCount = v2.get();
               }
               for (Long i : v1) {
                   clickCount += i;
               }
               return Optional.of(clickCount);
           }
       });

transform 操作
transform 操作（以及它的变化形式如 transformWith）允许在 DStream 运行任何 RDD-to-RDD 函数

JavaPairDStream<String, Integer> cleanedDStream = 
wordCounts.transform(rdd -> {
  rdd.join(spamInfoRDD).filter(...); // join data stream with spam information to do data cleaning
  ...
});

window操作
-它允许你在数据的一个滑动窗口上应用 transformation

Transformation（转换）	Meaning（含义）
window(windowLength, slideInterval)	返回一个新的 DStream，它是基于 source DStream 的窗口 batch 进行计算的。
countByWindow(windowLength, slideInterval)	返回 stream（流）中滑动窗口元素的数
reduceByWindow(func, windowLength, slideInterval)	返回一个新的单元素 stream（流），它通过在一个滑动间隔的 stream 中使用 func 来聚合以创建。该函数应该是 associative（关联的）且 commutative（可交换的），以便它可以并行计算
reduceByKeyAndWindow(func, windowLength, slideInterval, [_numTasks_])	在一个 (K, V) pairs 的 DStream 上调用时，返回一个新的 (K, V) pairs 的 Stream，其中的每个 key 的 values 是在滑动窗口上的 batch 使用给定的函数 func 来聚合产生的。Note（注意）: 默认情况下，该操作使用 Spark 的默认并行任务数量（local model 是 2，在 cluster mode 中的数量通过 `spark.default.parallelism` 来确定）来做 grouping。您可以通过一个可选的 `numTasks` 参数来设置一个不同的 tasks（任务）数量。
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [_numTasks_])	上述 `reduceByKeyAndWindow()` 的更有效的一个版本，其中使用前一窗口的 reduce 值逐渐计算每个窗口的 reduce值。这是通过减少进入滑动窗口的新数据，以及 “inverse reducing（逆减）” 离开窗口的旧数据来完成的。一个例子是当窗口滑动时”添加” 和 “减” keys 的数量。然而，它仅适用于 “invertible reduce functions（可逆减少函数）”，即具有相应 “inverse reduce（反向减少）” 函数的 reduce 函数（作为参数 invFunc </ i>）。像在 `reduceByKeyAndWindow` 中的那样，reduce 任务的数量可以通过可选参数进行配置。请注意，针对该操作的使用必须启用 checkpointing.
countByValueAndWindow(windowLength, slideInterval, [_numTasks_])	在一个 (K, V) pairs 的 DStream 上调用时，返回一个新的 (K, Long) pairs 的 DStream，其中每个 key 的 value 是它在一个滑动窗口之内的频次。像 code>reduceByKeyAndWindow</code> 中的那样，reduce 任务的数量可以通过可选参数进行配置。

DStream的输出操作：foreachRDD()、print()

dstream.foreachRDD(rdd -> {
  rdd.foreachPartition(partitionOfRecords -> {
    // ConnectionPool is a static, lazily initialized pool of connections
    Connection connection = ConnectionPool.getConnection();
    while (partitionOfRecords.hasNext()) {
      connection.send(partitionOfRecords.next());
    }
    ConnectionPool.returnConnection(connection); // return to the pool for future reuse
  });
});