Spark Streaming核心

最新推荐文章于 2020-08-03 09:27:07 发布

慢熟的孩子

最新推荐文章于 2020-08-03 09:27:07 发布

阅读量124

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_45400755/article/details/102734853

版权

大数据专栏收录该内容

6 篇文章 0 订阅

订阅专栏

核心概念
Transformations
Output Operatioins

核心概念

StreamingContext

To initialize a Spark Streaming program, a StreamingContext object has to be created which is the main entry point of all Spark Streaming functionality.


import org.apache.spark._
import org.apache.spark.streaming._

val conf = new SparkConf().setAppName(appName).setMaster(master)
val ssc = new StreamingContext(conf, Seconds(1))

 def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
  }
def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }

The appName parameter is a name for your application to show on the cluster UI. master is a Spark, Mesos, Kubernetes or YARN cluster URL, or a special “local[*]” string to run in local mode. In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there.

The batch interval must be set based on the latency requirements of your application and available cluster resources

batch interval 可以根据你的应用程序需求的延迟要求以及集群可用的资源情况来设置

import org.apache.spark.streaming._

val sc = ...                // existing SparkContext
val ssc = new StreamingContext(sc, Seconds(1))

Define the input sources by creating input DStreams.
Define the streaming computations by applying transformation and
output operations to DStreams.
Start receiving data and processing it using
streamingContext.start().
Wait for the processing to be stopped (manually or due to
any error)using streamingContext.awaitTermination()
The processing can be manually stopped using streamingContext.stop().

一旦StreamingContext定义好之后，就可以做一系列事情

一旦启动上下文，就无法设置新的流计算或将其添加到该流计算中。
上下文一旦停止，就无法重新启动。
JVM中只能同时激活一个StreamingContext。
StreamingContext上的stop（）也会停止SparkContext。要仅停止的StreamingContext，设置可选的参数stop()叫做stopSparkContext假。
只要在创建下一个StreamingContext之前停止（不停止SparkContext）上一个StreamingContext，即可将SparkContext重新用于创建多个StreamingContext

DStreams

Discretized Streams 是一个最基础的抽象，他代表一个持续化的数据流，是从source或者一个Dstream通过Transform来转化来的

在这里插入图片描述

RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval, as shown in the following figure.

对DStream操作算子，比如map/flatMap，其实底层会被翻译为对DStream中每一个RDD都做相同的操作，因为一个DStream是由不同批次的RDD所构成的

Input Dstreams and Receivers

Input DStreams are DStreams representing the stream of input data received from streaming sources. In the quick example, lines was an input DStream as it represented the stream of data received from the netcat server. Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing.

文件系统和实时流数据返回方法是不一样的

def socketStream[T: ClassTag](
      hostname: String,
      port: Int,
      converter: (InputStream) => Iterator[T],
      storageLevel: StorageLevel
    ): ReceiverInputDStream[T] = {
    new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
  }

  def textFileStream(directory: String): DStream[String] = withNamedScope("text file stream") {
    fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)
  }

Transformation and Output Operation

与RDD相似，转换允许修改来自输入DStream的数据。DStream支持普通Spark RDD上可用的许多转换。一些常见的方法如下。

Transformation Meaning
map(func) Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items.
filter(func) Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions) Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream) Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count() Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func) Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue() When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks]) When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks]) When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func) Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func) Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems.

慢熟的孩子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark Streaming核心

核心概念TransformationsOutput Operatioins核心概念StreamingContextTo initialize a Spark Streaming program, a StreamingContext object has to be created which is the main entry point of all Spark Strea...
复制链接

扫一扫