spark-streaming入门（二）

最新推荐文章于 2021-12-24 08:58:57 发布

VIP文章 qq_23660243

最新推荐文章于 2021-12-24 08:58:57 发布

阅读量1w

点赞数 1

分类专栏：大数据文章标签： spark streaming

本文链接：https://blog.csdn.net/qq_23660243/article/details/51511234

版权

Input DStreams and Receivers

Input DStreams are DStreams representing the stream of input data received from streaming sources. In the quick example, lines was an input DStream as it represented the stream of data received from the netcat server. Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing.

【译】input DStreams指的是从输入源获取的输入流数据。在之前的例子中，lines 就是input DStream 因为他表示从netcat服务器获取的数据流。每个input DStream（除了文件系统以外，在后面的章节中介绍）与一个Receiver对象相关联，该对象从数据源接收数据并且把他保存在spark的内存中用于处理。

Spark Streaming provides two categories of built-in streaming sources.

· Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, socket connections, and Akka actors.

· Advanced sources: Sources like Kafka, Flume, Kinesis, Twitter, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.

【译】spark streaming提供两种类型的源数据流。

1.基本源：源在Streaming Context的api中直接可以获得，比如：文件系统、socket连接、Akka actors【个人理解】：个人感觉指的就是源能直接在程序中获得，读入到内存中并且参与到运算。

2.高级源：类似于Kafka、Flume、kinesis、twitter这样的源是可以通过一些额外通用类来使用的。这些额外需要的依赖连接已经在linking章节讨论过了。【个人理解】就是说使用这些高级的源的话，需要一些额外的依赖包。

We are going to discuss some of the sources present in each category later in this section.

Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application. Therefore, it is important to remember that a Spark Streaming application needs to be allocated enough cores (or threads, if running locally) to process the received data, as well as to run the receiver(s).

【译】我们将会在本节的稍后位置讨论之前出现过的一些源。

注意，如果你想在你的streaming应用中平行的获取多个数据流，你可以创建多输入Dstreams（在之后的Performance Tuning章节讨论）。这将会创建多个receivers，这些receivers将会同时接收多个数据流。但是请注意，spark的 worker/executor是一个长时间运行的任务，因此他会占用spark streaming应用分配的内核之一。因此，谨记spark streaming 应用需要被分配足够的内核（或者线程、如果本地运行至少两个）去处理接收到的数据和运行receviers。

Points to remember

When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).

Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.

【译】谨记：

1.当在本地运行spark streaming的程序的时候，不要使用“local”或者“local[1]”作为master的URL。他们中的每个都意味着我们只分配一个线程去本地运行任务。如果你使用一个基于receiver的DStream输入，那么被分配的这唯一一个线程将会被使用运行receiver（【个人理解】也就是说这一个线程只用来接受数据），却没有线程来处理接收到的数据。因此，当本地运行的时候，经常使用"local[n]"作为master的URL，并且这里的n要大于需要运行的receivers的数量。

2.把逻辑扩展到在集群中运行时，被分配给spark streaming 应用的内核的数量必须多于receviers的数量。否则系统只会接受数据，不会去处理运行他。

Basic Sources

We have already taken a look at the ssc.socketTextStream(...) in the quick example which creates a DStream from text data received over a TCP socket connection. Besides sockets, the StreamingContext API provides methods for creating DStreams from files and Akka actors as input sources.

【译】基本源：

我们已经在之前的例子中看过了ssc.socketTextStream(...)符号，该例子接收来自于TCP端口的文本数据。除了端口以外，StreamingContext的api提供多种方法用于创建以文件和akka actors作为输入源的DStreams。

File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:

【译】文件流：在任何兼容HDFS API的文件系统中读取数据，DStream的创建方式如下：

streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported). Note that

· The files must have the same data format.

· The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.

· Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.

For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores.

【译】spark streaming将会监测dataDIrectory这个目录，并且处理任何在此目录下被创建的文件（写入到内嵌的目录中的文件并不支持），注意：

1.文件必须有着同样的数据结构

2.文件必须以自动移动或者重命名到数据目录下这种方式被创建【个人理解】：这句话我还真不明白说滴是什么

3.一旦移动，files文件不能够被更改，如果文件一直持续不断的被增加，新增加的内容不会被读取。

比如文本文件，有一个简单的方法就是我们熟知的：streamingContext.textFileStream(dataDirectory)。并且文件流不需要运行receiver，因此不需要分配内核。

Advanced Sources

This category of sources require interfacing with external non-Spark libraries, some of them with complex dependencies (e.g., Kafka and Flume). Hence, to minimize issues related to version conflicts of dependencies, the functionality to create DStreams from these sources has been moved to separate libraries that can be linked to explicitly when necessary. For example, if you want to create a DStream using data from Twitter’s stream of tweets, you have to do the following:

1. Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven project dependencies.

2. Programming: Import the TwitterUtils class and create a DStream with TwitterUtils.createStream as shown below.

3. Deploying: Generate an uber JAR with all the dependencies (including the dependency spark-streaming-twitter_2.10 and its transitive dependencies) and then deploy the application. This is further explained in the Deploying section.

最低0.47元/天解锁文章

qq_23660243

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark-streaming入门（二）

Input DStreams and ReceiversInput DStreams are DStreams representing the stream of input data received from streaming sources. In the quick example, lines was an input DStream as it represented th
复制链接

扫一扫