Spark Streaming编程指南（部分）

最新推荐文章于 2020-06-10 17:47:53 发布

cq1982

最新推荐文章于 2020-06-10 17:47:53 发布

阅读量760

点赞数

分类专栏： spark&scala

spark&scala 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

有一个链接

streaming-programming-guide

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput,fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with high-level functions like map,reduce,join and window.Finally, processed data can be pushed out to filesystems, databases,and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

翻译：

Spark Streaming是Spark核心API的扩展，它提供可扩展、高流量和可容错的实时数据流的流式处理。

数据能从许多数据源获取到，比如：Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets，

能使用高级方法描述的复杂算法高级方法来处理，比如：map, reduce, join 和 window。

最后，处理过的数据能被推到文件系统，数据库和实时仪表盘。

实际上，你能在数据流上应用Spark的机器学习和图形处理算法。

Spark Streaming

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

翻译：

内部，它像如下方式运行。Spark Streaming接收实时输入数据流并将数据分成多个批次，它被Spark引擎处理，分批生成最终结果流。

Spark Streaming

Spark Streaming provides a high-level abstraction called discretized stream orDStream,which represents a continuous stream of data. DStreams can be created either from input datastreams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

翻译：

Spark Streaming提供一个名为离散流或DStream的高级抽象，它代表流一个连续数据流。

DStream既能从Kafka, Flume和Kinesis等来源的输入数据流中创建，

也能从其他DStream应用高级操作得到。

内部，一个DStream代表了一个系列的RDDs（弹性数据集）。

下面是AMPCamp训练时的一个例子，之前是用命令行的方式运行，改成了idea项目直接运行（这一块后面再介绍）

用一个小例子来演示Spark处理文件流：

1、创建输入目录

mkdir -p /tmp/fileStream

2、创建Scala项目

引入Scala和Spark的jar包，比如：scala-sdk-2.10.4和spark-assembly-1.3.1-hadoop2.6.0。

3、新建一个scala object

import org.apache.spark.streaming._
import org.apache.spark.SparkConf


object TestFileStreaming {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SparkPi").setMaster("local")
    val ssc = new StreamingContext(conf, Seconds(4))

    val lines = ssc.textFileStream("/tmp/fileStream")
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

要点：

a. StreamContext，每隔4秒钟处理一次；

b. 从stream中取得RDD，并进行处理；

c. 启动流服务，一定要写awaitTermination，不然程序直接结束了。

4、输入

echo "hello world" > /tmp/fileStream/hello1

echo "hello world" > /tmp/fileStream/hello2

然后可以看到执行结果