Spark Streaming编程指南(部分)

有一个链接

streaming-programming-guide


Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput,fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with high-level functions like map,reduce,join and window.Finally, processed data can be pushed out to filesystems, databases,and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

翻译:

Spark Streaming是Spark核心API的扩展,它提供可扩展、高流量和可容错的实时数据流的流式处理。

数据能从许多数据源获取到,比如:Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets,

        能使用高级方法描述的复杂算法高级方法来处理,比如:map, reduce, joinwindow

最后,处理过的数据能被推到文件系统,数据库和实时仪表盘。

实际上,你能在数据流上应用Spark的机器学习和图形处理算法。

Spark Streaming

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

翻译:

内部,它像如下方式运行。Spark Streaming接收实时输入数据流并将数据分成多个批次,它被Spark引擎处理,分批生成最终结果流。


Spark Streaming

Spark Streaming provides a high-level abstraction called discretized stream orDStream,which represents a continuous stream of data. DStreams can be created either from input datastreams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

翻译:

Spark Streaming提供一个名为离散流或DStream的高级抽象,它代表流一个连续数据流。

DStream既能从Kafka, Flume和Kinesis等来源的输入数据流中创建,

                也能从其他DStream应用高级操作得到。

内部,一个DStream代表了一个系列的RDDs(弹性数据集)。


下面是AMPCamp训练时的一个例子,之前是用命令行的方式运行,改成了idea项目直接运行(这一块后面再介绍)

用一个小例子来演示Spark处理文件流:

1、创建输入目录

mkdir -p /tmp/fileStream

2、创建Scala项目

引入Scala和Spark的jar包,比如:scala-sdk-2.10.4和spark-assembly-1.3.1-hadoop2.6.0。


3、新建一个scala object

import org.apache.spark.streaming._
import org.apache.spark.SparkConf


object TestFileStreaming {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SparkPi").setMaster("local")
    val ssc = new StreamingContext(conf, Seconds(4))

    val lines = ssc.textFileStream("/tmp/fileStream")
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

要点:

a. StreamContext,每隔4秒钟处理一次;

b. 从stream中取得RDD,并进行处理;

c. 启动流服务,一定要写awaitTermination,不然程序直接结束了。


4、输入

echo "hello world" > /tmp/fileStream/hello1

echo "hello world" > /tmp/fileStream/hello2

然后可以看到执行结果


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值