- 概述
- 应用场景
- 集成Spark生态系统的使用
- 发展史
- 从词频统计功能入手
- 工作原理
概述
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
scalable – 可扩展
high-throughput–高吞吐量的
fault-tolerant --容错
Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams
将不同的数据源数据经过Spark Streaming将结果输出到文件外部系统
特点:低延时,从错误中高效恢复出来,能够运行在成百上千的节点,能够将批处理、机器学习、图计算和Spark Streaming综合起来使用
it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches
One stack to rule them all;一栈式
应用场景
银行业、电信行业、电子行业、工业界、电商行业、实时监控、web系统运行过程解决error信息
集成Spark生态系统的使用
Combine batch with streaming processing
Join data stream with static data sets
//Create data set from Hadoop life
val dataset = sparkContext.hadoopFile(“file”)
从一个文件系统上把一个文件读取出来
//Join each batch in stream with dataset
kafkaStream.transform{batchRDD=>batchRDD.join(dataset).filter(…)}
Learn model offline,apply them online
//learn model offline
val model=KMeans.train(dataset,…)
//Apply model online on stream
kafkaStream.map{event=>model.predict(event.feature)}
Interactively query streaming data with SQL
//Register each batch in stream as table
KafkaStream.map{batchRDD=>batchRDD.registerTempTable(“leastEvents”)}
//Interacively query table
sqlContext.sql(“select* from lastEvents”)
发展史
从词频统计功能入手
- spark-submit执行
- spark-shell执行
从Spark源码入手
githup
https://githup.com/apache/spark
spark-submit执行
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
*