一下内容来源于DT大数据梦工厂:
StreamingContext
* AJava-friendly version of[[org.apache.spark.streaming.StreamingContext]]which is the main
* entry point for Spark Streaming functionality. It provides methods to create
* [[org.apache.spark.streaming.api.java.JavaDStream]]and
* [[org.apache.spark.streaming.api.java.JavaPairDStream.]]from input sources. The internal
* org.apache.spark.api.java.JavaSparkContext (see core Spark documentation) canbe accessed
* using `context.sparkContext`. After creating and transforming DStreams, the streaming
* computation can be started and stopped using `context.start()`and `context.stop()`,
* respectively. `context.awaitTermination()`allows the current thread to wait for the
* termination of a context by `stop()`or by an exception.
StringContext 是通往集群的唯一通道,
DStream
Receiver
* Abstract class of a receiver that can be run on worker nodes to receive external data. A
接收外部数据,
InputDStream----Dstream(Tranformations)àoutputDStream(foreachDStream)
InputStream(背后Receiver):Socket编程 getInputStream不断的从远程pull数据到receiver中,转过来通过BlockManager进程存储。
Logical Model
Spark只认识RDD,所有该logical Model 需要生成RDD的DAG才能够真正被执行;
Dstream就是RDD的模板,
DStreamGraph就是DAG的模板;
当达到Interval的时候浙西模板就会被Beatch
Data实例化成为RDD和DAG
Spark Streaming 中出发业务Job的唯一方式是Batch Interval,Driver中有定时器Timer根据开发者传递进来的时间间隔生成和触发job的执行
RDD源码:如何确保RDD数据不变
package cn.tan.bd.bdapp.bd.sparkstreaming;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
/**
* @author 作者 E-mail:
* @version 创建时间:2016年4月18日 下午8:24:38
* 类说明
*/
public class SparkStreamingWordCount {
public static void main( String[] args ) {
/**
* 1、配置SparkConf
*/
SparkConf conf = new SparkConf().setMaster( "spark://node11:7077" )
.setAppName( "wordCount" );
/**
* 2、创建SparkStreamingContext
*/
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5));
/**
* 三、创建SparkStreaming 输入数据来源input Stream
* 1、输入来源可以是hdfs、File、Kafka streaming链接上端口并运行的时候一直在监听
* 2、如果经常在每间隔5秒没有数据的话不断启动空的job会造成资源的浪费,实力的企业级生成环境代码具体在
* 提 交Job之前就会判断是否有数据,,如果有数据提交,如果没有数据不提交
*/
JavaReceiverInputDStream lines = jsc.socketTextStream( "node11", 9999);
/**
* 四:接下来就像对RDD编程一样基于DStrean进行编程
*发生计算之前实质是把每个Batch的DStream操作编译成为对RDD的操作
*对初始的DStream进行Tranformation级别的处理,例如map,filter等高阶函数的编程,
*来执行具体的数据计算
*4.1第一步:计算每行的字符串拆分成单个的单词
*
*/
JavaDStream<String> words = lines.flatMap( new FlatMapFunction<String, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
public Iterable<String> call( String line )
throws Exception {
return Arrays.asList(line.split(" "));
}
} );
JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call( String word )
throws Exception {
return new Tuple2<String, Integer>(word, 1);
}
});
JavaPairDStream<String, Integer> wordCount = pairs.reduceByKey( new Function2<Integer, Integer, Integer>() {
public Integer call( Integer v1, Integer v2 )
throws Exception {
return v1 + v2;
}
} );
wordCount.print();
jsc.start();
jsc.awaitTermination();
jsc.close();
}
}