Spark的核心主要分为四个模块
Sparksql,SparkMLlib,Sparkstreaming,SparkGraphX
Sparkstreaming的概念
Sparkstreaming是对SparkContext的进一步封装,可以指定间隔时间,创建原始的DStream
DStream是SparkStreaming中最基本的抽象,也是抽象的分布式集合装着描述信息,对RDD的进一步封装,DStream可以定期的生成RDD, DStream也分为Transformmation和Action.
Sparkstreaming的特点
可容错,高吞吐,可扩展,编程API丰富,可以保证Exactly Once(精准一次性语义)
sparkstreaming接收nc数据,进行一个简单的wordcount的逻辑
import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* 创建一个sparkStreaming的环境,编写Transformmation(wordcount)
* 消费端口的消息
*
* 设置日志导入级别,主要导入的依赖
* <dependency>
* <groupId>log4j</groupId>
* <artifactId>log4j</artifactId>
* <version>1.2.17</version>
* </dependency>
*
* <dependency>
* <groupId>org.slf4j</groupId>
* <artifactId>slf4j-api</artifactId>
* <version>1.7.21</version>
* </dependency>
*/
object StreamingWordCount {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
//先创建一个离线的运行环境
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
val sc = new SparkContext(conf)
//在用Streaming包装下变成实时的运行环境
val ssc = new StreamingContext(sc, Seconds(5))
//创建DStream内涵RDD(对RDD的进一步封装)
val lines: ReceiverInputDStream[String] = ssc.socketTextStream("localhost",8888)
//边写wordcount的tranformmation,
val woeds: DStream[String] = lines.flatMap(_.split(" "))
val wordandone: DStream[(String, Int)] = woeds.map((_,1))
val reduce: DStream[(String, Int)] = wordandone.reduceByKey(_+_)
//触发Action
reduce.print()
//启动并让这个任务一直运行
ssc.start()
//挂起,不会退出
ssc.awaitTermination()
}
}
kafka直连sparkstreaming
我们先引入依赖
<!-- 导入spark的依赖,streaming连接kafka的依赖RDD编程API -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
代码实现
/**
* SparkStreaming直连Kafka.
* 使用kafka底层高效的api
* 将虚拟机开启每个节点都要保证开启
* 最后获取分区偏移量
*/
object KafkaStreaming {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.WARN)
//先创建一个离线的运行环境
val conf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Milliseconds(5000))
//是星官网那个拷贝下来的框架
val kafkaParams = Map[String, Object](
//接收的机器及端口号
"bootstrap.servers" -> "doit01:9092,doit02:9092,doit03:9092",
//K,V转换字符串
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
//指定主id
"group.id" -> "g1",
//从哪里开始读earliest:最早的从头读
"auto.offset.reset" -> "earliest",
//不自动提交偏移量,如果不设置,默认为true
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("wordcount2")
//创建直连连kafka数据流需要传两个参数,个两个策略1位置策略2消费策略
//创建Kafka数据流
val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent, //位置策略
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams) //订阅策略
)
//触发action
kafkaDStream.map(_.value()).print()
//启动
ssc.start()
//挂起
ssc.awaitTermination()
}
}