SparkStrem
是一个可扩展,高吞吐量,实时的流式处理
可以与多个数据源整合
将数据流分批次处理,每个批次就是一个时间段(每隔一段时间处理一次)
Dstream
将持续性的数据流,分割成一系列RDD,每个RDD含有一段时间内的数据。
是一个离散流,是sparkstreaming的基本数据抽象,由连续的RDD构成。
Dstram之间有依赖关系
窗口函数:一段时间内数据发生的变化 (像统计每小时注册量,金融中的各种概率)
两个重要的参数:
window length:窗口长度,代表窗口持续时间(也就是说该窗口包含几个批次间隔)
sliding interval: 滑动间隔,代表执行窗口操作的间隔(也就是说,从一个窗口滑到另一个地方需要的时间)
这两个参数必须是DStream 批次间隔的倍数
资源调度模式
1、standalone 模式(不依赖其他调度框架)
2、local模式(单机)本地代码调式
3.spark on yarn 模式(spark APP 运行在yarn资源调度框架里)
4、Messos
5、Doker
设置spark日志等级
object LoggerLevels extends Logging{
def setStreamingLogLevels(): Unit ={
val log4jInitialized: Boolean = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized){
logInfo("-----LoggerLevels------")
Logger.getRootLogger.setLevel(Level.WARN)
}
}
}
sparkstreaming wordcount
获取newcat服务数据
object SparkStreamingWC {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("SparkStreamingWC").setMaster("local[2]")
val sc: SparkContext = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc,Seconds(5))
val dstream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop02",8888)
val res: DStream[(String, Int)] = dstream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
res.print()
ssc.start()
ssc.awaitTermination()
}
}
newcat 数据wordcount ,保存前次处理结果updatestatebykey
/**
* 实现按批次累加,需要调用updatestatebykey
* 实现一个函数,在函数中,第一个参数是,word,第二个是当前批次单词出现的次数seq(1,1,1,1。。。)
* 第三个参数是之前批次累加的结果
*/
object SparkStreamingACCWC {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("SparkStreamingACCWC").setMaster("local[2]")
val sc: SparkContext = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc,Milliseconds(5000))
ssc.checkpoint("hdfs://hadoop01:9000/ck_20171030")
val dstream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop02",8888)
val tup: DStream[(String, Int)] = dstream.flatMap(_.split(" ")).map((_,1))
val res: DStream[(String, Int)] = tup.updateStateByKey(func,new HashPartitioner(ssc.sparkContext.defaultParallelism),false)
res.print()
ssc.start()
ssc.awaitTermination()
}
val func=(it:Iterator[(String,Seq[Int],Option[Int])])=>{
it.map(x=>{
(x._1,x._2.sum + x._3.getOrElse(0))
})
}
}
从kafka获取数据
object LoadKafkaDataAndWC {
def main(args: Array[String]): Unit = {
LoggerLevels.setStreamingLogLevels()
val Array(zkQuorum, group, topics, numThreads) = args
val conf: SparkConf = new SparkConf().setAppName("LoadKafkaDataAndWC").setMaster("local[2]")
val sc: SparkContext = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc, Milliseconds(5000))
ssc.checkpoint("hadfs:hadoop01:9000/ck_20171030_1")
val top2Map: Map[String, Int] = topics.split(",").map((_, numThreads.toInt)).toMap
val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, group, top2Map)
val lines: DStream[String] = data.map(_._2)
val tup: DStream[(String, Int)] = lines.flatMap(_.split(" ")).map((_, 1))
val res: DStream[(String, Int)] = tup.updateStateByKey(func, new HashPartitioner(ssc.sparkContext.defaultParallelism), false)
res.saveAsTextFiles("hdfs://hadoop01:9000/out-20171030_1")
ssc.start()
ssc.awaitTermination()
}
val func = (it: Iterator[(String, Seq[Int], Option[Int])]) => {
it.map {
case (x, y, z) => {
(x, y.sum + z.getOrElse(0))
}
}
}
}
spark拉取flume的数据进行wordcount
1、通过flume向spark推数据,简单,耦合低,但是sparkstreaming没启动,flume就会报错,sparkstreaming可能来不及消费。
引入jar
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.10</artifactId>
<version>2.1.0</version>
</dependency>
object FlumePushWC {
def main(args: Array[String]): Unit = {
LoggerLevels.setStreamingLogLevels()
val conf: SparkConf = new SparkConf().setAppName("FlumePushWC").setMaster("local[*]")
// val sc: SparkContext = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(conf,Seconds(5))
//推送方式 flume 向spark 发送数据,配置的是spark的ip地址与端口
val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(ssc,"192.168.216.51",8888)
//flume中的数据通过event.getbody()才能真正拿到数据
val word: DStream[(String, Int)] = flumeStream.flatMap(x=> new String(x.event.getBody.array()).split(" ")).map((_,1))
val res: DStream[(String, Int)] = word.reduceByKey(_+_)
res.print()
ssc.start()
ssc.awaitTermination()
}
}
2、http://blog.csdn.net/qq_21234493/article/details/51303009
spark向flume拉数据 就是spark自定义一个sink,然后去channel中拉取数据,稳定性好
进入http://spark.apache.org/docs/latest/streaming-flume-integration.html官网,下载
spark-streaming-flume-sink_2.10-1.6.0.jar、scala-library-2.10.5.jar、commons-lang3-3.3.2.jar三个包:并放到flume的lib目录下
配置flume-conf.properties
agent1表示代理名称
agent1.sources=source1
agent1.sinks=sink1
agent1.channels=channel1
#配置source1
agent1.sources.source1.type=spooldir
agent1.sources.source1.spoolDir=/usr/local/flume/tmp/TestDir
agent1.sources.source1.channels=channel1
agent1.sources.source1.fileHeader=false
agent1.sources.source1.interceptors=i1
agent1.sources.source1.interceptors.i1.type=timestamp
#配置sink1
agent1.sinks.sink1.type=org.apache.spark.streaming.flume.sink.sparksink
agent1.sinks.sink1.hostname=Master
agent1.sinks.sink1.port=9999
agent1.sinks.sink1.channel=channel1
#配置channel1
agent1.channels.channel1.type=file
agent1.channels.channel1.checkpointDir=/usr/local/flume/tmp/checkpointDir
agent1.channels.channel1.dataDirs=/usr/local/flume/tmp/dataDirs
将上面的代码中的 改成这 flume的地址,可以new多个放到集合
val address=Seq(new InetSocketAddress("172.16.0.11",8888)
//推送方式 flume 向spark 发送数据,配置的是flume的ip地址与端口
val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc,address,StroageLevel.MEMORY_AND_DISK)
窗口操作
object WindowOperationWC {
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val conf: SparkConf = new SparkConf().setAppName("WindowOperationWC").setMaster("local[2]")
val sc: SparkContext = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc,Seconds(5))
ssc.checkpoint("hadfs:hadoop01:9000/ck_20171030_2")
val dStream: Rec eiverInputDStream[String] = ssc.socketTextStream("hadoop01",8888)
val tup: DStream[(String, Int)] = dStream.flatMap(_.split(" ")).map((_,1))
val res: DStream[(String, Int)] = tup.reduceByKeyAndWindow((x:Int, y:Int)=>(x + y),Seconds(10),Seconds(10))
res.print()
ssc.start()
ssc.awaitTermination()
}
}