//至少需要两个线程
val sparkConf =newSparkConf().setMaster("local[*]").setAppName("hello Streaming")//指定 5s 一批次 微批次的计算模式
val streamingContext =newStreamingContext(sparkConf,Seconds(5))
val dStream:DStream[String]= streamingContext.socketTextStream("node102",54321)//业务线维度:
val flatMapDStream = dStream.flatMap(_.split(" "))
val mapDStream = flatMapDStream.map((_,1))
val reduceDStream = mapDStream.reduceByKey(_+_)
reduceDStream.print()
reduceDStream.foreachRDD(_.foreach(println))
streamingContext.start()//开启这个任务
streamingContext.awaitTermination()//等待
Linux下nc命令的使用 nc命令的作用
实现任意TCP/UDP端口的侦听,nc可以作为server以TCP或UDP方式侦听指定端口
端口的扫描,nc可以作为client发起TCP或UDP连接
机器之间传输文件
机器之间网络测速
[bduser@node102~]$ nc -lk 54321
li zh en li test soa test spark soa
a
a a a a
b
b
b
bb
b
bb bb b bb b
-------------------------------------------
Time:1584599230000 ms
-------------------------------------------(,1)(bb,1)(a,5)(b,4)(,1)(bb,1)(a,5)(b,4)-------------------------------------------
Time:1584599235000 ms
-------------------------------------------(,1)(bb,3)(b,2)(,1)(bb,3)(b,2)-------------------------------------------
Time:1584599240000 ms
---------------
classMySocketReceiver(hostname:String,port:Int)extendsReceiver[String](StorageLevel.MEMORY_ONLY){
override def onStart(): Unit ={newThread{
override def run()={
var socket: Socket = null
var inputStr:String = null
try{
socket =newSocket(hostname,port)
val bufferReader =newBufferedReader(newInputStreamReader(socket.getInputStream,"UTF-8"))while(isStarted &&(inputStr = bufferReader.readLine())!= null){store(inputStr)}restart("没数据的了,但是不能停,重启接收服务")}catch{case e:ConnectException =>println("连接失败");restart("连接中断,重启接入服务")case _ =>restart("未知错误,重启接受服务")}}}.start()}
override def onStop(): Unit ={}}
可以通过 streamingContext.receiverStream(<instance of custom receiver>)来使用自定义的数据采集源
//至少需要两个线程
val sparkConf =newSparkConf().setMaster("local[*]").setAppName("hello Streaming")//指定 5s 一批次 微批次的计算模式
val streamingContext =newStreamingContext(sparkConf,Seconds(5))//文本接收器//val dStream:DStream[String] = streamingContext.socketTextStream("node102",54321)
val dStream = streamingContext.receiverStream(newMySocketReceiver("node102",12345))//业务线维度:
val flatMapDStream = dStream.flatMap(_.split(" "))
val mapDStream = flatMapDStream.map((_,1))
val reduceDStream = mapDStream.reduceByKey(_+_)
reduceDStream.print()
reduceDStream.foreachRDD(_.foreach(println))
streamingContext.start()//开启这个任务
streamingContext.awaitTermination()//等待
[bduser@node102~]$ nc -lk 12345
a a a a
a
a
a
a
a a a a a
a
a
a
-------------------------------------------
Time:1584603280000 ms
--------------------------------------------------------------------------------------
Time:1584603285000 ms
-------------------------------------------(,3)(a,13)
3.3 基本数据源
文件数据源
Socket 数据流前面的例子已经看到过。
文件数据流:能够读取所有 HDFS API 兼容的文件系统文件,通过 fileStream 方法进行读取
//至少需要两个线程
val sparkConf =newSparkConf().setMaster("local[*]").setAppName("hello Streaming")//指定 5s 一批次 微批次的计算模式
val streamingContext =newStreamingContext(sparkConf,Seconds(5))//文本接收器
val dStream:DStream[String]= streamingContext.socketTextStream("node102",54321)//val dStream = streamingContext.receiverStream(new MySocketReceiver("node102",12345))//为有状态转换算子设置检查点的目录位置
streamingContext.sparkContext.setCheckpointDir("./chkdir")//业务线维度:
val flatMapDStream = dStream.flatMap(_.split(" "))
val mapDStream = flatMapDStream.map((_,1))//无状态转换://val reduceDStream = mapDStream.reduceByKey(_+_)//reduceDStream.print()//有状态转换
val updateStateDStream = mapDStream.updateStateByKey{(seq:Seq[Int],state:Option[Int])=>
val result = state.getOrElse(0)+ seq.sum
Some(result)}//简写://mapDStream.updateStateByKey((seq:Seq[Int],state:Option[Int]) => Some(seq.sum + state.getOrElse(0)))
updateStateDStream.print()
updateStateDStream.foreachRDD(_.foreach(println))
streamingContext.start()//开启这个任务
streamingContext.awaitTermination()//等待
a
aa
a a
a
a
a
a aaaa
aa
a
-------------------------------------------
Time:1584621695000 ms
-------------------------------------------(aa,2)(aaaa,1)(,11)(a,18)(aa,2)(aaaa,1)(,11)(a,18)-------------------------------------------
Time:1584621700000 ms
-------------------------------------------(aa,2)(aaaa,1)(,11)(a,18)(aa,2)(aaaa,1)(,11)(a,18)
假设,你想拓展前例从而每隔十秒对持续 30 秒的数据生成 word count。为做到这个,我们需要在持续 30秒数据的(word,1)对 DStream上应用 reduceByKey。使用操作 reduceByKeyAndWindow.
SQL体系的开窗函数:
select emp.*,rank()over(parition by deptno order by sal desc) from emp;
val list =List(1,2,3,4,5,6,7)//sliding :滑动
val iterator = list.sliding(3,2)//第一个参数是窗口长度;第二个是步长for(elem <- iterator){println(elem)}//至少需要两个线程
val sparkConf =newSparkConf().setMaster("local[*]").setAppName("hello Streaming")//指定 5s 一批次 微批次的计算模式
val streamingContext =newStreamingContext(sparkConf,Seconds(5))//文本接收器
val dStream:DStream[String]= streamingContext.socketTextStream("node102",54321)
val windowDStream = dStream.window(Seconds(15),Seconds(10))//业务线维度:
val flatMapDStream = windowDStream.flatMap(_.split(" "))
val mapDStream = flatMapDStream.map((_,1))//s是总和 v是后来的数据 窗口长度 步长// val reduceDStream = mapDStream.reduceByKeyAndWindow((s:Int,v:Int) => s + v,Seconds(15),Seconds(10))// reduceDStream.print()//无状态转换:
val reduceDStream = mapDStream.reduceByKey(_+_)
reduceDStream.print()
streamingContext.start()//开启这个任务
streamingContext.awaitTermination()//等待
List(1,2,3)List(3,4,5)List(5,6,7)-------------------------------------------
Time:1584625410000 ms
--------------------------------------------------------------------------------------
Time:1584625420000 ms
-------------------------------------------(aa,1)(,2)(a,13)
-------------------------------------------
Time:1584671865000 ms
--------------------------------------------------------------------------------------
Time:1584671870000 ms
--------------------------------------------------------------------------------------
Time:1584671875000 ms
--------------------------------------------------------------------------------------
Time:1584671880000 ms
--------------------------------------------------------------------------------------
Time:1584671885000 ms
-------------------------------------------(aa,1)(,1)(a,16)(a ,1)
4.2.2 Flume
[bduser@node102~]$ vim flumespark.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = node102
a1.sources.r1.port =44444
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node102
a1.sinks.k1.port =12345
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity =100000
a1.channels.c1.transactionCapacity =1000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
val sparkConf =newSparkConf().setMaster("local[*]").setAppName("Hello Streaming")
val streamingContext =newStreamingContext(sparkConf,Seconds(5))
val flumeDStream = FlumeUtils.createStream(streamingContext,"node102",12345)
flumeDStream.map(e =>(e.event.getBody,1)).reduceByKey(_+_).print()
streamingContext.start()
streamingContext.awaitTermination()