文件流
- 日志的实时捕捉:对目录进行监控,只要目录生成新的文件或者文件变动就捕捉。
- 1.创建被监控的文件目录:
cd /usr/local/spark/mycode
mkdir streaming
cd streaming
mkdir logfile
cd logfile
- 2.spark-scala文件监控程序-实现词频统计:
import org.apache.spark._
import org.apache.spark.streaming._
object WordCountStreaming{
def main(args:Array[String]){
val conf=new SparkConf().setMaster("local[2]").setAppName("file_stream")
val ssc=new StreamingContext(conf,Seconds(2))
//定义数据流
val lines=ssc.textFileStream("file:///home/chenbengang/ziyu_bigdata/quick_learn_spark/logfile")
//流计算过程
val words=lines.flatMap(_.split(" "))
val wordCounts=words.map(x => (x,1)).reduceByKey(_+_)
wordCounts.print()
//启动流计算
ssc.start()
ssc.awaitTermination()//遇到错误会停止
}
}
//在streaming目录下执行/usr/sbt/sbt/bin/sbt package 编译打包
//或者在REPL中执行WordCountStreaming.main(Array())即可执行
采用独立应用程序创建文件流
1.创建程序的文件目录:
cd /usr/local/spark/mycode
mkdir streaming
cd streaming
mkdir -p src/main/scala //创建三级子目录的规范结构
cd src/main/scala
vim TestStreaming.scala
2.在TestStreaming.scala中编写
import org.apache.spark._
import org.apache.spark.streaming._
object WordCountStreaming{
def main(args:Array[String]){
val conf=new SparkConf().setMaster("local[2]").setAppName("file_stream")
val ssc=new StreamingContext(conf,Seconds(2))
val lines=ssc.textFileStream("file:///home/chenbengang/ziyu_bigdata/quick_learn_spark/logfile")
val words=lines.flatMap(_.split(" "))
val wordCounts=words.map(x => (x,1)).reduceByKey(_+_)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
3.编译打包:
cd /usr/local/spark/mycode/streaming
vim simple.sbt
4.编写simple.sbt
name:="Simple Project"
version:="1.0"
scalaVersion:="2.12.6"
libraryDependencies+="org.apche.spark"%"spark-streaming_2.12"%"2.2.1"
5.执行sbt编译打包命令:
cd /usr/local/spark/mycode/streaming
/usr/local/sbt/sbt packge
6.启动程序:
cd /usr/local/spark/mycode/streaming
cd /usr/local/spark/bin/spark-submit --class "WordCountStreaming" /usr/local/spark/mycode/streaming/target/scala-2.11/simple-project_2.11-1.0.jar
7.在logfile目录下随便创建几个txt写几个单词即可。