需求:监控目录/opt/datas/spark-flume下的wctotal.log文件,并将文件内容通过Spark Streaming 进行批处理,在屏幕上输出event的数量
实验在伪分布式环境下,用local的模式启动spark(CDH5.5.0版本)
为了看每条代码比较清楚,采用bin/spark-shell –master local[2]方式启动
集成这两个功能需要将三个jar包导入到spark classpath中
flume-avro-source-1.6.0-cdh5.5.0.jar
flume-ng-sdk-1.6.0-cdh5.5.0.jar
//以上两个包在flume的lib下
spark-streaming-flume_2.10-1.5.2.jar
//这个包在spark编译后的external/flume/target目录下
使用以下命令启动spark local(三个jar包的路径写自己的就好)
bin/spark-shell \
--master local[4] \
--jars externalJars/flume-ng-sdk-1.6.0-cdh5.5.0.jar,externalJars/flume-avro-source-1.6.0-cdh5.5.0.jar,externalJars/spark-streaming-flume_2.10-1.5.2.jar
切记包之间用逗号隔开,且不能有空格
输入以下代码运行(一直收集流进行批处理)
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
import org.apache.spark.streaming.flume._
import org.apache.spark.storage.StorageLevel
val ssc = new StreamingContext(sc, Seconds(5))
val stream = FlumeUtils.createStream(ssc, "BPF", 9999, StorageLevel.MEMORY_ONLY_SER_2)
stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
flume agent:
# define agent
a2.sources = r2
a2.channels = c2
a2.sinks = k2
# definde sources
a2.sources.r2.type = exec
a2.sources.r2.command = tail -f /opt/datas/spark-flume/wctotal.log
a2.sources.r2.shell = /bin/bash -c
# define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
# define sinks
a2.sinks.k2.type = avro
a2.sinks.k2.hostname = BPF
a2.sinks.k2.port = 9999
# bind channels to sources and sinks
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
启动flume
bin/flume-ng agent -c conf -n a2 -f conf/flume_spark_push.conf -Dflume.root.logger=DEBUG,console
手动修改wctotal.log的内容(如echo)观察Spark Streaming的效果
842

被折叠的 条评论
为什么被折叠?



