官网教程:http://spark.apache.org/docs/1.3.0/streaming-flume-integration.html
有两种集成方式:
1、flume把数据推给stream
2、stream从flume拉取数据
基于方式1讲解
Flume有3个组件:source -> channel -> sink (streaming)
1、在flume的conf目录下增加 flume-spark-push.sh 文件,内容如下
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'agent'
#define agent name a1, and configuration
a2.sources = r2
a2.channels = c2
a2.sinks = k2
#define sources
a2.sources.r2.type = exec
a2.sources.r2.command = tail -f /opt/datas/spark-flume/wctotal.log
a2.sources.r2.shell = /bin/bash -c
#define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
#define sink
a2.sinks.k2.type = avro
a2.sinks.k2.hostname = hadoop-senior.ibeifeng.com
a2.sinks.k2.port = 9999
#define the sources and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
2、在 /opt/datas/spark-flume 目录下准备文件wctotal.log,保存数据
hadoop spark hadoop
hadoop spark hadoop
hadoop spark hadoop
hadoop spark hadoop
hadoop spark hadoop
hadoop spark hadoop
hadoop spark hadoop
3、把依赖包复制到spark的externallibs保存,从2个地方复制;一个是spark编译后的包,两个是flume安装目录下的包
cp /opt/modules/spark-1.3.0-src/external/flume/target/spark-streaming-flume_2.10-1.3.0.jar /opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/
cp /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/lib/flume-avro-source-1.5.0-cdh5.3.6.jar /opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/
cp /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/lib/flume-ng-sdk-1.5.0-cdh5.3.6.jar /opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/
4、以本地模式运行,spark-shell 加载依赖包,如果有多个包,用逗号隔开
bin/spark-shell --jars \
/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/spark-streaming-flume_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/flume-avro-source-1.5.0-cdh5.3.6.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externallibs/flume-ng-sdk-1.5.0-cdh5.3.6.jar
5、在spark-shell输入框,输入命令
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.flume._
import org.apache.spark.storage.StorageLevel
val ssc = new StreamingContext(sc, Seconds(5))
val stream = FlumeUtils.createStream(ssc, "hadoop-senior.ibeifeng.com", 9999, StorageLevel.MEMORY_ONLY_SER_2)
stream.count().map(cnt => "Received " + cnt + " flume event.").print()
ssc.start()
ssc.awaitTermination()
6、启动flume
bin/flume-ng agent -c conf -n a2 -f conf/flume-spark-push.sh -Dflume.root.logger=DEBUG,console
7、进入/opt/datas/spark-flume目录,追加数据到 wctotal.log 文件末尾
echo "hadoop spark hadoop" >> wctotal.log
8、可以看到spark-shell框,打印日志,显示有日志被监控到有新的内容
Received 1 flume event.