pom.xml:
Spark Streaming:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>2.1.0</version> </dependency>
Flume:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-flume_2.11</artifactId> <version>2.1.0</version> </dependency>
Apache flume 是一个分布式的、可信赖的、能够有效地收集、聚合和移动大量日志数据的可用服务。本笔记说明了我们怎么配置Flume并使得spark streaming能够从flume接受数据。这里我们有两种方法。
注意:在spark 2.3.0中,将不再支持flume。
Approach 1:Flume-style Push-based Approach
Flume被设计成在Flume agents之间推送数据。在这个方法中,spark streaming本质上创建一个Receiver,该Receiver作用于推送数据的flume的avro agent上。以下是配置步骤。
1.General Requirements
选择集群中的一台机器应该满足以下的条件:
- 当你的Flume程序+Spark Streaming程序启动了,Spark workers中的某一个worker必须运行在这台机器上。
- Flume能够通过配置将数据推送到这台机器的端口上。
由于这个推送的模式,spark streaming程序需要启动,并且Receiver将会被调度并且能够监听选择的端口,以便Flume能够推送数据。
2.Configuring Flume
可以通过以下配置来配置Flume agent以发送数据到一个avro sink。
# Describe the sink
exec-memory-avro.sinks.avro-sink.type = avro
exec-memory-avro.sinks.avro-sink.hostname = 192.168.188.1
exec-memory-avro.sinks.avro-sink.port = 9494
主要使用avro sink将疏浚发送到某个主机上,spark程序通过FlumeUtils从这台主机上接受数据。
3.Configuring Spark Streaming Application
object FlumePushWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("FlumePushWordCount")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("ERROR")
val flumeStream = FlumeUtils.createStream(ssc,"192.168.188.1",9494)
val result = flumeStream.map(x => {
new String(x.event.getBody.array()).trim
}).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
Approach 2 : Pull-based Approach using a Custom Sink
相比于Flume直接将数据推送到Spark Streaming,这个方法运行一个自定义的Flume sink。如下:
- Flume将数据推送到sink,数据将会缓存在sink中
- Spark Streaming 使用一个reliable Flume receiver和transactions从上述的sink中拉取数据。当Spark Streaming接收到数据并备份之后,transactions才成功。
与之前的方法相比,这个方法确保了更健壮的可靠性和fault-tolerance guarantees。
1.General Requirement
Choose a machine that will run the custom sink in a Flume agent. The rest of the Flume pipeline is configured to send data to that agent. Machines in the Spark cluster should have access to the chosen machine running the custom sink.
2.Configuring Flume
exec-memory-avro.sinks.avro-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
exec-memory-avro.sinks.avro-sink.hostname = slave1
exec-memory-avro.sinks.avro-sink.port = 9499
数据将会发送到当前主机并被缓存,Spark Streaming程序通过FlumeUtils从该主机上拉取数据。
3.Configuring Spark Streaming Application
object FlumePullWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("FlumePullWordCount")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.sparkContext.setLogLevel("ERROR")
val flumeStream = FlumeUtils.createPollingStream(ssc,"slave1",9499)
val result = flumeStream.map(x => new String(x.event.getBody.array()).trim).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
ssc.start()
ssc.awaitTermination()
}
}