Spark Streaming整合flume实战

最新推荐文章于 2025-04-22 08:31:58 发布

置顶 NicholasEcho

最新推荐文章于 2025-04-22 08:31:58 发布

阅读量6.4k

点赞数 2

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/weixin_41615494/article/details/79521120

版权

spark 专栏收录该内容

2 篇文章

订阅专栏

本文介绍如何使用Spark Streaming与Flume集成进行实时数据处理。包括两种集成方式：通过Pull方式从Flume读取数据及配置Flume以Push方式发送数据至Spark Streaming。通过实例演示了如何设置Flume配置文件及编写Spark程序。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

flume作为日志实时采集的框架，可以与SparkStreaming实时处理框进行对接，flume实时产生数据，sparkStreaming做实时处理。Spark Streaming对接FlumeNG有两种方式，一种是FlumeNG将消息Push推给Spark Streaming，还有一种是Spark Streaming从flume 中Poll拉取数据。

1.poll方式

（1）安装flume1.6以上

（2）下载依赖包

spark-streaming-flume-sink_2.11-2.0.2.jar放入到flume的lib目录下

（3）写flume的agent，注意既然是拉取的方式，那么flume向自己所在的机器上产数据就行

（4）编写flume-poll.conf配置文件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#source

a1.sources.r1.channels = c1

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir = /root/data 注:存放数据文件的地址

a1.sources.r1.fileHeader = true

#channel

a1.channels.c1.type =memory

a1.channels.c1.capacity = 20000

a1.channels.c1.transactionCapacity=5000

#sinks

a1.sinks.k1.channel = c1

a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink

a1.sinks.k1.hostname=node-1

a1.sinks.k1.port = 8888

a1.sinks.k1.batchSize= 2000

flume的启动命令:

bin/flume-ng agent -n a1 -c conf/ -f conf/flume-poll-spark.conf -Dflume.root.logger=INFO,console

（5）启动spark-streaming应用程序，去flume所在机器拉取数据

（6）代码实现

需要添加pom依赖

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-flume_2.11</artifactId>
    <version>2.0.2</version>

</dependency>

具体代码如下：

package cn.testdemo.dstream.flume

import java.net.InetSocketAddress
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}

//todo:利用sparkStreaming对接flume数据，实现单词计算------Poll拉模式
object SparkStreamingFlume_Poll {
def main(args: Array[String]): Unit = {
//1、创建sparkConf
val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingFlume_Poll").setMaster("local[2]")
//2、创建sparkContext
val sc = new SparkContext(sparkConf)
sc.setLogLevel("WARN")
//3、创建StreamingContext
val ssc = new StreamingContext(sc,Seconds(5))
//定义一个flume地址集合，可以同时接受多个flume的数据
val address=Seq(new InetSocketAddress("192.168.216.120",9999),new InetSocketAddress("192.168.216.121",9999))

//4、获取flume中数据

val stream: ReceiverInputDStream[SparkFlumeEvent] =

FlumeUtils.createPollingStream(ssc,address,StorageLevel.MEMORY_AND_DISK_SER_2)

//5、从Dstream中获取flume中的数据 {"header":xxxxx "body":xxxxxx}
val lineDstream: DStream[String] = stream.map(x => new String(x.event.getBody.array()))
//6、切分每一行,每个单词计为1
val wordAndOne: DStream[(String, Int)] = lineDstream.flatMap(_.split(" ")).map((_,1))
//7、相同单词出现的次数累加
val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)
//8、打印输出
result.print()

//开启计算
ssc.start()
ssc.awaitTermination()
}
}

（7）观察IDEA控制台输出

2.Push方式

（1）编写flume-push.conf配置文件

#push mode

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#source

a1.sources.r1.channels = c1

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir = /root/data

a1.sources.r1.fileHeader = true

#channel

a1.channels.c1.type =memory

a1.channels.c1.capacity = 20000

a1.channels.c1.transactionCapacity=5000

#sinks

a1.sinks.k1.channel = c1

a1.sinks.k1.type = avro

a1.sinks.k1.hostname=192.168.11.25

a1.sinks.k1.port = 8888

a1.sinks.k1.batchSize= 2000

注意配置文件中指明的hostname和port是spark应用程序所在服务器的ip地址和端口。

启动flume:

bin/flume-ng agent -n a1 -c conf/ -f conf/flume-push-spark.conf -Dflume.root.logger=INFO,console

（2）代码实现如下：

package cn.testdemo.dstream.flume

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}

//todo:利用sparkStreaming对接flume数据，实现单词计数------Push推模式
object SparkStreamingFlume_Push {

def main(args: Array[String]): Unit = {
//1、创建sparkConf
val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingFlume_Push").setMaster("local[2]")
//2、创建sparkContext
val sc = new SparkContext(sparkConf)
sc.setLogLevel("WARN")
//3、创建StreamingContext
val ssc = new StreamingContext(sc,Seconds(5))
//4、获取flume中的数据
val stream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(ssc,"192.168.11.25",9999)
//5、从Dstream中获取flume中的数据 {"header":xxxxx "body":xxxxxx}
val lineDstream: DStream[String] = stream.map(x => new String(x.event.getBody.array()))
//6、切分每一行,每个单词计为1
val wordAndOne: DStream[(String, Int)] = lineDstream.flatMap(_.split(" ")).map((_,1))
//7、相同单词出现的次数累加
val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)
//8、打印输出
result.print()
//开启计算
ssc.start()
ssc.awaitTermination()
}
}