前言
本文所需要的安装包&Flume配置文件,博主都已上传,链接为本文涉及的安装包&Flume配置文件,请自行下载~
- flume作为日志实时采集的框架, 可以与Spark Streaming实时处理框架进行对接.
- flume实时产生数据, Spark Streaming做实时处理
- Spark Streaming对接fluem有两种方式,一种是Flume将消息Push推给Spark Streaming;还有一种是Spark Streaming从flume中Poll拉取数据.
1. Spark Streaming从Flume中Poll拉取数据
1.1 flume前期准备
- 安装flume1.6以上
- 下载依赖包
spark-streaming-flume-sink_2.11-2.0.2.jar放入到flume的lib目录下. - 修改flume/lib下的scala依赖包版本
从spark安装目录的jars文件夹下找到scala-library-2.11.8.jar 包, 替换掉flume/lib目录下自带的scala-library-2.11.8.jar包. - 写flume的agent, 注意既然是拉取的方式,那么flume向自己所在的机器上产数据就行.
- 编写flume-poll.conf配置文件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source
a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/data
a1.sources.r1.fileHeader = true
channel
a1.channels.c1.type =memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity=5000
#sinks
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname=hdp-node-01
a1.sinks.k1.port = 8888
a1.sinks.k1.batchSize= 2000
1.2 Spark Streaming前期准备,编写Spark Streaming程序
- pom文件中导入依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.0.2</version>
</dependency>
- 使用scala编写spark程序
程序中需要指导fluem agent的ip地址+端口号,详情见1.1中flume配置conf的sink信息
package cn.acece.sparkStreamingtest
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
//todo:sparkStreaming整合flume----采用的是拉模式
object SparkStreamingPollFlume {
def main(args: Array[String]): Unit = {
//1、创建sparkConf
val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingPollFlume").setMaster("local[2]")
//2、创建sparkContext
val sc = new SparkContext(sparkConf)
sc.setLogLevel("WARN")
//3、创建streamingContext
val ssc = new StreamingContext(sc,Seconds(5))
ssc.checkpoint("./flume")
//4、通过FlumeUtils调用createPollingStream方法获取flume中的数据
val pollingStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc,"192.168.200.100",8888)
//5、获取flume中event的body {"headers":xxxxxx,"body":xxxxx}
val data: DStream[String] = pollingStream.map(x=>new String(x.event.getBody.array()))
//6、切分每一行,每个单词计为1
val wordAndOne: DStream[(String, Int)] = data.flatMap(_.split(" ")).map((_,1))
//7、相同单词出现的次数累加
val result: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunc)
//8、打印输出
result.print()
//9、开启流式计算
ssc.start()
ssc.awaitTermination()
}
//currentValues:他表示在当前批次每个单词出现的所有的1 (hadoop,1) (hadoop,1)(hadoop,1)
//historyValues:他表示在之前所有批次中每个单词出现的总次数 (hadoop,100)
def updateFunc(currentValues:Seq[Int], historyValues:Option[Int]):Option[Int] = {
val newValue: Int = currentValues.sum+historyValues.getOrElse(0)
Some(newValue)
}
}
1.3 既然是Spark Streaming拉取flume中的数据,那么就要先启动flume
- 启动flume, 在服务器上的 /root/data目录下准备数据文件data.txt
cd /root/data
- 启动flume
flume-ng agent -n a1 \
-c /opt/bigdata/flume/conf
-f /opt/bigdata/flume/conf/flume-poll.conf \
-Dflume.root.logger=INFO,console
- 启动spark Streaming程序, 去flume中拉取数据
在IDEA中启动1.3中编写的spark程序
1.4 观察IDEA控制台输出
Spark Streaming拉取flume中数据成功, 完美运行~