sparkStreaming整合flume

flume作为日志实时采集的框架,可以与SparkStreaming实时处理框进行对接,flume实时产生数据,sparkStreaming做实时处理。Spark Streaming对接FlumeNG有两种方式,一种是FlumeNG将消息Push推给Spark Streaming,还有一种是Spark Streaming从flume 中Pull拉取数据。

实际使用中,使用pull方式较多,因为其使用到的SparkSink有缓冲作用,SparkStreaming通过有效的flumeReceiver从sink中拉取数据,拉取的数据有多副本存储方式,增加了容错性,稳定性好,并且在拉取过程中具有事务性,而push方式可能出现数据丢失。
  • 安装flume1.6以上版本
  • pom文件引入spark-streaming-flume-sink依赖
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-flume_2.11</artifactId>
    <version>${spark.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-flume-sink_2.11</artifactId>
    <version>${spark.version}</version>
</dependency>

将spark-streaming-flume-sink_2.11.jar放在flume的lib目录

  • 引入avro依赖:
<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro</artifactId>
    <version>1.8.2</version>
</dependency>
<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-ipc</artifactId>
    <version>1.8.2</version>
</dependency>

将两jar包放在flume的lib目录,避免Could not initialize class rg.apache.spark.streaming.flume.sink.EventBatch的错误出现

  1. pull方式:
    配置文件 spark-streaming-flume-poll.conf:
b1.sources = r1  
b1.sinks = k1  
b1.channels = c1  

#source  
b1.sources.r1.type = netcat 
b1.sources.r1.bind = localhost 
b1.sources.r1.port = 44444

#channel  
b1.channels.c1.type  =memory  
b1.channels.c1.capacity  = 20000  
b1.channels.c1.transactionCapacity=5000  

#sinks  
b1.sinks.k1.type =  org.apache.spark.streaming.flume.sink.SparkSink  
b1.sinks.k1.hostname=hadoop005
b1.sinks.k1.port =  7474
b1.sinks.k1.batchSize=  2000   

#source-channel-sinks
b1.sources.r1.channels  = c1  
b1.sinks.k1.channel  = c1  

启动flume:

bin/flume-ng agent -n b1 -c conf -f job/spark-streaming-flume-poll.conf -Dflume.root.logger=INFO,console

启动telnet:

telnet localhost 44444

业务代码:

/**
  * flume版wordcount
  */
object SparkPollFlume {
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
    val ssc = new StreamingContext(conf,Seconds(6))
    val pollData: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc,"hadoop005",7474)

    pollData.map(x=>new String(x.event.getBody.array()).trim).flatMap(_.split(" +")).map((_,1)).reduceByKey(_+_).print(1000)

    ssc.start()
    ssc.awaitTermination()
  }
}
//val flumeData: DStream[String] = pollData.map(x=> new String(x.event.getBody.array()).trim)
//    val flatData: DStream[String] = flumeData.flatMap(_.split(" +"))
//    val mapData: DStream[(String, Int)] = flatData.map((_,1))
//    val result: DStream[(String, Int)] = mapData.reduceByKey(_+_)
//    result.print(1000)
  1. push方式:
    配置文件:
#push mode  
b1.sources = r1  
b1.sinks = k1  
b1.channels = c1  

#source  
b1.sources.r1.type = netcat 
b1.sources.r1.bind = localhost 
b1.sources.r1.port = 44444

#channel  
b1.channels.c1.type  =memory 
b1.channels.c1.capacity  = 20000  
b1.channels.c1.transactionCapacity=5000  

#sinks  
b1.sinks.k1.type =  avro  
b1.sinks.k1.hostname=192.168.100.121
b1.sinks.k1.port =  5555
b1.sinks.k1.batchSize=  2000 

b1.sources.r1.channels  = c1  
b1.sinks.k1.channels    = c1

业务代码:

object FlumePushSpark {
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
    val ssc = new StreamingContext(conf,Seconds(6))
    val pollData: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(ssc,"192.168.100.121",5555)

    val originData: DStream[String] = pollData.map(x=>new String(x.event.getBody.array))
    val flatMap: DStream[String] = originData.flatMap(_.split(" +"))
    val wordWithOne: DStream[(String, Int)] = flatMap.map((_,1))
    val result: DStream[(String, Int)] = wordWithOne.reduceByKey(_+_)
    result.print(1000)

    ssc.start()
    ssc.awaitTermination()
  }
}

注:
由于是push模式,需先将接受端程序端口打开(启动scala程序),再启动flume。
启动flume命令:

bin/flume-ng agent -n b1 -c conf -f job/spark-streaming-flume-push.conf -Dflume.root.logger=INFO,console

启动telnet:

 telnet localhost 44444

报错:Could not configure sink k1 due to: No channel configured for sink: k1,将配置文件的channels改为channel即可。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值