Flume 与 Spark Streaming 的整合

最新推荐文章于 2020-08-21 23:17:08 发布

青柠-柠

最新推荐文章于 2020-08-21 23:17:08 发布

阅读量1k

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/jasmine_lh/article/details/79238316

版权

大数据专栏收录该内容

2 篇文章 0 订阅

订阅专栏

看到这篇文章的你们，我相信你们已经把环境安装好，并且可以运行代码了，话不多说。

Spark Streaming整合Flume有两种方式，我只介绍这一种 Push 方法（一定要按照我写的步骤来做）

我会写本地环境和服务器环境联调

接下来是本地环境联调

第一步：编写Agent （推荐在flume/conf 文件夹下写conf）

flume_push_streaming.conf

simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444

simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = hadoop000
simple-agent.sinks.avro-sink.port = 41414

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

我的解释：这个agent的名字为simple-agent 。source 的来源为 host为 hadoop000的44444端口，接收这个agent传来的数据为 hadoop000的 41414. channel 为 memory 。

第二步：Spark Streaming 编程（我用的是scala语言来编写的）

/**
  * Spark Streaming整合Flume的第一种方式
  */
object FlumePushWordCount {

  def main(args: Array[String]): Unit = {

    if(args.length != 2) { //the long of canshu
      System.err.println("Usage: FlumePushWordCount <hostname> <port>")
      System.exit(1)
    }

    val Array(hostname, port) = args

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("FlumePushWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //TODO... 如何使用SparkStreaming整合Flume
    val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)

    flumeStream.map(x=> new String(x.event.getBody.array()).trim)
      .flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()

    ssc.start()
    ssc.awaitTermination()
  }
}

第三步：启动代码（这一步是一定要先做的）

第四步：启动agent

flume-ng agent \

--name simple-agent \

--conf $FLUME_HOME/conf \

--conf-file $FLUME_HOME/conf/flume_push_streaming.conf \

-Dflume.root.logger=INFO,console

第五步：打开端口 telnet hadoop000 44444

从端口输入数据

就可以从程序里看到数据了