SparkStreaming整合Flume（一）Push方式的整合

最新推荐文章于 2021-11-24 15:52:31 发布

「已注销」

最新推荐文章于 2021-11-24 15:52:31 发布

阅读量641

点赞数

分类专栏： Spark Flume 文章标签： spark

本文链接：https://blog.csdn.net/Wing_93/article/details/78490897

版权

Spark 同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

Flume

4 篇文章 0 订阅

订阅专栏

Apache Flume 是一个分布式、可靠的、可用的服务，应用于有效地收集、聚合以及移动大量的日志文件。

接下来我们将来介绍配置Flume，使得SparkStreaming可以去接收来自Flume的数据。注意Spark Streaming 与Flume版本的兼容性。

一、Push方式整合

概念

Flume is designed to push data between Flume agents. In this approach, Spark Streaming essentially sets up a receiver that acts an Avro agent for Flume, to which Flume can push the data. Here are the configuration steps.

这是官网上的一段解释，我大概通俗翻译下就是：

Flume的设计是为了在Flume agents中进行推送数据，我们可以配置多个agent，可以是串联的，也可以是并联的，有了agent后，我们就可以在angent之间做数据推送。在Push方式的整合，Spark Streaming 必须建立一个receiver（一旦看到receiver就要注意在本地测试的时候，local一定要大于1，否则就没有多余的线程），这个receiver作用类似于Flume中的Avro agent，这样的话Flume就可以对Spark Streaming进行推送数据。

General Requirements

Choose a machine in your cluster such that

When your Flume + Spark Streaming application is launched, one of the Spark workers must run on that machine.
Flume can be configured to push data to a port on that machine.

Due to the push model, the streaming application needs to be up, with the receiver scheduled and listening on the chosen port, for Flume to be able push data.

另外，这段话也是很重要的，大体意思是：

首先从你的集群中选出一台机器，当你的Flume和Spark Streaming的应用程序启动以后，你的Spark的worker中的一个work必须在选中的那台机器上运行，即和Flume配置到同一个节点上面。

Flume可以被配置去推送数据到一个端口之上，因为是用的Avro方式，所以需要指定一个端口。

在Push模式下，Spark Streaming需要先启动，Spark Streaming中的receiver会定时地调度和监听这个端口，然后Flume就可以推送数据了。

配置环境

Flume的配置文件：

Configure Flume agent to send data to an Avro sink by having the following in the configuration file.

agent.sinks = avroSink
agent.sinks.avroSink.type = avro
agent.sinks.avroSink.channel = memoryChannel
agent.sinks.avroSink.hostname = <chosen machine's hostname>
agent.sinks.avroSink.port = <chosen port on the machine>

这是官网给的配置文件模板，接下来对照这个模板，进行文件配置：

Flume Agent的编写： flume_push_streaming.conf

simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444

simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = 192.168.199.203
simple-agent.sinks.avro-sink.port = 41414

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

大概解释下，就是监听hadoop000这个机器的44444端口，通过channel，sink到192.168.199.203这个服务器上，该服务器的41414这个端口用来sink。

应用的开发与部署

配置完后，进行应用程序的开发：

1.首先在Maven项目中引入如下的依赖

 groupId = org.apache.spark
 artifactId = spark-streaming-flume_2.11
 version = 2.2.0

2.导入FlumeUtils，创建输入的DStream

import org.apache.spark.streaming.flume._

 val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])

接下来对照着这个来写个Demo

import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Spark Streaming整合Flume的第一种方式
  */
object FlumePushWordCount {

  def main(args: Array[String]): Unit = {

    if(args.length != 2) {
      System.err.println("Usage: FlumePushWordCount <hostname> <port>")
      System.exit(1)
    }

    val Array(hostname, port) = args

    val sparkConf = new SparkConf() //.setMaster("local[2]").setAppName("FlumePushWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //TODO... 如何使用SparkStreaming整合Flume
    val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)

    flumeStream.map(x=> new String(x.event.getBody.array()).trim)
      .flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()

    ssc.start()
    ssc.awaitTermination()
  }
}

此处主机名称和端口号需要自己传递，根据你自己的配置来，本地的话，可以用IDEA中的JVM OPTIONS那个地方进行传值。

接下来，用maven命令打成jar包，打包命令如下:

mvn clean package -DskipTests

3.对Spark应用程序进行部署

部署命令如下:

spark-submit \
--class com.imooc.spark.FlumePushWordCount \
--master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
/home/hadoop/lib/sparktrain-1.0.jar \
hadoop000 41414

此处我们需要注意到--packages 的配置，需要把FlumeUtils这个包给打进来，因为maven打成的jar包，只包含源代码，不包含任何Pom文件中的依赖。注意：这是需要在有网的条件下才能打进jar包。

4.启动Flume

启动Flume的命令如下：

flume-ng agent  \
--name simple-agent   \
--conf $FLUME_HOME/conf    \
--conf-file $FLUME_HOME/conf/flume_push_streaming.conf  \
-Dflume.root.logger=INFO,console

5.测试

开启个窗口，输入 telnet hadoop 44444，然后随意输入些字母。

然后我们在sink配置的那台服务器上的41414上可以观察到如下：