SparkStreaming+Flume集成例子

最新推荐文章于 2021-02-21 19:06:21 发布

刚毅

最新推荐文章于 2021-02-21 19:06:21 发布

阅读量357

点赞数 2

分类专栏： Spark 开发文章标签： Flume SparkStreaming

本文链接：https://blog.csdn.net/zhigang0529/article/details/85203801

版权

开发同时被 2 个专栏收录

17 篇文章 0 订阅

订阅专栏

Spark

2 篇文章 0 订阅

订阅专栏

1 Flume简介

Flume是Cloudera提供的一个高可用、高可靠、分布式的海量日志采集、聚合和传输的系统。Flume支持在日志系统中定制各类数据发送方用于收集数据，同时Flume提供对数据的简单处理，并将数据处理结果写入各种数据接收方的能力。

官网 http://flume.apache.org/index.html

2 运行环境

事先安装好JDK1.8.0，Spark-2.4.0

[root@centos ~]# uname -a
Linux centos 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@centos ~]# hostname
centos
[root@centos ~]# cd /opt
[root@centos opt]# ls -l
drwxr-xr-x.  7 root  root  10  143 4096 Oct  6 21:58 jdk1.8.0_192
drwxr-xr-x. 17 root root  4096 Dec 20 17:02 spark-2.4.0

3 下载安装Flume

下载地址 http://www.apache.org/dyn/closer.lua/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz

3.1 下载的安装包放在/opt目录中并解压

[root@centos opt]# ls
[root@centos opt]# apache-flume-1.8.0-bin.tar.gz
[root@centos opt]# tar -zxvf apache-flume-1.8.0-bin.tar.gz

3.2 改目录名为flume-1.8.0

[root@centos opt]# mv apache-flume-1.8.0-bin flume-1.8.0

3.3 移除安装包

[root@centos opt]# rm -f apache-flume-1.8.0-bin.tar.gz

4 配置Flume

本例使用Flume agent接收一个端口41414的字符串数据，Flume将数据输出到4444端口。通过SparkStreaming监听4444端口接收来自Flume传入的字符串，计数单词个数

41414 ---> Flume agent ----> 4444 ---> SparkStreaming

4.1 配置Flume的配置文件 /opt/flume-1.8.0/conf/avro.conf

# Describe source, sinks, channels name
agent.sources = netcat-source
agent.sinks = avro-sink
agent.channels =  memory-channel

# Describe/configure the source
agent.sources.netcat-source.type= netcat
agent.sources.netcat-source.bind = centos
agent.sources.netcat-source.port = 41414
agent.sources.netcat-source.channels = memory-channel

# Describe the sink
agent.sinks.avro-sink.type= avro
agent.sinks.avro-sink.hostname= centos
agent.sinks.avro-sink.port= 4444
agent.sinks.avro-sink.channel = memory-channel

# Use a channel which buffers events in memory
agent.channels.memory-channel.type= memory
agent.channels.memory-channel.capacity = 1000
agent.channels.memory-channel.transactionCapacity = 100

4.2 配置类型

Source使用Netcat Source，监听某个端口，将流经端口的每行文本数据作为Event输入，这里配置成当前服务器centos，使用端口41414

Channel使用MemoryChannel，Event数据存储在内存中

Sink使用Avro Sink，数据被转成Avro Event，输出到4444端口上

5 启动Flume

5.1 启动Flume agent

[root@centos conf]# cd /opt/flume-1.8.0/bin
[root@centos bin]# ls
flume-ng  flume-ng.cmd  flume-ng.ps1
[root@centos bin]# ./flume-ng agent -c ../conf -f ../conf/avro.conf -n agent -Dflume.root.logger=INFO,console

通过flume-ng命令带参数设置/conf目录、配置文件名称，指定代理名称为agent，代理名称要和/conf/avro.conf中的每行配置前缀名称一致

如果不设置 -c ../conf，也就是没有指定配置文件目录，会报如下错误：

log4j:WARN No appenders could be found for logger (org.apache.flume.node.Application).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

如果报找不到JAVA_HOME的错误，则可以flume-ng中配置上JAVA_HOME=/opt/jdk1.8.0_192再启动

5.2 正常启动日志

Info: Sourcing environment configuration script /opt/flume-1.8.0/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /opt/jdk1.8.0_192/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp '/opt/flume-1.8.0/conf:/opt/flume-1.8.0/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application -f ../conf/avro.conf -n agent
2018-12-22 13:49:19,611 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:62)] Configuration provider starting
2018-12-22 13:49:19,616 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:134)] Reloading configuration file:../conf/avro.conf
2018-12-22 13:49:19,623 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:avro-sink

6 SparkStreaming接收Flume的数据类

6.1 pom.xml配置需要的Spark依赖包

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-flume_2.11</artifactId>
            <version>2.3.0</version>
        </dependency>
    </dependencies>

6.2 源码 FlumePushWordCount.scala

package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils

object FlumePushWordCount {

	def main(args: Array[String]): Unit = {
		if (args.length != 2) {
			System.err.println("Usage: FlumePushWordCount <hostname> <port>")
			System.exit(1)
		}

		val conf = new SparkConf().setAppName("FlumePushWordCount")
		val ssc = new StreamingContext(conf, Seconds(5))

		val Array(hostname, port) = args
		val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)
		flumeStream.map(x => new String(x.event.getBody.array()).trim)
			.flatMap(x => x.split(" "))
			.map((_, 1)).reduceByKey(_ + _)
			.print()

		ssc.start()
		ssc.awaitTermination()
	}
}

在源码中，通过FlumeUtils监听服务器的指定端口号，把收到的event事件取出字符串数据流，按空格分配后，计数单词个数输出。

6.3 源码打jar

这里输出jar名为 spark-learn-1.0.jar，将jar上传到一个目录中，这里上传到目录 /opt/spark-2.4.0/lib

7 提交Spark任务

7.1 启动Spark

[root@centos sbin]# cd /opt/spark-2.4.0/sbin
[root@centos sbin]# ./start-all.sh

7.2 提交Spark任务

   spark-submit \
   --class org.apache.spark.examples.streaming.FlumePushWordCount \
   --packages org.apache.spark:spark-streaming-flume_2.11:2.3.0 \
   --master spark://centos:7077  \
   --executor-memory 2G \
   --total-executor-cores 2 \
   /opt/spark-2.4.0/lib/spark-learn-1.0.jar centos 4444

7.3 命令参数

命令最后有两个参数，centos是当前服务器的域名，监听4444端口。

其中要通过--packages参数设置spark-streaming-flume的版本，在Spark官网上有说明

http://spark.apache.org/docs/latest/streaming-flume-integration.html

spark-streaming-flume_2.11 and its dependencies can be directly added to spark-submit using --packages (see Application Submission Guide). That is,

 ./bin/spark-submit --packages org.apache.spark:spark-streaming-flume_2.11:2.4.0 ...

这样在任务启动时会自动下载所要的jar，前提是必须能上网。

8 测试

8.1 先提交Spark的任务

使用 7.2 提交Spark任务 中的命令行提交任务

8.2 在Flume中启动Flume-ng

使用 5.1 启动Flume agent 中的命令行启动

8.3 在当前服务器中通过telnet向41414端口发送字符串

[root@centos ~]# telnet centos 41414
Trying 192.168.237.131...
Connected to centos.
Escape character is '^]'.
today
OK
is
OK
a
OK
nice day
OK

8.4 Spark的日志输出收到的单词个数

2018-12-22 14:23:55 INFO  DAGScheduler:54 - Job 109 finished: print at FlumePushWordCount.scala:23, took 0.034998 s
-------------------------------------------
Time: 1545459835000 ms
-------------------------------------------
(is,1)
(a,1)
(today,1)

2018-12-22 14:24:00 INFO  DAGScheduler:54 - Job 111 finished: print at FlumePushWordCount.scala:23, took 0.034998 s
-------------------------------------------
Time: 1545459840000 ms
-------------------------------------------
(day,1)
(nice,1)

本例演示完成

刚毅

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
SparkStreaming+Flume集成例子

1 Flume简介Flume是Cloudera提供的一个高可用、高可靠、分布式的海量日志采集、聚合和传输的系统。Flume支持在日志系统中定制各类数据发送方用于收集数据，同时Flume提供对数据的简单处理，并将数据处理结果写入各种数据接收方的能力。官网 http://flume.apache.org/index.html 2 运行环境事先安装好JDK1.8.0，Spa...
复制链接

扫一扫