SparkStreaming+Flume集成例子

 

1 Flume简介

Flume是Cloudera提供的一个高可用、高可靠、分布式的海量日志采集、聚合和传输的系统。Flume支持在日志系统中定制各类数据发送方用于收集数据,同时Flume提供对数据的简单处理,并将数据处理结果写入各种数据接收方的能力。

官网 http://flume.apache.org/index.html

 

2 运行环境

  事先安装好JDK1.8.0,Spark-2.4.0

[root@centos ~]# uname -a
Linux centos 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@centos ~]# hostname
centos
[root@centos ~]# cd /opt
[root@centos opt]# ls -l
drwxr-xr-x.  7 root  root  10  143 4096 Oct  6 21:58 jdk1.8.0_192
drwxr-xr-x. 17 root root  4096 Dec 20 17:02 spark-2.4.0

 

3 下载安装Flume

下载地址 http://www.apache.org/dyn/closer.lua/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz

3.1 下载的安装包放在/opt目录中并解压

[root@centos opt]# ls
[root@centos opt]# apache-flume-1.8.0-bin.tar.gz
[root@centos opt]# tar -zxvf apache-flume-1.8.0-bin.tar.gz

3.2 改目录名为flume-1.8.0

[root@centos opt]# mv apache-flume-1.8.0-bin flume-1.8.0

3.3 移除安装包

[root@centos opt]# rm -f apache-flume-1.8.0-bin.tar.gz

 

4 配置Flume

本例使用Flume agent接收一个端口41414的字符串数据,Flume将数据输出到4444端口。通过SparkStreaming监听4444端口接收来自Flume传入的字符串,计数单词个数

       41414  ---> Flume agent ----> 4444 ---> SparkStreaming

4.1 配置Flume的配置文件 /opt/flume-1.8.0/conf/avro.conf

# Describe source, sinks, channels name
agent.sources = netcat-source
agent.sinks = avro-sink
agent.channels =  memory-channel

# Describe/configure the source
agent.sources.netcat-source.type= netcat
agent.sources.netcat-source.bind = centos
agent.sources.netcat-source.port = 41414
agent.sources.netcat-source.channels = memory-channel

# Describe the sink
agent.sinks.avro-sink.type= avro
agent.sinks.avro-sink.hostname= centos
agent.sinks.avro-sink.port= 4444
agent.sinks.avro-sink.channel = memory-channel

# Use a channel which buffers events in memory
agent.channels.memory-channel.type= memory
agent.channels.memory-channel.capacity = 1000
agent.channels.memory-channel.transactionCapacity = 100

4.2 配置类型

Source使用Netcat Source, 监听某个端口,将流经端口的每行文本数据作为Event输入,这里配置成当前服务器centos,使用端口41414

Channel使用MemoryChannel,Event数据存储在内存中

Sink使用Avro Sink,数据被转成Avro Event,输出到4444端口上

 

5 启动Flume

5.1 启动Flume agent

[root@centos conf]# cd /opt/flume-1.8.0/bin
[root@centos bin]# ls
flume-ng  flume-ng.cmd  flume-ng.ps1
[root@centos bin]# ./flume-ng agent -c ../conf -f ../conf/avro.conf -n agent -Dflume.root.logger=INFO,console

通过flume-ng命令带参数设置/conf目录、配置文件名称,指定代理名称为agent,代理名称要和/conf/avro.conf中的每行配置前缀名称一致

如果不设置 -c ../conf,也就是没有指定配置文件目录,会报如下错误:

log4j:WARN No appenders could be found for logger (org.apache.flume.node.Application).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

如果报找不到JAVA_HOME的错误,则可以flume-ng中配置上JAVA_HOME=/opt/jdk1.8.0_192再启动

5.2 正常启动日志

Info: Sourcing environment configuration script /opt/flume-1.8.0/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /opt/jdk1.8.0_192/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp '/opt/flume-1.8.0/conf:/opt/flume-1.8.0/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application -f ../conf/avro.conf -n agent
2018-12-22 13:49:19,611 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:62)] Configuration provider starting
2018-12-22 13:49:19,616 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:134)] Reloading configuration file:../conf/avro.conf
2018-12-22 13:49:19,623 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:avro-sink
 

6 SparkStreaming接收Flume的数据类

6.1 pom.xml配置需要的Spark依赖包

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-flume_2.11</artifactId>
            <version>2.3.0</version>
        </dependency>
    </dependencies>

6.2 源码 FlumePushWordCount.scala

package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils

object FlumePushWordCount {

	def main(args: Array[String]): Unit = {
		if (args.length != 2) {
			System.err.println("Usage: FlumePushWordCount <hostname> <port>")
			System.exit(1)
		}

		val conf = new SparkConf().setAppName("FlumePushWordCount")
		val ssc = new StreamingContext(conf, Seconds(5))

		val Array(hostname, port) = args
		val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)
		flumeStream.map(x => new String(x.event.getBody.array()).trim)
			.flatMap(x => x.split(" "))
			.map((_, 1)).reduceByKey(_ + _)
			.print()

		ssc.start()
		ssc.awaitTermination()
	}
}

在源码中, 通过FlumeUtils监听服务器的指定端口号,把收到的event事件取出字符串数据流,按空格分配后,计数单词个数输出。

6.3 源码打jar

这里输出jar名为 spark-learn-1.0.jar,将jar上传到一个目录中,这里上传到目录 /opt/spark-2.4.0/lib

 

7 提交Spark任务

7.1 启动Spark

[root@centos sbin]# cd /opt/spark-2.4.0/sbin
[root@centos sbin]# ./start-all.sh

7.2 提交Spark任务

   spark-submit \
   --class org.apache.spark.examples.streaming.FlumePushWordCount \
   --packages org.apache.spark:spark-streaming-flume_2.11:2.3.0 \
   --master spark://centos:7077  \
   --executor-memory 2G \
   --total-executor-cores 2 \
   /opt/spark-2.4.0/lib/spark-learn-1.0.jar centos 4444

7.3 命令参数

命令最后有两个参数,centos是当前服务器的域名,监听4444端口。

其中要通过--packages参数设置spark-streaming-flume的版本,在Spark官网上有说明

http://spark.apache.org/docs/latest/streaming-flume-integration.html

spark-streaming-flume_2.11 and its dependencies can be directly added to spark-submit using --packages (see Application Submission Guide). That is,

 ./bin/spark-submit --packages org.apache.spark:spark-streaming-flume_2.11:2.4.0 ...

这样在任务启动时会自动下载所要的jar,前提是必须能上网。

 

8 测试

8.1 先提交Spark的任务

    使用 7.2 提交Spark任务 中的命令行提交任务

8.2 在Flume中启动Flume-ng

    使用 5.1 启动Flume agent 中的命令行启动

8.3 在当前服务器中通过telnet向41414端口发送字符串

[root@centos ~]# telnet centos 41414
Trying 192.168.237.131...
Connected to centos.
Escape character is '^]'.
today
OK
is
OK
a
OK
nice day
OK

8.4 Spark的日志输出收到的单词个数

2018-12-22 14:23:55 INFO  DAGScheduler:54 - Job 109 finished: print at FlumePushWordCount.scala:23, took 0.034998 s
-------------------------------------------
Time: 1545459835000 ms
-------------------------------------------
(is,1)
(a,1)
(today,1)

2018-12-22 14:24:00 INFO  DAGScheduler:54 - Job 111 finished: print at FlumePushWordCount.scala:23, took 0.034998 s
-------------------------------------------
Time: 1545459840000 ms
-------------------------------------------
(day,1)
(nice,1)

本例演示完成

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值