1 Flume简介
Flume是Cloudera提供的一个高可用、高可靠、分布式的海量日志采集、聚合和传输的系统。Flume支持在日志系统中定制各类数据发送方用于收集数据,同时Flume提供对数据的简单处理,并将数据处理结果写入各种数据接收方的能力。
官网 http://flume.apache.org/index.html
2 运行环境
事先安装好JDK1.8.0,Spark-2.4.0
[root@centos ~]# uname -a
Linux centos 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@centos ~]# hostname
centos
[root@centos ~]# cd /opt
[root@centos opt]# ls -l
drwxr-xr-x. 7 root root 10 143 4096 Oct 6 21:58 jdk1.8.0_192
drwxr-xr-x. 17 root root 4096 Dec 20 17:02 spark-2.4.0
3 下载安装Flume
下载地址 http://www.apache.org/dyn/closer.lua/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz
3.1 下载的安装包放在/opt目录中并解压
[root@centos opt]# ls
[root@centos opt]# apache-flume-1.8.0-bin.tar.gz
[root@centos opt]# tar -zxvf apache-flume-1.8.0-bin.tar.gz
3.2 改目录名为flume-1.8.0
[root@centos opt]# mv apache-flume-1.8.0-bin flume-1.8.0
3.3 移除安装包
[root@centos opt]# rm -f apache-flume-1.8.0-bin.tar.gz
4 配置Flume
本例使用Flume agent接收一个端口41414的字符串数据,Flume将数据输出到4444端口。通过SparkStreaming监听4444端口接收来自Flume传入的字符串,计数单词个数
41414 ---> Flume agent ----> 4444 ---> SparkStreaming
4.1 配置Flume的配置文件 /opt/flume-1.8.0/conf/avro.conf
# Describe source, sinks, channels name
agent.sources = netcat-source
agent.sinks = avro-sink
agent.channels = memory-channel
# Describe/configure the source
agent.sources.netcat-source.type= netcat
agent.sources.netcat-source.bind = centos
agent.sources.netcat-source.port = 41414
agent.sources.netcat-source.channels = memory-channel
# Describe the sink
agent.sinks.avro-sink.type= avro
agent.sinks.avro-sink.hostname= centos
agent.sinks.avro-sink.port= 4444
agent.sinks.avro-sink.channel = memory-channel
# Use a channel which buffers events in memory
agent.channels.memory-channel.type= memory
agent.channels.memory-channel.capacity = 1000
agent.channels.memory-channel.transactionCapacity = 100
4.2 配置类型
Source使用Netcat Source, 监听某个端口,将流经端口的每行文本数据作为Event输入,这里配置成当前服务器centos,使用端口41414
Channel使用MemoryChannel,Event数据存储在内存中
Sink使用Avro Sink,数据被转成Avro Event,输出到4444端口上
5 启动Flume
5.1 启动Flume agent
[root@centos conf]# cd /opt/flume-1.8.0/bin
[root@centos bin]# ls
flume-ng flume-ng.cmd flume-ng.ps1
[root@centos bin]# ./flume-ng agent -c ../conf -f ../conf/avro.conf -n agent -Dflume.root.logger=INFO,console
通过flume-ng命令带参数设置/conf目录、配置文件名称,指定代理名称为agent,代理名称要和/conf/avro.conf中的每行配置前缀名称一致
如果不设置 -c ../conf,也就是没有指定配置文件目录,会报如下错误:
log4j:WARN No appenders could be found for logger (org.apache.flume.node.Application).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
如果报找不到JAVA_HOME的错误,则可以flume-ng中配置上JAVA_HOME=/opt/jdk1.8.0_192再启动
5.2 正常启动日志
Info: Sourcing environment configuration script /opt/flume-1.8.0/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /opt/jdk1.8.0_192/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp '/opt/flume-1.8.0/conf:/opt/flume-1.8.0/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application -f ../conf/avro.conf -n agent
2018-12-22 13:49:19,611 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:62)] Configuration provider starting
2018-12-22 13:49:19,616 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:134)] Reloading configuration file:../conf/avro.conf
2018-12-22 13:49:19,623 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:avro-sink
6 SparkStreaming接收Flume的数据类
6.1 pom.xml配置需要的Spark依赖包
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
6.2 源码 FlumePushWordCount.scala
package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils
object FlumePushWordCount {
def main(args: Array[String]): Unit = {
if (args.length != 2) {
System.err.println("Usage: FlumePushWordCount <hostname> <port>")
System.exit(1)
}
val conf = new SparkConf().setAppName("FlumePushWordCount")
val ssc = new StreamingContext(conf, Seconds(5))
val Array(hostname, port) = args
val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)
flumeStream.map(x => new String(x.event.getBody.array()).trim)
.flatMap(x => x.split(" "))
.map((_, 1)).reduceByKey(_ + _)
.print()
ssc.start()
ssc.awaitTermination()
}
}
在源码中, 通过FlumeUtils监听服务器的指定端口号,把收到的event事件取出字符串数据流,按空格分配后,计数单词个数输出。
6.3 源码打jar
这里输出jar名为 spark-learn-1.0.jar,将jar上传到一个目录中,这里上传到目录 /opt/spark-2.4.0/lib
7 提交Spark任务
7.1 启动Spark
[root@centos sbin]# cd /opt/spark-2.4.0/sbin
[root@centos sbin]# ./start-all.sh
7.2 提交Spark任务
spark-submit \
--class org.apache.spark.examples.streaming.FlumePushWordCount \
--packages org.apache.spark:spark-streaming-flume_2.11:2.3.0 \
--master spark://centos:7077 \
--executor-memory 2G \
--total-executor-cores 2 \
/opt/spark-2.4.0/lib/spark-learn-1.0.jar centos 4444
7.3 命令参数
命令最后有两个参数,centos是当前服务器的域名,监听4444端口。
其中要通过--packages参数设置spark-streaming-flume的版本,在Spark官网上有说明
http://spark.apache.org/docs/latest/streaming-flume-integration.html
spark-streaming-flume_2.11
and its dependencies can be directly added to spark-submit
using --packages
(see Application Submission Guide). That is,
./bin/spark-submit --packages org.apache.spark:spark-streaming-flume_2.11:2.4.0 ...
这样在任务启动时会自动下载所要的jar,前提是必须能上网。
8 测试
8.1 先提交Spark的任务
使用 7.2 提交Spark任务 中的命令行提交任务
8.2 在Flume中启动Flume-ng
使用 5.1 启动Flume agent 中的命令行启动
8.3 在当前服务器中通过telnet向41414端口发送字符串
[root@centos ~]# telnet centos 41414
Trying 192.168.237.131...
Connected to centos.
Escape character is '^]'.
today
OK
is
OK
a
OK
nice day
OK
8.4 Spark的日志输出收到的单词个数
2018-12-22 14:23:55 INFO DAGScheduler:54 - Job 109 finished: print at FlumePushWordCount.scala:23, took 0.034998 s
-------------------------------------------
Time: 1545459835000 ms
-------------------------------------------
(is,1)
(a,1)
(today,1)
2018-12-22 14:24:00 INFO DAGScheduler:54 - Job 111 finished: print at FlumePushWordCount.scala:23, took 0.034998 s
-------------------------------------------
Time: 1545459840000 ms
-------------------------------------------
(day,1)
(nice,1)
本例演示完成