Flume部署及其使用详解

最新推荐文章于 2024-07-18 21:37:27 发布

爆发的~小宇宙

最新推荐文章于 2024-07-18 21:37:27 发布

阅读量1.8k

点赞数 1

分类专栏： Flume 文章标签： Flume

本文链接：https://blog.csdn.net/yu0_zhang0/article/details/80168704

版权

Flume 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1 官网地址

apache链接
 cdh链接

2 产生背景

对于关系型数据库我们可以使用sqoop进行数据的处理，导入hive，hdfs,mysql等。那对于一些日志该怎么处理呢？（From 
 outside To  inside ），怎么样定时收集ng产生的日志到HDFS呢？
我们可能想到直接使用shell写一个脚本，使用crontab进行调度，这样不就行了吗。。但是大家有没有想到一个问题呢，就是如
果日志量太大，涉及到存储格式压缩格式的选择该怎么办呢？
这时我们的flume就产生了，他是apache的一个顶级项目，下面就开始我们的学习。

3 Flume 介绍

Flume NG是一个分布式、可靠、可用的系统，它能够将不同数据源的海量日志数据进行高效收集、聚合、移动，最后存储到一个中心化数据存储系统中。由原来的Flume OG到现在的Flume NG，进行了架构重构，并且现在NG版本完全不兼容原来的OG版本。经过架构重构后，Flume NG更像是一个轻量的小工具，非常简单，容易适应各种方式日志收集，并支持failover和负载均衡。

3.1 使用场景

flume->HDFS->batch
flume->kafka->streaming

3.2 基本架构

这里写图片描述

3.3 Event的概念

在这里有必要先介绍一下flume中event的相关概念：flume的核心是把数据从数据源(source)收集过来，在将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功，在送到目的地(sink)之前，会先缓存数据(channel),待数据真正到达目的地(sink)后，flume在删除自己缓存的数据。
在整个数据的传输的过程中，流动的是event，即事务保证是在event级别进行的。那么什么是event呢？—–event将传输的数据进行封装，是flume传输数据的基本单位，如果是文本文件，通常是一行记录，event也是事务的基本单位。event从source，流向channel，再到sink，本身为一个字节数组，并可携带headers(头信息)信息。event代表着一个数据的最小完整单元，从外部数据源来，向外部的目的地去。

3.4 flume三大核心组件

flume之所以这么神奇，是源于它自身的一个设计，这个设计就是agent，agent本身是一个java进程，运行在日志收集节点—所谓日志收集节点就是服务器节点。

Source:负责从源端采集数据，输出到channel中，常用的Source有exec/Spooling Directory/Taildir Source/NetCat
Channel:负责缓存Source端来的数据，常用的Channel有Memory/File
Sink:处理Channel而来的数据写到目标端，常用的Sink有HDFS/Logger/Avro/Kafka
- Source
- Sink
- Channel

Source+Channel+Sink=Agent,数据以event的形式从Source传送到Sink端，Flume就是写配置文件把我们的三大核心组件拼接起来，使用方便，可配置的、可插拔的、可组装的。

3.5 File Channel VS Memory Channel

File Channel是一个持久化的隧道（channel），他持久化所有的事件，并将其存储到磁盘中。因此，即使Java 虚拟机当掉，或者操作系统崩溃或重启，再或者事件没有在管道中成功地传递到下一个代理（agent），这一切都不会造成数据丢失。Memory Channel是一个不稳定的隧道，其原因是由于它在内存中存储所有事件。如果java进程死掉，任何存储在内存的事件将会丢失。另外，内存的空间收到RAM大小的限制,而File Channel这方面是它的优势，只要磁盘空间足够，它就可以将所有事件数据存储到磁盘上。

3.5 Flume的常用模式

扇入

注意：
1. 这里多个sink节点写入一个source是为了减少同时写入hdfs上的压力；
2. 多个agent进行串联时，前一个agent的Sink和后一个agent的Source都要采用Avro的形式。
扇出

这里的扇出是不是和离线数据的处理有点像呢？

4 安装

作者下载的时cdh的版本下载链接

4.1 配置FLUME_HOME

export FLUME_HOME=/opt/software/flume-1.6.0-cdh5.7.0-bin

export PATH=$FLUME_HOME/bin:

4.2 修改配置文件

 cp flume-env.sh.template flume-env.sh
 export JAVA_HOME=/usr/java/jdk1.8.0_45

5 如何使用

5.1( Source:NetCat) (Sink：logger) (Channel：memory)

NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。

Property Name Default     Description
channels       –     
type           –     The component type name, needs to be netcat
bind           –  日志需要发送到的主机名或者Ip地址，该主机运行着netcat类型的source在监听          
port           –  日志需要发送到的端口号，该端口号要有netcat类型的source在监听

5.1.1 配置文件

官方地址
官网上有详细的介绍，大家使用时可以去官网查找，但是每个版本对应的配置可能不同，一定要去对应的版本下去找。

vi hello.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory


# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5.1.2 启动命令

帮助命令flume-ng help

启动命令

./flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/hello.conf -Dflume.root.logger=INFO,console

--name agent的名字
--conf conf目录
--conf-file 配置文件所在目录
-Dflume.root.logger=INFO,console 可以再控制台查看

#5.1.3 测试

[hadoop@hadoop ~]$ telnet localhost 44444
Trying ::1...
Connected to localhost.
Escape character is '^]'.
hello
OK
wold
OK

注：如果telnet命令不可用，自行安装客户端和服务端，不会的自己百度吧。

ServerSocketChannelImpl[/0:0:0:0:0:0:0:0:44444]
2018-04-21 17:52:28,531 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 65 6C 6C 6F 0D                               hello. }
2018-04-21 18:00:42,815 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 77 6F 6C 64 0D 
可以看到hello wold收到了

5.2 ( Source:NetCat) (Sink：hdfs) (Channel：file)

将日志写入到hdfs上
这里写图片描述

5.2.1配置文件

vi test1.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume/
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in file
a1.channels.c1.type = file

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5.2.2启动命令

./flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/test1.conf   -Dflume.root.logger=DEBUG,console

2018-04-21 18:55:57,906 (lifecycleSupervisor-1-3) [DEBUG - org.apache.flume.source.NetcatSource.start(NetcatSource.java:190)] Source started
2018-04-21 18:55:57,907 (Thread-2) [DEBUG - org.apache.flume.source.NetcatSource$AcceptHandler.run(NetcatSource.java:270)] Starting accept handler

5.2.3结果

[hadoop@hadoop ~]$ telnet localhost 44444
Trying ::1...
Connected to localhost.
Escape character is '^]'.
hello world
OK

hdfs dfs -text /flume/2018-04-21-19-05-47.1524366347156
hello world

5.3 (Source:Spooling Directory )( Sink：hdfs)(Channel:memory )

Spooling Directory Source：监听一个指定的目录，即只要应用程序向这个指定的目录中添加新的文件，source组件就可以获取到该信息，并解析该文件的内容，然后写入到channle。写入完成后，标记该文件已完成或者删除该文件。
官网介绍，其可靠性较强，而且即使flume重启，也不会丢失数据，为了保证可靠性，只能是不可变的，唯一命名的文件可以放在目录下，日常来说，我们可以通过log4j来定义日志名称，这样基本不会重名，而且日志文件生成之后，一般来说都不会更改，所以离线数据处理，很适合使用本Source；

flume官网中Spooling Directory Source描述：

Property Name       Default      Description
channels              –  
type                  –          The component type name, needs to be spooldir.
spoolDir              –          Spooling Directory Source监听的目录
fileSuffix         .COMPLETED    文件内容写入到channel之后，标记该文件
deletePolicy       never         文件内容写入到channel之后的删除策略: never or immediate
fileHeader         false         Whether to add a header storing the absolute path filename.
ignorePattern      ^$           Regular expression specifying which files to ignore (skip)
interceptors          –          指定传输中event的head(头信息)，常用timestamp

Spooling Directory Source的两个注意事项：

1 If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.
即：拷贝到spool目录下的文件不可以再打开编辑
2 If a file name is reused at a later time, Flume will print an error to its log file and stop processing.
即：不能将具有相同文件名字的文件拷贝到这个目录下

5.3.1配置文件

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/inputfile
a1.sources.r1.fileHeader = true
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# Describe the sink
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume/
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in file
a1.channels.c1.type = file

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5.3.2启动命令

./flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/test2.conf  -Dflume.root.logger=INFO,console

2018-04-21 19:52:14,083 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.FileChannel.start(FileChannel.java:301)] Queue Size after replay: 0 [channel=c1]
2018-04-21 19:52:14,186 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2018-04-21 19:52:14,188 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: CHANNEL, name: c1 started
2018-04-21 19:52:14,188 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink k1
2018-04-21 19:52:14,190 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source r1
2018-04-21 19:52:14,190 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.source.SpoolDirectorySource.start(SpoolDirectorySource.java:78)] SpoolDirectorySource source starting with directory: /home/hadoop/inputfile
2018-04-21 19:52:14,200 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SINK, name: k1: Successfully registered new MBean.
2018-04-21 19:52:14,201 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SINK, name: k1 started
2018-04-21 19:52:14,240 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SOURCE, name: r1: Successfully registered new MBean.
2018-04-21 19:52:14,241 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SOURCE, name: r1 started


cp input.txt ../inputfile/

5.3.3结果

[hadoop@hadoop conf]$ hdfs dfs -text /flume/2018-04-21-19-53-24.1524369204246 
hello java
[hadoop@hadoop conf]$ hdfs dfs -text /flume/2018-04-21-19-53-26.1524369206318
hello hadoop
hello hive
hello sqoop
hello hdfs
hello spark

再日志中也能看到输入文件是否成功

5.4 (Source:Exec Source )( Sink：hdfs)(Channel:memory )

Exec Source：监听一个指定的命令，获取一条命令的结果作为它的数据源
常用的是tail -F file指令，即只要应用程序向日志(文件)里面写数据，source组件就可以获取到日志(文件)中最新的内容。

5.4.1配置文件

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/data.log

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true


# Use a channel which buffers events in memory
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在hive中建立外部表—–hdfs://hadoop80:9000/flume的目录，方便查看日志捕获内容

create external table t1(infor  string)
row format delimited
fields terminated by '\t'
location '/flume/';

5.4.2启动命令

./flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/test3.conf  -Dflume.root.logger=INFO,console

echo hadoop >data.log

5.4.3结果

hdfs上：
[hadoop@hadoop ~]$ hdfs dfs -text /flume/2018-04-21-20-13-38.1524370418338 
hadoop

hive:
hive> select * from t1;
OK
hello world
hello java
hello hadoop
hello hive
hello sqoop
hello hdfs
hello spark
hadoop

总结Exec source：
Exec source和Spooling Directory Source是两种常用的日志采集的方式，其中Exec source可以实现对日志的实时采集，Spooling Directory Source在对日志的实时采集上稍有欠缺，尽管Exec source可以实现对日志的实时采集，但是当Flume不运行或者指令执行出错时，Exec source将无法收集到日志数据，日志会出现丢失，从而无法保证收集日志的完整性。

这篇博客中也写入了常见的应用案例：链接地址