分布式日志收集框架Flume概述及实战案例I

最新推荐文章于 2024-05-05 06:03:53 发布

BloodSweet

最新推荐文章于 2024-05-05 06:03:53 发布

阅读量627

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/equinux/article/details/80305429

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Apache Flume

Flume is a distributed(分布式), reliable(高可靠的), and available(高可用的) service for efficiently collecting(收集), ,aggregating(聚合,存在哪里？磁盘/内存？), and moving(移动) large amounts of log data.

It has a simple and flexible(简单灵活) architecture based on streaming data flows(流数据). It is robust and fault tolerant(健壮，容错) with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online(在线) analytic application.

可靠性：当节点出现故障时，日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障，从强到弱依次为：end-to-end(收集到数据agent首先将event写到磁盘上，当数据传送成功后，再删除，如果发送失败，可以重新发送。)；Store on failure(这也是scribe采用的策略，当数据接收方crash时，将数据写到本地，待恢复后，继续发送)；Best effort(数据发送到接收方后，不会进行确认)。

Flume发展史

Cloudera 0.9.2 Flume-OG(0.94.0版本，日志传输不稳定的现象尤为严重，这点可以在BigInsights产品文档的troubleshooting板块发现)
2011.10.22 flume-728 Flume-NG ==> Apache
2012.7 1.0
2015.5 1.6
~ 1.8

Flume核心组件

Source: 数据收集
Channel: 数据聚集，数据管道。Memory Channel/Kafka Channel/File Channel
Sink: 数据输出， HDFS Sink/Hive Sink/Kafka Sink...

Flume安装

Java Runtime Environment - Java 1.7 or later
Memory - Sufficient memory(足够内存) for configurations used by sources, channels or sinks
Disk Space - Sufficient disk space(足够磁盘空间) for configurations used by channels or sinks
Directory Permissions - Read/Write permissions(目录权限) for directories used by agent

1. JDK安装

下载jdk1.8.0_144，解压到/opt/.
将java配置系统环境变量中: /etc/profile export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144 export PATH=$JAVA_HOME/bin:$PATH
source下让其配置生效
验证 java -version

2.安装Flume

1. 下载, 解压到/opt/
2. 将Flume配置系统环境变量中: /etc/profile
export FLUME_HOME=/opt/apache-flume-1.6.0-cdh5.7.0-bin
export PATH=$FLUME_HOME/bin:$PATH
3. source下让其配置生效
4. flume-env.sh的配置：export JAVA_HOME=/opt/jdk1.8.0_144
5. 检测: flume-ng version

Flume实战

实战1：监控一个目录，实时采集目录下新增加的文件，并将文件内容输出到控制台

1.技术选型

Spooling Directory Source

This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, it is renamed to indicate completion (or optionally deleted).

Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated:

If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.
If a file name is reused at a later time, Flume will print an error to its log file and stop processing.

Property Name	Default	Description
channels	-
type	-	The component type name, needs to be `spooldir`
spoolDir	-	The directory from which to read files from.
fileSuffix	.COMPLETED	Suffix to append to completely ingested files

spooling source + memory channel + logger sink

2.configure文件配置

# Name the components on this agent
a1.sources = spooling1
a1.sinks = target1
a1.channels = channel1

# Describe/configure the source
a1.sources.spooling1.type = spooldir
a1.sources.spooling1.channels = channel1

a1.sources.spooling1.spoolDir = /var/log/apache/flumeSpool

a1.sources.spooling1.fileSuffix = .test
# Describe the sink
a1.sinks.target1.type = logger
# Use a channel which buffers events in memory
a1.channels.channel1.type = memory
# Bind the source and sink to the channel
a1.sources.spooling1.channels = channel1
a1.sinks.target1.channel = channel1

3. 启动运行

flume-ng agent \
--name a1  \
--conf $FLUME_HOME/conf  \
--conf-file $FLUME_HOME/conf/spooling.conf \
-Dflume.root.logger=INFO,console

4. 分析总结

spooling source可以监控某个目录下的所有文件，通过channel输出到sink端。但是spooling source只支持对目录下同一个文件的一次性写入，不支持多次写入。

问题描述：假设spoolingDir=/.../OutputData/，其下有多个csv文件：201801010001.csv, 201801010002.csv，.... 有一个程序动态的在该目录下创建csv文件，将每分钟产生的数据放在，以第5分钟为例，xxxxxxxxx05.csv，文件里面。如何实时监控这些新产生的数据？

解决该问题有几种方案：

方案1：利用ignorePattern= ^(.)*\\.csv$，通过另外一个程序监控，该目录下：如果有新的csv文件产生，将上一分钟的'201801010001.csv'重新命名为'201801010001.spool'文件，这样就避免了在spooling目录下对一个文件的多次写入问题。

方案2：将spoolingDir=/.../Spooling/，即spoolingDir不指向OutputData目录，通过另一个程序或者shell脚本，监控Spooling目录下，如果有新的csv文件产生，将上一分钟的'201801010001.csv' move或者copy 到spoolingDir文件夹下。这样也避免了在spooling目录下对一个文件的多次写入问题。

taildir source 解读

Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files. If the new lines are being written, this source will retry reading them in wait for the completion of the write.

This source is reliable and will not miss data even when the tailing files rotate. It periodically writes the last read position of each files on the given position file in JSON format. If Flume is stopped or down for some reason, it can restart tailing from the position written on the existing position file.

In other use case, this source can also start tailing from the arbitray position for each files using the given position file. When there is no position file on the specified path , it will start tailing from the first line of each files by default.

Files will be consumed in order of their modification time. File with the oldest modification time will be consumed first.

This source does not rename or delete or do any modifications to the file being tailed. Currently this source does not support tailing binary files. It reads text files line by line.

Property Name	Default	Description
channels
type	-	The component type name, needs to be TAILDIR.
filegroups	-	Space-separated list of file groups. Each file group indicates a set of files to be tailed
filegroups.<filegroupName>	-	Absolute path of the file group. Regular expression (and not file system patterns) can be used for filename only.
positionFile	~/.flume/taildir_position.json	File in JSON format to record the inode, the absolute path and the last position of each tailing file.
headers.<filegroupName>.<headerKey>	-	Header value which is the set with header key. Multiple headers can be specified for one file group.

Example:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true

BloodSweet

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
分布式日志收集框架Flume概述及实战案例I

Apache Flume Flume is a distributed(分布式), reliable(高可靠的), and available(高可用的) service for efficiently collecting(收集), ,aggregating(聚合,存在哪里？磁盘/内存？), and moving(移动) large amounts of log data. It h...
复制链接

扫一扫