分布式日志收集框架Flume概述 及 实战案例I

Apache Flume

    Flume is a distributed(分布式), reliable(高可靠的), and available(高可用的) service for efficiently collecting(收集), ,aggregating(聚合,存在哪里?磁盘/内存?), and moving(移动) large amounts of log data.

    It has a simple and flexible(简单灵活) architecture based on streaming data flows(流数据). It is robust and fault tolerant(健壮,容错) with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online(在线) analytic application.

可靠性:当节点出现故障时,日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障,从强到弱依次为:end-to-end(收集到数据agent首先将event写到磁盘上,当数据传送成功后,再删除,如果发送失败,可以重新发送。);Store on failure(这也是scribe采用的策略,当数据接收方crash时,将数据写到本地,待恢复后,继续发送);Best effort(数据发送到接收方后,不会进行确认)。



Flume发展史

  • Cloudera   0.9.2   Flume-OG(0.94.0版本,日志传输不稳定的现象尤为严重,这点可以在BigInsights产品文档的troubleshooting板块发现)
  • 2011.10.22 flume-728  Flume-NG  ==> Apache
  • 2012.7  1.0
  • 2015.5  1.6 
  • ~      1.8

Flume核心组件

  • Source: 数据收集
  • Channel: 数据聚集,数据管道。Memory Channel/Kafka Channel/File Channel
  • Sink: 数据输出, HDFS Sink/Hive Sink/Kafka Sink...


Flume安装

  • Java Runtime Environment - Java 1.7 or later
  • Memory - Sufficient memory(足够内存) for configurations used by sources, channels or sinks
  • Disk Space - Sufficient disk space(足够磁盘空间) for configurations used by channels or sinks
  • Directory Permissions - Read/Write permissions(目录权限) for directories used by agent

1. JDK安装

  1. 下载jdk1.8.0_144,解压到/opt/.
  2. 将java配置系统环境变量中: /etc/profile     export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144    export PATH=$JAVA_HOME/bin:$PATH
  3. source下让其配置生效
  4. 验证 java -version

2.安装Flume

1. 下载, 解压到/opt/
        2. 将Flume配置系统环境变量中: /etc/profile
export FLUME_HOME=/opt/apache-flume-1.6.0-cdh5.7.0-bin
export PATH=$FLUME_HOME/bin:$PATH
3. source下让其配置生效
4. flume-env.sh的配置:export JAVA_HOME=/opt/jdk1.8.0_144
5. 检测: flume-ng version

Flume实战 

实战1: 监控一个目录,实时采集目录下新增加的文件,并将文件内容输出到控制台

1.技术选型

Spooling Directory Source

This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, it is renamed to indicate completion (or optionally deleted).

Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated:

  1. If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.
  2. If a file name is reused at a later time, Flume will print an error to its log file and stop processing.
Property NameDefaultDescription
channels- 
type-The component type name, needs to be spooldir
spoolDir-The directory from which to read files from.
fileSuffix.COMPLETEDSuffix to append to completely ingested files

spooling source + memory channel + logger sink


2.configure文件配置
# Name the components on this agent
a1.sources = spooling1
a1.sinks = target1
a1.channels = channel1

# Describe/configure the source
a1.sources.spooling1.type = spooldir
a1.sources.spooling1.channels = channel1

a1.sources.spooling1.spoolDir = /var/log/apache/flumeSpool

a1.sources.spooling1.fileSuffix = .test
# Describe the sink
a1.sinks.target1.type = logger
# Use a channel which buffers events in memory
a1.channels.channel1.type = memory
# Bind the source and sink to the channel
a1.sources.spooling1.channels = channel1
a1.sinks.target1.channel = channel1

3. 启动运行

flume-ng agent \
--name a1  \
--conf $FLUME_HOME/conf  \
--conf-file $FLUME_HOME/conf/spooling.conf \
-Dflume.root.logger=INFO,console
4. 分析总结

    spooling source可以监控某个目录下的所有文件,通过channel输出到sink端。但是spooling source只支持对目录下同一个文件的一次性写入,不支持多次写入。

问题描述:假设spoolingDir=/.../OutputData/,其下有多个csv文件:201801010001.csv, 201801010002.csv,.... 有一个程序动态的在该目录下创建csv文件,将每分钟产生的数据放在,以第5分钟为例,xxxxxxxxx05.csv,文件里面。如何实时监控这些新产生的数据?

解决该问题有几种方案:

方案1:利用ignorePattern= ^(.)*\\.csv$,通过另外一个程序监控,该目录下:如果有新的csv文件产生,将上一分钟的'201801010001.csv'重新命名为'201801010001.spool'文件,这样就避免了在spooling目录下对一个文件的多次写入问题。

方案2:将spoolingDir=/.../Spooling/,即spoolingDir不指向OutputData目录,通过另一个程序或者shell脚本,监控Spooling目录下,如果有新的csv文件产生,将上一分钟的'201801010001.csv' move或者copy 到spoolingDir文件夹下。这样也避免了在spooling目录下对一个文件的多次写入问题。


taildir source 解读

    Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files. If the new lines are being written, this source will retry reading them in wait for the completion of the write.

    This source is reliable and will not miss data even when the tailing files rotate. It periodically writes the last read position of each files on the given position file in JSON format. If Flume is stopped or down for some reason, it can restart tailing from the position written on the existing position file.

    In other use case, this source can also start tailing from the arbitray position for each files using the given position file. When there is no position file on the specified path , it will start tailing from the first line of each files by default.

    Files will be consumed in order of their modification time. File with the oldest modification time will be consumed first.

    This source does not rename or delete or do any modifications to the file being tailed. Currently this source does not support tailing binary files. It reads text files line by line.

Property NameDefaultDescription
channels  
type-The component type name, needs to be TAILDIR.
filegroups-Space-separated list of file groups. Each file group indicates a set of files to be tailed
filegroups.<filegroupName>-Absolute path of the file group. Regular expression (and not file system patterns) can be used for filename only.
positionFile~/.flume/taildir_position.jsonFile in JSON format to record the inode, the absolute path and the last position of each tailing file.
headers.<filegroupName>.<headerKey>-Header value which is the set with header key. Multiple headers can be specified for one file group.

Example:


a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值