Flume基本概念及入门

最新推荐文章于 2021-10-14 21:53:41 发布

yutao_Struggle

最新推荐文章于 2021-10-14 21:53:41 发布

阅读量390

点赞数

分类专栏： big data 文章标签： flume

本文链接：https://blog.csdn.net/yutao_Struggle/article/details/102620247

版权

big data 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

1 Flume简介

1.1 Flume概述

Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume的使用不仅限于日志数据聚合，由于数据源是可定制的，因此Flume可用于传输大量事件数据，包括但不限于网络流量数据，社交媒体生成的数据，电子邮件消息以及几乎所有可能的数据源。Flume基于流式架构，灵活简单。

当前Flume有两个版本Flume 0.9X版本的统称Flume-og（Cloudera Flume），Flume1.X版本的统称Flume-ng（Apache Flume）。由于Flume-ng经过重大重构，与Flume-og有很大不同，使用时请注意区分。

1.2 架构

Flume事件定义为具有字节有效负载和可选字符串属性集的数据流单位。Flume Agent是一个（JVM）进程，承载了组件，事件通过这些组件从外部源流到下一个目标（hop）。
flume
外部Source（如Web服务器）以目标Flume Source可以识别的格式将事件发送到Flume。例如，Avro Flume Source可用于从Avro Client或另一个Flume agent的Avro Sink接收Avro事件。可以使用Thrift Flume Source定义类似的流程，以接收来自Thrift Sink或Flume Thrift Rpc客户端或以Flume Thrift协议生成的任何语言编写的Thrift客户端的事件，Flume Source收到事件后，会将其存储到一个或多个Channel，该Channel是一个被动存储，用于保留事件，直到被Flume Sink消耗为止，Sink从通道中删除事件，并将其放入HDFS之类的外部存储库（通过Flume HDFS Sink），或将其转发到流中下一个Flume Agent的Flume Source。给定Agent中的Source和Sink与通道中上传的事件异步运行。

1.2.1 Agent

Agent是一个JVM进程，它以事件的形式将数据从源头送至目的，是Flume数据传输的基本单元。Agent主要有3个部分组成，Source、Channel、Sink。　Flume的数据流由事件(Event)贯穿始终。事件是Flume的基本数据单位，它携带日志数据(字节数组形式)并且携带有头信息，这些Event由Agent外部的Source生成，当Source捕获事件后会进行特定的格式化，然后Source会把事件推入(单个或多个)Channel中。你可以把Channel看作是一个缓冲区，它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。

1.2.2 Source

Source是负责接收数据到Flume Agent的组件，用于将数据封装成一个个Event。Source组件可以处理各种类型、各种格式的日志数据。包括：

Avro Source：RPC框架，接收Avro客户端的消息
Thrift Source
Exec Source：通过Linux命令行获取数据，如tail -f命令
JMS Source：JMS消息
Spooling Directory Source：监控某个文件夹内追加的文件
Taildir Source：监控某个文件夹内文件追加的内容，相对于Exec更好，可以实现断点续传
Twitter 1% firehose Source
Kafka Source：Kafka消息
NetCat Source：NetCat支持TCP和UDP，接收NetCat客户端的消息
Sequence Generator Source
Syslog Sources
Http Source：接收Http消息
legacy Sources
Scribe Source
Custom Source：用户自定义Source

同一个Source可以关联多个Channel。

1.2.3 Channel

Channel是位于Source和Sink之间的缓冲区。Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作。

Flume支持的Channel有：

Memory Channel：内存中的队列。在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么Memory Channel就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失。
JDBC Channel
Kafka Channel
File Channel：所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。
Spillable Memory Channel
Pseudo Transaction Channel
Custom Channel

1.2.4 Sink

Sink不断地轮询Channel中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。Sink是完全事务性的。在从Channel批量删除数据之前，每个Sink用Channel启动一个事务。批量事件一旦成功写出到存储系统或下一个Flume Agent，Sink就利用Channel提交事务。事务一旦被提交，该Channel从自己的内部缓冲区删除事件。

Flume支持的Sink有：

1.2.5 Event

传输单元， Flume 数据传输的基本单元，以 Event 的形式将数据从源头送至目的地。Event 由 Header 和 Body 两部分组成， Header 用来存放该 event 的一些属性，为 K-V 结构，Body 用来存放该条数据，形式为字节数组。

1.2.6 Interceptors

在Flume中允许使用拦截器对传输中的event进行拦截和处理（在source将event放入到channel之前拦截），拦截器必须实现org.apache.flume.interceptor.Interceptor接口。拦截器可以根据开发者的设定修改甚至删除event，Flume同时支持拦截器链，即由多个拦截器组合而成，通过指定拦截器链中拦截器的顺序，event将按照顺序依次被拦截器进行处理。

官方文档：http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-interceptors

1.2.7 Channel Selectors

Channel Selectors用于source组件将event传输给多个channel的场景。常用的有replicating（默认）和multiplexing两种类型。replicating负责将event复制到多个channel，而multiplexing则根据event的属性和配置的参数进行匹配，匹配成功则发送到指定的channel。

官方文档：http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-channel-selectors

1.2.8 Sink Processors

用户可以将多个sink组成一个整体（sink组），Sink Processors可用于提供组内的所有sink的负载平衡功能，或在时间故障的情况下实现从一个sink到另一个sink的故障转移。

官方文档：http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-sink-processors

1.3 特点

复杂流动（Complex flows）：支持构建multi-hop流程，事件在到达最终目的地前可通过多个Agent传递，同时支持fan-in和fan-out流程，上下文路由和备份路由（故障转移）。
可靠性（Reliability）：Flume使用事务性方法来确保事件的可靠传递。当节点出现故障时，日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障，从强到弱依次分别为：end-to-end（收到数据agent首先将event写到磁盘上，当数据传送成功后，再删除；如果数据发送失败，可以重新发送。），Store on failure（这也是scribe采用的策略，当数据接收方crash时，将数据写到本地，待恢复后，继续发送），Besteffort（数据发送到接收方后，不会进行确认）。
可恢复性（Reliability）：支持File Channel。

2 数据获取方式

Flume支持多种机制来从外部源获取数据。

2.1 RPC

Flume发行版中包含的Avro客户端可以使用avro RPC机制将给定文件发送到Flume Avro Source

#将/usr/logs/log.10发送到localhost:41414的Avro Source
$ bin/flume-ng avro-client -H localhost -p 41414 -F /usr/logs/log.10

2.2 Exec

exec source执行一个给定的命令并使用输出。单个“行”输出（文字后跟回车符（’\ r’）或换行符（’\ n’）或两者一起）。如：

a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /tmp/root/hive.log
a2.sources.r2.shell = /bin/bash -c

source将通过linux命令去获取数据

2.3 Network streams

Flume支持以下机制从常用的日志流类型读取数据，例如：

Avro
Thrift
Syslog
Netcat

3 快速入门

#flume-ng是flume的运行命令，支持一下参数
[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# bin/flume-ng help
Usage: bin/flume-ng <command> [options]...

commands:
  help                      display this help text
  agent                     run a Flume agent
  avro-client               run an avro Flume client
  version                   show Flume version info

global options:
  --conf,-c <conf>          use configs in <conf> directory
  --classpath,-C <cp>       append to the classpath
  --dryrun,-d               do not actually start Flume, just print the command
  --plugins-path <dirs>     colon-separated list of plugins.d directories. See the
                            plugins.d section in the user guide for more details.
                            Default: $FLUME_HOME/plugins.d
  -Dproperty=value          sets a Java system property value
  -Xproperty=value          sets a Java -X option

agent options:
  --name,-n <name>          the name of this agent (required)
  --conf-file,-f <file>     specify a config file (required if -z missing)
  --zkConnString,-z <str>   specify the ZooKeeper connection to use (required if -f missing)
  --zkBasePath,-p <path>    specify the base path in ZooKeeper for agent configs
  --no-reload-conf          do not reload config file if changed
  --help,-h                 display help text

avro-client options:
  --rpcProps,-P <file>   RPC client properties file with server connection params
  --host,-H <host>       hostname to which events will be sent
  --port,-p <port>       port of the avro source
  --dirname <dir>        directory to stream to avro source
  --filename,-F <file>   text file to stream to avro source (default: std input)
  --headerFile,-R <file> File containing event headers as key/value pairs on each new line
  --help,-h              display help text

  Either --rpcProps or both --host and --port must be specified.

Note that if <conf> directory is specified, then it is always included first
in the classpath.

3.1 安装部署Flume

将apache-flume-1.9.0-bin.tar.gz上传到linux的/opt/software目录下

解压apache-flume-1.9.0-bin.tar.gz到/opt/module/目录下

[root@iZnq8v4wpstsagZ software]# tar -zxf apache-flume-1.7.0-bin.tar.gz -C /opt/module/

将apache-flume-1.7.0-bin/conf下的flume-env.sh.template文件修改为flume-env.sh，并配置flume-env.sh文件

[root@iZnq8v4wpstsagZ conf]# mv flume-env.sh.template flume-env.sh
[root@iZnq8v4wpstsagZ conf]# vi flume-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_144

3.2 Flume入门案例

3.2.1 监控端口数据官方案例

官方文档：http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#a-simple-example
需求： 使用 Flume 监听一个端口，收集该端口数据，并打印到控制台。
思路：

通过NetCat TCP Source监听某个端口，然后使用NetCat客户端工具向指定端口发送数据。
将数据打印到控制台，Flume提供了logger Sink用于将数据输出到控制台。

实现步骤：

安装netcat客户端工具

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# sudo yum install -y nc

在apache-flume-1.9.0-bin目录下创建job/flume-netcat-logger.conf文件夹及文件

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# mkdir job
[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# touch job/flume-netcat-logger.conf

编写flume-netcat-logger.conf配置文件

[root@iZnq8v4wpstsagZ job]# vim flume-netcat-logger.conf
＃flume-netcat-logger.conf：单节点Flume配置

＃命名Agent组件的名称为a1
a1.sources  =  r1 
a1.sinks  =  k1 
a1.channels  =  c1

＃配置source
a1.sources.r1.type  =  netcat 
a1.sources.r1.bind  =  127.0.0.1
a1.sources.r1.port  =  44444

＃配置sink
a1.sinks.k1.type  =  logger

＃缓存事件到Memory Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100

＃将source和sink绑定到channel
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

运行flume

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# bin/flume-ng agent -c conf -f job/flume-netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

参数说明：

–conf/-c：表示配置文件存储在 conf/目录
–name/-n：表示给 agent 起名为 a1
–conf-file/-f：flume 本次启动读取的配置文件是在 job 文件夹下的flume-netcat-logger.conf
-Dflume.root.logger=INFO,console ：-D 表示 flume 运行时动态修改 flume.root.logger
参数属性值，并将控制台日志打印级别设置为 INFO 级别。日志级别包括:log、 info、 warn、 error

使用netcat客户端测试

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# nc -v 127.0.0.1 44444
hello

3.2.2 实时监控单个追加文件

需求： 实时监控 Hive 日志，并上传到 HDFS 中。
思路：

在Unix系统中可以通过tail -f实时监控某个文件，而Flume提供了Exec Source。Exec Source在启动时会运行一个给定的Unix命令，并期望该进程在标准输出时连续生成数据（stderr会被丢弃，除非设置logStdErr=true），将标准输出封装为Event。如果Unix进程因任何原因退出，则Source也将退出，并且不会生成进一步的数据，这意味着像cat [named pipe]或tail -F [file]这样的配置将产生所需的结果，而像date可能不会产生这样的结果：前两个命令生成持续数据流，后者生成一个事件并退出。
Flume提供了HDFS Sink用于将数据输出到HDFS上。目前只支持创建TEXT和sequenceFile这两种类型文件，这两种类型文件都可以使用压缩。写入HDFS的文件可以基于运行时间、数据大小或事件数量周期性滚动文件（关闭当前文件并创建新文件），它还支持按时间戳或事件发生的机器等属性对数据进行分桶/分区。HDFS目录路径可能包含格式化转义序列，这些转义序列将被HDFS Sink替换，以生成用于存储事件的目录/文件名。使用这个Sink需要安装hadoop，以便Flume可以使用hadoop jars与HDFS集群通信。请注意，需要一个支持sync()调用的Hadoop版本。

实现步骤：

Flume 要想将数据输出到 HDFS，须持有 Hadoop 相关 jar 包，将以下jar包copy到/opt/module/apache-flume-1.9.0-bin/lib目录下

commons-configuration-1.6.jar
hadoop-auth-2.7.2.jar
hadoop-common-2.7.2.jar
hadoop-hdfs-2.7.2.jar
commons-io-2.4.jar
htrace-core-3.1.0-incubating.jar

在apache-flume-1.9.0-bin/job目录下创建flume-file-hdfs.conf配置文件

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# vim job/flume-file-hdfs.conf
# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /tmp/root/hive.log
a2.sources.r2.shell = /bin/bash -c

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://127.0.0.1:9000/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
#时间戳是否应该向下舍入,与roundValue、roundUnit一起使用
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 30
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

对于所有与时间相关的转义序列， Event Header 中必须存在以 “timestamp”的 key（除非
hdfs.useLocalTimeStamp = true，此方法会使用 TimestampInterceptor 自动添加timestamp）。
a2.sinks.k2.hdfs.useLocalTimeStamp = true

运行Flume

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# bin/flume-ng agent --conf conf/ --name a2 --conf-file job/flume-file-hdfs.conf

运行Hive并产生日志

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# hive
hive (default)> select * from dept order id;

注意：
ExecSource和其他异步源的问题是，source不能保证如果将event放入channel失败了，让客户端知道它，在这种情况下，数据将丢失。例如，tail -F [file]用例，如果channel内存满了，Flume不能发送事件，Flume无法向写入日志文件的应用程序指示它需要保留日志，或者通知它由于某种原因事件尚未发送。当使用单向异步接口（如ExecSource）时，应用程序永远无法保证数据已被接收！为了获得更强的可靠性保证，可以考虑使用Spooling Directory Source、Taildir Source、或通过SDK直接与Flume集成。

3.2.3 实时监控目录下多个新文件

需求： 使用 Flume 监听整个目录的文件，并上传至 HDFS。
思路：

Flume提供的Spooling Directory Source，SpoolingDirSource指定本地磁盘的一个目录为"Spooling(自动收集)"的目录！这个source可以读取目录中新增的文件，将文件的内容封装为event。SpoolingDirSource在读取一整个文件到channel之后，它会采取策略，要么删除文件(是否可以删除取决于配置)，要么对文件进行一个完成状态的重命名，这样可以保证source持续监控新的文件。SpoolingDirSource和execsource不同的是SpoolingDirSource是可靠的，即使flume被杀死或重启，依然不丢数据；但是为了保证这个特性，付出的代价是，一旦flume发现以下情况，flume就会报错，停止！
①一个文件已经被放入目录，在采集文件时，不能被修改
②文件的名在放入目录后又被重新使用（出现了重名的文件）
要求：必须已经封闭的文件才能放入到SpoolingDirSource，在同一个SpoolingDirSource中都不能出现重名的文件。

实现步骤：

在apache-flume-1.9.0-bin/job目录下创建flume-dir-hdfs.conf配置文件

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# vim job/flume-dir-hdfs.conf
a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/apache-flume-1.9.0-bin/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp 结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path =hdfs://127.0.0.1:9000/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M,略小于block比较好
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

运行Flume

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-dir-hdfs.conf

向 upload 文件夹中添加文件

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# mkdir upload
[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# touch upload/test.txt
[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# touch upload/test.tmp
[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# touch upload/test.log

spooldir Source监控的目录下文件名不可以重名，不然会导致Flume报错无法正常运行。不要在监控目录中创建并持续修改文件，Flume扫描到第一次保存的文件上传完成后会以.COMPLETED结尾，之后修改将不会触发事件，被监控文件夹每500ms扫描一次文件变动。

3.2.4 实时监控目录下的多个追加文件

需求： 使用 Flume 监听整个目录的实时追加文件，并上传至 HDFS。
思路： Exec source 适用于监控一个实时追加的文件，但不能保证数据不丢失； Spooldir Source 能够保证数据不丢失，且能够实现断点续传，但延迟较高，不能实时监控文件改动；而 TailDir Source 既能够实现断点续传，又可以保证数据不丢失，还能够进行实时监控文件改动。Taildir Source 可以读取多个文件最新追加写入的内容，Taildir Source是可靠的，即使flume出现了故障或挂掉。Taildir Source在工作时，会将读取文件的最后的位置记录在一个json文件中，一旦agent重启，会从之前已经记录的位置，继续执行tail操作，Json文件中，位置是可以修改，修改后，Taildir Source会从修改的位置进行tail操作，如果JSON文件丢失了，此时会重新从每个文件的第一行，重新读取，这会造成数据的重复。Taildir Source目前只能读文本文件。

实现步骤：

在apache-flume-1.9.0-bin/job目录下创建flume-taildir-hdfs.conf配置文件

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# vim job/flume-taildir-hdfs.conf
a4.sources = r4
a4.sinks = k4
a4.channels = c4

# Describe/configure the source
a4.sources.r4.type = TAILDIR
#记录每个文件的传输位置的索引点
a4.sources.r4.positionFile = /opt/module/apache-flume-1.9.0-bin/tail_dir.json
a4.sources.r4.filegroups = f1
a4.sources.r4.filegroups.f1 = /opt/module/apache-flume-1.9.0-bin/files/file.*

# Describe the sink
a4.sinks.k4.type = hdfs
a4.sinks.k4.hdfs.path = hdfs://127.0.0.1:9000/flume/upload/%Y%m%d/%H
#上传文件的前缀
a4.sinks.k4.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a4.sinks.k4.hdfs.round = true
#多少时间单位创建一个新的文件夹
a4.sinks.k4.hdfs.roundValue = 1
#重新定义时间单位
a4.sinks.k4.hdfs.roundUnit = hour
#是否使用本地时间戳
a4.sinks.k4.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a4.sinks.k4.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a4.sinks.k4.hdfs.fileType = DataStream
#多久生成一个新的文件
a4.sinks.k4.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a4.sinks.k4.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a4.sinks.k4.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a4.channels.c4.type = memory
a4.channels.c4.capacity = 1000
a4.channels.c4.transactionCapacity = 100

# Bind the source and sink to the channel
a4.sources.r4.channels = c4
a4.sinks.k4.channel = c4

运行Flume

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# bin/flume-ng agent --conf conf/ --name a4 --conf-file job/flume-taildir-hdfs.conf

向 files 文件夹中追加内容

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# mkdir files
[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# echo hello >> file1.txt
[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# echo world >> file2.txt

Taildir Source 维护了一个 json 格式的 position File，其会定期的往 position File中更新每个文件读取到的最新的位置，因此能够实现断点续传。Position File 的格式如下：

{"inode":2496272,"pos":12,"file":"/opt/module/apache-flume-1.9.0-bin/files/file1.txt"}
{"inode":2496275,"pos":12,"file":"/opt/module/apache-flume-1.9.0-bin/files/file2.txt"}

Linux 中储存文件元数据的区域就叫做 inode，每个 inode 都有一个号码，操作系统用 inode 号码来识别不同的文件， Unix/Linux 系统内部不使用文件名，而使用 inode号码来识别文件。

常见问题：TailDir Source采集的文件，不能随意重命名。如果日志在正在写入时，名称为 xxxx.tmp，写入完成后，滚动改名为xxx.log，此时一旦匹配规则可以匹配上述名称，就会发生数据的重复采集。

3.3 在配置文件中使用环境变量

Flume可以替换配置中值的环境变量，如：

a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = ${NC_PORT}
a1.sources.r1.channels = c1

可以在代理调用上通过设置Java系统属性（propertiesImplementation = org.apache.flume.node.EnvVarResolverProperties）来启用，如：

[root@iZnq8v4wpstsagZ apache-flume-1.9.0-bin]# NC_PORT=44444 bin/flume-ng agent -c conf/ -f job/netcat-flume-logger.conf -n a1 -Dflume.root.logger=INFO,console -DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties

也可以用其他方式配置环境变量，包括在conf/flume-env.sh中设置

3.4 第三方插件

Flume具有完全基于插件的体系结构。Flume拥有很多开箱即用的sources、channels、sinks、serializers，并且提供许多与Flume解耦的实现方案。虽然Flume可以通过在flume-env.sh文件添加用户自定义的Flume组件jars到FLUME_CLASSPATH中，但Flume支持在$FLUME_HOME/plugins.d目录下自动获取特定格式打包的插件，这使插件的管理更加容易。

每个插件目录在plugins.d目录下最多可以有三个子目录：