Flume安装部署&使用-CSDN博客

本文链接：https://blog.csdn.net/qq_23329167/article/details/84193982

Flume软件有3个组件，分别是source、channel、sink。三个组件的作用，如下图所示：

复杂结构：

一、安装部署

Flume的安装非常简单，上传安装包到数据源所在节点上（node1），然后解压：

tar -zxvf apache-flume-1.6.0-bin.tar.gz

然后进入flume的目录，修改conf下的flume-env.sh，在里面配置JAVA_HOME即可。

先用一个最简单的例子来测试一下程序环境是否正常:

1、先在 flume 的 conf 目录下新建一个文件

vi netcat-logger.conf

# 定义这个 agent 中各组件的名字 
a1.sources = r1 
a1.sinks = k1
a1.channels = c1 
 
# 描述和配置 source 组件：r1 
a1.sources.r1.type = netcat 
a1.sources.r1.bind = localhost 
a1.sources.r1.port = 44444 
 
# 描述和配置 sink 组件：k1 
a1.sinks.k1.type = logger 
 
# 描述和配置 channel 组件，此处使用是内存缓存的方式 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
 
# 描述和配置 source  channel   sink 之间的连接关系 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2、启动 agent 去采集数据 (flume的解压目录中启动)

bin/flume-ng agent -c conf/ -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

-c conf/ 指定flume自身的配置文件所在目录

-f conf/netcat-logger.conf 指定我们所描述的采集方案

-n a1 指定我们这个agent的名字

3、测试先要往agent采集监听的端口上发送数据，让agent有数据可采。在node1的机器，安装telnet ：yum -y install telnet

4、要退出telnet时，按ctrl + ] ,然后按quit ，最后回车即可

二、案例

2.1 采集目录到 HDFS

采集需求：服务器的某特定目录下，会不断产生新的文件，每当有新文件出现，就需要把文件采集到HDFS中去

根据需求，首先定义以下3大要素

采集源，即source——监控文件目录 : spooldir

下沉目标，即sink——HDFS文件系统 : hdfs sink

source和sink之间的传递通道——channel，可用file channel 也可以用内存channel

1、配置文件编写：（在conf目录中创建：spooldir-hdfs.conf）

# Name the components on this agent 
a1.sources = r1 
a1.sinks = k1 
a1.channels = c1 
 
# Describe/configure the source 
##注意：不能往监控目中重复丢同名文件 
a1.sources.r1.type = spooldir 
a1.sources.r1.spoolDir = /root/logs
a1.sources.r1.fileHeader = true
 
# Describe the sink 
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream
 
# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
 
# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

2、启动命令（以下命令在flume解压目录中执行）

bin/flume-ng agent -c conf/ -f conf/spooldir-hdfs.conf -n a1  -Dflume.root.logger=INFO,console

3、不断的往/root/logs目录中，新增文件，注意：如果之前/root/logs目录中已经增加了一个文件，那么就不能再增加同名的文件。

2.2 采集文件到 HDFS

采集需求：比如业务系统使用log4j生成的日志，日志内容不断增加，需要把追加到日志文件中的数据实时采集到hdfs

根据需求，首先定义以下3大要素

采集源，即source——监控文件内容更新 : exec ‘tail -F file’

下沉目标，即sink——HDFS文件系统 : hdfs sink

Source和sink之间的传递通道——channel，可用file channel 也可以用内存channel

1、配置文件编写（conf目录中创建文件exec-hdfs.conf，在这个文件中配置）：

# Name the components on this agent 
a1.sources = r1 
a1.sinks = k1 
a1.channels = c1 
 
# Describe/configure the source 
a1.sources.r1.type = exec 
a1.sources.r1.command = tail -F /root/logs/test.log 
 
# Describe the sink 
a1.sinks.k1.type = hdfs 
a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/ 
a1.sinks.k1.hdfs.filePrefix = events- 
a1.sinks.k1.hdfs.round = true 
a1.sinks.k1.hdfs.roundValue = 10 
a1.sinks.k1.hdfs.roundUnit = minute 
a1.sinks.k1.hdfs.rollInterval = 3 
a1.sinks.k1.hdfs.rollSize = 20 
a1.sinks.k1.hdfs.rollCount = 5 
a1.sinks.k1.hdfs.batchSize = 1 
a1.sinks.k1.hdfs.useLocalTimeStamp = true 
a1.sinks.k1.hdfs.fileType = DataStream 
 
# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
 
# Bind the source and sink to the channel 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2、克隆一个会话，写一个shell，不断的往/root/logs/test.log文件中写数据

while true;do echo access... >> /root/logs/test.log;sleep 0.3;done

3、启动（在flume的解压目录中启动）

bin/flume-ng agent -c conf/ -f conf/exec-hdfs.conf -n a1  -Dflume.root.logger=INFO,console

参数解析：

rollInterval

默认值：30

hdfs sink间隔多长将临时文件滚动成最终目标文件，单位：秒；如果设置成0，则表示不根据时间来滚动文件；

注：滚动（roll）指的是，hdfs sink将临时文件重命名成最终目标文件，并新打开一个临时文件来写入数据；

rollSize

默认值：1024

当临时文件达到该大小（单位：bytes）时，滚动成目标文件；如果设置成0，则表示不根据临时文件大小来滚动文件；

rollCount

默认值：10 当events数据达到该数量时候，将临时文件滚动成目标文件；如果设置成0，则表示不根据events数据来滚动文件；

round

默认值：false

是否启用时间上的“舍弃”，这里的“舍弃”，类似于“四舍五入”；

roundValue

默认值：1 时间上进行“舍弃”的值；

roundUnit

默认值：seconds 时间上进行“舍弃”的单位，包含：second,minute,hour 示例：

a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S a1.sinks.k1.hdfs.round = true

a1.sinks.k1.hdfs.roundValue = 10

a1.sinks.k1.hdfs.roundUnit = minute

当时间为2015-10-16 17:38:59时候，hdfs.path依然会被解析为：

/flume/events/20151016/17:30/00

因为设置的是舍弃10分钟内的时间，因此，该目录每10分钟新生成一个；