Hadoop离线_flume的三个采集案例_client.sinks.k1.hdfs.roundunit-CSDN博客

本文链接：https://blog.csdn.net/weixin_44449054/article/details/113772270

flume的三个采集案例

案例一

1.需求：
某服务器的某特定目录下，会不断产生新的文件，每当有新文件出现，就需要把文件采集到HDFS中去

2.解决方案：
三大组件的选择：
Source：选择Spooling Directory Source，配置时写成spooldir
Sink：选择HDFS Sink，因为是要将文件采集到的HDFS中
Channel：选择Memory Channel和File Channel都可以

spooldir的特性：
1.监视一个目录，只要目录中出现新文件，就会采集文件中的内容
2.采集完成的文件，会被agent自动添加一个后缀：COMPLETED
3.所监视的目录中不允许重复出现相同文件名的文件

3.flume配置文件开发
1.先建立一个供我们监控的文件夹mkdir -p /export/servers/dirfile
2.到flume的conf文件夹下新建配置文件：vim spooldir.conf

# 定义Agent各个组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述和配置source组件
##注意：不能往监控目录中重复丢同名文件
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/servers/dirfile
a1.sources.r1.fileHeader = true

# 描述和配置sink组件
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://node01:8020/spooldir/files/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
# 文件的采集策略，这样可以控制flume采集数据的频率，避免在HDFS上产生大量小文件
# 文件多长时间采集一次
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# 文件多大采集一次
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# 描述和配置channel组件，此处使用是内存缓存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 描述和配置source  channel   sink之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4.启动配置文件：
bin/flume-ng agent -c ./conf -f ./conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console
往目录/export/servers/dirfile下传输文件即可实现传输

案例二

1.需求
业务系统使用log4j生成的日志，日志内容不断增加，需要把追加到日志文件中的数据实时采集到HDFS

2.解决方案：
选择组件
Source：选择Exec Source，配置时写成exec
Sink：选择HDFS Sink
Channel：选择Memory Channel和File Channel都可以

3.配置文件的开发：
到Flume的conf文件夹下，新建配置文件vim tail-file.conf

# 定义agent各个组件名字
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# 描述并配置tail -F source1 (拦截器，暂时不用)
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -F /export/servers/taillogs/access_log
agent1.sources.source1.channels = channel1

# 配置source1主机
#agent1.sources.source1.interceptors = i1
#agent1.sources.source1.interceptors.i1.type = host
#agent1.sources.source1.interceptors.i1.hostHeader = hostname

# 描述并配置sink1
agent1.sinks.sink1.type = hdfs
#a1.sinks.k1.channel = c1
agent1.sinks.sink1.hdfs.path = hdfs://node01:8020/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# 描述和配置channel组件，此处使用是内存缓存的方式
agent1.channels.channel1.type = memory
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600

# 描述和配置source  channel   sink之间的连接关系
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

4.启动配置文件:
bin/flume-ng agent -c ./conf -f ./conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console

案例三

1.需求：
第一个agent负责收集文件当中的数据，通过网络发送到第二个agent当中去，
第二个agent负责接收第一个agent发送的数据，并将数据保存到hdfs上面去
2.解决方案：
用node02做第一个agent
用node03做第二个agent
在node02的配置：
数据源组件： Source，选择Exec Source，因为第一个agent的作用是收集文件当中的数据
下沉组件： Sink，选择Avro Sink，Avro Sink主要就是用来做多级agent的串联
通道组件：Channel，选择Memory Channel
在node03的配置：
数据源组件： Source，选择Avro Source，因为第一个agent的下沉为Avro Sink，所以接收上一个agent的数据就应该用Avro Source
下沉组件： Sink，选择HDFS Sink，因为最后把数据保存到HDFS上去
通道组件：Channel，选择Memory Channel
3.配置文件的开发：
在node02先开发flume配置文件
cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim tail-avro-avro-logger.conf

# 命名agent各个组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述和配置source组件
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /export/servers/taillogs/access_log
a1.sources.r1.channels = c1

# 描述和配置sink组件
##sink端的avro是一个数据发送者
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 192.168.0.30
a1.sinks.k1.port = 4141
a1.sinks.k1.batch-size = 10

# 描述和配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 描述和配置source channel sink之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

再在node03开发flume配置文件
cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim avro-hdfs.conf

# 命名agent的各个组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述和配置source组件
##source中的avro组件是一个接收者服务
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 192.168.0.30
a1.sources.r1.port = 4141

# 描述sink组件
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node01:8020/avro/hdfs/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# 使用memory channel来缓存events
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source和sink通过channel绑定
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4.启动配置文件：
在node03启动flume配置文件
bin/flume-ng agent -c conf -f conf/avro-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
在node02启动flume配置文件
bin/flume-ng agent -c conf -f conf/tail-avro-avro-logger.conf -n a1 -Dflume.root.logger=INFO,console