需求:
需要收集多处文件或者目录数据,最后输出到同一个文件中;
思路:
有多处的source 那么资源的来源可能不一样,我们可以构建多个Agent;构建多种不同的资源通道,最后这两个Agent输出槽的sink文件再作为到另一个Agent的source输入;
======a.Agent====
a.sources = r1
a.channels = c1
a.sinks = k1
# define sources
a.sources.r1.type = exec
## 具有可读的权限
a.sources.r1.command = tail -F /var/log/httpd/access_log
a.sources.r1.shell = /bin/bash -c
# define channels
a.channels.c1.type = memory
a.channels.c1.capacity = 1000
a.channels.c1.transactionCapacity = 100
# define sinks
#启用设置多级目录,这里按年/月/日/时 2级目录,每个小时生成一个文件夹
a.sinks.k1.type = avro
# agent3所在的IP
a.sinks.k1.hostname = vampire04
# agent3监听的端口
a.sinks.k1.port = 4545
# bind the sources and sinks to the channels
a.sources.r1.channels = c1
a.sinks.k1.channel = c1
=====b.Agent======
b.sources = r2b.channels = c2
b.sinks = k2
# define sources
b.sources.r2.type = exec
## 注意一定要执行flume命令的用户对该/var/log/httpd/access_log文件
## 具有可读的权限
b.sources.r2.command = tail -F /opt/modules/cdh/hive-0.13.1-cdh5.3.6/logs
b.sources.r2.shell = /bin/bash -c
# define channels
b.channels.c2.type = memory
b.channels.c2.capacity = 1000
b.channels.c2.transactionCapacity = 100
# define sinks
#启用设置多级目录,这里按年/月/日/时 2级目录,每个小时生成一个文件夹
b.sinks.k2.type = avro
b.sinks.k2.hostname = vampire04
b.sinks.k2.port = 4545
# bind the sources and sinks to the channels
b.sources.r2.channels = c2
b.sinks.k2.channel = c2
==============c.Agent======================
c.sources = r3
c.channels = c3
c.sinks = k3
# define sources
# source:avro 对应flume1和flume2的sink
c.sources.r3.type = avro
# c所在的IP
c.sources.r3.bind = vampire04
# c监听的端口
c.sources.r3.port = 4545
# define channels
c.channels.c3.type = memory
c.channels.c3.capacity = 1000
c.channels.c3.transactionCapacity = 100
c.sinks.k3.type = hdfs
c.sinks.k3.hdfs.path=hdfs://vampire04:8020/flume3/%Y%m%d/%H
c.sinks.k3.hdfs.filePrefix = accesslog
#启用按时间生成文件夹
c.sinks.k3.hdfs.round=true
#设置round单位:小时
c.sinks.k3.hdfs.roundValue=1
c.sinks.k3.hdfs.roundUnit=hour
#使用本地时间戳
c.sinks.k3.hdfs.useLocalTimeStamp=true
c.sinks.k3.hdfs.batchSize=1000
c.sinks.k3.hdfs.fileType=DataStream
c.sinks.k3.hdfs.writeFormat=Text
#设置解决文件过多过小问题
#每600秒生成一个文件
c.sinks.k3.hdfs.rollInterval=600
#当达到128000000bytes时,创建新文件 127*1024*1024
#实际环境中如果按照128M回滚文件,那么这里设置一般设置成127M
c.sinks.k3.hdfs.rollSize=128000
#设置文件的生成不和events数相关
c.sinks.k3.hdfs.rollCount=0
#设置成1,否则当有副本复制时就重新生成文件,上面三条则没有效果
c.sinks.k3.hdfs.minBlockReplicas=1
# bind the sources and sinks to the channels
c.sources.r3.channels = c3
c.sinks.k3.channel = c3
##启动agent.c:
bin/flume-ng agent --conf conf --conf-file conf/c.conf --name c-Dflume.root.logger=INFO,console
##启动agent.b:
bin/flume-ng agent --conf conf --conf-file conf/b.conf --name b -Dflume.root.logger=INFO,console
##启动agent.a:
bin/flume-ng agent --conf conf --conf-file conf/a.conf --name a-Dflume.root.logger=INFO,console
最后生成两个文件中,有利于后面建立分区表.