flume框架综合应用(fan in)

最新推荐文章于 2021-12-27 21:14:34 发布

我不是李寻欢

最新推荐文章于 2021-12-27 21:14:34 发布

阅读量250

点赞数

分类专栏： Flume

本文链接：https://blog.csdn.net/qq_39532946/article/details/76933873

版权

Flume 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

需求:

需要收集多处文件或者目录数据,最后输出到同一个文件中;

思路:

有多处的source 那么资源的来源可能不一样,我们可以构建多个Agent;构建多种不同的资源通道,最后这两个Agent输出槽的sink文件再作为到另一个Agent的source输入;

======a.Agent====

a.sources = r1
a.channels = c1
a.sinks = k1

# define sources
a.sources.r1.type = exec

## 具有可读的权限
a.sources.r1.command = tail -F /var/log/httpd/access_log
a.sources.r1.shell = /bin/bash -c

# define channels
a.channels.c1.type = memory
a.channels.c1.capacity = 1000
a.channels.c1.transactionCapacity = 100

# define sinks
#启用设置多级目录，这里按年/月/日/时 2级目录，每个小时生成一个文件夹
a.sinks.k1.type = avro
# agent3所在的IP
a.sinks.k1.hostname = vampire04
# agent3监听的端口
a.sinks.k1.port = 4545

# bind the sources and sinks to the channels
a.sources.r1.channels = c1

a.sinks.k1.channel = c1

=====b.Agent======

b.sources = r2
b.channels = c2
b.sinks = k2

# define sources
b.sources.r2.type = exec
## 注意一定要执行flume命令的用户对该/var/log/httpd/access_log文件
## 具有可读的权限
b.sources.r2.command = tail -F /opt/modules/cdh/hive-0.13.1-cdh5.3.6/logs
b.sources.r2.shell = /bin/bash -c

# define channels
b.channels.c2.type = memory
b.channels.c2.capacity = 1000
b.channels.c2.transactionCapacity = 100

# define sinks
#启用设置多级目录，这里按年/月/日/时 2级目录，每个小时生成一个文件夹
b.sinks.k2.type = avro
b.sinks.k2.hostname = vampire04
b.sinks.k2.port = 4545

# bind the sources and sinks to the channels
b.sources.r2.channels = c2

b.sinks.k2.channel = c2

==============c.Agent======================

c.sources = r3
c.channels = c3
c.sinks = k3

# define sources
# source:avro 对应flume1和flume2的sink
c.sources.r3.type = avro
# c所在的IP
c.sources.r3.bind = vampire04
# c监听的端口
c.sources.r3.port = 4545

# define channels
c.channels.c3.type = memory
c.channels.c3.capacity = 1000
c.channels.c3.transactionCapacity = 100

c.sinks.k3.type = hdfs
c.sinks.k3.hdfs.path=hdfs://vampire04:8020/flume3/%Y%m%d/%H
c.sinks.k3.hdfs.filePrefix = accesslog
#启用按时间生成文件夹
c.sinks.k3.hdfs.round=true
#设置round单位：小时
c.sinks.k3.hdfs.roundValue=1
c.sinks.k3.hdfs.roundUnit=hour
#使用本地时间戳
c.sinks.k3.hdfs.useLocalTimeStamp=true

c.sinks.k3.hdfs.batchSize=1000
c.sinks.k3.hdfs.fileType=DataStream
c.sinks.k3.hdfs.writeFormat=Text

#设置解决文件过多过小问题
#每600秒生成一个文件
c.sinks.k3.hdfs.rollInterval=600
#当达到128000000bytes时，创建新文件 127*1024*1024
#实际环境中如果按照128M回滚文件,那么这里设置一般设置成127M
c.sinks.k3.hdfs.rollSize=128000
#设置文件的生成不和events数相关
c.sinks.k3.hdfs.rollCount=0
#设置成1，否则当有副本复制时就重新生成文件，上面三条则没有效果
c.sinks.k3.hdfs.minBlockReplicas=1

# bind the sources and sinks to the channels
c.sources.r3.channels = c3
c.sinks.k3.channel = c3

##启动agent.c：
bin/flume-ng agent --conf conf --conf-file conf/c.conf --name c-Dflume.root.logger=INFO,console

##启动agent.b：
bin/flume-ng agent --conf conf --conf-file conf/b.conf --name b -Dflume.root.logger=INFO,console

##启动agent.a：

bin/flume-ng agent --conf conf --conf-file conf/a.conf --name a-Dflume.root.logger=INFO,console

最后生成两个文件中,有利于后面建立分区表.

我不是李寻欢

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
flume框架综合应用(fan in)

需求:需要收集多处文件或者目录数据,最后输出到同一个文件中;思路:有多处的source 那么资源的来源可能不一样,我们可以构建多个Agent;构建多种不同的资源通道,最后这两个Agent输出槽的sink文件再作为到另一个Agent的source输入;======a.Agent====a.sources = r1a.channels = c1a.s
复制链接

扫一扫