分布式日志收集——Flume学习笔记

一.Flume架构

在这里插入图片描述执行流程:
在这里插入图片描述
概念event:
event将传输的数据进行封装,是flume传输数据的基本单位,如果是文本文件,通常是一行记录,event也是事务的基本单位。event从source,流向channel,再到sink,本身为一个字节数组,并可携带headers(头信息)信息。event代表着一个数据的最小完整单元,从外部数据源来,向外部的目的地去。

二.通过netcat写入到控制台

写入文件example.conf:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat      //可用telnet来测试
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444        //访问44444端口就行

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Describe the sink
a1.sinks.k1.type = logger          //显示到控制台

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

运行命令:

flume-ng agent \
--conf $FLUME_HOME/config \
--conf-file $FLUME_HOME/config/example.conf \
--name a1 \
-Dflume.root.logger=INFO,console

telnet命令:

[hadoop@hadoop000 config]$ telnet localhost 44444
Trying 192.168.198.128...
Connected to localhost.
Escape character is '^]'.
aa         //传输字符串aa
OK
bb
OK
cc
OK

显示结果:

21/01/24 01:59:18 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0:0:0:0:0:0:0:0:44444]
21/01/24 01:59:34 INFO sink.LoggerSink: Event: { headers:{} body: 61 61 0D                                        aa. }
21/01/24 01:59:36 INFO sink.LoggerSink: Event: { headers:{} body: 62 62 0D                                        bb. }
21/01/24 01:59:37 INFO sink.LoggerSink: Event: { headers:{} body: 63 63 0D                                        cc. }

可见,每次输入都会构成一个event,event由headers和body构成。body包含数据信息。

三.通过exec写入到hdfs

# Name the components on this agent
exec-hdfs-agent.sources = exec-source
exec-hdfs-agent.sinks = hdfs-sink
exec-hdfs-agent.channels = memory-channel

# Describe/configure the source
exec-hdfs-agent.sources.exec-source.type = exec
exec-hdfs-agent.sources.exec-source.command = tail -F ~/data/data.log
exec-hdfs-agent.sources.exec-source.shell = /bin/bash -c

# Describe the sink
exec-hdfs-agent.sinks.hdfs-sink.type = hdfs
exec-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://hadoop000:8020/data/flume/tail
exec-hdfs-agent.sinks.hdfs-sink.hdfs.fileType=DataStream
exec-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat=Text
exec-hdfs-agent.sinks.hdfs-sink.hdfs.batchSize=10

# Use a channel which buffers events in memory
exec-hdfs-agent.channels.memory-channel.type = memory

# Bind the source and sink to the channel
exec-hdfs-agent.sources.exec-source.channels = memory-channel
exec-hdfs-agent.sinks.hdfs-sink.channel = memory-channel

运行命令:

flume-ng agent \
--conf $FLUME_HOME/config \
--conf-file $FLUME_HOME/config/exec-hdfs-agent.conf \
--name exec-hdfs-agent \
-Dflume.root.logger=INFO,console

输入数据:

[hadoop@hadoop000 data]$ echo aaa >> data.log 
[hadoop@hadoop000 data]$ echo bbb >> data.log 
[hadoop@hadoop000 data]$ echo ccc >> data.log 

显示结果:

21/01/24 00:32:42 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
21/01/24 00:32:42 INFO hdfs.BucketWriter: Creating hdfs://hadoop000:8020/data/flume/tail/FlumeData.1611419562576.tmp
21/01/24 00:33:12 INFO hdfs.BucketWriter: Closing hdfs://hadoop000:8020/data/flume/tail/FlumeData.1611419562576.tmp
21/01/24 00:33:12 INFO hdfs.BucketWriter: Renaming hdfs://hadoop000:8020/data/flume/tail/FlumeData.1611419562576.tmp to hdfs://hadoop000:8020/data/flume/tail/FlumeData.1611419562576
21/01/24 00:33:12 INFO hdfs.HDFSEventSink: Writer callback called.

hdfs.rollInterval:每30秒滚动一个文件。可见Creating hdfs到Closing hdfs就是30秒。
文件名为:前缀+文件名+后缀。
运行时文件名为:运行时前缀+前缀+文件名+后缀+运行时后缀。
hdfs.filePrefix:文件前缀为FlumeData。
hdfs.inUsePrefix:运行时FlumeData前面没有前缀。
hdfs.fileSuffix:文件无后缀。
hdfs.inUseSuffix:运行时有后缀tmp。运行完会有Renaming操作,去掉tmp。

HDFS Sink参数:
在这里插入图片描述重要参数:
batchSize:从source写入到channel的批次大小或者sink从channel拉一次的大小。(一批批的写入比一个个写入,效率会增高很多)。一个批次以事务的形式写入,即中途写入失败会回滚。
transactionCapacity:一个事务的最大容量,即上方架构图的List集合的容量,所以batchSize要小于transactionCapacity。
Capacity:channel中最大能存放多少event。

缺点:
缺点一:通过tail -F的linux命令,可靠性不是太好,虽然有cmd命令挂了的restart机制,但是还是不靠谱。
缺点二:只能监控一个文件,不能监控文件夹。

四.Spooling Directory Source

官网解释:
这个source允许您通过将要接收的文件放置到磁盘上的“spooling”目录中来接收数据。该source将监视指定目录中的新文件,并在新文件出现时从新文件中解析事件。事件解析逻辑是可插入的。在将给定的文件完全读入channel后,默认情况下,通过重命名文件来指示完成,或者可以删除该文件,或者使用trackerDir来跟踪已处理的文件。

与Exec source不同,这个source是可靠的,即使Flume重新启动或被杀死,也不会丢失数据。为了获得这种可靠性,必须只将不可变的、唯一命名的文件放入spooling目录。Flume试图检测这些问题条件,如果它们被违反,将会大声失败:

  • 如果一个文件在被放置到假脱机目录后被写入,Flume将在其日志文件中打印一个错误并停止处理。
  • 如果某个文件名在以后被重用,Flume将在其日志文件中打印一个错误并停止处理。

为了避免上述问题,在将文件名移动到假脱机目录时,为它们添加一个惟一标识符(例如时间戳)可能是有用的。

尽管该source具有可靠性保证,但如果发生某些下游故障,仍然存在事件可能重复的情况。这与其他Flume组件提供的保证是一致的。

上方就是说,不能修改“spooling”目录中的文件,所以只能把整个写好的文件cp到“spooling”目录中,而不能cp到“spooling”目录中再去修改。

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /tmp/upload
# 给 spoolDir 目录中文件添加的后缀,区分记录与未记录(先记录后改名)
a3.sources.r3.fileSuffix = .COMPLETED         //完成后会默认在文件后加上后缀
a3.sources.r3.fileHeader = true
# 忽略所有以.tmp 结尾的文件,不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

优点:能够监控文件夹。
缺点:如上所示

五.Taildir Source(非常重要)

taildir-memory-logger.conf:

agent.sources = s1
agent.channels = c1
agent.sinks = r1

# 指定source使用的channel
agent.sources.s1.channels = c1
# 指定sink使用的channel
agent.sinks.r1.channel = c1

# source类型
agent.sources.s1.type = TAILDIR
# 元数据位置
agent.sources.s1.positionFile = /home/hadoop/tmp/flume/taildir_position.json
# 监控的目录f1和f2
agent.sources.s1.filegroups = f1 f2
agent.sources.s1.filegroups.f1=/home/hadoop/tmp/flume/test1/example.log
agent.sources.s1.headers.f1.headerKey1 = value1
agent.sources.s1.filegroups.f2=/home/hadoop/tmp/flume/test2/.*log.*
agent.sources.s1.headers.f2.headerKey1 = value2-1
agent.sources.s1.headers.f2.headerKey2 = value2-2
agent.sources.s1.fileHeader = true


agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000
agent.channels.c1.transactionCapacity = 100

agent.sinks.r1.type = logger

运行命令:

flume-ng agent \
--conf $FLUME_HOME/config \
--conf-file $FLUME_HOME/config/taildir-memory-logger.conf \
--name agent \
-Dflume.root.logger=INFO,console

f1下的example.log和 f2下的1.log,2.log:

[hadoop@hadoop000 test2]$ echo aaa>>1.log
[hadoop@hadoop000 test2]$ echo bbb>>1.log
[hadoop@hadoop000 test2]$ echo ccc>>2.log
[hadoop@hadoop000 test2]$ echo ddd>>2.log
[hadoop@hadoop000 test1]$ echo eee >> example.log

显示结果:

21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: headerTable: {f1={headerKey1=value1}, f2={headerKey1=value2-1, headerKey2=value2-2}}
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: Opening file: /home/hadoop/tmp/flume/test1/example.log, inode: 102948027, pos: 0
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: Opening file: /home/hadoop/tmp/flume/test2/1.log, inode: 546644, pos: 0
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: Opening file: /home/hadoop/tmp/flume/test2/2.log, inode: 546645, pos: 0
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: Updating position from position file: /home/hadoop/tmp/flume/taildir_position.json
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: File not found: /home/hadoop/tmp/flume/taildir_position.json, not updating position
21/01/24 14:15:16 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.
21/01/24 14:15:16 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: s1 started
21/01/24 14:15:38 INFO sink.LoggerSink: Event: { headers:{headerKey1=value2-1, headerKey2=value2-2, file=/home/hadoop/tmp/flume/test2/1.log} body: 61 61 61                                        aaa }
21/01/24 14:15:43 INFO sink.LoggerSink: Event: { headers:{headerKey1=value2-1, headerKey2=value2-2, file=/home/hadoop/tmp/flume/test2/1.log} body: 62 62 62                                        bbb }
21/01/24 14:15:49 INFO sink.LoggerSink: Event: { headers:{headerKey1=value2-1, headerKey2=value2-2, file=/home/hadoop/tmp/flume/test2/2.log} body: 63 63 63                                        ccc }
21/01/24 14:15:53 INFO sink.LoggerSink: Event: { headers:{headerKey1=value2-1, headerKey2=value2-2, file=/home/hadoop/tmp/flume/test2/2.log} body: 64 64 64                                        ddd }
21/01/24 14:16:15 INFO sink.LoggerSink: Event: { headers:{headerKey1=value1, file=/home/hadoop/tmp/flume/test1/example.log} body: 65 65 65                                        eee }
21/01/24 14:17:47 INFO taildir.TaildirSource: Closed file: /home/hadoop/tmp/flume/test2/1.log, inode: 546644, pos: 8
21/01/24 14:17:57 INFO taildir.TaildirSource: Closed file: /home/hadoop/tmp/flume/test2/2.log, inode: 546645, pos: 8
21/01/24 14:18:17 INFO taildir.TaildirSource: Closed file: /home/hadoop/tmp/flume/test1/example.log, inode: 102948027, pos: 4

结果会显示偏移量,用来记录消费到的位子,就算Flume挂了,依然有数据进入文件,Flume重启后依然会从挂了的偏移量开始消费。不会丢失数据。

查看偏移量:

[hadoop@hadoop000 flume]$ cat /home/hadoop/tmp/flume/taildir_position.json

//结果:
[{"inode":102948027,"pos":4,"file":"/home/hadoop/tmp/flume/test1/example.log"},
{"inode":546644,"pos":8,"file":"/home/hadoop/tmp/flume/test2/1.log"},
{"inode":546645,"pos":8,"file":"/home/hadoop/tmp/flume/test2/2.log"}]

echo eee >> example.log 后 example.log的偏移量就为4。

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值