分布式日志收集——Flume学习笔记

最新推荐文章于 2021-12-10 09:30:54 发布

程研板

最新推荐文章于 2021-12-10 09:30:54 发布

阅读量567

点赞数 2

分类专栏： # Flume 文章标签： flume 大数据 hadoop linux

本文链接：https://blog.csdn.net/qq_38258720/article/details/113091123

版权

Flume 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一.Flume架构

在这里插入图片描述执行流程：

概念event：
event将传输的数据进行封装，是flume传输数据的基本单位，如果是文本文件，通常是一行记录，event也是事务的基本单位。event从source，流向channel，再到sink，本身为一个字节数组，并可携带headers(头信息)信息。event代表着一个数据的最小完整单元，从外部数据源来，向外部的目的地去。

二.通过netcat写入到控制台

写入文件example.conf：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat      //可用telnet来测试
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444        //访问44444端口就行

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Describe the sink
a1.sinks.k1.type = logger          //显示到控制台

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

运行命令：

flume-ng agent \
--conf $FLUME_HOME/config \
--conf-file $FLUME_HOME/config/example.conf \
--name a1 \
-Dflume.root.logger=INFO,console

telnet命令：

[hadoop@hadoop000 config]$ telnet localhost 44444
Trying 192.168.198.128...
Connected to localhost.
Escape character is '^]'.
aa         //传输字符串aa
OK
bb
OK
cc
OK

显示结果：

21/01/24 01:59:18 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0:0:0:0:0:0:0:0:44444]
21/01/24 01:59:34 INFO sink.LoggerSink: Event: { headers:{} body: 61 61 0D                                        aa. }
21/01/24 01:59:36 INFO sink.LoggerSink: Event: { headers:{} body: 62 62 0D                                        bb. }
21/01/24 01:59:37 INFO sink.LoggerSink: Event: { headers:{} body: 63 63 0D                                        cc. }

可见，每次输入都会构成一个event，event由headers和body构成。body包含数据信息。

三.通过exec写入到hdfs

# Name the components on this agent
exec-hdfs-agent.sources = exec-source
exec-hdfs-agent.sinks = hdfs-sink
exec-hdfs-agent.channels = memory-channel

# Describe/configure the source
exec-hdfs-agent.sources.exec-source.type = exec
exec-hdfs-agent.sources.exec-source.command = tail -F ~/data/data.log
exec-hdfs-agent.sources.exec-source.shell = /bin/bash -c

# Describe the sink
exec-hdfs-agent.sinks.hdfs-sink.type = hdfs
exec-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://hadoop000:8020/data/flume/tail
exec-hdfs-agent.sinks.hdfs-sink.hdfs.fileType=DataStream
exec-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat=Text
exec-hdfs-agent.sinks.hdfs-sink.hdfs.batchSize=10

# Use a channel which buffers events in memory
exec-hdfs-agent.channels.memory-channel.type = memory

# Bind the source and sink to the channel
exec-hdfs-agent.sources.exec-source.channels = memory-channel
exec-hdfs-agent.sinks.hdfs-sink.channel = memory-channel

运行命令：

flume-ng agent \
--conf $FLUME_HOME/config \
--conf-file $FLUME_HOME/config/exec-hdfs-agent.conf \
--name exec-hdfs-agent \
-Dflume.root.logger=INFO,console

输入数据：

[hadoop@hadoop000 data]$ echo aaa >> data.log 
[hadoop@hadoop000 data]$ echo bbb >> data.log 
[hadoop@hadoop000 data]$ echo ccc >> data.log

显示结果：

21/01/24 00:32:42 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
21/01/24 00:32:42 INFO hdfs.BucketWriter: Creating hdfs://hadoop000:8020/data/flume/tail/FlumeData.1611419562576.tmp
21/01/24 00:33:12 INFO hdfs.BucketWriter: Closing hdfs://hadoop000:8020/data/flume/tail/FlumeData.1611419562576.tmp
21/01/24 00:33:12 INFO hdfs.BucketWriter: Renaming hdfs://hadoop000:8020/data/flume/tail/FlumeData.1611419562576.tmp to hdfs://hadoop000:8020/data/flume/tail/FlumeData.1611419562576
21/01/24 00:33:12 INFO hdfs.HDFSEventSink: Writer callback called.

hdfs.rollInterval：每30秒滚动一个文件。可见Creating hdfs到Closing hdfs就是30秒。
文件名为：前缀+文件名+后缀。
运行时文件名为：运行时前缀+前缀+文件名+后缀+运行时后缀。
hdfs.filePrefix：文件前缀为FlumeData。
hdfs.inUsePrefix：运行时FlumeData前面没有前缀。
hdfs.fileSuffix：文件无后缀。
hdfs.inUseSuffix：运行时有后缀tmp。运行完会有Renaming操作，去掉tmp。

HDFS Sink参数：
在这里插入图片描述重要参数：
batchSize：从source写入到channel的批次大小或者sink从channel拉一次的大小。（一批批的写入比一个个写入，效率会增高很多）。一个批次以事务的形式写入，即中途写入失败会回滚。
transactionCapacity：一个事务的最大容量，即上方架构图的List集合的容量，所以batchSize要小于transactionCapacity。
Capacity：channel中最大能存放多少event。

缺点：
缺点一：通过tail -F的linux命令，可靠性不是太好，虽然有cmd命令挂了的restart机制，但是还是不靠谱。
缺点二：只能监控一个文件，不能监控文件夹。

四.Spooling Directory Source

官网解释：
这个source允许您通过将要接收的文件放置到磁盘上的“spooling”目录中来接收数据。该source将监视指定目录中的新文件，并在新文件出现时从新文件中解析事件。事件解析逻辑是可插入的。在将给定的文件完全读入channel后，默认情况下，通过重命名文件来指示完成，或者可以删除该文件，或者使用trackerDir来跟踪已处理的文件。

与Exec source不同，这个source是可靠的，即使Flume重新启动或被杀死，也不会丢失数据。为了获得这种可靠性，必须只将不可变的、唯一命名的文件放入spooling目录。Flume试图检测这些问题条件，如果它们被违反，将会大声失败：

如果一个文件在被放置到假脱机目录后被写入，Flume将在其日志文件中打印一个错误并停止处理。
如果某个文件名在以后被重用，Flume将在其日志文件中打印一个错误并停止处理。

为了避免上述问题，在将文件名移动到假脱机目录时，为它们添加一个惟一标识符(例如时间戳)可能是有用的。

尽管该source具有可靠性保证，但如果发生某些下游故障，仍然存在事件可能重复的情况。这与其他Flume组件提供的保证是一致的。

上方就是说，不能修改“spooling”目录中的文件，所以只能把整个写好的文件cp到“spooling”目录中，而不能cp到“spooling”目录中再去修改。

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /tmp/upload
# 给 spoolDir 目录中文件添加的后缀，区分记录与未记录（先记录后改名）
a3.sources.r3.fileSuffix = .COMPLETED         //完成后会默认在文件后加上后缀
a3.sources.r3.fileHeader = true
# 忽略所有以.tmp 结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

优点：能够监控文件夹。
缺点：如上所示

五.Taildir Source（非常重要）

taildir-memory-logger.conf:

agent.sources = s1
agent.channels = c1
agent.sinks = r1

# 指定source使用的channel
agent.sources.s1.channels = c1
# 指定sink使用的channel
agent.sinks.r1.channel = c1

# source类型
agent.sources.s1.type = TAILDIR
# 元数据位置
agent.sources.s1.positionFile = /home/hadoop/tmp/flume/taildir_position.json
# 监控的目录f1和f2
agent.sources.s1.filegroups = f1 f2
agent.sources.s1.filegroups.f1=/home/hadoop/tmp/flume/test1/example.log
agent.sources.s1.headers.f1.headerKey1 = value1
agent.sources.s1.filegroups.f2=/home/hadoop/tmp/flume/test2/.*log.*
agent.sources.s1.headers.f2.headerKey1 = value2-1
agent.sources.s1.headers.f2.headerKey2 = value2-2
agent.sources.s1.fileHeader = true


agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000
agent.channels.c1.transactionCapacity = 100

agent.sinks.r1.type = logger

运行命令：

flume-ng agent \
--conf $FLUME_HOME/config \
--conf-file $FLUME_HOME/config/taildir-memory-logger.conf \
--name agent \
-Dflume.root.logger=INFO,console

f1下的example.log和 f2下的1.log，2.log：

[hadoop@hadoop000 test2]$ echo aaa>>1.log
[hadoop@hadoop000 test2]$ echo bbb>>1.log
[hadoop@hadoop000 test2]$ echo ccc>>2.log
[hadoop@hadoop000 test2]$ echo ddd>>2.log
[hadoop@hadoop000 test1]$ echo eee >> example.log

显示结果：

21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: headerTable: {f1={headerKey1=value1}, f2={headerKey1=value2-1, headerKey2=value2-2}}
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: Opening file: /home/hadoop/tmp/flume/test1/example.log, inode: 102948027, pos: 0
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: Opening file: /home/hadoop/tmp/flume/test2/1.log, inode: 546644, pos: 0
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: Opening file: /home/hadoop/tmp/flume/test2/2.log, inode: 546645, pos: 0
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: Updating position from position file: /home/hadoop/tmp/flume/taildir_position.json
21/01/24 14:15:16 INFO taildir.ReliableTaildirEventReader: File not found: /home/hadoop/tmp/flume/taildir_position.json, not updating position
21/01/24 14:15:16 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.
21/01/24 14:15:16 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: s1 started
21/01/24 14:15:38 INFO sink.LoggerSink: Event: { headers:{headerKey1=value2-1, headerKey2=value2-2, file=/home/hadoop/tmp/flume/test2/1.log} body: 61 61 61                                        aaa }
21/01/24 14:15:43 INFO sink.LoggerSink: Event: { headers:{headerKey1=value2-1, headerKey2=value2-2, file=/home/hadoop/tmp/flume/test2/1.log} body: 62 62 62                                        bbb }
21/01/24 14:15:49 INFO sink.LoggerSink: Event: { headers:{headerKey1=value2-1, headerKey2=value2-2, file=/home/hadoop/tmp/flume/test2/2.log} body: 63 63 63                                        ccc }
21/01/24 14:15:53 INFO sink.LoggerSink: Event: { headers:{headerKey1=value2-1, headerKey2=value2-2, file=/home/hadoop/tmp/flume/test2/2.log} body: 64 64 64                                        ddd }
21/01/24 14:16:15 INFO sink.LoggerSink: Event: { headers:{headerKey1=value1, file=/home/hadoop/tmp/flume/test1/example.log} body: 65 65 65                                        eee }
21/01/24 14:17:47 INFO taildir.TaildirSource: Closed file: /home/hadoop/tmp/flume/test2/1.log, inode: 546644, pos: 8
21/01/24 14:17:57 INFO taildir.TaildirSource: Closed file: /home/hadoop/tmp/flume/test2/2.log, inode: 546645, pos: 8
21/01/24 14:18:17 INFO taildir.TaildirSource: Closed file: /home/hadoop/tmp/flume/test1/example.log, inode: 102948027, pos: 4

结果会显示偏移量，用来记录消费到的位子，就算Flume挂了，依然有数据进入文件，Flume重启后依然会从挂了的偏移量开始消费。不会丢失数据。

查看偏移量：

[hadoop@hadoop000 flume]$ cat /home/hadoop/tmp/flume/taildir_position.json

//结果：
[{"inode":102948027,"pos":4,"file":"/home/hadoop/tmp/flume/test1/example.log"},
{"inode":546644,"pos":8,"file":"/home/hadoop/tmp/flume/test2/1.log"},
{"inode":546645,"pos":8,"file":"/home/hadoop/tmp/flume/test2/2.log"}]

echo eee >> example.log 后 example.log的偏移量就为4。

程研板

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
分布式日志收集——Flume学习笔记

目录一.Flume架构二.通过netcat写入到控制台三.通过exec写入到hdfs四.Spooling Directory Source五.Taildir Source（非常重要）一.Flume架构执行流程：概念event：event将传输的数据进行封装，是flume传输数据的基本单位，如果是文本文件，通常是一行记录，event也是事务的基本单位。event从source，流向channel，再到sink，本身为一个字节数组，并可携带headers(头信息)信息。event代表着一个数据的最小完
复制链接

扫一扫

专栏目录