数据采集工具Flume——应用案例

想做CTO的任同学...

已于 2023-01-16 11:31:30 修改

阅读量204

点赞数

文章标签： hadoop 大数据 flume java

于 2021-09-15 22:47:24 首次发布

本文链接：https://blog.csdn.net/qq_43408367/article/details/120318881

版权

Flume 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

文章目录

Flume——入门案例

需求：监听本机 8888 端口，Flume将监听的数据实时显示在控制台
需求分析：
1. 使用 telnet 工具可以向 8888 端口发送数据
2. 监听端口数据，选择 netcat source
3. channel 选择 memory
4. 数据实时显示，选择 logger sink

创建 Flume Agent 配置文件。 flume-netcat-logger.conf

# a1是agent的名称。source、channel、sink的名称分别为：r1 c1 k1
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# source
a1.sources.r1.type = netcat
a1.sources.r1.bind = linux123
a1.sources.r1.port = 8888
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 100

# sink
a1.sinks.k1.type = logger
# source、channel、sink之间的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Memory Channel 是使用内存缓冲Event的Channel实现。速度比较快速，容量会受到 jvm 内存大小的限制，可靠性不够高。适用于允许丢失数据，但对性能要求较高的日志采集业务。

启动Flume Agent

[root@linux123 conf]# $FLUME_HOME/bin/flume-ng agent --name a1 \
> --conf-file $FLUME_HOME/conf/flume-netcat-logger.conf \
> -Dflume.root.logger=INFO,console

使用 telnet 向 8888 端口发送消息：telnet linux123 8888

在 Flume 监听页面查看数据接收情况

INFO sink.LoggerSink: Event: { headers:{} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64 0D             hello world. }

Flume——监控日志文件信息到HDFS

监控本地日志文件，收集内容实时上传到HDFS
需求分析：
1. 使用 tail -F 命令即可找到本地日志文件产生的信息
2. source 选择 exec。exec 监听一个指定的命令，获取命令的结果作为数据源。
3. source组件从这个命令的结果中取数据。当agent进程挂掉重启后，可能存在数据丢失；
4. channel 选择 memory
5. sink 选择 HDFS
```
tail -f
等同于--follow=descriptor，根据文件描述符进行追踪，当文件改名或被删除，追踪停止
tail -F
等同于--follow=name --retry，根据文件名进行追踪，并保持重试，
即该文件被删除或改名后，如果再次创建相同的文件名，会继续追踪
```

创建配置文件。flume-exec-hdfs.conf ：

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /tmp/root/hive.log
# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 10000
a2.channels.c2.transactionCapacity = 500
# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://linux121:8020/flume/%Y%m%d/%H%M
# 上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
# 是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
# 积攒500个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 500
# 设置文件类型，支持压缩。DataStream没启用压缩
a2.sinks.k2.hdfs.fileType = DataStream
# 1分钟滚动一次
a2.sinks.k2.hdfs.rollInterval = 60
# 128M滚动一次
a2.sinks.k2.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关
a2.sinks.k2.hdfs.rollCount = 0
# 最小冗余数
a2.sinks.k2.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

启动Flume；

[root@linux123 conf]# $FLUME_HOME/bin/flume-ng agent --name a2 \
> --conf-file ~/conf/flume-exec-hdfs.conf \
> -Dflume.root.logger=INFO,console

在HDFS上查看文件

Flume——监控目录采集信息到HDFS

需求：监控指定目录，收集信息实时上传到HDFS
需求分析：
1. source 选择 spooldir。spooldir 能够保证数据不丢失，且能够实现断点续传，但延迟较高，不能实时监控
2. channel 选择 memory
3. sink 选择 HDFS
spooldir Source监听一个指定的目录，即只要向指定目录添加新的文件，source组件就可以获取到该信息，并解析该文件的内容，写入到channel。sink处理完之后，标记该文件已完成处理，文件名添加 .completed 后缀。虽然是自动监控整个目录，但是只能监控文件，如果以追加的方式向已被处理的文件中添加内容，source并不能识别。
注意：
1. 拷贝到spool目录下的文件不可以再打开编辑
2. 无法监控子目录的文件夹变动
3. 被监控文件夹每500毫秒扫描一次文件变动
4. 适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步

创建配置文件。flume-spooldir-hdfs.conf

# Name the components on this agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3
# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /root/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
# 忽略以.tmp结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 10000
a3.channels.c3.transactionCapacity = 500
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path =
hdfs://linux121:8020/flume/upload/%Y%m%d/%H%M
# 上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
# 是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
# 积攒500个Event，flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 500
# 设置文件类型
a3.sinks.k3.hdfs.fileType = DataStream
# 60秒滚动一次
a3.sinks.k3.hdfs.rollInterval = 60
# 128M滚动一次
a3.sinks.k3.hdfs.rollSize = 134217700
# 文件滚动与event数量无关
a3.sinks.k3.hdfs.rollCount = 0
# 最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

启动Flume

[root@linux123 conf]# $FLUME_HOME/bin/flume-ng agent --name a3 \
> --conf-file ~/conf/flume-spooldir-hdfs.conf \
> -Dflume.root.logger=INFO,console

向upload文件夹中添加文件
查看HDFS上的数据

PS:HDFS Sink

使用 HDFS Sink 都会采用滚动生成文件的方式，滚动生成文件的策略有：
1. 基于时间:hdfs.rollInterval，默认值：30，单位秒，0表示禁用
2. 基于文件大小：hdfs.rollSize，默认值1024字节，0表示禁用
3. 基于event数量：hdfs.rollCount，默认值10，0表示禁用
4. 基于文件空闲时间：hdfs.idleTimeout，默认值0，
5. 基于HDFS文件副本数：hdfs.minBlockReplicas，默认：与HDFS的副本数一致，要将该参数设置为1；否则HFDS文件所在块的复制会引起文件滚动
时间配置：
1. hdfs.useLocalTimeStamp：使用本地时间，而不是event header的时间戳，默认值：false
2. hdfs.round：时间戳是否四舍五入，默认值false，如果为true，会影响所有的时间，除了t%
3. hdfs.roundValue，四舍五入的最高倍数（单位配置在hdfs.roundUnit），但是要小于当前时间，默认值：1
4. hdfs.roundUnit：可选值为：second、minute、hour，默认值：second

避免HDFS Sink产生小文件，参考如下参数设置：

a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.path=hdfs://linux121:9000/flume/events/%Y/%m/%d/%H/%M
a1.sinks.k1.hdfs.minBlockReplicas=1
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=0

Flume——监控日志文件采集数据到HDFS、本地文件系统

需求：监控日志文件，收集信息上传到HDFS 和本地文件系统
需求分析：
1. 需要多个Agent级联实现
2. source 选择 taildir
3. channel 选择 memory
4. 最终的 sink 分别选择 hdfs、file_roll
taildir Source。Flume 1.7.0加入的新Source，相当于 spooldir source + execsource。可以监控多个目录，并且使用正则表达式匹配该目录中的文件名进行实时收集。实时监控一批文件，并记录每个文件最新消费位置，agent进程重启后不会有数据丢失的问题。不适用于Windows系统；其不会对于跟踪的文件有任何处理，不会重命名也不会删除，不会做任何修改。不支持读取二进制文件，支持一行一行的读取文本文件。

创建第一个配置文件：flume-taildir-avro.conf 配置文件包括：

1个 taildir source
2个 memory channel
2个 avro sink

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有channel
a1.sources.r1.selector.type = replicating
# source
a1.sources.r1.type = taildir
# 记录每个文件最新消费位置
a1.sources.r1.positionFile = /root/flume/taildir_position.json
a1.sources.r1.filegroups = f1
# 备注：.*log 是正则表达式；这里写成 *.log 是错误的
a1.sources.r1.filegroups.f1 = /tmp/root/.*log
# sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux123
a1.sinks.k1.port = 9091

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux123
a1.sinks.k2.port = 9092
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 500
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 500
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

创建第二个配置文件，flume-avro-hdfs.conf配置文件包括：

1个 avro source
1个 memory channel
1个 hdfs sink

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = linux123
a2.sources.r1.port = 9091
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 500
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://linux121:8020/flume2/%Y%m%d/%H
# 上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
# 是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
# 500个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 500
# 设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
# 60秒生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 60
a2.sinks.k1.hdfs.rollSize = 0
a2.sinks.k1.hdfs.rollCount = 0
a2.sinks.k1.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

创建第三个配置文件，flume-avro-file.conf配置文件包括：

1个 avro source
1个 memory channel
1个 file_roll sink

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = linux123
a3.sources.r1.port = 9092
# Describe the sink
a3.sinks.k1.type = file_roll
# 目录需要提前创建好
a3.sinks.k1.sink.directory = /root/flume/output
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 10000
a3.channels.c2.transactionCapacity = 500
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

分别启动3个Agent

$FLUME_HOME/bin/flume-ng agent --name a3 \
--conf-file ~/conf/flume-avro-file.conf \
-Dflume.root.logger=INFO,console &

$FLUME_HOME/bin/flume-ng agent --name a2 \
--conf-file ~/conf/flume-avro-hdfs.conf \
-Dflume.root.logger=INFO,console &

$FLUME_HOME/bin/flume-ng agent --name a1 \
--conf-file ~/conf/flume-taildir-avro.conf \
-Dflume.root.logger=INFO,console &

执行hive命令产生日志
分别检查HDFS文件、本地文件、以及消费位置文件

想做CTO的任同学...

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据采集工具Flume——应用案例

实时监控一批文件，并记录每个文件最新消费位置，agent进程重启后不会有数据丢失的问题。其不会对于跟踪的文件有任何处理，不会重命名也不会删除，不会做任何修改。spooldir Source监听一个指定的目录，即只要向指定目录添加新的文件，source组件就可以获取到该信息，并解析该文件的内容，写入到channel。sink处理完之后，标记该文件已完成处理，文件名添加 .completed 后缀。虽然是自动监控整个目录，但是只能监控文件，如果以追加的方式向已被处理的文件中添加内容，source并不能识别。
复制链接

扫一扫

专栏目录