非结构数据采集工具---Flume，实现数据采集示例

最新推荐文章于 2024-05-31 16:30:23 发布

姠惢荇者

最新推荐文章于 2024-05-31 16:30:23 发布

阅读量1.9k

点赞数

分类专栏：大数据文章标签： flume 数据采集

本文链接：https://blog.csdn.net/hou_ge/article/details/107870934

版权

大数据专栏收录该内容

28 篇文章 7 订阅

订阅专栏

1、Flume简介

Apache Flume是一种分布式、可靠和可用的系统，用于高效收集、聚合，以及将大量日志数据从许多不同的来源移动到集中式数据存储上。使用Apache Flume不仅限于日志数据的聚合。由于数据源是可定制的，因此可以使用Flume来传输大量的事件数据，包括但不限于网络流量数据、社交媒体生成的数据、电子邮件消息和其他数据源。

Flume使用两个独立的事务负责从Source到Channel及从Channel到Sink的事件传递。Channel中的File Channel具有持久性，事件写入File Channel后，即使Agent重新启动，事件也不会丢失。Flume中还提供了一种Memory Channel的方式，但它不具有持久存储的能力，但是与File Channel相比，MemoryChannel的优点是具有较高的吞吐量。

Flume的主要组件有Event、Client、Agent、Source、Channel和Sink等。

事件Event
Event是Flume数据传输的基本单元，Flume以事件的形式将数据从源头传送到最终目的。
代理Agent
Agent是Flume流的基础部分，一个Agent包含Source、Channel、Sink和其他组件，它基于这些组件把Event从一个节点传输到另一个节点或最终目的地上，由Flume为这些组件提供配置、生命周期管理和监控支持。
Source
Source的主要职责是接收Event，并将Event批量地放到一个或者多个Channel中。Spooling Directory Source是通过读取硬盘上需要被收集数据的文件到spooling目录来获取数据，然后再将数据发送到Channel。该Source会监控指定的目录来发现新文件并解析新文件。在给定的文件已被读完之后，它被重命名为指示完成（或可选地删除）。Exec源在启动时运行给定的UNIX命令，并期望该进程在标准输出上连续生成数据。如果进程由于任何原因退出，则源也将退出并且不会继续产生数据。
Channel
Channel位于Source和Sink之间，用于缓存Event，当Sink成功将Event发送到下一个Agent或最终目的处之后，会将Event从Channel上移除。Memory Channel是指Events被存储在已配置最大容量的内存队列中，因此它不具有持久存储能力。File Channel具有持久性，只要事件被写入Channel，即使代理重新启动，事件也不会丢失，能保障数据的完整性。
Sink
Sink的主要职责是将Event传输到下一个Agent或最终目的处，成功传输完成后将Event从Channel中移除。Sink主要分为两大类：File Roll Sink和Hdfs Sink。File Roll Sink是指将事件写入本地文件系统中，首先我们要在本地文件系统中创建一个缓冲目录。HDFS Sink是指将事件写入Hadoop分布式文件系统（HDFS）。它可以根据经过的时间、数据大小或事件数量定期滚动文件，也就是关闭当前文件并创建新文件。
其他
Interceptor组件主要作用于Source，可以按照特定的顺序对Events进行装饰或过滤。Sink Group允许用户将多个Sink组合在一起，Sink Processor则能够通过组中的Sink切换来实现负载均衡，也可以在一个Sink出现故障时切换转到另一个Sink。

2、Flume安装

1、下载

wget https://mirrors.bfsu.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz

2、解压

tar -zxvf apache-flume-1.9.0-bin.tar.gz -C ../servers/

3、设置环境变量

vim /etc/profile

增加如下配置：

export FLUME_HOME=/export/servers/apache-flume-1.9.0-bin

export PATH=:$FLUME_HOME/bin:$PATH

刷新配置

source /etc/profile

3、Flume实现本地文件读取和写入

1、创建文件目录

#作为数据源
mkdir /export/source
#作为输出目录
mkdir /export/dist

2、创建配置文件
配置文件是实现Flume数据采集的核心。这里主要配置采集源和输出目录等信息，如下所示：

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置 Source
a1.sources.r1.type =spooldir
a1.sources.r1.spoolDir=/export/source

#配置Sink
a1.sinks.k1.type =file_roll
a1.sinks.k1.sink.directory=/export/dist

# 设置Channel类型为Memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 把 Source 和 Sink 绑到 Channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3、启动Flume代理

 bin/flume-ng agent -n a1 -c conf -f ./job/job1.conf -Dflume.root.logger=INFO,console

4、测试
启动成功后，在source目录下，创建test.txt文件，然后会看到控制台打印如下内容：
在这里插入图片描述

这个时候，source目录下的文件变成了test.txt.COMPLETED，说明读取成功了。
在这里插入图片描述

4、Flume实现基于HDFS的收集

1、编写配置文件

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置 Source
a1.sources.r1.type =spooldir
a1.sources.r1.spoolDir=/export/source

# 配置 Sink
a1.sinks.k1.type =hdfs
#按照%Y-%m-%d/%H%M格式分开存储文件
a1.sinks.k1.hdfs.path=hdfs://node01:8020/flume/data/%Y-%m-%d/%H%M
a1.sinks.k1.hdfs.rollInterval=0
a1.sinks.k1.hdfs.rollSize=10240000
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=3
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundValue=10
a1.sinks.k1.hdfs.roundUnit=minute
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#a1.sinks.k1.type =hdfs

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 绑定 Source 和 Sink 到 Channel 上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

其中，a1.sinks.k1.hdfs.path的值，根据自己的配置填写。

2、启动Flume代理

 bin/flume-ng agent -n a1 -c conf -f ./job/job2.conf -Dflume.root.logger=INFO,console

3、测试
启动成功后，在source目录下，创建t1.txt文件。然后访问hadoop的hdfs，比如：http://192.168.1.8:50070/explorer.html#/flume/data，就可以看见上传到hdfs的文件。
在这里插入图片描述

5、Flume实现通过exec命令收集数据

a1.sources = logSource
a1.channels = fileChannel
a1.sinks = hdfsSink
#指定Source的类型是exec
a1.sources.logSource.type = exec
#指定命令是tial -F，持续监测/export/dist/test.txt中的数据
a1.sources.logSource.command = tail -F /export/dist/test.txt


# 将Channel设置为fileChannel
a1.sources.logSource.channels = fileChannel

# 设置Sink为HDFS
a1.sinks.hdfsSink.type = hdfs
#文件生成的时间
a1.sinks.hdfsSink.hdfs.path = hdfs://master:8020/flume/record/%Y-%m-%d/%H%M
a1.sinks.hdfsSink.hdfs.filePrefix= transaction_log
a1.sinks.hdfsSink.hdfs.rollInterval= 600
a1.sinks.hdfsSink.hdfs.rollCount= 10000
a1.sinks.hdfsSink.hdfs.rollSize= 0
a1.sinks.hdfsSink.hdfs.round = true
a1.sinks.hdfsSink.hdfs.roundValue = 10
a1.sinks.hdfsSink.hdfs.roundUnit = minute
a1.sinks.hdfsSink.hdfs.fileType = DataStream
a1.sinks.hdfsSink.hdfs.useLocalTimeStamp = true
#Specify the channel the sink should use
a1.sinks.hdfsSink.channel = fileChannel

# 设置 Channel的类型为file，并设置断点目录和channel数据存放目录
a1.channels.fileChannel.type = file
a1.channels.fileChannel.checkpointDir= /export/flume/dataCheckpointDir
a1.channels.fileChannel.dataDirs= /export/flume/dataDir

2、启动Flume代理

 bin/flume-ng agent -n a1 -c conf -f ./job/job3.conf -Dflume.root.logger=INFO,console

姠惢荇者

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录