【Hadoop学习笔记】（四）——Flume

wanger61

已于 2022-10-28 11:27:04 修改

阅读量792

点赞数

分类专栏：大数据开发文章标签： hadoop 学习 flume

于 2022-10-27 17:24:18 首次发布

本文链接：https://blog.csdn.net/wanger61/article/details/127555541

版权

大数据开发专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、Flume概述

Flume是一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统，Flume支持在日志系统中定制各类数据发送方，用于收集数据；同时，Flume提供对数据进行简单处理，并写到各种数据接受方（可定制）的能力。

1. Flume 基本概念

Flume 的基本工作流程： Flume 从一个称为信源（source）的部件接收数据，将数据传入信道（channel），最终写入称为信宿（sink）的目的地。

这里涉及到几个基本概念：

信源（source）：信源负责接收数据，信源在接收到足够的数据后可以生成一个 Flume 事件，并负责把事件传给用户设置的信道。如 Netcat，Avro，Kafka均可以作为信源。
信道（channel）：信道负责传输事件的通信机制和保留机制。由于从信源接收数据的速度和往信宿写入数据的速率存在差异，因此信道需要在信源和信宿之间缓存数据。信道的类型：1. memory信道：直接把数据读入内存，效率高，但内存占用大，有数据丢失风险 2.file、JDBC信道：效率相对较低，但是可靠性好
信宿（sink）：信宿负责接收数据。信宿类型：logger、fill_roll、HDFS、HBase、Avro（用于代理链）、null（用于测试）、IRC（用于互联网中继聊天服务）

除了自带的信源、信道、信宿外，Flume提供了信源、信道、信宿的接口，可用于自定义信源、信道、信宿

二、 Flume使用

1. 网络数据写入日志文件

Flume 的信源、信道、信宿通过配置文件配置

编写 agent1.conf 配置文件：

agent1.sources = netsource                                  // 信源名
agent1.sinks = logsink                                      // 信宿名
agent1.channels = memorychannel                             // 信道名

agent1.sources.netsource.type = netcat                      // 信源具体类型（netcat）
agent1.sources.netsource.bind = localhost                   // 主机
agent1.sources.netsource.port = 3000                        // 端口

agent1.sinks.logsink.type = logger                          // 信宿具体类型（logger）

agent1.sources.memorychannel.type = memory                  // 信道具体类型（memory）
agent1.sources.memorychannel.capacity = 1000                // 容量
agent1.sources.memorychannel.transactionCapacity = 100      // 事务容量

agent1.sources.netsource.channels = memorychannel 
agent1.sinks.logsink.channels = memorychannel               // 配置连接信源和信宿的信道

启动代理

flume-ng agent --conf conf --conf-file agent1.conf  --name agent1

该命令启动了一个代理，指定了配置文件，并且指定了配置文件里设置的代理名称

网络数据传输

远程连接至主机，并输入数据

curl telnet://localhost:3000

即可在 flume.log 文件中采集收到的网络数据、

2. 远程文件数据写入本地文件

编写 agent3.conf 配置文件：

agent3.sources = avrosource
agent3.sinks = filesink
agent3.channels = jdbcchannel

agent3.sources.avrosource.type = avro                              // 配置信源为 Avro
agent3.sources.avrosource.bind = localhost 
agent3.sources.avrosource.port = 4000
agent3.sources.avrosource.threads = 5

agent3.sinks.filesink.type = FILE_ROLL
agent3.sinks.filesink.sink.directory = /home/hadoop/flume/files    // 配置文件路径
agent3.sinks.filesink.sink.rollInterval = 0

agent3.channels.jdbcchannel.type = jdbc

agent3.sources.avrosource.channels = jdbcchannel
agent3.sinks.filesink.channel = jdbcchannel

使用 Flume Avro 客户端向 agent3 发送文件

flume-ng avro-client -H localhost -p 4000 -F /home/hadoop/message

Avro 是一个数据序列化框架，负责对数据进行封包，并把数据从网络中的一个地方传到另一个地方。Flume既会用到 Avro 信源，也会用到独立的 Avro客户端

3. 把网络数据写入 HDFS

编写 agent4.conf 配置文件：

agent4.sources = netsource                                  // 信源名
agent4.sinks = hdfssink                                     // 信宿名
agent4.channels = memorychannel                             // 信道名

agent4.sources.netsource.type = netcat                      // 信源具体类型（netcat）
agent4.sources.netsource.bind = localhost                   // 主机
agent4.sources.netsource.port = 3000                        // 端口

agent4.sinks.hdfssink.type = hdfs
agent4.sinks.hdfssink.hdfs.path = /flume                    // hadoop路径
agent4.sinks.hdfssink.hdfs.filePrefix = log                 // 文件前缀
agent4.sinks.hdfssink.hdfs.rollInterval = 0         
agent4.sinks.hdfssink.hdfs.rollCount = 3                    // 每个文件存储的行数
agent4.sinks.hdfssink.hdfs.fileType = DataStream       

agent4.sources.memorychannel.type = memory                  // 信道具体类型（memory）
agent4.sources.memorychannel.capacity = 1000                // 容量
agent4.sources.memorychannel.transactionCapacity = 100      // 事务容量

agent4.sources.netsource.channels = memorychannel 
agent4.sinks.hdfssink.channels = memorychannel               // 配置连接信源和信宿的信道

网络传输数据后，在输出目录下即可看到采集的文件（.tmp代表正在写的文件）

hadoop fs -ls /flume
hadoop fs -cat "/flume/*”

三、Flume其他特性

1. 拦截器

agent5.sources.netsource.interceptors = ts
agent5.sources.netsource.interceptors.ts.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

agent5.sinks.hdfssink.hdfs.path = /flume-%Y-%m-%d

上述配置文件配置了Flume自带的时间戳拦截器，可以用来获取时间的时间戳，进而对数据进行分块，不仅简化了数据管理，也便于 MapReduce 使用这些数据

2. 多层Flume网络

可以使用一个 Flume 代理作为另一个代理的信宿，从而构建多层 Flume 网络

编写 agent6.conf

agent6.sources = avrosource
agent6.sinks = avrosink
agent6.channels = memorychannel

agent6.sources.avrosource.type = avro                              // 配置信源为 Avro
agent6.sources.avrosource.bind = localhost 
agent6.sources.avrosource.port = 2000 
agent6.sources.avrosource.threads = 5

agent6.sinks.avrosink.type = avro
agent6.sinks.avrosink.hostname = localhost
agent6.sinks.avrosink.port = 4000

agent6.sources.memorychannel.type = memory                  // 信道具体类型（memory）
agent6.sources.memorychannel.capacity = 1000                // 容量
agent6.sources.memorychannel.transactionCapacity = 100      // 事务容量

agent6.sources.avrosource.channels = memorychannel
agent6.sinks.filesink.channel = memorychannel

启动两个代理

flume-ng avro-client -H localhost -p 4000 -F /home/hadoop/message
flume-ng avro-client -H localhost -p 2000 -F /home/hadoop/message

可以看到，发送的文件会经由 agent6 传输至 agent3，从而构建起一个数据链路

3. 多信源多信宿

一个代理可以从多个信源获取数据写入多个信宿

agent7.sources = netsource
agent7.sinks = hdfssink filesink
agent7.channels = memorychannel1 memorychannel2

agent7.sources.netsource.type = netcat                      // 信源具体类型（netcat）
agent7.sources.netsource.bind = localhost                   // 主机
agent7.sources.netsource.port = 3000                        // 端口

agent7.sinks.hdfssink.type = hdfs
agent7.sinks.hdfssink.hdfs.path = /flume                    // hadoop路径
agent7.sinks.hdfssink.hdfs.filePrefix = log                 // 文件前缀
agent7.sinks.hdfssink.hdfs.rollInterval = 0         
agent7.sinks.hdfssink.hdfs.rollCount = 3                    // 每个文件存储的行数
agent7.sinks.hdfssink.hdfs.fileType = DataStream    

agent7.sinks.filesink.type = FILE_ROLL
agent7.sinks.filesink.sink.directory = /home/hadoop/flume/files    // 配置文件路径
agent7.sinks.filesink.sink.rollInterval = 0

agent7.sources.memorychannel11.type = memory                  // 信道具体类型（memory）
agent7.sources.memorychannel11.capacity = 1000                // 容量
agent7.sources.memorychannel11.transactionCapacity = 100      // 事务容量

agent7.sources.memorychannel12.type = memory                  // 信道具体类型（memory）
agent7.sources.memorychannel12.capacity = 1000                // 容量
agent7.sources.memorychannel12.transactionCapacity = 100      // 事务容量

agent7.sources.netsource.channels = memorychannel11 memorychannel12
agent7.sinks.filesink.channel = memorychannel11
agent7.sinks.filesink.channel = memorychannel12

agent7.sources.netsource.selector.type = replicating          // 设置信源选择器

这里配置信源选择器为 replicating 模式，该模式下所有事件会同时发送给上述两个信道
multiplexing 模式的信源选择器会依据时间的特定头部字段值判断向哪个信道发送事件

wanger61

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【Hadoop学习笔记】（四）——Flume

Flume是一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统，Flume支持在日志系统中定制各类数据发送方，用于收集数据；同时，Flume提供对数据进行简单处理，并写到各种数据接受方（可定制）的能力。
复制链接

扫一扫