Flume 入门教程（超详细）

最新推荐文章于 2024-09-07 16:56:58 发布

西门催学不吹雪

最新推荐文章于 2024-09-07 16:56:58 发布

阅读量2.3w

点赞数 72

分类专栏：大数据 # Flume 文章标签： flume

本文链接：https://blog.csdn.net/weixin_42837961/article/details/104533147

版权

文章目录

1. Flume 概述
2. Flume 的安装
- 2.1 安装地址
- 2.2 安装流程
3. Flume 入门案例
4. Flume 进阶
5. Flume 企业开发案例
6. 自定义 Flume 组件
7. Flume 数据流监控
- 7.1 Ganglia 的安装部署
- 7.2 操作 Flume 测试监控

1. Flume 概述

1.1 Flume 定义

Flume 是 Cloudera 提供的一种高可用、高可靠、分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单。
Flume 最主要的作用是，实时读取服务器本地磁盘的数据，将数据写到 HDFS。
在这里插入图片描述

1.2 Flume 基础架构

在这里插入图片描述

1.2.1 Agent

Agent 是一个 JVM 进程，它以事件的形式将数据从源头送至目的。
Agent 主要有三个组成部分，Source、Channel、Sink。

1.2.2 Source

Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据，包括 avro、thrif、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy。

1.2.3 Sink

Sink 不断地轮询 Channel 中的事件且批量移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent。
Sink 组件的目的地包括 hdfs、logger、avro、thrif、file、HBase、solr、自定义。

1.2.4 Channel

Channel 是位于 Source 和 Sink 之间的缓冲区。因此，Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个 Sink 的读取操作。
Flume 常用的 Channel：Memory Channel 和 File Channel。

1.2.5 Event

Flume 数据传输的基本单元，以 Event 的形式将数据从源头送至目的地。Event 由 Header 和 Body 两个部分组成。Header 用来存放该 Event 的一些属性，为 K-V 结构；Body 用来存放该条数据，形式为字节数组。

2. Flume 的安装

2.1 安装地址

Flume 官网

Flume 1.9.0 官方文档

2.2 安装流程

将安装包 apache-flume-1.9.0-bin.tar.gz 上传到 Linux 系统上。
解压安装包到指定目录下

tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/moudule/

重命名

mv apache-flume-1.9.0-bin flume

将 flume/conf 目录下的 flume-env.sh.template 文件修改为 flume-env.sh。

mv flume-env.sh.template flume-env.sh

配置 flume-env.sh 文件，将 LInux 系统的 jdk 的路径写到其中。

export JAVA_HOME=/usr/local/java/jdk1.8.0_151

3. Flume 入门案例

3.1 监控端口数据

3.1.1 需求

使用 Flume 监听一个端口，收集该端口数据，并打印到控制台。

3.1.2 分析

在这里插入图片描述

3.1.3 实现流程

安装 netcat 工具。

yum install -y nc

创建 FLume Agent 的配置文件 flume-netcat-logger.conf 。

（1）在 flume 目录下创建 job 文件夹并进入 job 文件夹。

mkdir job
cd job/

（2）在 job 文件夹下创建 FLume Agent 的配置文件 flume-netcat-logger.conf 。

vim flume-netcat-logger.conf

（3）在该配置文件中添加如下内容：

# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

注：a1 为 agent 的名称。

开启 Flume 监听窗口

写法一：

bin/flume-ng agent --conf conf --conf-file job/flume-netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

写法二：

bin/flume-ng agent -c conf -f job/flume-netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

使用 netcat 工具向本机 44444端口发送内容

nc localhost 44444

在这里插入图片描述
5. 在 FLume 监听页面观察接收数据情况

3.2 监控单个追加文件

3.2.1 需求

实时监控 Hive 日志，输出到控制台。
实时监控 Hive 日志，输出到 HDFS 上。

3.2.2 分析

在这里插入图片描述
注：要想读 Linux 系统中的文件，就得按照 Linux 命令的规则执行命令。由于 Hive 日志在 Linux 系统中，所以读取文件的类型为：exec（execute）。表示执行 Linux 命令来读取文件。

3.2.3 实现流程

（一）输出到控制台

创建 flume-file-logger.conf 文件。

vim flume-file-logger.conf

配置该文件内容。

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /hadoop/hive-2.3.6/logs/hive.log


# Describe the sink
a2.sinks.k2.type = logger

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

运行 Flume。

 bin/flume-ng agent -c conf/ -f job/flume-file-logger.conf -n a2 -Dflume.root.logger=INFO,console

开启 Hadoop 的 Hive，并操作 Hive 产生日志。（比如：show databases;）
在控制台查看数据。

（二）输出到 HDFS 上

创建 flume-file-hdfs.conf 文件。

vim flume-file-hdfs.conf

配置该文件。

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /hadoop/hive-2.3.6/logs/hive.log


# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://master:9000/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs- 
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 10
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 30
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

运行 Flume。

bin/flume-ng agent -c conf/ -f job/flume-file-hdfs.conf -n a2

开启 Hadoop 的 Hive，并操作 Hive 产生日志。（比如：show databases;）
在 HDFS 上查看文件。

3.3 监控目录下多个新文件

3.3.1 需求

使用 Flume 监听整个目录的文件，并上传到 HDFS 上。

3.3.2 分析

在这里插入图片描述

3.3.3 实现流程

创建配置文件 flume-dir-hdfs.conf。

vim flume-dir-hdfs.conf

配置该文件内容。

# Name the components on this agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp 结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path =  hdfs://master:9000/flume/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload- 
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 10
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000

a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

启动监控文件夹命令。

bin/flume-ng agent -c conf -f job/flume-dir-hdfs.conf -n a3

向 upload 文件夹中添加文件。

（1）在 /opt/module/flume/ 下创建文件夹 upload

mkdir upload

（2）向 upload 文件夹中添加文件。

touch 1.txt
touch 2.txt
touch 3.txt

查看 HDFS 上的数据。
再次查看 upload 文件夹。

3.4 监控目录下的多个追加文件

Exec Source 适用于监控一个实时追加的文件，但不能保证数据不丢失；Spooldir Source 能够保证数据不丢失，且能实现断点续传，但延迟较高，不能实时监控；而 Taildir Source 既能实现断点续传，又可以保证数据不丢失，还能够进行实时监控。

3.4.1 需求

最低0.47元/天解锁文章