flume日志采集

最新推荐文章于 2023-03-24 14:55:41 发布

复姓独孤

最新推荐文章于 2023-03-24 14:55:41 发布

阅读量470

点赞数

分类专栏： Bigdata # flume

本文链接：https://blog.csdn.net/weixin_45077780/article/details/107920951

版权

Bigdata 同时被 2 个专栏收录

24 篇文章 2 订阅

订阅专栏

flume

1 篇文章 0 订阅

订阅专栏

1.Flume 概述

1.1 Flume

Flume 是 Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单。
在这里插入图片描述

1.2Flume 基础架构

在这里插入图片描述

1.2.1 Agent

Agent 是一个 JVM 进程，它以事件的形式将数据从源头送至目的。
Agent 主要有 3 个部分组成，Source、Channel、Sink。

1.2.2 Source

Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据，包括 avro、thrift、exec、jms、spooling directory、netcat、sequence
generator、syslog、http、legacy。

1.2.3 Sink

Sink 不断地轮询 Channel 中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent。
Sink 组件目的地包括 hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。

1.2.4 Channel

Channel 是位于 Source 和 Sink 之间的缓冲区。因此，Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个Sink 的读取操作。
Flume 自带两种 Channel：Memory Channel 和 File Channel 以及 Kafka Channel。
Memory Channel 是内存中的队列。Memory Channel 在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么 Memory Channel 就不应该使用，因为程序死亡、机器宕
机或者重启都会导致数据丢失。
File Channel 将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数
据。

1.2.5 Event

传输单元，Flume 数据传输的基本单元，以 Event 的形式将数据从源头送至目的地。Event 由 r Header 和 y Body 两部分组成，Header 用来存放该 event 的一些属性，为 K-V 结构，Body 用来存放该条数据，形式为字节数组。

2.Flume 快速入门

2.1 Flume 安装部署

2.1.1 安装地址

1） Flume 官网地址.

2）文档查看地址.

3）下载地址.

2.1.2 安装部署

1）将 apache-flume-1.7.0-bin.tar.gz 上传到 linux 的/opt/software 目录下
在这里插入图片描述

2）解压 apache-flume-1.7.0-bin.tar.gz 到/opt/module/目录下

tar -zxf apache-flume-1.7.0-bin.tar.gz -C /opt/module/

3）可以修改文件名为flume

mv apache-flume-1.7.0-bin flume

4）将 flume/conf 下的 flume-env.sh.template 文件修改为 flume-env.sh，并配置 flume-env.sh 文件

mv flume-env.sh.template flume-env.sh

更改配置

export JAVA_HOME=/opt/module/jdk1.8.0_261

2.2 Flume 入门案例

2.2.1 监控端口数据官方案例

1 ）案例需求：
使用 Flume 监听一个端口，收集该端口数据，并打印到控制台。
2 ）需求分析：
在这里插入图片描述

3 ）实现步骤：
1.安装 netcat 工具

[liuyongjun@hadoop102 software]$ sudo yum install -y nc

在这里插入图片描述
2）测试netcat：
在hadoop102终端输入：

nc -lk 44444

hadoop102是服务端
在hadoop103终端输入：

nc hadoop102 44444

hadoop103是客户端
这时在此终端输入hello，hadoop102终端会收到
在这里插入图片描述

而在hadoop103输入信息，hadoop102也能收到

3.创建 Flume t Agent 配置文件 flume-net cat-logger.conf
在 flume 目录下创建 job 文件夹并进入 job 文件夹。


[liuyongjun@hadoop102 flume]$ mkdir job
[liuyongjun@hadoop102 flume]$ cd job/

在 job 文件夹下创建 Flume Agent 配置文件 flume-netcat-logger.conf。


[liuyongjun@hadoop102 job]$ vim flume-netcat-logger.conf

在 flume-netcat-logger.conf 文件中添加如下内容：

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

以上内容在官方手册.
在这里插入图片描述

4 . 先开启 flume 监听端口
两种写法：
第一种：

bin/flume-ng agent --conf conf/ --name a1  --conf-file job/flume-netcat-logger.conf  -Dflume.root.logger=INFO,console

第二种：

bin/flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

参数说明：

--conf/-c：表示配置文件存储在 conf/目录
--name/-n：表示给 agent 起名为 a1
--conf-file/-f：flume 本次启动读取的配置文件是在 job 文件夹下的flume-netcat-logger.conf文件。
-Dflume.root.logger=INFO,console ：-D 表示 flume 运行时动态修改 flume.root.logger参数属性值，并将控制台日志打印级别设置为 INFO 级别。日志级别包括:log、info、warn、error。

5 ．使用netcat 工具向本机的 4 44444
重新克隆一个hadoop102，输入如下命令

[liuyongjun@hadoop102 ~]$ nc localhost 44444

在这里插入图片描述

这里我设置了nc命令改为ncat命令，和nc是一样的
在这里插入图片描述

2.2.2 实时监控单个追加文件

1）案例需求：实时监控 Hive 日志，并上传到 HDFS 中
2）需求分析：
在这里插入图片描述
注：日志级别

日志信息分类
1.等级由低到高：debug<info<warn<Error<Fatal;
如果日志设置为L,一个级别为P的输出日志只有当P >= L时日志才会输出。
即如果日志级别L设置INFO，只有P的输出级别为INFO、WARN，后面的日志才会正常输出。
2.区别：
debug 级别最低，详细的了解系统程序的运行情况，在任何地方都可以用；
info 重要，输出信息：用来反馈系统的当前状态给用户，以便定位问题；
后三个，警告、错误、严重错误，这三者应该都在系统运行时检测到了一个不正常的状态。
warn, 可修复，系统可继续运行下去；
Error, 可修复性，但无法确定系统会正常的工作下去;
Fatal, 相当严重，可以肯定这种错误已经无法修复，并且如果系统继续运行下去的话后果严重。
3.使用
什么时候使用 info, warn , error ?
info 用于打印程序应该出现的正常状态信息，便于追踪定位；
warn 表明系统出现轻微的不合理但不影响运行和使用；
error 表明出现了系统错误和异常，无法正常完成目标操作。

3）实现步骤：
1.Flume 要想将数据输出到 HDFS，须持有 Hadoop 相关 jar 包
将如下内容：

commons-configuration-1.6.jar、
hadoop-auth-2.7.2.jar、
hadoop-common-2.7.2.jar、
hadoop-hdfs-2.7.2.jar、
commons-io-2.4.jar、
htrace-core-3.1.0-incubating.jar

拷贝到/opt/module/flume/lib 文件夹下。
2.创建 flume-file-hdfs.conf 文件
创建文件

touch flume-file-hdfs.conf

注：要想读取 Linux 系统中的文件，就得按照 Linux 命令的规则执行命令。由于 Hive 日志在 Linux 系统中所以读取文件的类型选择：exec 即 execute 执行的意思。表示执行 Linux命令来读取文件。

在文件中添加如下内容：

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log
a2.sources.r2.shell = /bin/bash -c
# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs- #是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 30
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

在这里插入图片描述
3.运行 Flume.

bin/flume-ng agent -c conf/ -f job/flume-file-hdfs.conf -n a2 -Dflume.root.logger=INFO,console

4.开启 Hadoop 和 Hive 并操作 Hive 产生日志

start-dfs.sh
start-yarn.sh
bin/hive
hive (default)>

5.在 HDFS 上查看文件。
在这里插入图片描述

2.2 .3 实时监控目录下多个新文件

1 ）案例需求：使用 F Fe lume 监听整个目录的文件，并上传至 HDFS
2 ）需求分析：
在这里插入图片描述
实现步骤：
1 ．创建配置文件 flume-dir-hdfs.conf
创建一个文件
添加内容：

a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp 结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path  =
hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

在这里插入图片描述
2.启动监控文件夹命令

bin/flume-ng agent -c conf/ -n a3 -f job/flume-dir-hdfs.conf

说明：在使用 Spooling Directory Source 时
不要在监控目录中创建并持续修改文件
上传完成的文件会以.COMPLETED 结尾
被监控文件夹每 500 毫秒扫描一次文件变动
3. 向 upload 文件夹中添加文件
在/opt/module/flume 目录下创建 upload 文件夹
向 upload 文件夹中添加文件，然后查看结果即可

2.2.4 实时监控目录下的多个追加文件

Exec source 适用于监控一个实时追加的文件，但不能保证数据不丢失；Spooldir Source 能够保证数据不丢失，且能够实现断点续传，但延迟较高，不能实时监控；而 Taildir Source 既能够实现断点续传，又可以保证数据不丢失，还能够进行实时监控。

1）案例需求：使用 Flume 监听整个目录的实时追加文件，并上传至 HDFS
2）需求分析：

在这里插入图片描述
实现步骤：
1．创建配置文件 flume-taildir-hdfs.conf
具体内容如下：

a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/file.*
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = 
hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload- #是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700 #文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

在这里插入图片描述
2.启动监控文件夹命令

bin/flume-ng agent -c conf/ -n a3 -f job/flume-taildir-hdfs.conf

3.向 files 文件夹中追加内容
在/opt/module/flume 目录下创建 files 文件夹
向 files 文件夹中添加文件

 echo hello >> file1.txt
 echo world >> file2.txt

3.Flume 进阶

3.1 Flume 事务

在这里插入图片描述

面试需要知道：
两个事务，put、take
两个临时缓冲区：putlist和takelist

3.2 Flume Agent 内部原理

在这里插入图片描述
重要组件：
1）ChannelSelector
ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型，分别是 Replicating（复制）和 Multiplexing（多路复用）
ReplicatingSelector 会将同一个 Event 发往所有的 Channel，Multiplexing 会根据相应的原则，将不同的 Event 发往不同的 Channel。
2）SinkProcessor
SinkProcessor 共有三种类型，分别是 DefaultSinkProcessor 、LoadBalancingSinkProcessor 和 FailoverSinkProcessor DefaultSinkProcessor 对应的是单个的 Sink ，LoadBalancingSinkProcessor 和 FailoverSinkProcessor 对应的是 Sink Group，LoadBalancingSinkProcessor 可以实现负载均衡的功能，FailoverSinkProcessor 可以实现故障转移的功能。

3.3 Flume 拓扑结构

3.3.1 简单串联

在这里插入图片描述
这种模式是将多个 flume 顺序连接起来了，从最初的 source 开始到最终 sink 传送的目的存储系统。此模式不建议桥接过多的 flume 数量，flume 数量过多不仅会影响传输速率，而且一旦传输过程中某个节点 flume 宕机，会影响整个传输系统。

3.3.2 复制和多路复用

在这里插入图片描述
Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个channel 中，或者将不同数据分发到不同的 channel 中，sink 可以选择传送到不同的目的地。

3.3.3 负载均衡和故障转移

在这里插入图片描述
Flume支持使用将多个sink逻辑上分到一个sink组，sink组配合不同的SinkProcessor可以实现负载均衡和错误恢复的功能。

3.3.4 聚合

在这里插入图片描述
这种模式是我们最常见的，也非常实用，日常 web 应用通常分布在上百个服务器，大者甚至上千个、上万个服务器。产生的日志，处理起来也非常麻烦。用 flume 的这种组合方式能很好的解决这一问题，每台服务器部署一个 flume 采集日志，传送到一个集中收集日志的
flume，再由此 flume 上传到 hdfs、hive、hbase 等，进行日志分析。

3.4 Flume 企业开发案例

3.4.1 复制和多路复用

1）案例需求
使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给Flume-3，Flume-3 负责输出到 Local FileSystem。
2）需求分析：
在这里插入图片描述
3）实现步骤：
0.准备工作
在/opt/module/flume/job 目录下创建 group1 文件夹
在/opt/module/data/目录下创建 flume3 文件夹
1．创建 flume-file-flume.conf
配置 1 个接收日志文件的 source 和两个 channel、两个 sink，分别输送给 flume-flume-hdfs 和 flume-flume-dir。
编辑配置文件

vim flume-file-flume.conf

添加如下内容

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

2.创建 flume-flume-hdfs.conf
配置上级 Flume 输出的 Source，输出是到 HDFS 的 Sink。
编辑配置文件

 vim flume-flume-hdfs.conf

添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2- #是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

3.创建 flume-flume-dir.conf
配置上级 Flume 输出的 Source，输出是到本地目录的 Sink。
编辑配置文件

vim flume-flume-dir.conf

添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

提示：输出的本地目录必须是已经存在的目录，如果该目录不存在，并不会创建新的目录。
4．执行配置文件
分别启动对应的 flume 进程：flume-flume-dir，flume-flume-hdfs，flume-file-flume。

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group1/flume-flume-dir.conf
[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group1/flume-flume-hdfs.conf
[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group1/flume-file-flume.conf

5.启动 Hadoop 和 Hive
6.检查 HDFS 上数据
7.检查/opt/module/data/flume3 目录中数据

3.4.2 负载均衡和故障转移

1）案例需求
使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用FailoverSinkProcessor，实现故障转移的功能。
2）需求分析
在这里插入图片描述
3）实现步骤
0.准备工作
在/opt/module/flume/job 目录下创建 group2 文件夹
1.创建 flume-netcat-flume.conf
配置 1 个 netcat source 和 1 个 channel、1 个 sink group（2 个 sink），分别输送给 flume-flume-console1 和 flume-flume-console2。
编辑配置文件

 vim flume-netcat-flume.conf

添加如下内容

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

2.创建 flume-flume-console1.conf
配置上级 Flume 输出的 Source，输出是到本地控制台。
编辑配置文件

vim flume-flume-console1.conf

添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = logger
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

3.创建 flume-flume-console2.conf
配置上级 Flume 输出的 Source，输出是到本地控制台。
编辑配置文件

vim flume-flume-console2.conf

添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
—————————————————————————————
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

4.执行配置文件
分别开启对应配置文件：flume-flume-console2，flume-flume-console1，flume-netcatflume。
5.使用 netcat 工具向本机的 44444 端口发送内容
6.查看 Flume2 及 Flume3 的控制台打印日志
7.将 Flume2 kill，观察 Flume3 的控制台打印情况。

3.4.3 聚合

1）案例需求
hadoop102 上的 Flume-1 监控文件/opt/module/data/group.log，
hadoop103 上的 Flume-2 监控某一个端口的数据流，
Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3，Flume-3 将最终数据打印到控制台
2）需求分析
在这里插入图片描述
3）实现步骤：
分发 Flume

[atguigu@hadoop102 module]$ xsync flume

在 hadoop102、hadoop103 以及 hadoop104 的/opt/module/flume/job目录下创建一个 group3
文件夹。
1.创建 flume1-logger-flume.conf
添加如下内容

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2.创建 flume2-netcat-flume.conf
配置 Source 监控端口 44444 数据流，配置 Sink 数据到下一级 Flume：在 hadoop103 上编辑配置文件
添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444
# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

3.创建 flume3-flume-logger.conf
配置 source 用于接收 flume1 与 flume2 发送过来的数据流，最终合并后 sink 到控制台。
在 hadoop104 上编辑配置文件

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141
# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

4.执行配置文件
分别开启对应配置文件：flume3-flume-logger.conf，flume2-netcat-flume.conf，flume1-logger-flume.conf。
5．在 hadoop103 上向/opt/module 目录下的 group.log 追加内容
6．在 hadoop102 上向 44444 端口发送数据
7.检查 hadoop104 上数据

复姓独孤

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
flume日志采集

目录标题1.Flume 概述1.1 Flume1.2Flume 基础架构1.2.1 Agent1.2.2 Source1.2.3 Sink1.2.4 Channel1.2.5 Event2.Flume 快速入门2.1 Flume 安装部署2.1.1 安装地址2.1.2 安装部署1.Flume 概述1.1 FlumeFlume 是 Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单。1.2Flume 基础架构1.2
复制链接

扫一扫