Flume官方文档笔记

最新推荐文章于 2024-03-10 09:28:20 发布

2NaCl

最新推荐文章于 2024-03-10 09:28:20 发布

阅读量636

点赞数 1

分类专栏：分布式计算文章标签： Flume

本文链接：https://blog.csdn.net/qq_41936805/article/details/99098287

版权

分布式计算专栏收录该内容

13 篇文章 1 订阅

订阅专栏

为什么针对Flume写文档笔记呢，因为Flume Spark这两个框架都是我觉得写得很不错的，比Hadoop，Zookeeper之类的那些好很多，不多bb了。

Flume 入门简介

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Flume是一种分布式，可靠且可用的服务，用于有效地收集，聚合和移动大量日志数据。它具有基于流数据的简单灵活的架构。它具有可靠的可靠性机制和许多故障转移和恢复机制，具有强大的容错性。它使用简单的可扩展数据模型，允许在线分析应用程序。

在这里插入图片描述
简单来说一个架构图，WebServer也就是我们的服务器，我们的Flume要从指定的服务器，通过WebServer去获取到数据，也就是Source，然后存储在Channel内部，再由Sink输出数据到我们的大数据分布式文件存储系统HDFS中，也就是Flume的一整套流程，核心流程也就是Agent内部的事件。

简单来说，Flume就是作日志采集的，而且是分布式高可用的去采，不过话说回来其实大多数分布式场景，都是这样，之前几篇专栏也介绍了大数据相关的一些知识点和操作，Flume也是一样的。

Flume的目标：
可靠性：保证数据是安全传送，且一个节点宕机能立刻被其他节点顶替。
扩展性：可适当提升Agent的硬件基础扩展性能。
管理性：在同类产品的对比中更适应分布式环境。

从使用角度来说，日志信息进行转储的过程，可能是多个Agent共同连续起作用的，可以看下图的双层代理模式：

在这里插入图片描述

第一个Agent将数据输出之后以RPC通信的形式，作为第二个Agent数据的来源输入，使用这种架构，是因为我们采用多代理模式，先前代理的接收器和当前跳的源需要是avro类型，接收器指向源的主机名（或IP地址）和端口。

然后再说一下分布式环境中更加常见的一种情况，那就是，多个WebServer同时共同产生日志，这些日志信息分别都需要存到一个共同的HDFS，那么我们就可以同样按照需求使用多层代理：

在这里插入图片描述

通过使用avro接收器(一般用于跨节点运输数据)配置多个第一层代理在Flume中实现，所有这些代理都指向单个代理的avro源（同样，您可以在这种情况下使用thrift源/接收器/客户端）。第二层代理上的此源将接收的事件合并到单个信道中，该信道由信宿器消耗到其最终目的地。

同时，Flume还支持多路复用到不同的地方去存储：

在这里插入图片描述

上面的例子显示了来自代理“foo”的源代码将流程扩展到三个不同的通道。扇出可以复制或多路复用。在复制流的情况下，每个事件被发送到所有三个通道。对于多路复用情况，当事件的属性与预配置的值匹配时，事件将被传递到可用通道的子集。

例如，如果一个名为“txnType”的事件属性设置为“customer”，那么它应该转到channel1和channel3，如果它是“vendor”，那么它应该转到channel2，否则转到channel3。可以在代理的配置文件中设置映射。

好，Flume的基础简介到这里，下面来说一下Flume的实际使用。

Flume入门案例

目标一：从指定网络端口采集数据输出到控制台

使用Flume的关键则是在于它的配置文件，首先我们看一下

# Name the components on this agent
#a1: agent的名称
a1.sources = r1 #r1：source的名称
a1.sinks = k1 #k1：sink的名称
a1.channels = c1 #c1：channel的名称

# Describe/configure the source
a1.sources.r1.type = netcat #数据源的种类
a1.sources.r1.bind = linux01 #数据源绑定ip(这个每个人不一样，我这里绑定的是master节点)
a1.sources.r1.port = 44444 #数据源端口

# Describe the sink
a1.sinks.k1.type = logger #日志信息以INFO级别输出

# Use a channel which buffers events in memory
a1.channels.c1.type = memory # 内存存储
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 # 将source和channel串起来
a1.sinks.k1.channel = c1 #将sink和channel串起来
# source与channel是一对多，多对多的关系
# sink和channel是一对一的关系

flume的启动关键就在于它的配置文件，然后接下来我们来启动一个flume 来试试，然后解析一下启动的命令

[centos01@linux01 conf]$ flume-ng agent \
> --name a1 \  上面配置的agent名字就是a1
> --conf $FLUME_HOME/conf \  系统配置的目录
> --conf-file $FLUME_HOME/conf/example.conf  \   系统配置的文件
> -Dflume.root.logger=INFO,console   将日志打印到控制台

然后就会发现成功启动了

在这里插入图片描述

我们这里监听的是44444端口，之前也注明过netcat是我们的source type了，现在只需要往端口注入message即可了。(没有netcat的用yum装一下就行了)

在这里我们给44444端口注入信息之后，会实时的显示在flume的控制台上：
在这里插入图片描述
到这里我们第一个任务就完成了

目标二：监控一个文件实时采集新增的数据输出到控制台

很明显，这次的数据源就肯定不是网络端口的数据获取的了，那肯定数据源就会变了，我们来看看官方文档合适的source：

Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means configurations such as cat [named pipe] or tail -F [file] are going to produce the desired results where as date will probably not - the former two commands produce streams of data where as the latter produces a single event and exits.

大意是说，如果有一个固定的Exec源提供数据，那么提供就实时采集，但是如果进程中断就会一起中断。最重要的还是来配置conf，这是任何情况下都最核心的事情。配置如下：

Agent选型为exec source +memory channel + logger sink

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/centos01/modules/apache-flume-1.7.0-bin/test_dataSource/data.log
a1.sources.r1.shell = /bin/sh -c

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

然后重新启动一下：
在这里插入图片描述

然后我们往文件内写入数据：
在这里插入图片描述
这时flume就已经可以监控到了：

在这里插入图片描述
但是在这里有必要说一句话：现在这样是没有任何意义的，真正做数据处理的时候分为离线和实时，离线的情况下我们经常会用HDFS Sink，实时的情况下我们会使用flume+kafka，也就是使用Kafka Sink，配置也雷同，文档写的超级详细。

※目标三：将A服务器的日志实时采集到B服务器

因为是两台服务器，所以也是两个agent，先看一下架构图
在这里插入图片描述

技术选型：
exec source + memory channel + avro sink
avro source + memory channel + logger sink

然后我们来写配置文件：
在这里注意，要把agent定义为不同的名字，不然两个agent都用a1肯定完蛋
exec-memory-avro.conf

# Name the components on this agent
avro.sources = exec-source
avro.sinks = avro-sink
avro.channels = memory-channel

# Describe/configure the source
avro.sources.exec-source.type = exec
avro.sources.exec-source.command = tail -F /home/centos01/modules/apache-flume-1.7.0-bin/test_dataSource/data.log
avro.sources.exec-source.shell = /bin/sh -c

# Describe the sink
avro.sinks.avro-sink.type = avro
avro.sinks.avro-sink.hostname = linux01
avro.sinks.avro-sink.port = 44444

# Use a channel which buffers events in memory
avro.channels.memory-channel.type = memory
avro.channels.memory-channel.capacity = 1000
avro.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
avro.sources.exec-source.channels = memory-channel
avro.sinks.avro-sink.channel = memory-channel

avro-memory-logger.conf

# Name the components on this agent
avro-memory-logger.sources = avro-source
avro-memory-logger.sinks = logger-sink
avro-memory-logger.channels = memory-channel

# Describe/configure the source
avro-memory-logger.sources.avro-source.type = avro
avro-memory-logger.sources.avro-source.bind = linux01
avro-memory-logger.sources.avro-source.port = 44444

# Describe the sink
avro-memory-logger.sinks.logger-sink.type = logger

# Use a channel which buffers events in memory
avro-memory-logger.channels.memory-channel.type = memory
avro-memory-logger.channels.memory-channel.capacity = 1000
avro-memory-logger.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
avro-memory-logger.sources.avro-source.channels = memory-channel
avro-memory-logger.sinks.logger-sink.channel = memory-channel

因为有两个agent，
先启动avro-memory-logger
再启动exec-memory-avro

在这里插入图片描述
继续发送信息，信息也是可以过来的，但是我们在exec-memory-avro却没有发现输出，这是因为我们没有对信息进行放出。

日志收集过程：

机器上A监控一个文件，当我们访问主站时会有用户日志记录到xxx.log中
avro sink把新产生的日志输出到对应的avro source指定的hostname和port上
通过avro source对应的agent将我们的日志输出到控制台kafka

2NaCl

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flume官方文档笔记

为什么针对Flume写文档笔记呢，因为Flume Spark这两个框架都是我觉得写得很不错的，比Hadoop，Zookeeper之类的那些好很多，不多bb了。Flume 入门简介Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving l...
复制链接

扫一扫

专栏目录