hadoop离线阶段（第十六节—1）flume的介绍、安装、使用和自定义拦截器

最新推荐文章于 2022-02-09 21:22:47 发布

hwq317622817

最新推荐文章于 2022-02-09 21:22:47 发布

阅读量176

点赞数

文章标签： flume

本文链接：https://blog.csdn.net/hwq317622817/article/details/111084631

版权

本文详细介绍了Flume的介绍、安装、使用，包括目录到HDFS的采集、文件到HDFS的采集、多agent级联、高可用Flume配置以及负载均衡配置。同时，还展示了如何自定义拦截器以满足特定数据处理需求。

摘要由CSDN通过智能技术生成

flume的介绍

概述

Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。
Flume可以采集文件，socket数据包、文件、文件夹、kafka等各种形式源数据，又可以将采集到的数据(下沉sink)输出到HDFS、hbase、hive、kafka等众多外部存储系统中。
一般的采集需求，通过对flume的简单配置即可实现。
Flume针对特殊场景也具备良好的自定义扩展能力，因此，flume可以适用于大部分的日常数据采集场景。

运行机制

Flume分布式系统中最核心的角色是agent，flume采集系统就是由一个个agent所连接起来形成
每一个agent相当于一个数据传递员，内部有三个组件：
a) Source：采集组件，用于跟数据源对接，以获取数据
b) Sink：下沉组件，用于往下一级agent传递数据或者往最终存储系统传递数据
c) Channel：传输通道组件，用于从source将数据传递到sink

在这里插入图片描述

Flume采集系统结构图

简单结构
单个agent采集数据
复杂结构
多级agent之间串联

flume的安装

在CDH软件集上找合适的版本的flume，本文使用的是flume-ng-1.6.0-cdh5.14.0。
安装就直接解压压缩包到指定目录即可。然后配置一个文件：
将apache-flume-1.6.0-cdh5.14.0-bin/conf下的flume-env.sh.template复制为flume-env.sh，然后vi flume-env.sh，然后去掉export JAVA_HOME的注释，并赋值为JDK的目录。

flume的使用

采集目录到HDFS

1、需求

监视Linux上的目录，看目录下的文件有无新增，有新增就把新增的文件上传到hdfs上
根据需求，首先定义以下3大要素

数据源组件，即source ——监控文件目录 : spooldir
spooldir特性：
1、监视一个目录，只要目录中出现新文件，就会采集文件中的内容
2、采集完成的文件，会被agent自动添加一个后缀：COMPLETED
3、所监视的目录中不允许重复出现相同文件名的文件
下沉组件，即sink——HDFS文件系统 : hdfs sink
通道组件，即channel——可用file channel 也可以用内存channel

2、flume配置文件开发

cd  /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim spooldir.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
##注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/servers/dirfile
a1.sources.r1.fileHeader = true
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://node01:8020/spooldir/files/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3、启动flume

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin
bin/flume-ng agent -c ./conf -f ./conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console

采集文件到HDFS

1、需求
监视某个文件内容的变化，将新增的内容滚动写入到hdfs上的文件中
根据需求，首先定义以下3大要素

采集源，即source——监控文件内容更新 : exec ‘tail -F file’
下沉目标，即sink——HDFS文件系统 : hdfs sink
Source和sink之间的传递通道——channel，可用file channel 也可以用内存channel
2、flume配置文件开发

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim tail-file.conf

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure tail -F source1
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -F /export/servers/taillogs/access_log
agent1.sources.source1.channels = channel1

#configure host for source
#agent1.sources.source1.interceptors = i1
#agent1.sources.source1.interceptors.i1.type = host
#agent1.sources.source1.interceptors.i1.hostHeader = hostname

# Describe sink1
agent1.sinks.sink1.type = hdfs
#a1.sinks.k1.channel = c1
agent1.sinks.sink1.hdfs.path = hdfs://node01:8020/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

3、启动flume

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin
bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1  -Dflume.root.logger=INFO,console

两个agent级联

1、需求
将上面的需求分成两步，在node02监视文件内容的变化，node02监视到内容变化后将数据发给node03，由node03完成写入hdfs的操作
在这里插入图片描述
2、将node03的flume的安装目录发给node02
3、node02的flume配置文件开发

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim tail-avro-avro-logger.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /export/servers/taillogs/access_log
a1.sources.r1.channels = c1
# Describe the sink
##sink端的avro是一个数据发送者
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 192.168.52.120
a1.sinks.k1.port = 4141
a1.sinks.k1.batch-size = 10
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4、node03的flume配置文件开发

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim avro-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
##source中的avro组件是一个接收者服务
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 192.168.52.120
a1.sources.r1.port = 4141
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node01:8020/avro/hdfs/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5、在node02写连续插入文件内容的脚本

cd /export/servers/shells
vi taillogs.sh

#!/bin/bash
while true
do
 date >> /export/servers/taillogs/access_log;
 sleep 0.5
done

6、先启动node03的flume（使用DEBUG运行flume），再启动node02的flume（使用DEBUG运行flume），最后运行node02上连续插入文件内容的脚本。

高可用Flum-NG配置案例failover

所谓高可用就是两个机器同时执行相同的监视采集任务，其中一个真正执行任务，另一个等真正执行任务的机器死掉后，替补上场。
1、机器角色分配
Flume的Agent和Collector分布如下表所示：

名称	HOST	角色
Agent1	node01	Web Server
Collector1	node02	AgentMstr1
Collector2	node03	AgentMstr2

2、将node03上的flume安装目录发给node01
3、在node01上进行flume配置文件开发

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
vim agent.conf

#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2  #关键语句
#
##set gruop
agent1.sinkgroups = g1 #关键语句
#
##set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
#
agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /export/servers/taillogs/access_log
#
agent1.sources.r1.interceptors = i1 i2
agent1.sources.r1.interceptors.i1.type = static
agent1.sources.r1.interceptors.i1.key = Type
agent1.sources.r1.interceptors.i1.value = LOGIN
agent1.sources.r1.interceptors.i2.type = timestamp
#
## set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node02
agent1.sinks.k1.port = 52020
## set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node03
agent1.sinks.k2.port = 52020
#
##set sink group
agent1.sinkgroups.g1.sinks = k1 k2 #关键语句
#
##set failover
agent1.sinkgroups.g1.processor.type = failover #关键语句
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty

最低0.47元/天解锁文章

hwq317622817

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop离线阶段（第十六节—1）flume的介绍、安装、使用和自定义拦截器

flume的介绍概述Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。Flume可以采集文件，socket数据包、文件、文件夹、kafka等各种形式源数据，又可以将采集到的数据(下沉sink)输出到HDFS、hbase、hive、kafka等众多外部存储系统中。一般的采集需求，通过对flume的简单配置即可实现。Flume针对特殊场景也具备良好的自定义扩展能力，因此，flume可以适用于大部分的日常数据采集场景。运行机制Flume分布式系统中最核心的角色是agent
复制链接

扫一扫