大数据组件-Apache Flume简介,架构,安装部署,Flume全量采集目录/增量文件到hdfs,负载均衡,容错,静态拦截器

最新推荐文章于 2021-05-26 15:00:00 发布

程序猿与汪

最新推荐文章于 2021-05-26 15:00:00 发布

阅读量292

点赞数 1

文章标签： hadoop 大数据 mysql sqlite postgresql

本文链接：https://blog.csdn.net/weixin_45154559/article/details/106042974

版权

版本统一:
apache-fluem:1.8.0
jkd:1.8
hadoop：2.7.5
zk:3.4.9
flume用户指导手册

Flume简介,架构

(1)概述

flume是一款大数据中海量数据采集传输汇总的软件。特别指的是数据流转的过程，或者说是数据搬运的过程。把数据从一个存储介质通过flume传递到另一个存储介质中.

(2)核心组件

source:用于对接各个不同的数据源
sink:用于对接各个不同存储数据的目的地(数据的下沉地)
channel:用于中间临时存储缓存数据

(3)Flume采集系统结构

在这里插入图片描述

(4)运行机制

flum本身就是java程序,在需要采集数据的机器上启动(agent进程)
agent进程里面包含了:source sink channel
在flum中数据被包装成event,真是的数据是放置event body中,event是flume中最小的数据单元

(5)运行架构

简单架构:只需要部署一个agent即可
复杂架构 :多个agent之间的串联,串联的架构中没有主从之分 大家的地位都是一样的

Flume安装部署

(1)上传解压

上传安装包到数据源所在的节点(我的是node02节点)
点击,安装包下载地址

tar -zxvf apache-flume-1.8.0-bin.tar.gz -C /export/servers/

(2)修改配置文件

将配置文件的模板复制一份并使用Notepad++连接进行修改

cp flume-env.sh.template flume-env.sh

在conf/flume-env.sh 导入java环境变量
保证flume工作的时候一定可以正确的加载到环境变量
在这里插入图片描述这个配置文件是没有执行权限的需要我们设置一下

chomd 755 flume-env.sh

(3)使用小demo来测试配置环境是否正常

cd /exprot/servers/apache-flume-1.8.0/conf
touch netcat-logger

我们使用Notepad++连接node02节点进行编辑netcat-logger文件

内容(去flume用户指导手册中可以有现成的直接粘贴即用)

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动
因为我们没有配置环境变量,所以需要到flume的根目录中输入以下命令

bin/flume-ng agent --conf conf --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

flume-ng表明他是一个新的flume,–conf conf 指定配置文件conf ,指定具体采集方案文件配置路径–conf-file conf/netcat-logger.conf
–name a1,这个agent进程名称a1, -Dflume.root.logger=INFO,console开启日志,在控制台打印
3. 精简版命令

bin/flume-ng agent -c ./conf -f ./conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

-c 后面是配置文件路径 -f后面是指定采集的配置文件路径名，最后是开启日志，控制台打印

在这里插入图片描述

连接一下44444端口传入数据
首先安装一下连接工具

yum -y install telnet

telnet localhost 44444

在这里插入图片描述
在窗口内输入hello world

Flume采集配置模板**

分为

# Name the components on this agent  1.命名组件
# Describe/configure the source 2.配置源
# Describe the sink 3.配置目标
# Use a channel which buffers events in memory 4.使用管道
# # Bind the source and sink to the channel 5.源和目标绑定到管道两端

Flume采集案例

1.全量采集目录到HDFS

(1)采集需求:

服务器在某特定目录下,会不断产生新的文件,每当有新文件出现,就需要把文件采集到HDFS中去

采集源,即source–监控文件目录:Spooling Directory Source
下沉目标,即sink–HDFS文件系统:HDFS Sink
源和目标的通道channel使用:File Channel

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
##注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/logs1
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(2)参数剖析:

roll控制写入hdfs文件 以何种方式进行滚动(控制文件滚动)
a1.sinks.k1.hdfs.rollInterval = 3  以时间间隔
a1.sinks.k1.hdfs.rollSize = 20     以文件大小
a1.sinks.k1.hdfs.rollCount = 5     以event个数
如果三个都配置  谁先满足谁触发滚动
如果不想以某种属性滚动  设置为0即可

是否开启时间上的舍弃  控制文件夹以多少时间间隔滚动(控制文件夹滚动,一个文件夹里面可以有多个文件)
以下述为例：就会每10分钟生成一个文件夹
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

capacity:默认该通道最大的可以存储的event数量
trasactionCapacity:每次最大可以从source中拿到或者送到sink中的event数量

(3)启动监控

bin/flume-ng agent -c ./conf -f ./conf/spooldir-hdfs.conf -n a1 -Dflume.root.logger=INFO,console

在这里插入图片描述

2.最最重要问题

不能往监控目录中丢重名文件，如果传入同名文件flume会报错且罢工，后续就不再进行数据的监视采集了
怎么保证不会出现重名文件：在企业中通常给文件追加时间戳命名方式，以保证所有文件不重名

3.增量采集文件到HDFS

(1)采集需求:

业务系统中使用log4j生成日志,日志文件的内容不断增加,我们需要实现把实时生成的日志信息实时采集到hdfs上

确定三大要素
1. 采集源source–监控文件内容更新:Exec Source
2. 下沉目标sink–hdfs系统:HDFS Sink
3. 管道channel–源和下沉目标直接传递通道:Memory Channel

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/test.log
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume1/tailout/%y-%m-%d/%H-%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(2)编写一个shell脚本模拟生成数据

vim addDate

输入以下内容

#!/bin/bash
while true 
do 
	date >> /root/log/test.log
done

(3)启动flume监控

bin/flume-ng agent -c ./conf -f ./conf/tail-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
//启动生成数据的脚本
addDate.sh

Flume的load-blance负载均衡,failover容错

1.Flume的负载均衡

(1)使用场景:

当我们多个flume进行串联,但是当某一节点的计算机处理能力不足时,可能会产生大量数据堆积,我们可以开启多台,进行并联,并联的多台集群,就涉及到了资源分配的负载均衡算法(轮询, 随机,权重),和同一个请求只能给一个进程出口,避免数据的一个重复.

(2)分发flume到其他两个节点

scp -r /export/servers/flume node01:$PWD
scp -r /export/servers/flume node03:$PWD

(2)flume串联跨网络传输数据

avro sink
avro source
使用上述两个组件指定绑定端口ip就可以满足数据跨网络的传递,通常用于flume串联架构中

(3)flume串联配置

node01 配置名称为:exec-avro.conf

#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2

#set gruop
agent1.sinkgroups = g1

#set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/logs/123.log

# set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node02
agent1.sinks.k1.port = 52020

# set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node03
agent1.sinks.k2.port = 52020

#set sink group
agent1.sinkgroups.g1.sinks = k1 k2

#set failover
agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
agent1.sinkgroups.g1.processor.selector.maxTimeOut=10000

node02节点配置名称为:avro-logger.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node02
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

node03节点配置名称为:avro-logger.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node03
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(4)flume串联启动

启动应该从远离数据端开始启动,这样可以避免数据的丢失
启动node03

bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

启动node02

bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

启动node01

bin/flume-ng agent -c conf -f conf/exec-avro.conf -n agent1 -Dflume.root.logger=INFO,console

2.容错

在上述内容基础上node01增加failover配置其他不变

#set failover
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000

Flume拦截器案例

1.静态拦截器

(1)使用场景:

通过拦截器可以实现flume的数据传入source时进行拦截,拦截后往里面添加k,v对标识,最后上传hdfs时可以通过标识区分数据后归总存放

如果没有使用静态拦截器
Event: { headers:{} body:  36 Sun Jun  2 18:26 }

使用静态拦截器之后 自己添加kv标识对
Event: { headers:{type=access} body:  36 Sun Jun  2 18:26 }
Event: { headers:{type=nginx} body:  36 Sun Jun  2 18:26 }
Event: { headers:{type=web} body:  36 Sun Jun  2 18:26 }

后续在存放数据的时候可以使用flume的规则语法获取到拦截器添加的kv内容

%{type}

(2)实现案例

案例场景
A、B两台日志服务机器实时生产日志主要类型为access.log、nginx.log、web.log
现在要求：

把A、B 机器中的access.log、nginx.log、web.log 采集汇总到C机器上然后统一收集到hdfs中。
但是在hdfs中要求的目录为：

/source/logs/access/20160101/**
/source/logs/nginx/20160101/**
/source/logs/web/20160101/**

在node01上配置exec_source_avro_sink.conf文件

# Name the components on this agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/access.log
# 添加一个拦截器
a1.sources.r1.interceptors = i1
# 设置为静态的
a1.sources.r1.interceptors.i1.type = static

a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access

a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /root/logs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx

a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /root/logs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node02
a1.sinks.k1.port = 41414

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

在node02上配置avro_source_hdfs_sink.conf文件

#定义agent名， source、channel、sink的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1


#定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = node02
a1.sources.r1.port =41414

#添加时间拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder


#定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

#定义sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=flume2/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
#时间类型
#a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件不按条数生成
a1.sinks.k1.hdfs.rollCount = 0
#间隔20秒滚动一次
a1.sinks.k1.hdfs.rollInterval = 20
#生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize  = 10485760
#批量写入hdfs的evens个数
a1.sinks.k1.hdfs.batchSize = 20
# flume操作hdfs的线程数（包括新建，写入等）
a1.sinks.k1.hdfs.threadsPoolSize=10
#操作hdfs超时时间
a1.sinks.k1.hdfs.callTimeout=30000

#组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动node02的flume

bin/flume-ng agent -c ./conf -f ./conf/avro_source_hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console

启动node01的flume

bin/flume-ng agent -c ./conf -f ./conf/exec_source_avro_sink.conf -n a1 -Dflume.root.logger=INFO,console

模拟数据实现产生
写一组shell脚本产生数据

while true; do echo "access access....." >> /root/logs/access.log;sleep 0.5;done
while true; do echo "web web....." >> /root/logs/web.log;sleep 0.5;done
while true; do echo "nginx nginx....." >> /root/logs/nginx.log;sleep 0.5;done

在这里插入图片描述

程序猿与汪

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
大数据组件-Apache Flume简介,架构,安装部署,Flume全量采集目录/增量文件到hdfs,负载均衡,容错,静态拦截器

版本统一:目录标题Fluem简介,架构Flume安装部署Fluem简介,架构概述Flume是一款大数据中海量日志采集,聚合和传输汇总的软件.这里的采集特指的是数据流转的过程,或者说是数据搬运的过程.把数据从一个存储介质通过flume传递带另一个存储介质中.核心组件source:用于对接各个不同的数据源sink:用于对接各个不同存储数据的目的地(数据的下沉地)channel:用于中间临时存储缓存数据Flume采集系统结构运行机制flum本身就是java程序,在需要采集数据的机器上启
复制链接

扫一扫