版本统一:
apache-fluem:1.8.0
jkd:1.8
hadoop:2.7.5
zk:3.4.9
flume用户指导手册
Flume简介,架构
(1)概述
flume是一款大数据中海量数据采集传输汇总的软件。特别指的是数据流转的过程,或者说是数据搬运的过程。把数据从一个存储介质通过flume传递到另一个存储介质中.
(2)核心组件
source:用于对接各个不同的数据源
sink:用于对接各个不同存储数据的目的地(数据的下沉地)
channel:用于中间临时存储缓存数据
(3)Flume采集系统结构
(4)运行机制
flum本身就是java程序,在需要采集数据的机器上启动(agent进程)
agent进程里面包含了:source sink channel
在flum中数据被包装成event,真是的数据是放置event body中,event是flume中最小的数据单元
(5)运行架构
简单架构:只需要部署一个agent即可
复杂架构 :多个agent之间的串联,串联的架构中没有主从之分 大家的地位都是一样的
Flume安装部署
(1)上传解压
上传安装包到数据源所在的节点(我的是node02节点)
点击,安装包下载地址
tar -zxvf apache-flume-1.8.0-bin.tar.gz -C /export/servers/
(2)修改配置文件
将配置文件的模板复制一份并使用Notepad++连接进行修改
cp flume-env.sh.template flume-env.sh
在conf/flume-env.sh 导入java环境变量
保证flume工作的时候一定可以正确的加载到环境变量
这个配置文件是没有执行权限的需要我们设置一下
chomd 755 flume-env.sh
(3)使用小demo来测试配置环境是否正常
cd /exprot/servers/apache-flume-1.8.0/conf
touch netcat-logger
我们使用Notepad++连接node02节点进行编辑netcat-logger文件
- 内容(去flume用户指导手册中可以有现成的直接粘贴即用)
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 启动
因为我们没有配置环境变量,所以需要到flume的根目录中输入以下命令
bin/flume-ng agent --conf conf --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console
flume-ng表明他是一个新的flume,–conf conf 指定配置文件conf ,指定具体采集方案文件配置路径–conf-file conf/netcat-logger.conf
–name a1,这个agent进程名称a1, -Dflume.root.logger=INFO,console开启日志,在控制台打印
3. 精简版命令
bin/flume-ng agent -c ./conf -f ./conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console
-c 后面是配置文件路径 -f后面是指定采集的配置文件路径名,最后是开启日志,控制台打印
- 连接一下44444端口传入数据
首先安装一下连接工具
yum -y install telnet
telnet localhost 44444
在窗口内输入hello world
Flume采集配置模板**
分为
# Name the components on this agent 1.命名组件
# Describe/configure the source 2.配置源
# Describe the sink 3.配置目标
# Use a channel which buffers events in memory 4.使用管道
# # Bind the source and sink to the channel 5.源和目标绑定到管道两端
Flume采集案例
1.全量采集目录到HDFS
(1)采集需求:
服务器在某特定目录下,会不断产生新的文件,每当有新文件出现,就需要把文件采集到HDFS中去
- 采集源,即source–监控文件目录:Spooling Directory Source
- 下沉目标,即sink–HDFS文件系统:HDFS Sink
- 源和目标的通道channel使用:File Channel
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
##注意:不能往监控目中重复丢同名文件
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/logs1
a1.sources.r1.fileHeader = true
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(2)参数剖析:
roll控制写入hdfs文件 以何种方式进行滚动(控制文件滚动)
a1.sinks.k1.hdfs.rollInterval = 3 以时间间隔
a1.sinks.k1.hdfs.rollSize = 20 以文件大小
a1.sinks.k1.hdfs.rollCount = 5 以event个数
如果三个都配置 谁先满足谁触发滚动
如果不想以某种属性滚动 设置为0即可
是否开启时间上的舍弃 控制文件夹以多少时间间隔滚动(控制文件夹滚动,一个文件夹里面可以有多个文件)
以下述为例:就会每10分钟生成一个文件夹
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
capacity:默认该通道最大的可以存储的event数量
trasactionCapacity:每次最大可以从source中拿到或者送到sink中的event数量
(3)启动监控
bin/flume-ng agent -c ./conf -f ./conf/spooldir-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
2.最最重要问题
- 不能往监控目录中丢重名文件,如果传入同名文件flume会报错且罢工,后续就不再进行数据的监视采集了
- 怎么保证不会出现重名文件:在企业中通常给文件追加时间戳命名方式,以保证所有文件不重名
3.增量采集文件到HDFS
(1)采集需求:
业务系统中使用log4j生成日志,日志文件的内容不断增加,我们需要实现把实时生成的日志信息实时采集到hdfs上
- 确定三大要素
- 采集源source–监控文件内容更新:Exec Source
- 下沉目标sink–hdfs系统:HDFS Sink
- 管道channel–源和下沉目标直接传递通道:Memory Channel
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/test.log
a1.sources.r1.channels = c1
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume1/tailout/%y-%m-%d/%H-%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(2)编写一个shell脚本模拟生成数据
vim addDate
输入以下内容
#!/bin/bash
while true
do
date >> /root/log/test.log
done
(3)启动flume监控
bin/flume-ng agent -c ./conf -f ./conf/tail-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
//启动生成数据的脚本
addDate.sh
Flume的load-blance负载均衡,failover容错
1.Flume的负载均衡
(1)使用场景:
当我们多个flume进行串联,但是当某一节点的计算机处理能力不足时,可能会产生大量数据堆积,我们可以开启多台,进行并联,并联的多台集群,就涉及到了资源分配的负载均衡算法(轮询, 随机,权重),和同一个请求只能给一个进程出口,避免数据的一个重复.
(2)分发flume到其他两个节点
scp -r /export/servers/flume node01:$PWD
scp -r /export/servers/flume node03:$PWD
(2)flume串联跨网络传输数据
- avro sink
- avro source
使用上述两个组件指定绑定端口ip就可以满足数据跨网络的传递,通常用于flume串联架构中
(3)flume串联配置
- node01 配置名称为:exec-avro.conf
#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2
#set gruop
agent1.sinkgroups = g1
#set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/logs/123.log
# set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node02
agent1.sinks.k1.port = 52020
# set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node03
agent1.sinks.k2.port = 52020
#set sink group
agent1.sinkgroups.g1.sinks = k1 k2
#set failover
agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
agent1.sinkgroups.g1.processor.selector.maxTimeOut=10000
- node02节点配置名称为:avro-logger.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node02
a1.sources.r1.port = 52020
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- node03节点配置名称为:avro-logger.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node03
a1.sources.r1.port = 52020
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(4)flume串联启动
启动应该从远离数据端开始启动,这样可以避免数据的丢失
启动node03
bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console
启动node02
bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console
启动node01
bin/flume-ng agent -c conf -f conf/exec-avro.conf -n agent1 -Dflume.root.logger=INFO,console
2.容错
在上述内容基础上node01增加failover配置其他不变
#set failover
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000
Flume拦截器案例
1.静态拦截器
(1)使用场景:
通过拦截器可以实现flume的数据传入source时进行拦截,拦截后往里面添加k,v对标识,最后上传hdfs时可以通过标识区分数据后归总存放
如果没有使用静态拦截器
Event: { headers:{} body: 36 Sun Jun 2 18:26 }
使用静态拦截器之后 自己添加kv标识对
Event: { headers:{type=access} body: 36 Sun Jun 2 18:26 }
Event: { headers:{type=nginx} body: 36 Sun Jun 2 18:26 }
Event: { headers:{type=web} body: 36 Sun Jun 2 18:26 }
后续在存放数据的时候可以使用flume的规则语法获取到拦截器添加的kv内容
%{type}
(2)实现案例
- 案例场景
A、B两台日志服务机器实时生产日志主要类型为access.log、nginx.log、web.log
现在要求:
把A、B 机器中的access.log、nginx.log、web.log 采集汇总到C机器上然后统一收集到hdfs中。
但是在hdfs中要求的目录为:
/source/logs/access/20160101/**
/source/logs/nginx/20160101/**
/source/logs/web/20160101/**
- 在node01上配置exec_source_avro_sink.conf文件
# Name the components on this agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/access.log
# 添加一个拦截器
a1.sources.r1.interceptors = i1
# 设置为静态的
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access
a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /root/logs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx
a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /root/logs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node02
a1.sinks.k1.port = 41414
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1
- 在node02上配置avro_source_hdfs_sink.conf文件
#定义agent名, source、channel、sink的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = node02
a1.sources.r1.port =41414
#添加时间拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
#定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000
#定义sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=flume2/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
#时间类型
#a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件不按条数生成
a1.sinks.k1.hdfs.rollCount = 0
#间隔20秒滚动一次
a1.sinks.k1.hdfs.rollInterval = 20
#生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize = 10485760
#批量写入hdfs的evens个数
a1.sinks.k1.hdfs.batchSize = 20
# flume操作hdfs的线程数(包括新建,写入等)
a1.sinks.k1.hdfs.threadsPoolSize=10
#操作hdfs超时时间
a1.sinks.k1.hdfs.callTimeout=30000
#组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 启动node02的flume
bin/flume-ng agent -c ./conf -f ./conf/avro_source_hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console
- 启动node01的flume
bin/flume-ng agent -c ./conf -f ./conf/exec_source_avro_sink.conf -n a1 -Dflume.root.logger=INFO,console
- 模拟数据实现产生
写一组shell脚本产生数据
while true; do echo "access access....." >> /root/logs/access.log;sleep 0.5;done
while true; do echo "web web....." >> /root/logs/web.log;sleep 0.5;done
while true; do echo "nginx nginx....." >> /root/logs/nginx.log;sleep 0.5;done