Flume-实战

最新推荐文章于 2024-09-29 16:42:39 发布

痴迷的小小工匠

最新推荐文章于 2024-09-29 16:42:39 发布

阅读量391

点赞数

分类专栏： hadoop 文章标签： flume

本文链接：https://blog.csdn.net/li1019865596/article/details/116672997

版权

hadoop 专栏收录该内容

26 篇文章 0 订阅

订阅专栏

一、简介

Flume 由 Cloudera 公司开发， 是一个分布式、高可靠、高可用的海量日志采集、聚 合、传输的系统 。

Flume 支持在日志系统中定制各类数据发送方，用于采集数据； Flume提供对数据进行简单处理，并写到各种数据接收方的能力。

简单的说， Flume 是 实时采集日志的数据采集引擎 。

Flume 有 3 个重要组件： Source 、 Channel 、 Sink

特点：

分布式：flume分布式集群部署，扩展性好
可靠性好: 当节点出现故障时，日志能够被传送到其他节点上而不会丢失
易用性：flume配置使用较繁琐，对使用人员专业技术要求高
实时采集：flume采集流模式进行数据实时采集

适用场景：适用于日志文件实时采集。

部署模式

聚合模式是最常见的，也非常实用，日常web应用通常分布在上百个服务器，大者甚至上千个、上万个服务器。产生的日志，处理起来也非常麻烦。用这种组合方式能很好的解决这一问题，每台服务器部署一个flume采集日志，传送到一个集中收集日志的 flume，再由此flume上传到hdfs、hive、hbase、消息队列中

Flume内部原理

总体数据流向：Souce => Channel => Sink

Channel: 处理器、拦截器、选择器

安装步骤

1、下载软件 apache-flume-1.9.0-bin.tar.gz，并上传到 linux123 上的 /opt/software 目录下

2、解压 apache-flume-1.9.0-bin.tar.gz 到 /opt/servers/ 目录下；并重命名为 flume-1.9.0

3、在 /etc/profile 中增加环境变量，并执行 source /etc/profile，使修改生效
export FLUME_HOME=/opt/servers/flume-1.9.0
export PATH=$PATH:$FLUME_HOME/bin
4、将 $FLUME_HOME/conf 下的 flume-env.sh.template 改名为 flume-env.sh，并添加 JAVA_HOME的配置
cd $FLUME_HOME/conf
mv flume-env.sh.template flume-env.sh
vi flume-env.sh
export JAVA_HOME=/opt/servers/jdk1.8.0_231

二、应用

Flume 支持的数据源种类有很多，可以来自directory、http、kafka等

常见的 source有

采集到的日志需要进行缓存，Flume提供了Channel组件用来缓存数据。常见的 Channel 有：

（1）memory channel：缓存到内存中（最常用）

（2）file channel：缓存到文件中

（3）JDBC channel：通过JDBC缓存到关系型数据库中

（4）kafka channel：缓存到kafka中

缓存的数据最终需要进行保存，Flume提供了Sink组件用来保存数据。常见的 Sink 有：

（1）logger sink：将信息显示在标准输出上，主要用于测试

（2）avro sink：Flume events发送到sink，转换为Avro events，并发送到配置好的hostname/port。从配置好的channel按照配置好的批量大小批量获取events

（3）null sink：将接收到events全部丢弃

（4）HDFS sink：将 events 写进HDFS。支持创建文本和序列文件，支持两种文件类型压缩。文件可以基于数据的经过时间、大小、事件的数量周期性地滚动

（5）Hive sink：该sink streams 将包含分割文本或者JSON数据的events直接传送到Hive表或分区中。使用Hive 事务写events。当一系列events提交到Hive时，它们马上可以被Hive查询到

（6）HBase sink：保存到HBase中

（7）kafka sink：保存到kafka中

案例

一）监听本机 8888 端口，Flume将监听的数据实时显示在控制台

业务需求：

监听本机 8888 端口，Flume将监听的数据实时显示在控制台

需求分析：

使用 telnet 工具可以向 8888 端口发送数据

监听端口数据，选择 netcat source

channel 选择 memory

数据实时显示，选择 logger sink

测试

安装 telnet 工具
yum install telnet

检查 8888 端口是否被占用
lsof -i:8888

cd $FLUME_HOME/conf
vim flume-netcat-logger.conf
添加下面内容

a1.sources = r1
a1.channels = c1
a1.sinks = k1
# source
a1.sources.r1.type = netcat
a1.sources.r1.bind = linux128
a1.sources.r1.port = 8888
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 100
# sink
a1.sinks.k1.type = logger
# source、channel、sink之间的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

执行命令
$FLUME_HOME/bin/flume-ng agent --name a1 --conf-file $FLUME_HOME/conf/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

二）监控日志文件信息到HDFS

业务需求：监控本地日志文件，收集内容实时上传到HDFS

需求分析：

使用 tail -F 命令即可找到本地日志文件产生的信息

source 选择 exec。exec 监听一个指定的命令，获取命令的结果作为数据源。 source组件从这个命令的结果中取数据。当agent进程挂掉重启后，可能存在数据丢失；

channel 选择 memory

sink 选择 HDFS

cd /opt/servers/hadoop-2.9.2/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib

cd /opt/servers/hive-2.3.7/conf 
vim flume-exec-hdfs.conf
添加下面内容
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
# 这里给一个root下可以输出日志的就行
a2.sources.r2.command = tail -F /tmp/root/stderr

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 10000
a2.channels.c2.transactionCapacity = 500

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://linux126:9000/flume/%Y%m%d/%H%M

# 上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
# 是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
# 积攒500个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 500
# 设置文件类型，支持压缩。DataStream没启用压缩
a2.sinks.k2.hdfs.fileType = DataStream
# 1分钟滚动一次
a2.sinks.k2.hdfs.rollInterval = 60
# 128M滚动一次
a2.sinks.k2.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关
a2.sinks.k2.hdfs.rollCount = 0
# 最小冗余数
a2.sinks.k2.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2


执行命令
$FLUME_HOME/bin/flume-ng agent --name a2 --conf-file $FLUME_HOME/conf/flume-exec-hdfs.conf -Dflume.root.logger=INFO,console

启动Hadoop和Hive，操作Hive产生日志
start-dfs.sh
start-yarn.sh
# 在命令行多次执行
hive -e "show databases"

三）监控目录采集信息到HDFS

业务需求：

监控指定目录，收集信息实时上传到HDFS

需求分析：

source 选择 spooldir。spooldir 能够保证数据不丢失，且能够实现断点续传，但延迟较高，不能实时监控

channel 选择 memory

sink 选择 HDFS

需要注意的是：

自动监控整个目录，但是只能监控文件，如果以追加的方式向已被处理的文件中添加内容，source并不能识别。拷贝到spool目录下的文件不可以再打开编辑

vim $FLUME_HOME/conf/flume-spooldir-hdfs.conf
添加下面内容

# Name the components on this agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3
# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /root/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
# 忽略以.tmp结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 10000
a3.channels.c3.transactionCapacity = 500
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path =hdfs://linux126:9000/flume/upload/%Y%m%d/%H%M
# 上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
# 是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
# 积攒500个Event，flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 500
# 设置文件类型
a3.sinks.k3.hdfs.fileType = DataStream
# 60秒滚动一次
a3.sinks.k3.hdfs.rollInterval = 60
# 128M滚动一次
a3.sinks.k3.hdfs.rollSize = 134217700
# 文件滚动与event数量无关
a3.sinks.k3.hdfs.rollCount = 0
# 最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

命令执行
$FLUME_HOME/bin/flume-ng agent --name a3 --conf-file $FLUME_HOME/conf/flume-spooldir-hdfs.conf -Dflume.root.logger=INFO,console

四）监控日志文件采集数据到HDFS、本地文件系统

业务需求：监控日志文件，收集信息上传到HDFS 和本地文件系统

需求分析：

需要多个Agent级联实现

source 选择 taildir

channel 选择

memory 最终的

sink 分别选择 hdfs、file_roll

vim flume-taildir-avro.conf
添加下面内容
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有channel
a1.sources.r1.selector.type = replicating
# source
a1.sources.r1.type = taildir
# 记录每个文件最新消费位置
a1.sources.r1.positionFile = /root/flume/taildir_position.json
a1.sources.r1.filegroups = f1
# 备注：.*log 是正则表达式；这里写成 *.log 是错误的
a1.sources.r1.filegroups.f1 = /tmp/root/.*log
# sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux128
a1.sinks.k1.port = 9091
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux128
a1.sinks.k2.port = 9092
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 500
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 500
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2


vim flume-avro-hdfs.conf
添加下面内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = linux128
a2.sources.r1.port = 9091
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 500
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://linux126:9000/flume2/%Y%m%d/%H
# 上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
# 是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
# 500个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 500
# 设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
# 60秒生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 60
a2.sinks.k1.hdfs.rollSize = 0
a2.sinks.k1.hdfs.rollCount = 0
a2.sinks.k1.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1



vim flume-avro-file.conf
添加下面内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = linux128
a3.sources.r1.port = 9092
# Describe the sink
a3.sinks.k1.type = file_roll
# 目录需要提前创建好
a3.sinks.k1.sink.directory = /root/flume/output
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 10000
a3.channels.c2.transactionCapacity = 500
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2


执行下面命令
$FLUME_HOME/bin/flume-ng agent --name a3 --conf-file $FLUME_HOME/conf/flume-avro-file.conf -Dflume.root.logger=INFO,console&


$FLUME_HOME/bin/flume-ng agent --name a2 --conf-file $FLUME_HOME/conf/flume-avro-hdfs.conf -Dflume.root.logger=INFO,console&

$FLUME_HOME/bin/flume-ng agent --name a1 --conf-file $FLUME_HOME/conf/flume-taildir-avro.conf -Dflume.root.logger=INFO,console&

注意：第一次启动a1会报错，应为position文件内容内空，解析报错，不用理会

五）拦截器

时间戳拦截器

在案例一的配置中添加

# 这部分是新增 时间拦截器的 内容
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# 是否保留Event header中已经存在的同名时间戳，缺省值false
a1.sources.r1.interceptors.i1.preserveExisting= false
# 这部分是新增 时间拦截器的 内容

如图

$FLUME_HOME/bin/flume-ng agent --name a1 --conf-file $FLUME_HOME/conf/timestamp.conf -Dflume.root.logger=INFO,console

host添加拦截器

在上面的基础上添加i2配置

三、事务机制与可靠性

在Flume中一共有两个事务：

Put事务。在Source到Channel之间
Take事务。Channel到Sink之间

Put事务

事务开始的时候会调用一个 doPut 方法， doPut 方法将一批数据放在putList 中；
putList在向 Channel 发送数据之前先检查 Channel 的容量能否放得下，如果放不下一个都不放，只能doRollback；
数据批的大小取决于配置参数 batch size 的值；
putList的大小取决于配置 Channel 的参数 transaction capacity 的大小，该参数大小就体现在putList上；（Channel的另一个参数 capacity 指的是 Channel 的容量）；
数据顺利的放到putList之后，接下来可以调用 doCommit 方法，把putList中所有的 Event 放到 Channel 中，成功放完之后就清空putList；

在doCommit提交之后，事务在向 Channel 存放数据的过程中，事务容易出问题。

如 Sink取数据慢，而 Source 放数据速度快，容易造成 Channel 中数据的积压，

如果 putList 中的数据放不进去，会如何呢？

此时会调用 doRollback 方法，doRollback方法会进行两项操作：

将putList清空
抛出 ChannelException异常

source会捕捉到doRollback抛出的异常，然后source 就将刚才的一批数据重新采集，然后重新开始一个新的事务，这就是事务的回滚。

Take 事务

doTake方法会将channel中的event剪切到takeList中。如果后面接的是HDFS Sink的话，在把Channel中的event剪切到takeList中的同时也往写入HDFS的IO 缓冲流中放一份event(数据写入HDFS是先写入IO缓冲流然后flush到HDFS）；
当takeList中存放了batch size 数量的event之后，就会调用doCommit方法， doCommit方法会做两个操作：
1、针对HDFS Sink，手动调用IO流的flush方法，将IO流缓冲区的数据写入到 HDFS磁盘中；
2、清空takeList中的数据

flush到HDFS的时候组容易出问题。flush到HDFS的时候，可能由于网络原因超时导致数据传输失败，这个时候调用doRollback方法来进行回滚，回滚的时候由于 takeList 中还有备份数据，所以将takeList中的数据原封不动地还给channel，这时候就完成了事务的回滚。

但是，如果flush到HDFS的时候，数据flush了一半之后出问题了，这意味着已经有一半的数据已经发送到HDFS上面了，现在出了问题，同样需要调用doRollback方法来进行回滚，回滚并没有“一半”之说，它只会把整个takeList中的数据返回给 channel，然后继续进行数据的读写。这样开启下一个事务的时候容易造成数据重复的问题。

Flume在数据进行采集传输的时候，有可能会造成数据的重复，但不会丢失数据。
Flume在数据传输的过程中是否可靠，还需要考虑具体使用Source、Channel、Sink
的类型。

分析Source

exec Source ，后面接 tail -f ，这个数据也是有可能丢的
TailDir Source ，这个是不会丢数据的，它可以保证数据不丢失

分析sink
Hdfs Sink，数据有可能重复，但是不会丢失

最后，分析channel。理论上说：要想数据不丢失的话，还是要用 File
channel；memory channel 在 Flume 挂掉的时候是有可能造成数据的丢失的。
如果使用 TailDir source 和 HDFS sink，所以数据会重复但是不会丢失

四、配置高可用flume集群

把linux128的flume复制到linux126和127服务器上，分别配置环境变量
# 在liunx128上执行
/opt/servers
scp -r flume-1.9.0/ linux126:$PWD
scp -r flume-1.9.0/ linux127:$PWD


在linux128上
vim $FLUME_HOME/conf/flume-cluster-taildir-avro.conf
添加下面内容
a1.sinks = k1 k2
# source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile =/root/flume_log/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /tmp/root/.*log
a1.sources.r1.fileHeader = true
# interceptor
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = Type
a1.sources.r1.interceptors.i1.value = LOGIN
# 在event header添加了时间戳
a1.sources.r1.interceptors.i2.type = timestamp
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 500
# sink group
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
# set sink1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux126
a1.sinks.k1.port = 9999
# set sink2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux127
a1.sinks.k2.port = 9999
# set failover
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 100
a1.sinkgroups.g1.processor.priority.k2 = 60
a1.sinkgroups.g1.processor.maxpenalty = 10000
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1


在linux126上
vim $FLUME_HOME/conf/flume-cluster-avro-hdfs.conf
添加下面内容

# set Agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1
# Source
a2.sources.r1.type = avro
a2.sources.r1.bind = linux126
a2.sources.r1.port = 9999
# interceptor
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
a2.sources.r1.interceptors.i1.value = linux126
# set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 500
# HDFS Sink
a2.sinks.k1.type=hdfs
a2.sinks.k1.hdfs.path=hdfs://linux126:9000/flume/failover/
a2.sinks.k1.hdfs.fileType=DataStream
a2.sinks.k1.hdfs.writeFormat=TEXT
a2.sinks.k1.hdfs.rollInterval=60
a2.sinks.k1.hdfs.filePrefix=%Y-%m-%d
a2.sinks.k1.hdfs.minBlockReplicas=1
a2.sinks.k1.hdfs.rollSize=0
a2.sinks.k1.hdfs.rollCount=0
a2.sinks.k1.hdfs.idleTimeout=0
a2.sources.r1.channels = c1
a2.sinks.k1.channel=c1


在linux127上
vim $FLUME_HOME/conf/flume-cluster-avro-file.conf
添加下面内容

# set Agent name
a3.sources = r1
a3.channels = c1
a3.sinks = k1
# Source
a3.sources.r1.type = avro
a3.sources.r1.bind = linux127
a3.sources.r1.port = 9999
# interceptor
a3.sources.r1.interceptors = i1
a3.sources.r1.interceptors.i1.type = static
a3.sources.r1.interceptors.i1.key = Collector
a3.sources.r1.interceptors.i1.value = linux127
# set channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 10000
a3.channels.c1.transactionCapacity = 500
# HDFS Sink
a3.sinks.k1.type=hdfs
a3.sinks.k1.hdfs.path=hdfs://linux126:9000/flume/failover/
a3.sinks.k1.hdfs.fileType=DataStream
a3.sinks.k1.hdfs.writeFormat=TEXT
a3.sinks.k1.hdfs.rollInterval=60
a3.sinks.k1.hdfs.filePrefix=%Y-%m-%d
a3.sinks.k1.hdfs.minBlockReplicas=1
a3.sinks.k1.hdfs.rollSize=0
a3.sinks.k1.hdfs.rollCount=0
a3.sinks.k1.hdfs.idleTimeout=0
a3.sources.r1.channels = c1
a3.sinks.k1.channel=c1

验证可用

执行下面命令
$FLUME_HOME/bin/flume-ng agent --name a3 --conf-file $FLUME_HOME/conf/flume-cluster-avro-file.conf -Dflume.root.logger=INFO,console

$FLUME_HOME/bin/flume-ng agent --name a2 --conf-file $FLUME_HOME/conf/flume-cluster-avro-hdfs.conf -Dflume.root.logger=INFO,console

$FLUME_HOME/bin/flume-ng agent --name a1 --conf-file $FLUME_HOME/conf/flume-cluster-taildir-avro.conf -Dflume.root.logger=INFO,console

在linx128 /tmp/root文件下
vim 1.log
ajsdlj

查看hdfs界面，flume/failover目录下文件变动情况

验证高可用
杀掉linux126 flume，再次执行上述操作