1.flume是是什么,有什么工作组件,如何进行工作的,画图说明?
flume定义:flume是一个由Cloudera(课捞的爱rua)所提供的一个高可用,高可靠分布式的海量日志采集,聚合和传输系统,特点:灵活简单,流式架构.
工作组件:在一个Agent(爱剪剃)代理JVM进程中,
Sourse :接收数据,处理各种类型,各种格式的数据
Sink : 不断轮询,主动处理Channel里的事件,批量的写入存储HDFS库中,或者发送到第其他Agent处理jVM中
Channel:(读音:前脑)管道,位于sourse读取与sink写入的缓冲区,线程安全,
memory(卖毛瑞) channel 内存缓冲区易丢失数据,file channel:磁盘缓冲区不易丢失数据
画图运行流程

2,写出安装flume步骤,验证安装成功,截图说明
(1)从官网下载对应的安装包, .gz
下载地址:http://archive.apache.org/dist/flume/
(2)上传并解压安装包到自己的安装文件目录下
切换到兄弟目录
例如同一个父目录/opt下
cd /opt/software/
cd ../module/
1,直接通过下载的,gz的安装包上传到/opt/software目录下
2,解压文件夹 到 /opt/module目录下
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/module/
3,在安装目录module下 将文件改名
mv apache-flume-1.9.0-bin/ flume-1.9.0
4,到lib目录下删除与hadoop 3.1.3兼容的的guava-11.0.2,jar
cd /opt/module/flume/lib
rm -rf guava-11.0.2.jar
5,根目录下配置fulme环境变量
vim /etc/profile.d/my_env.sh
配置以下内容
#flume
export FLUME_HOME=/opt/module/flume-1.9.0
export PATH=$PATH:$FLUME_HOME/bin
6,重启服务
source /etc/profile
7,验证配置成功 进入目录
cd $FLUME_HOME

3,编写一个成功案例代码,并分析意义,截图说明
使用nc在node2监听一个端口
不知道如何使用nc 可以提使用以下两个代码查询
nc -help
man nc
监听:
nc -l localhost 44444
在 flume 目录下创建 job 文件夹并进入 job 文件夹。
[itwise@node2 flume]$ mkdir job
[itwise@node2 flume]$ cd job/
在 job 文件夹下创建 Flume Agent 配置文件 flume-netcat-logger.conf。
[itwise@node2 job]$ vim flume-netcat-logger.conf
在 flume-netcat-logger.conf 文件中添加配置:
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat -- 类型
a1.sources.r1.bind = localhost -- ip
a1.sources.r1.port = 4444 -- 端口号
# Describe the sink
a1.sinks.k1.type = logger -- 类型
# Use a channel which buffers events in memory
a1.channels.c1.type = memory -- 类型
a1.channels.c1.capacity = 1000 -- 容量:放入事件的数量
a1.channels.c1.transactionCapacity = 100 -- course存入/sink写入事务的大小
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
进行测试
1,第一步,node启动flume,默认端口监听输出位log4j日志文件,需要将后面的改为,console
bin/flume-ng agent --conf conf --conf-file jobs/flume-netcat-logger.conf --name a1
2,重新开启一个node2窗口保存到第一种方式的保存到log/log4j中的
本地开启监听,这个窗口输出内容
nc localhost 44444
3,再开一个窗口,查看监测日志. 注意:上一个窗口重新输入内容,查看日志需要重新打开
cd /opt/module/flume-1.9.0/
vim flume.log
#方法一:
bin/flume-ng agent --conf conf --conf-file jobs/flume-netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console
#方法二
[itwise@node2 flume]$ bin/flume-ng agent -c conf -f jobs/flume netcat-logger.conf -n al Dflume.root.logger=INFO,console
#常用写法,将地址写错成环境变量的方式
[itwise@node2 flume]$ bin/flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/flume netcat-logger.conf -n al Dflume.root.logger=INFO,console

案列二:监听Hive日志,并上传到hdfs中
分解:监听一个文件,如果文件发生变化追加文件,将内容打印到控制台
补充:linux命令 监控文本输入 写入会覆盖( > ) 追加(>>)
tail -f $FLUME_HOME/jobs/workdir/file1.txt


1,创建文件:
[itwise@node2 jobs]$ pwd
/opt/module/flume-1.9.0/jobs
[itwise@node2 jobs]$ mkdir workdir
[itwise@node2 jobs]$ cd workdir/
[itwise@node2 jobs]$ touch file1.txt
#打开一个窗口node2监听
[itwise@node2 home]$ tail -f $FLUME_HOME/jobs/workdir/file1.txt
#在窗口写入数据
[itwise@node2 workdir]$ echo 222 >> file1.txt
[itwise@node2 workdir]$ echo 333 >> file1.txt
# 原数据发生数据变化:
使用flume进行文本写入的监听
注意在配置文件中,路径要是绝对路径
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /opt/module/flume-1.9.0/jobs/workdir/file1.txt
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
创建脚本文件
[itwise@node2 jobs]$ vim flume-exec-logger.conf
上面的文件添加到文本当中
执行命令行
#常用写法,将地址写错成环境变量的方式
[itwise@node2 flume-1.9.0]$
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/flume-exec-logger.conf -n a1 -Dflume.root.logger=INFO,console
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/flume-exec-logger.conf -n al -Dflume.root.logger=INFO,console
echo 4444 >> $FLUME_HOME/jobs/workdir/file1.txt


3,将文件追加爱文件上传到hdfs中
写入目标位置,修改Sink的配置,其余的预上面保持一致
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /opt/module/flume-1.9.0/jobs/workdir/file1.txt
# Describe the sink
#Sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node2:9820/flume/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
开始执行,保证hadoop集群的启动
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/flume-exec-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
打开node2:9820, 去看看,有没有创建/flume, 看看它里面的东西:通过前面的测试,原有数据的。
读数据写保存。看到
效果图

[itwise@node2 jobs]$ echo '我爱大数据' >> $FLUME_HOME/jobs/workdir/file1.txt
#代码跑的效果代码
2024-03-08 20:45:55,415 (hdfs-k1-call-runner-4) [INFO - org.apache.flume.sink.hdfs.BucketWriter$7.call(BucketWriter.java:681)] Renaming hdfs://node2:9820/flume/20240308/20/logs-.1709901895357.tmp to hdfs://node2:9820/flume/20240308/20/logs-.1709901895357


案例 3 实时监控目录下的新文件,将内容上传到HDFS中*
编写代码:
有一个意识: 现在需要source, 后面不动
创建一个要被监控的目录
mkdir spooling
[itwise@node2 jobs]$ vim flume-spooling-hdfs.conf
vim flume-spooling-hdfs.conf内容:
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/module/flume-1.9.0/jobs/spooling
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.ignorePattern = .*\.tmp
# Describe the sink
#Sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node2:9820/flume/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动命令:
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/flume-spooling -hdfs.conf -n a1 -Dflume.root.logger=INFO,console
测试启动命令是否正常启动
cp $FLUME_HOME/jobs/workdir/file1.txt $FLUME_HOME/jobs/spooling/
运行率结果截图



结论: 只要是新文件都会被监听上传到hdfs中,但是如果新文件的名字是带.COMPLETED扩展名,
flume不去处理。因为这个扩展名是我们用来告诉flume区分是否被上传的文件的。
4,实时监控目录下多个追加文件,将内容上传到HDFS中
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
# 更改type类型 创建f1 f2组
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /opt/module/flume-1.9.0/jobs/taildir/.*\.txt
a1.sources.r1.filegroups.f2 = /opt/module/flume-1.9.0/jobs/taildir/.*\.log
# 创建position文件夹来存放服务
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/jobs/position/position.json
# Describe the sink
#Sink
# 存放path地址,并且按照自己指定的时间类型
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node2:9820/flume/%Y%m%d/%H
#上传文件的前缀区分文件
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位以小时为创建hdfs目录
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳为前缀+文件名
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
执行命令
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/flume-taildir-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
5,Flume企业开发的案例讲解:
执行力
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/group1/flume3.conf -n a3 -Dflume.root.logger=INFO,console
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/group1/flume2.conf -n a3 -Dflume.root.logger=INFO,console
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/group1/flume.conf -n a3 -Dflume.root.logger=INFO,console
#Flume1.conf
#Named
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
#Source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/flume-1.9.0/jobs/group1/taildir-flume1/.*\.txt
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/jobs/group1/position-flume1/position.json
#channel selector
a1.sources.r1.selector.type = replicating
#Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 100
#Sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 7777
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost
a1.sinks.k2.port = 8888
#Bind
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
#Flume2.conf
# example.conf: A single-node Flume configuration
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = localhost
a2.sources.r1.port = 7777
# Describe the sink
#Sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://node2:9820/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k1.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
#Flume3.conf
#Named
a3.sources = r1
a3.channels = c1
a3.sinks = k1
#Source
a3.sources.r1.type = avro
a3.sources.r1.bind = localhost
a3.sources.r1.port = 8888
#Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 10000
a3.channels.c1.transactionCapacity = 100
#Sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/flume-1.9.0/jobs/group1/flume3
#Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
配置备注:
position-flume1 是 处理文本存储位置
taildir-flume1 是监控的文件夹




a3.sources.r1.port = 8888
#Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 10000
a3.channels.c1.transactionCapacity = 100
#Sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/flume-1.9.0/jobs/group1/flume3
#Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
### 配置备注:
### position-flume1 是 处理文本存储位置
### taildir-flume1 是监控的文件夹
[外链图片转存中...(img-GwsmpCHu-1712971301051)]
[外链图片转存中...(img-VtYpmJhH-1712971301051)]
[外链图片转存中...(img-HD1VAhSb-1712971301052)]
[外链图片转存中...(img-yccxDNx5-1712971301052)]
本文详细介绍了ApacheFlume的架构、工作组件(Source、Sink、Channel)以及如何安装和配置Flume,包括使用netcat监听、Hive日志监控、并将数据上传至HDFS。实例涵盖了单节点配置和多文件监控,展示了Flume在企业级应用中的部署和使用。
1146

被折叠的 条评论
为什么被折叠?



