随笔3-Flume入门案例_基于flume的入门案例-CSDN博客

本文链接：https://blog.csdn.net/SnowXu01/article/details/108777772

1. 监控端口数据官方案例

需求:

使用Flume监听一个端口，收集该端口数据，并打印到控制台。

步骤实现

(1) 在flume目录下创建work工作目录

[hadoop@hadoop181 apache-flume]$ mkdir -p work
[hadoop@hadoop181 apache-flume]$ cd work

(2) 创建flume Agent配置文件

[hadoop@hadoop181 work]$ vim flume-netcat-logger.conf

(3) 配置文件内容如下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(4) 开启flume监听端口

[hadoop@hadoop181 work]$ cd ../
[hadoop@hadoop181 apache-flume]$ bin/flume-ng agent -c conf/ -n a1 -f work/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

(5) 使用telnet 或者 netcat 工具向agent发送内容

[hadoop@hadoop181 apache-flume]$ telnet 192.168.207.181 44444
Trying 192.168.207.181...
Connected to 192.168.207.181.
Escape character is '^]'.
nihao
OK

(6) 查看监听的控制台
在这里插入图片描述

备注

(1) 如果netcat 没有可以使用yum安装

[hadoop@hadoop181 ~]$ sudo yum -y install nc

(2) 如果telnet 没有也可以直接使用yum安装

[hadoop@hadoop181 ~]$ sudo yum -y install telnet

2. 实时监控单个追加文件到HDFS

需求:

实时监控 /home/hadoop/data/flume/logs/append_log.log日志文件，并上传到HDFS中

步骤实现

(1) 在flume的work目录下创建agent配置文件

[hadoop@hadoop181 work]$ vim flume-file-hdfs.conf

(2) 配置内容如下

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/flume/logs/append_log.log
a1.sources.r1.shell = /bin/bash -c

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop181:9000/flume/%Y%m%d/%H
a1.sinks.k1.hdfs.filePrefix = flume-logs-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1

a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.useLocalTimeStamp = true


a1.sinks.k1.hdfs.batchSize = 1000
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 134117700
a1.sinks.k1.hdfs.rollCount = 0

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000

a1.sources.r1.channels = c1

(3) 创建测试的 /home/hadoop/data/flume/logs/append_log.log 文件

[hadoop@hadoop181 work]$ touch /home/hadoop/data/flume/logs/append_log.log

(4) 启动flume

[hadoop@hadoop181 work] cd ../
[hadoop@hadoop181 apache-flume]$ bin/flume-ng agent -c conf/ -n a1 -f work/flume-file-hdfs.conf -Dflume.root.logger=INFO,console

(5) 追加文件内容

[hadoop@hadoop181 ~]$ echo "2">>/home/hadoop/data/flume/logs/append_log.log

(6) 可以看到agent的console打印出内容
在这里插入图片描述

3. 实时监控目录下多个新日志文件

需求:

使用Flume监听整个目录的文件，并上传至HDFS

步骤实现

(1) 在flume的work目录下创建agent配置文件

[hadoop@hadoop181 work]$ vim flume-dir-hdfs.conf

(2) agent 配置文件内容如下

#1、定义agent、source、channel、sink的名称
a1.channels = c1
a1.sources = r1
a1.sinks = k1

# 2、描述source
a1.sources.r1.type = spooldir
# 指定监控哪个目录
a1.sources.r1.spoolDir = /home/hadoop/data/spooldir/logs/
# 设置监控目录下符合正则匹配要求的文件
a1.sources.r1.includePattern = ^.*\.log$
# 忽略所有以.tmp结尾的文件，不上传
a1.sources.r1.ignorePattern = ([^ ]*\.tmp)
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.batchSize = 100

#3、描述channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 100

#4、描述sink
a1.sinks.k1.type = hdfs
#设置hdfs存储目录
a1.sinks.k1.hdfs.path = hdfs://hadoop181:9000/flume/logs/spooldir/%Y%m%d/%H
#设置存储文件的前缀
a1.sinks.k1.hdfs.filePrefix = flume-

#设置滚动的时间间隔
a1.sinks.k1.hdfs.rollInterval = 5
#设置滚动的文件的大小[当文件达到指定的大小之后,会生成新文件，flume向新文件里面写入内存]
a1.sinks.k1.hdfs.rollSize = 13421000
#设置文件写入多少个Event就滚动
a1.sinks.k1.hdfs.rollCount = 0
#设置每次向HDFS写入的时候批次大小
a1.sinks.k1.hdfs.batchSize = 100
#指定写往HDFS的时候的压缩格式
a1.sinks.k1.hdfs.fileType = DataStream
#是否按照指定的时间间隔生成文件夹
a1.sinks.k1.hdfs.round = true
#指定生成文件夹的时间值
a1.sinks.k1.hdfs.roundValue = 24
#指定生成文件夹的时间单位[hour、second、minute]
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#5、关联source->channel->sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) 创建测试的日志目录

[hadoop@hadoop181 work]$ mkdir -p /home/hadoop/data/spooldir/logs/

(4) 启动flume

[hadoop@hadoop181 work] cd ../
[hadoop@hadoop181 apache-flume]$ bin/flume-ng agent -c conf/ -n a1 -f work/flume-dir-hdfs.conf -Dflume.root.logger=INFO,console

(5) 向被监控目录中增加.log的文件

[hadoop@hadoop181 apache-flume]$ cd /home/hadoop/data/spooldir/logs/
[hadoop@hadoop181 logs]$ 
# 创建日志文件,追加日志内容
[hadoop@hadoop181 logs]$ echo "1">test1.log

(6) 浏览器查看hdfs 文件
在这里插入图片描述

(7) 在创建一个试试

[hadoop@hadoop181 logs]$ 
[hadoop@hadoop181 logs]$ echo "2">test2.log
[hadoop@hadoop181 logs]$

(8) 查看一下
在这里插入图片描述

4. 实时监控目录下的多个追加文件

需求:

使用Flume监听整个目录的实时追加文件，并上传至HDFS

实现步骤:

(1) 在flume的work目录下创建配置文件 flume-taildir-hdfs.conf
(2) 配置内容如下

#1、定义agent、source、channel、sink的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#2、描述source,注意: 这个是大写的
a1.sources.r1.type = TAILDIR
#设置flume监听的组
a1.sources.r1.filegroups = fg1 fg2
#设置f1组监听的文件 /home/hadoop/data/taildir/logs/
a1.sources.r1.filegroups.fg1 = /home/hadoop/data/taildir/logs/.*log
#设置f2组监听的文件
a1.sources.r1.filegroups.fg2 = /home/hadoop/data/taildir/txt/.*txt

#读取的进度记录文件,通过该文件能够做到断点续传
a1.sources.r1.positionFile = /home/hadoop/data/taildir/postion.json
a1.sources.r1.batchSize = 100

#3、描述channel,容量,事务容量,注意 : 事务容量必须<=channel的容量
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 100

#4、描述sink
a1.sinks.k1.type = hdfs
#设置hdfs存储目录
a1.sinks.k1.hdfs.path = hdfs://hadoop181:9000/flume/logs/taildir/%Y%m%d/%H
#设置存储文件的前缀
a1.sinks.k1.hdfs.filePrefix = flume-
#设置滚动的时间间隔
a1.sinks.k1.hdfs.rollInterval = 5
#设置滚动的文件的大小[当文件达到指定的大小之后,会生成新文件，flume向新文件里面写入内存]
a1.sinks.k1.hdfs.rollSize = 13421000
#设置文件写入多少个Event就滚动
a1.sinks.k1.hdfs.rollCount = 0
#设置每次向HDFS写入的时候批次大小
a1.sinks.k1.hdfs.batchSize = 100
#指定写往HDFS的时候的压缩格式
#a1.sinks.k1.hdfs.codeC = 100
#指定写往HDFS的时候的文件格式[SequenceFile-序列化文件,DataStream-文本文件,CompressedStream-压缩文件]
a1.sinks.k1.hdfs.fileType = DataStream
#是否按照指定的时间间隔生成文件夹
a1.sinks.k1.hdfs.round = true
#指定生成文件夹的时间值
a1.sinks.k1.hdfs.roundValue = 24
#指定生成文件夹的时间单位[hour、second、minute]
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true

#5、关联source->channle->sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) 创建测试目录

[hadoop@hadoop181 logs]$ mkdir -p /home/hadoop/data/taildir/logs/
[hadoop@hadoop181 logs]$ mkdir -p /home/hadoop/data/taildir/txt/

(4) 启动flume 的 agent

[hadoop@hadoop181 work] cd ../
[hadoop@hadoop181 apache-flume]$ bin/flume-ng agent -c conf/ -n a1 -f work/flume-taildir-hdfs.conf -Dflume.root.logger=INFO,console

(5) 向logs目录中追加文件测试

[hadoop@hadoop181 logs]$ cd /home/hadoop/data/taildir/logs
[hadoop@hadoop181 logs]$ echo "log1,hello">test1.log

[hadoop@hadoop181 logs]$ cd /home/hadoop/data/taildir/txt
[hadoop@hadoop181 txt]$ echo "txt1,world">test1.txt

(6) 查看hdfs文件
在这里插入图片描述

在这里插入图片描述

5. 将文件按照名称追加到不同目录中

需求

在node1 主机下的 /home/hadoop/data/logs/ 目录下有access.log , nginx.log, web.log 现在需要分别将他们采集到hdfs下并且按照类别放在不同的目录下

实现步骤

(1) 在flume的work目录下创建 exec_source_avro_sink.conf 文件

vim exec_source_avro_sink.conf

(2) 配置内容如下

# 配置读取的数据源,r1,r2,r3 分别拦截三种类型的文件
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1
 
# Describe/configure the source
# 元数据类型
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/logs/access.log
# 拦截器别名
a1.sources.r1.interceptors = i1
# 拦截器类型
a1.sources.r1.interceptors.i1.type = static
# 拦截的键值对
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access
 
a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /home/hadoop/data/logs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx
 
a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /home/hadoop/data/logs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web
 
# 下沉的目的地
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop182
a1.sinks.k1.port = 41414
 
# 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
 
# 绑定source -> channels -> sinks
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

(3) 早node2的work目录下创建 avro_source_hdfs_sink.conf文件

vim avro_source_hdfs_sink.conf

(4) 追加内容如下

# 定义agent名， source、channel、sink的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1
 
 
# 定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop182
a1.sources.r1.port =41414
 
# 添加时间拦截器，将没有时间戳的event添加上时间戳
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
 
# 定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
 
# 定义sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=hdfs://hadoop181:9000/flume/staticlog/%{type}/%Y%m%d

a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

# 时间类型
#a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 生成的文件不按条数生成
a1.sinks.k1.hdfs.rollCount = 0
# 生成的文件不按时间生成
a1.sinks.k1.hdfs.rollInterval = 30
# 生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize  = 10485760
#a1.sinks.k1.hdfs.rollSize  =0
# 批量写入hdfs的个数
a1.sinks.k1.hdfs.batchSize = 20
# flume操作hdfs的线程数（包括新建，写入等）
a1.sinks.k1.hdfs.threadsPoolSize=10
# 操作hdfs超时时间
a1.sinks.k1.hdfs.callTimeout=30000
 
# 组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(5) 启动node2上的agent(永远先启动远离数据源的一端)

[hadoop@hadoop182 apache-flume]$ 
[hadoop@hadoop182 apache-flume]$ bin/flume-ng agent -c conf/ -f work/avro_source_hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console

(6) 启动node1上的agent

[hadoop@hadoop182 apache-flume]$ 
[hadoop@hadoop182 apache-flume]$ bin/flume-ng agent -c conf/ -f work/exec_source_avro_sink.conf -n a1 -Dflume.root.logger=INFO,console

(7) 查看hdfs生成的文件
在这里插入图片描述
(9) 测试追加