Flume基础【Flume的几个小案例】

最新推荐文章于 2023-05-08 11:40:56 发布

OneTenTwo76

最新推荐文章于 2023-05-08 11:40:56 发布

阅读量1.6k

点赞数

分类专栏： Flume 大数据开发文章标签： hadoop flume

本文链接：https://blog.csdn.net/weixin_43923463/article/details/124046293

版权

大数据开发同时被 2 个专栏收录

33 篇文章 3 订阅

订阅专栏

Flume

5 篇文章 1 订阅

订阅专栏

一案例二实时监控单个追加文件

需求：实时监控Hive日志，并上传到HDFS中

实现步骤

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/flume/demo/1.log
a2.sources.r2.shell = /bin/bash -c

# Describe the sink
a2.sinks.k2.type = hdfs

# 如果要使用时间转义时间序列需要满足两个条件之一
#    使用本地时间戳
#    在event的headers中必须有时间戳
a2.sinks.k2.hdfs.path = hdfs://hadoop101:8020/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true

#积攒多少个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream

#多久生成一个新的文件（60秒）
a2.sinks.k2.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关（0表示禁用此功能)
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

--启动flume的agent
flume-ng agent -n a2 -c /conf/ -f job/execsource_hdfssink.conf -Dflume.root.logger=INFO,cosle
--启动后会在hdfs上看到flume目录

exec source

Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means configurations such as cat [named pipe] or tail -F [file] are going to produce the desired results where as date will probably not - the former two commands produce streams of data where as the latter produces a single event and exits.

HDFS SInk

This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events. It also buckets/partitions data by attributes like timestamp or machine where the event originated. The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events. Using this sink requires hadoop to be installed so that Flume can use the Hadoop jars to communicate with the HDFS cluster. Note that a version of Hadoop that supports the sync() call is required.

https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

二案例二实时监控目录下多个新文件

需求：使用Flume监听整个目录的文件，并上传至HDFS

实现步骤

--创建配置文件flume-dir-hdfs.conf
vim flume-dir-hdfs.conf
--添加如下内容

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
# Spooling Directory Source：用来监听一个目录并自动收集目录中的内容
# 当目录中的一个log文件的内容被读取完毕后，此文件有两个处理方案
#	删除或在文件名资质后加上一些内容（默认改为.COMPLETED）通过deletePolicy属性配置
# 此目录中的文件名不能相同，相同则抛出异常
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop101:8020/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3


--启动监控文件夹命令
flume-ng agent -n a3 -c conf/ -f job/flume-dir-hdfs.conf
--在/opt/module/flume目录下创建upload文件夹
mkdir upload
--随便写入一些内容，在hdfs上会显示日志文件

当抛出异常的时候，agent就会挂掉，此时向文件中写文件，agent不会读到数据，将agent重启。

三实时监控目录下的多个追加文件

需求：使用Flume监听整个目录的实时追加文件，并上传至HDFS

实现步骤

--编写配置文件
vim flume-taildir-hdfs.conf
--添加以下内容

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = TAILDIR
# 此文件记录了source读取到的内容的位置，此文件丢失，source会从该文件的初始位置读取数据
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1 f2
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/.*file.*
a3.sources.r3.filegroups.f2 = /opt/module/flume/files/.*log.*

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop101:8020/flume/upload2/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

Exec source适用于监控一个实时追加的文件，不能实现断点续传；

Spooldir Source适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步；

Taildir Source适合用于监听多个实时追加的文件，并且能够实现断点续传。

--启动文件夹监听命令
flume-ng agent -n a3 -c conf/ -f job/flume-taildir-hdfs.conf
--在/opt/module/flume目录下创建files文件夹
mkdir files
--进入files文件夹，并向文件夹中添加一些数据，hdfs中会记录这些信息
echo hello >> file1.txt
echo 111 > file2.txt

OneTenTwo76

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Flume基础【Flume的几个小案例】

一案例二实时监控单个追加文件需求：实时监控Hive日志，并上传到HDFS中实现步骤# Name the components on this agenta2.sources = r2a2.sinks = k2a2.channels = c2# Describe/configure the sourcea2.sources.r2.type = execa2.sources.r2.command = tail -F /opt/module/flume/demo/1.loga2
复制链接

扫一扫