Flume常用组件示例

最新推荐文章于 2023-10-15 14:33:46 发布

freesOcean

最新推荐文章于 2023-10-15 14:33:46 发布

阅读量155

点赞数

分类专栏：大数据文章标签：大数据

本文链接：https://blog.csdn.net/gexiaoyizhimei/article/details/108869487

版权

大数据专栏收录该内容

20 篇文章 2 订阅

订阅专栏

文章目录

安装

1） Flume 官网地址

http://flume.apache.org/

2）文档查看地址

http://flume.apache.org/FlumeUserGuide.html

3）下载地址

http://archive.apache.org/dist/flume/

第一步：apache-flume-1.7.0-bin.tar.gz 上传到 linux 并解压到/opt/目录

第二步：将 flume/conf 下的 flume-env.sh.template 文件修改为 flume-env.sh，并配置 flume-env.sh 文件。修改JAVA_HOME路径

定义

Flume 是 Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。

在这里插入图片描述

Agent ：一个JVM进程。它以事件的形式将数据从源头送至目的地。它包含三个部分：
- Source : 负责接收数据。支持：avro、thrift、exec、jms、spooling directory、netcat、sequence
  
  generator、syslog、http、legacy。
- Channel: Source 和 Sink 之间的缓冲区。因此，Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个Sink 的读取操作
- Sink ：轮询Channel，将事件写入目的地包括：hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。
一个Source可以写入多个Channel，一个Channel可以被多个Sink轮询写出。但是一个sink只能绑定一个Channel.
Event : 包含header和body两部分，body存放字节数组。

常用的几种组件：

Source (配置文件)	说明
netcat	NetCat
spooldir	Spooling Directory
TAILDIR	Taildir
exec	Exec
avro	Avro
org.apache.flume.source.kafka.KafkaSource	Kafka

Sink（配置文件）	说明
hdfs	HDFS Sink
hive	Hive Sink
avro	Avro Sink
logger	Logger Sink
file_roll	File Roll Sink
org.apache.flume.sink.kafka.KafkaSink	Kafka Sink

Channel （type）	说明
memory	Memory Channel
org.apache.flume.channel.kafka.KafkaChannel	Kafka Channel
jdbc	JDBC Channel

监听端口数据写事件到控制台案例

nc工具：

模拟服务端：

nc -lk 4444 #默认监听本机的4444端口

模拟客户端：

nc 192.168.111.12 4444 #连接服务端4444端口

此时就可以互发数据

1.先新建一个目录，新建一个配置文件 netcat-flume-logger.conf

目录和文件名称自定义，内容如下：

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

a1表示agent的名称。

2.启动Flume监听对应的端口：

bin/flume-ng agent --conf conf/ --name a1 --conf-file job/netcat-flume-logger.conf -Dflume.root.logger=INFO,console

3.用netcat 作为客户端连接Flume并发送

nc localhost 44444

方式二：另一种写法

 bin/flume-ng agent -c conf/ -n a1 -f job/netcat-flume-logger.conf -Dflume.root.logger=INFO,console

监控单个追加日志文件

配置文件：

#Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source定义source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /home/script/serverJAR/BSNService_log
a2.sources.r2.shell = /bin/bash -c


# Describe the sink 定义sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://master:9000/flume/%Y%m%d/%H
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true

#定义channel
# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
#绑定
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

上传到HDFS:需要添加hadoop的jar包到flume的lib目录：

在这里插入图片描述

启动：

 bin/flume-ng agent --conf conf/ --name a2 --conf-file job/flume-file-hdfs.conf  -Dflume.root.logger=INFO,console

HDFS Sink 属性说明：文档

hdfs.rollInterval	30	Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize	1024	File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount	10	Number of events written to file before it rolled (0 = never roll based on number of events)

这三个属性，控制文件滚动。分别是时间，大小，和事件个数。

场景：1.如果日志产生速率很快或者很慢，可以通过时间控制指定时间内的日志。

2.大小设置可以设置为HDFS一个块大小，比如128M

比如：时间间隔设为15秒，个数设置为5，则如果个数先达到，则滚动一个新文件（即当第六个事件来时，将写入一个新文件，上传到HDFS），上传到HDFS,如果时间先到，则下一次事件到达就会上传到新的文件。

# Describe the sink 定义sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://master:9000/flume/%Y%m%d/%H
a3.sinks.k3.hdfs.rollCount=5
a3.sinks.k3.hdfs.rollInterval=15

/%Y%m%d/%H 表示取事件发生时的年月日小时，当时间变动时，会自动滚动文件夹。也可以用下面三个属性个性化设置：

#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour

一个配置文件示例：

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2
# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log
a2.sources.r2.shell = /bin/bash -c
# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs- 
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 30
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

监控某个文件夹的新文件

使用Spooldir Source .

流程：当监控的文件夹，有新的文件时，就会触发。并且Flume会将将文件名增加一个后缀：默认是 .COMPLETED 。该Source不能监控文件的内容动态变化。每500毫秒扫描看是否有新文件，会忽略.COMPLETED的文件。

配置文件：

a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp 结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = 
hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload- 
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次,测试时可以放小一点
a3.sinks.k3.hdfs.batchSize = 2
实时读取目录文件到HDFS案例
a3.sources = r3 a3.sinks = k3 a3.channels = c3 

 # Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3 
a3.sinks.k3.channel = c3

断点续传文件

Exec source 适用于监控一个实时追加的文件，但不能保证数据不丢失；Spooldir

Source 能够保证数据不丢失，且能够实现断点续传，但延迟较高，不能实时监控；而 Taildir

Source 既能够实现断点续传，又可以保证数据不丢失，还能够进行实时监控。

Taildir通过记录读取文件的位置，实现断点续传。

1.配置文件：

a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/file.*
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = 
hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload- #是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700 #文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

2.启动flume

bin/flume-ng agent -c conf/ -n a5 -f job/flume-taildir-hdfs.conf -Dflume.root.logger=INFO,console

freesOcean

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flume常用组件示例

文章目录安装定义监听端口数据写事件到控制台案例监控单个追加日志文件监控某个文件夹的新文件断点续传文件安装1） Flume 官网地址http://flume.apache.org/2）文档查看地址http://flume.apache.org/FlumeUserGuide.html3）下载地址http://archive.apache.org/dist/flume/第一步：apache-flume-1.7.0-bin.tar.gz 上传到 linux 并解压到/opt/目录第二步：将 flu
复制链接

扫一扫

专栏目录