flume基本使用

最新推荐文章于 2023-09-23 22:16:31 发布

Xyunh

最新推荐文章于 2023-09-23 22:16:31 发布

阅读量1.5k

点赞数

文章标签： flume big data 大数据

本文链接：https://blog.csdn.net/xyunh/article/details/122907059

版权

1.入门案例：监控端口数据

需求：使用Flume监听一个端口，收集数据并打印到控制台

实现步骤
- 1. 安装netcat工具
  sudo yum install -y nc
- 1. 判断是否被占用
  sudo netstat -nlp | grep 44444
- 1. 创建Flume Agent配置文件 flume-netcat-logger.conf
    
    vim /opt/app/flume/job/flume-netcat-logger.conf
- 1. 开启监听端口
    
    bin/flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf -f Dflume.root.logger=INFO,console
    
    参数说明：
    
    -c：表示配置文件存储在 conf/目录
    
    -n：表示给 agent 起名为 a1
    
    -f：flume 本次启动读取的配置文件是在 job 文件夹下的 flume-telnet.conf 文件。
    
    -Dflume.root.logger=INFO,console ：-D 表示 flume 运行时动态修改 flume.root.logger 参数属性值，并将控制台日志打印级别设置为 INFO 级别。日志级别包括:log、info、warn、 error。
- 1. 使用netcat工具向本机的44444端口发送内容
    
    nc localhost 44444
    
    启用端口

配置文件

# Name the components on this agent 
a1.sources = r1			    **r1:a1的Source名称**
a1.sinks = k1					**k1:a1的Sink名称**
a1.channels = c1 			**c1:a1的Channel名称**



# Describe/configure the source  			
a1.sources.r1.type = netcat					**表示a1输入源类型为netcat**
a1.sources.r1.bind = localhost				**表示a1监听的主机**
a1.sources.r1.port = 44444					 **表示a1监听的端口号**



# Describe the sink 
a1.sinks.k1.type = logger 						**表示a1的输出目的地是控制台logger类型**



# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 						**表示a1的channel类型是memory类型**
a1.channels.c1.capacity = 1000 					   **表示a1的channel总容量为1000个event**

a1.channels.c1.transactionCapacity = 100		**表示a1的channel传输是收集到100条event后再提交事务**



# Bind the source and sink to the channel 
a1.sources.r1.channels = c1					  表示将r1和c1连接起来
a1.sinks.k1.channel = c1								表示将k1和c1连接起来

实时监控单个追加文件

需求：实时监控Hive日志，并上传到HDFS上
实现步骤
- 1. 创建flume-file-hdfs.conf配置文件
    
    vim /opt/app/flume-1.9.0/job/flume-file-hdfs.conf
  2. 运行flume
    
    bin/flume-ng agent -c conf/ -n a2 -f job/flume-file-hdfs.conf

配置文件

# Name the components on this agent

a2.sources = r2
a2.sinks = k2
a2.channels = c2



# Describe/configure the source

a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/app/hive-3.1.2/logs/hive.log



# Describe the sink

a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop001:8020/flume/%Y%m%d/%H



#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0

#Use a channel which buffers events in memory

a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

#Bind the source and sink to the channel

a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

实时监控目录下多个新文件

需求：使用 Flume 监听整个目录的文件，并上传至 HDFS

实现步骤
- 1. 创建flume-dir-hdfs.conf配置文件
    
    vim flume-dir-hdfs.conf
  2. 启动flume

配置文件

a3.sources = r3
a3.sinks = k3
a3.channels = c3

#Describe/configure the source

a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/app/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp 结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)



#Describe the sink

a3.sinks.k3.type = hdfs
xa3.sinks.k3.hdfs.path = hdfs://hadoop001:8020/flume/upload/%Y%m%d/%H

#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true	 **因为头部没有时间戳，所以这句必须要有**
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0



#Use a channel which buffers events in memory

a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100



# Bind the source and sink to the channel

a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

实时监控目录下多个追加文件

需求：使用 Flume 监听整个目录的实时追加文件，并上传至 HDFS

实现步骤

1. 创建flume-taildir-hdfs.conf配置文件
  
  vim flume-taildir-hdfs.conf
2. 启动flume

配置文件

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.positionFile = /opt/app/flume-1.9.0/tail_dir.json
a3.sources.r3.filegroups = f1 f2
a3.sources.r3.filegroups.f1 = /opt/app/flume-1.9.0/files1/.*file.*
a3.sources.r3.filegroups.f2 = /opt/app/flume-1.9.0/files2/.*log.*

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop001:8020/flume/upload2/%Y%m%d/%H

#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

复制和多路复用

案例需求：使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 Local FileSystem。
实现步骤
1. 创建 flume-file-flume.conf ，配置 1 个接收日志文件的 source 和两个 channel、两个 sink，分别输送给 flume-flumehdfs 和 flume-flume-dir。
  
  vim /opt/app/flume-1.9.0/job/group1/flume-file-flume.conf
2. 创建 flume-flume-hdfs.conf 配置上级 Flume 输出的 Source，输出是到 HDFS 的 Sink。
  
  vim /opt/app/flume-1.9.0/job/group1/flume-flume-hdfs.conf
3. 创建 flume-flume-dir.conf 配置上级 Flume 输出的 Source，输出是到本地目录的 Sink。
  
  vim /opt/app/flume-1.9.0/job/group1/flume-flume-dir.conf
4. 执行配置文件
5. 启动Hadoop和Hive

配置文件

flume-file-flume.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/app/hive-3.1.2/logs/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop001
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop001
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

flume-flume-hdfs.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop001
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop001:8020/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

flume-flume-dir.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop001
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/app/data/flume3

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

负载均衡和故障转移

案例需求：使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用 FailoverSinkProcessor，实现故障转移的功能。
实现步骤：
1. 创建 flume-netcat-flume.conf
  
  vim /opt/app/flume-1.9.0/job/group2/flume-netcat-flume.conf
2. 创建 flume-flume-console1.conf
  
  vim /opt/app/flume-1.9.0/job/group2/ flume-flume-console1.conf
3. 创建 flume-flume-console2.conf
  
  vim /opt/app/flume-1.9.0/job/group2/flume-flume-console2.conf
4. 执行配置文件

配置文件：

flume-netcat-flume.conf

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop001
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop001
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

flume-netcat-console1.conf

# Name the components on this agenta2.sources = r1a2.sinks = k1a2.channels = c1# Describe/configure the sourcea2.sources.r1.type = avroa2.sources.r1.bind = hadoop001a2.sources.r1.port = 4141# Describe the sinka2.sinks.k1.type = logger# Describe the channela2.channels.c1.type = memorya2.channels.c1.capacity = 1000a2.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela2.sources.r1.channels = c1a2.sinks.k1.channel = c1

flume-flume-console2.conf

# Name the components on this agenta3.sources = r1a3.sinks = k1a3.channels = c2# Describe/configure the sourcea3.sources.r1.type = avroa3.sources.r1.bind = hadoop001a3.sources.r1.port = 4142# Describe the sinka3.sinks.k1.type = logger# Describe the channela3.channels.c2.type = memorya3.channels.c2.capacity = 1000a3.channels.c2.transactionCapacity = 100# Bind the source and sink to the channela3.sources.r1.channels = c2a3.sinks.k1.channel = c2

负载均衡
实现步骤：与上述类似，将flume-netcat-flume.conf文件的配置修改部分

配置文件

flume-netcat-flume.conf

source部分修改为：# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = localhosta1.sources.r1.port = 44444a1.sinkgroups.g1.processor.type = load_balance

聚合

案例需求：

hadoop102 上的 Flume-1 监控文件/opt/module/group.log，

hadoop103 上的 Flume-2 监控某一个端口的数据流，

Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3，Flume-3 将最终数据打印
实现步骤：
1. 在hadoop001创建 flume1-logger-flume.conf
  
  配置 Source 用于监控 hive.log 文件，配置 Sink 输出数据到下一级 Flume。
  
  vim /opt/app/flume-1.9.0/job/group3/flume1-logger-flume.conf
2. 在hadoop002创建 flume2-netcat-flume.conf
  
  配置 Source 监控端口 44444 数据流，配置 Sink 数据到下一级 Flume：
  
  vim /opt/app/flume-1.9.0/job/group3/flume1-netcat-flume.conf
3. 在hadoop003创建 flume3-flume-logger.conf
  
  配置 source 用于接收 flume1 与 flume2 发送过来的数据流，最终合并后 sink 到控制台。
  
  vim /opt/app/flume-1.9.0/job/group3/flume3-flume-logger.conf
4. 执行配置文件
5. 在 hadoop001 上向/opt/module 目录下的 group.log 追加内容，在 hadoop002 上向 44444 端口发送数据

配置文件

hadoop001配置文件 flume1-logger-flume.conf

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = execa1.sources.r1.command = tail -F /opt/app/data/flume/1.loga1.sources.r1.shell = /bin/bash -c# Describe the sinka1.sinks.k1.type = avroa1.sinks.k1.hostname = hadoop003a1.sinks.k1.port = 4141# Describe the channela1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

hadoop002配置文件 flume2-netcat-flume.conf

# Name the components on this agenta2.sources = r1a2.sinks = k1a2.channels = c1# Describe/configure the sourcea2.sources.r1.type = netcata2.sources.r1.bind = hadoop002a2.sources.r1.port = 44444# Describe the sinka2.sinks.k1.type = avroa2.sinks.k1.hostname = hadoop003a2.sinks.k1.port = 4141# Use a channel which buffers events in memorya2.channels.c1.type = memorya2.channels.c1.capacity = 1000a2.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela2.sources.r1.channels = c1a2.sinks.k1.channel = c1

hadoop003配置文件 flume3-flume-logger.conf

# Name the components on this agenta3.sources = r1a3.sinks = k1a3.channels = c1# Describe/configure the sourcea3.sources.r1.type = avroa3.sources.r1.bind = hadoop003a3.sources.r1.port = 4141# Describe the sink# Describe the sinka3.sinks.k1.type = logger# Describe the channela3.channels.c1.type = memorya3.channels.c1.capacity = 1000a3.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela3.sources.r1.channels = c1a3.sinks.k1.channel = c1

自定义Interceptor

案例需求：使用 Flume 采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统
实现步骤：
1. 使用idea，编辑自定义的Interceptor，导入依赖，并打包传入flume的lib目录下
  
  具体代码实现在idea代码下的flume-demo中，类名为TypeInterceptor
2. 为 hadoop001 上的 Flume1 配置 1 个 netcat source，1 个 sink group（2 个 avro sink），并配置相应的 ChannelSelector 和 interceptor。
  
  vim /opt/app/flume-1.9.0/job/group4/flume1
3. 为 hadoop002 上的 Flume4 配置一个 avro source 和一个 logger sink。
  
  vim /opt/app/flume-1.9.0/job/group4/flume3
4. 为 hadoop003 上的 Flume3 配置一个 avro source 和一个 logger sink。
  
  vim /opt/app/flume-1.9.0/job/group4/flume4
5. 启动flume2，flume3，flume1

配置文件

在hadoop001上的flume1

# Name the components on this agenta1.sources = r1a1.sinks = k1 k2a1.channels = c1 c2# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = localhosta1.sources.r1.port = 44444a1.sources.r1.interceptors = i1a1.sources.r1.interceptors.i1.type = bigdata.interceptor.TypeInterceptor$Buildera1.sources.r1.selector.type = multiplexinga1.sources.r1.selector.header = typea1.sources.r1.selector.mapping.bigdata = c1a1.sources.r1.selector.mapping.other = c2# Describe the sinka1.sinks.k1.type = avroa1.sinks.k1.hostname = hadoop002a1.sinks.k1.port = 4141a1.sinks.k2.type=avroa1.sinks.k2.hostname = hadoop003a1.sinks.k2.port = 4242# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Use a channel which buffers events in memorya1.channels.c2.type = memorya1.channels.c2.capacity = 1000a1.channels.c2.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1 c2a1.sinks.k1.channel = c1a1.sinks.k2.channel = c2

hadoop002上的flume2.conf配置

a1.sources = r1a1.sinks = k1a1.channels = c1a1.sources.r1.type = avroa1.sources.r1.bind = hadoop002a1.sources.r1.port = 4141a1.sinks.k1.type = loggera1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100a1.sinks.k1.channel = c1a1.sources.r1.channels = c1

hadoop003上的flume3.conf配置

a1.sources = r1a1.sinks = k1a1.channels = c1a1.sources.r1.type = avroa1.sources.r1.bind = hadoop003a1.sources.r1.port = 4242a1.sinks.k1.type = loggera1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100a1.sinks.k1.channel = c1a1.sources.r1.channels = c1