Flume温习笔记（一）：想不想用Flume采集模板成为数据达人？一篇文章让你轻松上手！

最新推荐文章于 2024-08-05 23:59:28 发布

卡林神不是猫

最新推荐文章于 2024-08-05 23:59:28 发布

阅读量611

点赞数 17

分类专栏： Flume重温笔记文章标签： flume 笔记 linq 大数据数据库 kafka hdfs

本文链接：https://blog.csdn.net/m0_60732994/article/details/136969487

版权

Flume重温笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最全面的 flume 配置文件锦囊

前言：最近在公司做 flume 业务的时候，和部门同事总结了一些技巧性配置文件，可以涵盖工作的 90% flume 数据采集业务，我看到网上好多都是付费的文章，因此想写一篇文章分享给大家，希望对大家有帮助！

Tips："分享是快乐的源泉💧，在我的博客里，不仅有知识的海洋🌊，还有满满的正能量加持💪，快来和我一起分享这份快乐吧😊！

喜欢我的博客的话，记得点个红心❤️和小关小注哦！您的支持是我创作的动力！"

文章目录

最全面的 flume 配置文件锦囊

一、本地文件—>Kafka

# 后缀名为 conf 或者 properties 都可以的！
vim /export/server/flume-1.9.0-bin/usercase/momo_mem_kafka.properties

source 类型：TAILDIR

channel 类型：memory

sink 类型：org.apache.flume.sink.kafka.KafkaSink

# define a1
a1.sources = s1 
a1.channels = c1
a1.sinks = k1

#define s1
a1.sources.s1.type = TAILDIR
#指定一个元数据记录文件
a1.sources.s1.positionFile = /export/server/flume-1.9.0-bin/position/taildir_momo_kafka.json
#将所有需要监控的数据源变成一个组
a1.sources.s1.filegroups = f1
#指定了f1是谁：监控目录下所有文件
a1.sources.s1.filegroups.f1 = /export/data/momo_data/.*
#指定f1采集到的数据的header中包含一个KV对
a1.sources.s1.headers.f1.type = momo
a1.sources.s1.fileHeader = true

#define c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

#define k1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = MOMO_MSG
a1.sinks.k1.kafka.bootstrap.servers = node1:9092,node2:9092,node3:9092
a1.sinks.k1.kafka.flumeBatchSize = 10
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 100

#bound
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

启动 Flume 配置文件样式：

cd /export/server/flume-1.9.0-bin
bin/flume-ng agent -c conf/ -n a1 -f usercase/momo_mem_kafka.properties -Dflume.root.logger=INFO,console

二、Mysql—>控制台

source 类型：org.apache.flume.source.jdbc.MySQLSource

channel 类型：memory

sink 类型：logger

vim Esql2console.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.jdbc.SQLSource
a1.sources.r1.hibernate.connection.url = jdbc:mysql://node1:3306/mysqlsource
a1.sources.r1.hibernate.connection.user = root
a1.sources.r1.hibernate.connection.password = hadoop
a1.sources.r1.hibernate.connection.autocommit=true
a1.sources.r1.table = student
a1.sources.r1.run.query.delay=5000
a1.sources.r1.status.file.path=/export/server/flume-1.9.0-bin
a1.sources.r1.status.file.name=a1.status

# Describe the sink
a1.sinks.k1.type = logger

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

状态文件不能自己创建，不然会报错

三、Kafka—>HDFS

source 类型：org.apache.flume.source.kafka.KafkaSource

channel 类型：memory

sink 类型：hdfs

# define a1
a1.sources = s1 
a1.channels = c1
a1.sinks = k1

#define s1
a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.s1.batchSize = 500
a1.sources.s1.batchDurationMills = 2000
a1.sources.s1.kafka.bootstrap.servers = node1:9092
a1.sources.s1.kafka.topics = flume

#define c1
a1.channels.c1.type = memory
a1.channels.c1.keep-alive = 120
a1.channels.c1.capacity = 500000
a1.channels.c1.transactionCapacity = 600

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/fromKafka/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = kafka_log
a1.sinks.k1.hdfs.maxOpenFiles = 5000
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollCount = 100000
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp = true

四、本地文件夹1—>本地文件夹2

source 类型：spooldir

channel 类型：memory

sink 类型：file_roll

应用场景：数据备份

实例：
本地文件夹下有两个文件：1.txt，2.txt

# define a1
a1.sources = s1 
a1.channels = c1
a1.sinks = k1

#define s1
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /export/data/mylogs

#define c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Describe the sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /export/data/backup

现象1：原来的文件夹 /export/data/mylogs 下面的文件名字居然有了后缀.COMPLETED

1.txt 变成了 1.txt.COMPLETED
2.txt 变成了 2.txt.COMPLETED

现象2：目标文件夹 /export/data/backup（自动生成），文件名：时间戳-编号

秒级时间戳-1
秒级时间戳-2

五、本地文件—>HDFS

source 类型：exec

channel 类型：memory

sink 类型：hdfs

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/test.log
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

六、本地文件夹—>HDFS

source 类型：spooldir

channel 类型：memory

sink 类型：hdfs

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
##注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/logs
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

七、Socket—>本地文件

source 类型：netcat

channel 类型：memory

sink 类型：com.example.flumesink.MySink

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = node1
a1.sources.r1.port = 9999
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = com.example.flumesink.MySink
a1.sinks.k1.filePath=/export/servers
a1.sinks.k1.fileName=filesink.txt

# # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

安装 telnet：
yum -y install telnet

启动 telnet：
Telnet node1 9999

八、多源采集（企业级案例）

需求：A、B两台日志服务机器实时生产日志主要类型为access.log、nginx.log、web.log ，把A、B 机器中的access.log、nginx.log、web.log 采集汇总到C机器上然后统一收集到hdfs中。

（A、B服务器通过 telnet 采集到 C 服务器，C 服务器采集到 HDFS 上！）—— 应用到拦截器知识点

① 在服务器A和服务器B上，配置exec_source_avro_sink.conf

# Name the components on this agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/data/access.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static

# static拦截器的功能就是往采集到的数据的header中插入自己定义的key-value对
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access

a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /root/data/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static

a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx

a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /root/data/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static

a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.200.101
a1.sinks.k1.port = 41414

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

② 在服务器C上创建配置文件avro_source_hdfs_sink.conf

#定义agent名， source、channel、sink的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = mini2
a1.sources.r1.port =41414

#添加时间拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = 
org.apache.flume.interceptor.TimestampInterceptor$Builder

#定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

#定义sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=hdfs://192.168.200.101:9000/source/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
#时间类型
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件不按条数生成
a1.sinks.k1.hdfs.rollCount = 0
#生成的文件按时间生成
a1.sinks.k1.hdfs.rollInterval = 30
#生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize  = 10485760
#批量写入hdfs的个数
a1.sinks.k1.hdfs.batchSize = 10000
#flume操作hdfs的线程数（包括新建，写入等）
a1.sinks.k1.hdfs.threadsPoolSize=10
#操作hdfs超时时间
a1.sinks.k1.hdfs.callTimeout=30000

#组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

结果：

先启动 c 服务器的 Flume 程序，再启动 A、B 服务器的 Flume 程序
HDFS 结果显示文件：
/source/logs/access/20240323/**
/source/logs/nginx/20240323/**
/source/logs/web/20240323/**

卡林神不是猫

关注

17
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
6
评论
Flume温习笔记（一）：想不想用Flume采集模板成为数据达人？一篇文章让你轻松上手！

最近在公司做 flume 业务的时候，和部门同事总结了一些技巧性配置文件，可以涵盖工作的 90% flume 数据采集业务，我看到网上好多都是付费的文章，因此想写一篇文章分享给大家，希望对大家有帮助！Tips："分享是快乐的源泉💧，在我的博客里，不仅有知识的海洋🌊，还有满满的正能量加持💪，快来和我一起分享这份快乐吧😊！喜欢我的博客的话，记得点个红心❤️和小关小注哦！您的支持是我创作的动力！
复制链接

扫一扫