Flume的安装和操作详解

最新推荐文章于 2024-07-18 21:37:27 发布

lmh450201598

最新推荐文章于 2024-07-18 21:37:27 发布

阅读量440

点赞数

分类专栏： flume 文章标签： flume

本文链接：https://blog.csdn.net/lmh450201598/article/details/106360162

版权

flume 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

官方文档：http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

一、Flume框架介绍

1、Flume在集群中扮演的角色
Flume、Kafka用来实时进行数据收集，Spark、Storm用来实时处理数据，impala用来实时查询。
2、Flume框架简介

（1） Flume提供一个分布式的，可靠的，对大数据量的日志进行高效收集、聚集、移动的服务，Flume只能在Unix环境下运行。

（2）Flume基于流式架构，容错性强，也很灵活简单，主要用于在线实时分析。

（3）角色
（3-1）Source：用于采集数据，Source是产生数据流的地方，同时Source会将产生的数据流传输到Channel，这个有点类似于Java IO部分的Channel
（3-2）Channel：用于桥接Sources和Sinks，类似于一个队列。
（3-3）Sink：从Channel收集数据，将数据写到目标源（可以是下一个Source，也可以是HDFS或者HBase）

（4）传输单元：Event
Flume数据传输的基本单元，以事件的形式将数据从源头送至目的地。

（5）传输过程
source监控某个文件，文件产生新的数据，拿到该数据后，将数据封装在一个Event中，并put到channel后commit提交，channel队列先进先出，sink去channel队列中拉取数据，然后写入到hdfs或者HBase中。

二、安装配置FLume

1、将flume解压到/opt/modules/cdh5.3.6/apache-flume-1.5.0-cdh5.3.6-bin目录下
2、修改配置文件flume-env.sh：配置Java的环境变量
3、flume帮助命令

bin/flume-ng

三、Flume监听端口，输出端口数据，使用telnet工具

1、拷贝
将telnet-server-0.17-59.el7.x86_64.rpm和telnet-0.17-59.el7.x86_64.rpm拷贝到Linux本地的文件夹中（比如：/opt/softwares）。

2、查看rpm包

ls | grep rpm

3、安装telnet工具

sudo rpm -ivh telnet-server-0.17-59.el7.x86_64.rpm

sudo rpm -ivh telnet-0.17-59.el7.x86_64.rpm

4、判断44444端口是否被占用

netstat -an | grep 44444

5、创建Flume Agent配置文件flume-telnet.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

6、开启flume监听端口

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flume-telnet.conf -Dflume.root.logger==INFO,console

注意：-Dflume.root.logger==INFO,console的意思是将日志打印在控制台上

7、使用telnet工具向本机的44444端口发送内容

telnet localhost 44444

8、退出telnet

ctrl + ]

telnet> quit

四、监听上传Hive日志文件到HDFS：HDFS Sink

1、拷贝Hadoop相关jar到Flume的lib目录下
jar包如下：
share/hadoop/common/lib/hadoop-auth-2.5.0-cdh5.3.6.jar
share/hadoop/common/lib/commons-configuration-1.6.jar
share/hadoop/mapreduce1/lib/hadoop-hdfs-2.5.0-cdh5.3.6.jar
share/hadoop/common/hadoop-common-2.5.0-cdh5.3.6.jar

2、创建Flume Agent配置文件flume-hdfs.conf

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://192.168.1.20:8020/flume/%y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = events-hive-
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = false
#积攒多少个event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 1000
#设置文件类型，可支持压缩，DataStream为压缩的
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新文件，设置为600秒
a2.sinks.k2.hdfs.rollInterval = 600
#设置文件的滚动大小，134217728为128M，设置比128M小一点
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与event数量无关
a2.sinks.k2.hdfs.rollCount = 0
#最小冗余数，这里要设置为1，因为上面设置了文件滚动的机制，假如minBlockReplicas设置为3，则自行滚动文件，上面的设置就失效了
a2.sinks.k2.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

这里，a2.sources.r2.shell = /bin/bash -c的意思是执行语句，类似于 ``

3、执行监控配置

bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/flume-hdfs.conf

五、Flume监听整个目录：Spooling Directory Source

1、创建配置文件flume-spooldir.conf

# Name the components on this agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /home/admin/Documents
a3.sources.r3.fileHeader = true
#忽略所有以.tmp为结尾的文件
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://192.168.1.20:8020/flume/Documents/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = Documents-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 600
#设置每个文件的滚动大小
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a3.sinks.k3.hdfs.rollCount = 0
#最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1


# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

2、执行测试

bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/flume-dir.conf

3、注意事项

（1）不要在监控目录中创建并持续修改文件，只能将写好的文件移入到监听的目录；
（2）上传完成的文件会以.COMPLETED结尾；
（3）被监控文件夹每600毫秒扫描一次变动。

lmh450201598

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录