Flume框架——日志文件数据采集工具的安装与使用

最新推荐文章于 2022-07-12 14:26:25 发布

无名一小卒

最新推荐文章于 2022-07-12 14:26:25 发布

阅读量577

点赞数

文章标签： Flume框架使用 Flume动态监控多个文件夹与文件 Flume实现分区存储

本文链接：https://blog.csdn.net/h1025372645/article/details/96017243

版权

Hadoop 专栏收录该内容

30 篇文章 3 订阅

订阅专栏

介绍

本文对Flume框架进行了简单的介绍，内容如下
如何在安装Linux上安装Flume框架
如何动态读取一个日志文件
如何使用Flume将文件存储到HDFS上
如何使用Flume将文件存储到HDFS指定目录下
如何使用Flume使用分区方式将文件存储到HDFS上
如何动态监听一个文件夹中的内容
如何过滤不想加载到Flume中的文件
如何实现动态监听多个文件与文件

1：Flume简单介绍与安装

1.1：Flume介绍

（1）分布式：

可以在多台机器上运行多个flume,日志文件往往分布在不同的机器里面

（2）collecting, aggregating, and moving

收集聚集移动

（3）组件agent

source:从数据源读取数据的，将数据转换为数据流，将数据丢给channel

channel：类似于一个队列，临时存储source发送过来的数据

sink：负责从channel中读取数据，然后发送给目的地

（4）flume的使用很简单，就是一个配置文件，

1.2：Flume版本

flume-ng：（next generation）: 目前使用该版本

flume-og：（Original generation）：以前的版本，淘汰

1.3 ：Flume安装

环境要求：Linux下，hadoop环境安装完成；JDK安装完成

安装配置：

（1）修改文件名，配置JDK

1：mv flume-env.sh.template  flume-env.sh

（2）找到HDFS的地址：

方法1.声明Hadoop_home为全局环境变量

全局配置

方法2.将core-site.xml和hdfs-site.xml放到flume配置文件下(推荐)

cp /opt/cdh5.7.6/hadoop-2.6.0-cdh5.7.6/etc/hadoop/core-site.xml /opt/cdh5.7.6/hadoop-2.6.0-cdh5.7.6/etc/hadoop/hdfs-site.xml  ./

方法3.直接在使用的时候给HDFS绝对路径

hdfs://hostname:8020/aa/bb

（3）添加HDFS的Jar包lib目录下：在执行的过程中需要使用HDFS api

测试案例1：读取Hive日志信息到控制台

flume-conf.properties配置文件


# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per a1, 
# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# defined sources
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell=/bin/sh -c


# defined channel
a1.channels.c1.type = memory
#容量
a1.channels.c1.capacity=1000
#读取数据容量
a1.channels.c1.transactionCapacity=100


# defined sink
a1.sinks.k1.type = logger

#bond
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

在Flume目录下输入命令

bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties -Dflume.root.logger=INFO,console

结果：控制台输出为二进制，所以看不出结果

测试案例2：读取Hive日志信息到HDFS上

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# For each one of the sources, the type is defined
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c


# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data



# Each sink's type must be defined
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/hdfs2/
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

命令：与案例1一样

HDFS结果

文件内容：

2019-07-15 05:34:11,930 INFO  [main]: ql.Driver (Driver.java:compile(500)) - Semantic Analysis Completed
2019-07-15 05:34:12,060 INFO  [main]: ql.Driver (Driver.java:getSchema(266)) - Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
2019-07-15 05:34:12,646 INFO  [main]: ql.Driver (Driver.java:compile(607)) - Completed compiling command(queryId=huadian_20190715053434_d650e334-f5f3-4e7e-b030-0622a759f812); Time taken: 1.595 seconds
2019-07-15 05:34:12,647 INFO  [main]: ql.Driver (Driver.java:checkConcurrency(186)) - Concurrency mode is disabled, not creating a lock manager
2019-07-15 05:34:12,647 INFO  [main]: ql.Driver (Driver.java:execute(1598)) - Executing command(queryId=huadian_20190715053434_d650e334-f5f3-4e7e-b030-0622a759f812): show tables
2019-07-15 05:34:12,665 INFO  [main]: ql.Driver (Driver.java:launchTask(1968)) - Starting task [Stage-0:DDL] in serial mode
2019-07-15 05:34:12,830 INFO  [main]: ql.Driver (Driver.java:execute(1877)) - Completed executing command(queryId=huadian_20190715053434_d650e334-f5f3-4e7e-b030-0622a759f812); Time taken: 0.183 seconds

测试案例3：存储在HDFS文件大小的问题，解决小文件问题

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# For each one of the sources, the type is defined
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c


# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data



# Each sink's type must be defined
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/hdfs1/
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果：文件大小明显变化了

测试案例4：数据指定目录到hdfs中，导入hive分区

设置每年每月每天每分钟进行分区

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# For each one of the sources, the type is defined
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c


# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data



# Each sink's type must be defined
a1.sinks.k1.type = hdfs
#hdfs中存储文件的路径
a1.sinks.k1.hdfs.path=/flume/part/yearst=%Y/monthstr=%m/daystr=%d/minutestr=%M
#使用了时间格式，必须设置该属性
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果：

/flume/part/yearst=2019/monthstr=07/daystr=15/minutestr=50

导入Hive问题：

将Flume的文件导入Hive中，操作起来比较麻烦

原因一：

要求Hive表中的数据的存储格式必须为ORC（列式存储）

原因二：

要求Hive表为桶表、按照每条数据进行分桶

测试案例5：如何动态监听一个目录Spooling Directory Source

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# defined sources
#设置Flume扫描文件夹
a1.sources.s1.type = spooldir
#具体扫描哪一个文件夹
a1.sources.s1.spoolDir = /opt/datas/flume/spool

# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data

# Each sink's type must be defined
a1.sinks.k1.type = hdfs
#hdfs中存储文件的路径
a1.sinks.k1.hdfs.path=/flume/part/yearst=%Y/monthstr=%m/daystr=%d/minutestr=%M
#使用了时间格式，必须设置该属性
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果：可以看到Linux下文件被加载成功了

测试案例6：过滤不被加载到Flume中的文件

在案例5中，被加载的文件只会被加载一次

这样后续写入到文件里的数据就不会被读取

为了解决这个问题，可以添加过滤操作

在需要加载该文件时，修改文件名，对该文件进行加载

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# defined sources
#设置Flume扫描文件夹
a1.sources.s1.type = spooldir
#具体扫描哪一个文件夹
a1.sources.s1.spoolDir = /opt/datas/flume/spool
#正则过滤
a1.sources.s1.ignorePattern=([^ ]*\.tmp)

# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data



# Each sink's type must be defined
a1.sinks.k1.type = hdfs
#hdfs中存储文件的路径
a1.sinks.k1.hdfs.path=/flume/part/yearst=%Y/monthstr=%m/daystr=%d/minutestr=%M
#使用了时间格式，必须设置该属性
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果：可以看到后缀为.tmp 的文件没有被加载

测试案例7：动态监听多个文件，并加载到内存中

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# defined sources
#如果是自己编译的类，这里写类的全路径
a1.sources.s1.type = TAILDIR
a1.sources.s1.positionFile =/opt/cdh5.7.6/flume-1.6.0-cdh5.7.6-bin/position/taildir_position.json
#文件组的绝对路径。 正则表达式（而不是文件系统模式）只能用于文件名。
a1.sources.s1.filegroups = f1 f2

a1.sources.s1.filegroups.f1 = /opt/datas/flume/taildir/test.txt
#标题值，使用标题键设置。 可以为一个文件组指定多个标头
a1.sources.s1.headers.f1.age = 17
a1.sources.s1.headers.f1.type = bb

a1.sources.s1.filegroups.f2 = /opt/datas/flume/taildir/huadian/.*
#标题值，使用标题键设置。 可以为一个文件组指定多个标头
a1.sources.s1.headers.f2.age = 18
a1.sources.s1.headers.f2.type = aa

# Each channel's type is defined.
a1.channels.c1.type = memory
#容量
a1.channels.c1.capacity=1000
#一次写出多少文件
a1.channels.c1.transactionCapacity=100


# Each sink's type must be defined
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/taildir
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text


#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果

文件内容：

i
am
a
chinese

i
love
my
country

追加内容后Flume中的内容：


i
am
a
chinese

i
love
my
country
test1

无名一小卒

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
Flume框架——日志文件数据采集工具的安装与使用

介绍本文对Flume框架进行了简单的介绍，内容如下如何在安装Linux上安装Flume框架如何动态读取一个日志文件如何使用Flume将文件存储到HDFS上如何使用Flume将文件存储到HDFS指定目录下如何使用Flume使用分区方式将文件存储到HDFS上如何动态监听一个文件夹中的内容如何过滤不想加载到Flume中的文件如何实现动态监听多个文件与文件1：Flum...
复制链接

扫一扫