一 案例二 实时监控单个追加文件
-
需求:实时监控Hive日志,并上传到HDFS中
-
实现步骤
# Name the components on this agent a2.sources = r2 a2.sinks = k2 a2.channels = c2 # Describe/configure the source a2.sources.r2.type = exec a2.sources.r2.command = tail -F /opt/module/flume/demo/1.log a2.sources.r2.shell = /bin/bash -c # Describe the sink a2.sinks.k2.type = hdfs # 如果要使用时间转义时间序列需要满足两个条件之一 # 使用本地时间戳 # 在event的headers中必须有时间戳 a2.sinks.k2.hdfs.path = hdfs://hadoop101:8020/flume/%Y%m%d/%H #上传文件的前缀 a2.sinks.k2.hdfs.filePrefix = logs- #是否按照时间滚动文件夹 a2.sinks.k2.hdfs.round = true #多少时间单位创建一个新的文件夹 a2.sinks.k2.hdfs.roundValue = 1 #重新定义时间单位 a2.sinks.k2.hdfs.roundUnit = hour #是否使用本地时间戳 a2.sinks.k2.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a2.sinks.k2.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a2.sinks.k2.hdfs.fileType = DataStream #多久生成一个新的文件(60秒) a2.sinks.k2.hdfs.rollInterval = 60 #设置每个文件的滚动大小 a2.sinks.k2.hdfs.rollSize = 134217700 #文件的滚动与Event数量无关(0表示禁用此功能) a2.sinks.k2.hdfs.rollCount = 0 # Use a channel which buffers events in memory a2.channels.c2.type = memory a2.channels.c2.capacity = 1000 a2.channels.c2.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r2.channels = c2 a2.sinks.k2.channel = c2 --启动flume的agent flume-ng agent -n a2 -c /conf/ -f job/execsource_hdfssink.conf -Dflume.root.logger=INFO,cosle --启动后会在hdfs上看到flume目录
- exec source
Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means configurations such as
cat [named pipe]
ortail -F [file]
are going to produce the desired results where asdate
will probably not - the former two commands produce streams of data where as the latter produces a single event and exits.- HDFS SInk
This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events. It also buckets/partitions data by attributes like timestamp or machine where the event originated. The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events. Using this sink requires hadoop to be installed so that Flume can use the Hadoop jars to communicate with the HDFS cluster. Note that a version of Hadoop that supports the sync() call is required.
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
二 案例二 实时监控目录下多个新文件
-
需求:使用Flume监听整个目录的文件,并上传至HDFS
-
实现步骤
--创建配置文件flume-dir-hdfs.conf vim flume-dir-hdfs.conf --添加如下内容 a3.sources = r3 a3.sinks = k3 a3.channels = c3 # Describe/configure the source # Spooling Directory Source:用来监听一个目录并自动收集目录中的内容 # 当目录中的一个log文件的内容被读取完毕后,此文件有两个处理方案 # 删除或在文件名资质后加上一些内容(默认改为.COMPLETED)通过deletePolicy属性配置 # 此目录中的文件名不能相同,相同则抛出异常 a3.sources.r3.type = spooldir a3.sources.r3.spoolDir = /opt/module/flume/upload a3.sources.r3.fileSuffix = .COMPLETED a3.sources.r3.fileHeader = true #忽略所有以.tmp结尾的文件,不上传 a3.sources.r3.ignorePattern = ([^ ]*\.tmp) # Describe the sink a3.sinks.k3.type = hdfs a3.sinks.k3.hdfs.path = hdfs://hadoop101:8020/flume/upload/%Y%m%d/%H #上传文件的前缀 a3.sinks.k3.hdfs.filePrefix = upload- #是否按照时间滚动文件夹 a3.sinks.k3.hdfs.round = true #多少时间单位创建一个新的文件夹 a3.sinks.k3.hdfs.roundValue = 1 #重新定义时间单位 a3.sinks.k3.hdfs.roundUnit = hour #是否使用本地时间戳 a3.sinks.k3.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a3.sinks.k3.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a3.sinks.k3.hdfs.fileType = DataStream #多久生成一个新的文件 a3.sinks.k3.hdfs.rollInterval = 60 #设置每个文件的滚动大小大概是128M a3.sinks.k3.hdfs.rollSize = 134217700 #文件的滚动与Event数量无关 a3.sinks.k3.hdfs.rollCount = 0 # Use a channel which buffers events in memory a3.channels.c3.type = memory a3.channels.c3.capacity = 1000 a3.channels.c3.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r3.channels = c3 a3.sinks.k3.channel = c3 --启动监控文件夹命令 flume-ng agent -n a3 -c conf/ -f job/flume-dir-hdfs.conf --在/opt/module/flume目录下创建upload文件夹 mkdir upload --随便写入一些内容,在hdfs上会显示日志文件
当抛出异常的时候,agent就会挂掉,此时向文件中写文件,agent不会读到数据,将agent重启。
三 实时监控目录下的多个追加文件
-
需求:使用Flume监听整个目录的实时追加文件,并上传至HDFS
-
实现步骤
--编写配置文件 vim flume-taildir-hdfs.conf --添加以下内容 a3.sources = r3 a3.sinks = k3 a3.channels = c3 # Describe/configure the source a3.sources.r3.type = TAILDIR # 此文件记录了source读取到的内容的位置,此文件丢失,source会从该文件的初始位置读取数据 a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json a3.sources.r3.filegroups = f1 f2 a3.sources.r3.filegroups.f1 = /opt/module/flume/files/.*file.* a3.sources.r3.filegroups.f2 = /opt/module/flume/files/.*log.* # Describe the sink a3.sinks.k3.type = hdfs a3.sinks.k3.hdfs.path = hdfs://hadoop101:8020/flume/upload2/%Y%m%d/%H #上传文件的前缀 a3.sinks.k3.hdfs.filePrefix = upload- #是否按照时间滚动文件夹 a3.sinks.k3.hdfs.round = true #多少时间单位创建一个新的文件夹 a3.sinks.k3.hdfs.roundValue = 1 #重新定义时间单位 a3.sinks.k3.hdfs.roundUnit = hour #是否使用本地时间戳 a3.sinks.k3.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a3.sinks.k3.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a3.sinks.k3.hdfs.fileType = DataStream #多久生成一个新的文件 a3.sinks.k3.hdfs.rollInterval = 60 #设置每个文件的滚动大小大概是128M a3.sinks.k3.hdfs.rollSize = 134217700 #文件的滚动与Event数量无关 a3.sinks.k3.hdfs.rollCount = 0 # Use a channel which buffers events in memory a3.channels.c3.type = memory a3.channels.c3.capacity = 1000 a3.channels.c3.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r3.channels = c3 a3.sinks.k3.channel = c3
Exec source适用于监控一个实时追加的文件,不能实现断点续传;
Spooldir Source适合用于同步新文件,但不适合对实时追加日志的文件进行监听并同步;
Taildir Source适合用于监听多个实时追加的文件,并且能够实现断点续传。
--启动文件夹监听命令 flume-ng agent -n a3 -c conf/ -f job/flume-taildir-hdfs.conf --在/opt/module/flume目录下创建files文件夹 mkdir files --进入files文件夹,并向文件夹中添加一些数据,hdfs中会记录这些信息 echo hello >> file1.txt echo 111 > file2.txt