1. 监控端口数据官方案例
需求:
使用Flume监听一个端口,收集该端口数据,并打印到控制台。
步骤实现
(1) 在flume目录下创建work工作目录
[hadoop@hadoop181 apache-flume]$ mkdir -p work
[hadoop@hadoop181 apache-flume]$ cd work
(2) 创建flume Agent配置文件
[hadoop@hadoop181 work]$ vim flume-netcat-logger.conf
(3) 配置文件内容如下
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(4) 开启flume监听端口
[hadoop@hadoop181 work]$ cd ../
[hadoop@hadoop181 apache-flume]$ bin/flume-ng agent -c conf/ -n a1 -f work/flume-netcat-logger.conf -Dflume.root.logger=INFO,console
(5) 使用telnet 或者 netcat 工具向agent发送内容
[hadoop@hadoop181 apache-flume]$ telnet 192.168.207.181 44444
Trying 192.168.207.181...
Connected to 192.168.207.181.
Escape character is '^]'.
nihao
OK
(6) 查看监听的控制台
备注
(1) 如果netcat 没有可以使用yum安装
[hadoop@hadoop181 ~]$ sudo yum -y install nc
(2) 如果telnet 没有也可以直接使用yum安装
[hadoop@hadoop181 ~]$ sudo yum -y install telnet
2. 实时监控单个追加文件到HDFS
需求:
实时监控
/home/hadoop/data/flume/logs/append_log.log
日志文件,并上传到HDFS中
步骤实现
(1) 在flume的work目录下创建agent配置文件
[hadoop@hadoop181 work]$ vim flume-file-hdfs.conf
(2) 配置内容如下
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/flume/logs/append_log.log
a1.sources.r1.shell = /bin/bash -c
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop181:9000/flume/%Y%m%d/%H
a1.sinks.k1.hdfs.filePrefix = flume-logs-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.batchSize = 1000
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 134117700
a1.sinks.k1.hdfs.rollCount = 0
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
a1.sources.r1.channels = c1
(3) 创建测试的 /home/hadoop/data/flume/logs/append_log.log 文件
[hadoop@hadoop181 work]$ touch /home/hadoop/data/flume/logs/append_log.log
(4) 启动flume
[hadoop@hadoop181 work] cd ../
[hadoop@hadoop181 apache-flume]$ bin/flume-ng agent -c conf/ -n a1 -f work/flume-file-hdfs.conf -Dflume.root.logger=INFO,console
(5) 追加文件内容
[hadoop@hadoop181 ~]$ echo "2">>/home/hadoop/data/flume/logs/append_log.log
(6) 可以看到agent的console打印出内容
3. 实时监控目录下多个新日志文件
需求:
使用Flume监听整个目录的文件,并上传至HDFS
步骤实现
(1) 在flume的work目录下创建agent配置文件
[hadoop@hadoop181 work]$ vim flume-dir-hdfs.conf
(2) agent 配置文件内容如下
#1、定义agent、source、channel、sink的名称
a1.channels = c1
a1.sources = r1
a1.sinks = k1
# 2、描述source
a1.sources.r1.type = spooldir
# 指定监控哪个目录
a1.sources.r1.spoolDir = /home/hadoop/data/spooldir/logs/
# 设置监控目录下符合正则匹配要求的文件
a1.sources.r1.includePattern = ^.*\.log$
# 忽略所有以.tmp结尾的文件,不上传
a1.sources.r1.ignorePattern = ([^ ]*\.tmp)
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.batchSize = 100
#3、描述channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 100
#4、描述sink
a1.sinks.k1.type = hdfs
#设置hdfs存储目录
a1.sinks.k1.hdfs.path = hdfs://hadoop181:9000/flume/logs/spooldir/%Y%m%d/%H
#设置存储文件的前缀
a1.sinks.k1.hdfs.filePrefix = flume-
#设置滚动的时间间隔
a1.sinks.k1.hdfs.rollInterval = 5
#设置滚动的文件的大小[当文件达到指定的大小之后,会生成新文件,flume向新文件里面写入内存]
a1.sinks.k1.hdfs.rollSize = 13421000
#设置文件写入多少个Event就滚动
a1.sinks.k1.hdfs.rollCount = 0
#设置每次向HDFS写入的时候批次大小
a1.sinks.k1.hdfs.batchSize = 100
#指定写往HDFS的时候的压缩格式
a1.sinks.k1.hdfs.fileType = DataStream
#是否按照指定的时间间隔生成文件夹
a1.sinks.k1.hdfs.round = true
#指定生成文件夹的时间值
a1.sinks.k1.hdfs.roundValue = 24
#指定生成文件夹的时间单位[hour、second、minute]
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#5、关联source->channel->sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(3) 创建测试的日志目录
[hadoop@hadoop181 work]$ mkdir -p /home/hadoop/data/spooldir/logs/
(4) 启动flume
[hadoop@hadoop181 work] cd ../
[hadoop@hadoop181 apache-flume]$ bin/flume-ng agent -c conf/ -n a1 -f work/flume-dir-hdfs.conf -Dflume.root.logger=INFO,console
(5) 向被监控目录中增加.log的文件
[hadoop@hadoop181 apache-flume]$ cd /home/hadoop/data/spooldir/logs/
[hadoop@hadoop181 logs]$
# 创建日志文件,追加日志内容
[hadoop@hadoop181 logs]$ echo "1">test1.log
(6) 浏览器查看hdfs 文件
(7) 在创建一个试试
[hadoop@hadoop181 logs]$
[hadoop@hadoop181 logs]$ echo "2">test2.log
[hadoop@hadoop181 logs]$
(8) 查看一下
4. 实时监控目录下的多个追加文件
需求:
使用Flume监听整个目录的实时追加文件,并上传至HDFS
实现步骤:
(1) 在flume的work目录下创建配置文件 flume-taildir-hdfs.conf
(2) 配置内容如下
#1、定义agent、source、channel、sink的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#2、描述source,注意: 这个是大写的
a1.sources.r1.type = TAILDIR
#设置flume监听的组
a1.sources.r1.filegroups = fg1 fg2
#设置f1组监听的文件 /home/hadoop/data/taildir/logs/
a1.sources.r1.filegroups.fg1 = /home/hadoop/data/taildir/logs/.*log
#设置f2组监听的文件
a1.sources.r1.filegroups.fg2 = /home/hadoop/data/taildir/txt/.*txt
#读取的进度记录文件,通过该文件能够做到断点续传
a1.sources.r1.positionFile = /home/hadoop/data/taildir/postion.json
a1.sources.r1.batchSize = 100
#3、描述channel,容量,事务容量,注意 : 事务容量必须<=channel的容量
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 100
#4、描述sink
a1.sinks.k1.type = hdfs
#设置hdfs存储目录
a1.sinks.k1.hdfs.path = hdfs://hadoop181:9000/flume/logs/taildir/%Y%m%d/%H
#设置存储文件的前缀
a1.sinks.k1.hdfs.filePrefix = flume-
#设置滚动的时间间隔
a1.sinks.k1.hdfs.rollInterval = 5
#设置滚动的文件的大小[当文件达到指定的大小之后,会生成新文件,flume向新文件里面写入内存]
a1.sinks.k1.hdfs.rollSize = 13421000
#设置文件写入多少个Event就滚动
a1.sinks.k1.hdfs.rollCount = 0
#设置每次向HDFS写入的时候批次大小
a1.sinks.k1.hdfs.batchSize = 100
#指定写往HDFS的时候的压缩格式
#a1.sinks.k1.hdfs.codeC = 100
#指定写往HDFS的时候的文件格式[SequenceFile-序列化文件,DataStream-文本文件,CompressedStream-压缩文件]
a1.sinks.k1.hdfs.fileType = DataStream
#是否按照指定的时间间隔生成文件夹
a1.sinks.k1.hdfs.round = true
#指定生成文件夹的时间值
a1.sinks.k1.hdfs.roundValue = 24
#指定生成文件夹的时间单位[hour、second、minute]
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#5、关联source->channle->sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(3) 创建测试目录
[hadoop@hadoop181 logs]$ mkdir -p /home/hadoop/data/taildir/logs/
[hadoop@hadoop181 logs]$ mkdir -p /home/hadoop/data/taildir/txt/
(4) 启动flume 的 agent
[hadoop@hadoop181 work] cd ../
[hadoop@hadoop181 apache-flume]$ bin/flume-ng agent -c conf/ -n a1 -f work/flume-taildir-hdfs.conf -Dflume.root.logger=INFO,console
(5) 向logs目录中追加文件测试
[hadoop@hadoop181 logs]$ cd /home/hadoop/data/taildir/logs
[hadoop@hadoop181 logs]$ echo "log1,hello">test1.log
[hadoop@hadoop181 logs]$ cd /home/hadoop/data/taildir/txt
[hadoop@hadoop181 txt]$ echo "txt1,world">test1.txt
(6) 查看hdfs文件
5. 将文件按照名称追加到不同目录中
需求
在node1 主机下 的 /home/hadoop/data/logs/ 目录下有access.log , nginx.log, web.log 现在需要分别将他们采集到hdfs下并且按照类别放在不同的目录下
实现步骤
(1) 在flume的work目录下创建 exec_source_avro_sink.conf
文件
vim exec_source_avro_sink.conf
(2) 配置内容如下
# 配置读取的数据源,r1,r2,r3 分别拦截三种类型的文件
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
# 元数据类型
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/logs/access.log
# 拦截器别名
a1.sources.r1.interceptors = i1
# 拦截器类型
a1.sources.r1.interceptors.i1.type = static
# 拦截的键值对
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access
a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /home/hadoop/data/logs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx
a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /home/hadoop/data/logs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web
# 下沉的目的地
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop182
a1.sinks.k1.port = 41414
#
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# 绑定source -> channels -> sinks
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1
(3) 早node2的work目录下创建 avro_source_hdfs_sink.conf
文件
vim avro_source_hdfs_sink.conf
(4) 追加内容如下
# 定义agent名, source、channel、sink的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop182
a1.sources.r1.port =41414
# 添加时间拦截器,将没有时间戳的event添加上时间戳
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
# 定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# 定义sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=hdfs://hadoop181:9000/flume/staticlog/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
# 时间类型
#a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 生成的文件不按条数生成
a1.sinks.k1.hdfs.rollCount = 0
# 生成的文件不按时间生成
a1.sinks.k1.hdfs.rollInterval = 30
# 生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize = 10485760
#a1.sinks.k1.hdfs.rollSize =0
# 批量写入hdfs的个数
a1.sinks.k1.hdfs.batchSize = 20
# flume操作hdfs的线程数(包括新建,写入等)
a1.sinks.k1.hdfs.threadsPoolSize=10
# 操作hdfs超时时间
a1.sinks.k1.hdfs.callTimeout=30000
# 组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(5) 启动node2上的agent(永远先启动远离数据源的一端)
[hadoop@hadoop182 apache-flume]$
[hadoop@hadoop182 apache-flume]$ bin/flume-ng agent -c conf/ -f work/avro_source_hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console
(6) 启动node1上的agent
[hadoop@hadoop182 apache-flume]$
[hadoop@hadoop182 apache-flume]$ bin/flume-ng agent -c conf/ -f work/exec_source_avro_sink.conf -n a1 -Dflume.root.logger=INFO,console
(7) 查看hdfs生成的文件
(9) 测试追加