6.1 概念
Flume是一个高可用,高可靠,分布式的海量日志采集、聚合和传输的系统,能够有效的收集、聚合、移动大量的日志数据的工具。
一句话 靠谱 方便的日志采集工具。
Flume偏向数据传输。Flume 和 Logstash 区别
6.2特性
- 简单,灵活的基于流的结构
- 具有负载均衡和故障转移机制
6.3 应用场景
-
基本应用场景
WebServer:Web应用产生日志,
Agent:Flume代理,是一个持续传输数据的服务,数据在Agent内部传输的基本单位是Event
Agent是由Source、Channel、Sink这三大组件组成的,这就是Flume中的三大核心组件
Event是flume中处理消息的基本单元,由零个或者多个header和正文body组成。
-
高级应用场景
6.4 核心组件
-
Source
Source数据源,通过source组件可以指定让Flume读取哪里的数据,然后将数据传递给后面的channel。
常见的Source类型
-
Exec Source
监控文件中新增的内容
-
NetCat TCP/UDP Source
采集指定端口(tcp、udp)的数据,可以读取流经端口的每一行数据
-
Spooling Directory Source
采集文件夹中新增的文件
-
Kafka Source
从Kafka消息队列中采集数据
-
-
Channel
Channel,临时存储数据的管道,
常见的存储数据的类型
-
Memory Channel
优点是效率高,不涉及磁盘IO
缺点有两个
- 可能会丢数据,如果Flume的agent挂了,那么channel中的数据就丢失了。
- 内存是有限的,会存在内存不够用的情况
-
File Channel
优点是数据不会丢失
缺点是效率相对内存来说会有点慢,但是这个慢并没有我们想象中的那么慢,所以这个也是比较常用的一种channel。 -
Spillable Memory Channel
优点:解决了内存不够用的问题。
缺点:还是存在数据丢失的风险
-
-
Sink
Sink负责把Channel中的数据,写入目的地。
常见的Sink类型
-
Logger Sink
将数据作为日志处理,可以选择打印到控制台或者写到文件中,这个主要在测试的时候使用。
-
HDFS Sink
将数据传输到HDFS中,这个是比较常见的,主要针对离线计算的场景
-
Kafka Sink
将数据发送到kafka消息队列中,这个也是比较常见的,主要针对实时计算场景,数据不落盘,实时传输,最后使用实时计算框架直接处理。
-
6.5 快速上手
-
入门案例
-
需求分析
Source:NetCat TCP/UDP Source
Channel:Memory Channel
Sink:Logger Sink
-
编写 配置文件
[root@bigdata04 ~]# cat /data/soft/apache-flume-1.9.0-bin/conf/example.conf #example.conf: A single-node Flume configuration # Name the components on this agent # a1 代表agent Flume代理,分别给source,sink,channel 命名 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source # 指定source 类型为 netcat 监听端口发送来的信息 a1.sources.r1.type = netcat a1.sources.r1.bind = 0.0.0.0 a1.sources.r1.port = 44444 # Describe the sink # 指定 sink 类型 为 logger a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
启动 Flume
[root@bigdata04 apache-flume-1.9.0-bin]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/example.conf -Dflume.root.logger=INFO,console
后面指定agent,表示启动一个Flume的agent代理
–name:指定agent的名字
–conf:指定flume配置文件的根目录
–conf-file:指定Agent对应的配置文件(包含source、channel、sink配置的文件)
-D:动态添加一些参数,在这里是指定了flume的日志输出级别和输出位置,INFO表示日志级别
-
发送消息
[root@bigdata04 ~]# telnet 192.168.35.103 44444 Trying 192.168.35.103... Connected to 192.168.35.103. Escape character is '^]'. hello big data04 OK
yum install -y telnet
-
控制台打印信息
[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 62 69 67 20 64 61 74 61 30 34 hello big data04 }
-
-
采集文件内容上传至HDFS
-
需求分析
Source:Spooling Directory Source
Channel:File Channel
Sink:HDFS Sink
-
设置客户端节点
将bigdata04设置为客户端节点以访问HDFS
# 拷贝配置文件 [root@bigdata01 soft]# scp -rq hadoop-3.2.0 192.168.182.103:/data/soft/ # 修改配置文件(建议将bigdata01的配置文件直接拷贝至bigdata04) export HADOOP_HOME=/data/soft/hadoop-3.2.0 export PATH=.:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/bin:$PATH
注意配置文件 生效
source /etc/profile
-
编写配置文件
[root@bigdata04 conf]# cat file-to-hdfs.conf me the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /data/log/studentDir # Use a channel which buffers events in memory a1.channels.c1.type = file a1.channels.c1.checkpointDir = /data/soft/apache-flume-1.9.0-bin/data/studentDir/checkpoint a1.channels.c1.dataDirs = /data/soft/apache-flume-1.9.0-bin/data/studentDir/data # Describe the sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://192.168.35.100:9000/flume/studentDir a1.sinks.k1.hdfs.filePrefix = stu- a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.rollInterval = 30 a1.sinks.k1.hdfs.rollSize = 134217728 a1.sinks.k1.hdfs.rollCount = 0 al.sinks.k1.hdfs.fileSuffix= .log # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
启动HADOOP集群
[root@bigdata01 sbin]# start-all.sh [root@bigdata01 sbin]# pwd /data/soft/hadoop-3.2.0/sbin
-
新增文件
在
/data/log/studentDir
目录下新增文件[root@bigdata04 hadoop-3.2.0]# ll /data/log/studentDir/ total 36 -rw-r--r--. 1 root root 76 Aug 23 16:08 class1.dat.COMPLETED -rw-r--r--. 1 root root 76 Aug 23 16:13 class2.dat.COMPLETED -rw-r--r--. 1 root root 13 Aug 25 10:24 LOL1111.txt.COMPLETED -rw-r--r--. 1 root root 16 Aug 25 10:55 LOL222.dat.COMPLETED -rw-r--r--. 1 root root 24 Aug 23 16:27 LOL666.dat.COMPLETED -rw-r--r--. 1 root root 19 Aug 23 16:37 LOL9999.dat.COMPLETED -rw-r--r--. 1 root root 27 Aug 23 16:30 LOL999.dat.COMPLETED -rw-r--r--. 1 root root 10 Aug 23 16:14 LOL.dat.COMPLETED -rw-r--r--. 1 root root 22 Aug 23 16:45 LOLjjjjj.dat.COMPLETED
注意 Flume 读取完的文件有
.COMPLETED
后缀 -
启动Flume
[root@bigdata04 apache-flume-1.9.0-bin]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-hdfs.conf -Dflume.root.logger=INFO,console
-
查看 HDFS文件
[root@bigdata04 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://192.168.35.100:9000/flume/studentDir/stu-.1661396200534 hello LOL!!! hello github!!!
-
使用 nohup & 启动
# 1.nohup 启动 [root@bigdata04 apache-flume-1.9.0-bin]# nohup bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-hdfs.conf -Dflume.root.logger=INFO,console & # 2.创建源文件 [root@bigdata04 hadoop-3.2.0]# vi /data/log/studentDir/1111.dat 11111 # 3.查看hdfs [root@bigdata04 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://192.168.35.100:9000/flume/studentDir/stu-.1661396803549 11111 # 4.查看 nohup日志 [root@bigdata04 apache-flume-1.9.0-bin]# cat nohup.out # 关闭nohup [root@bigdata04 apache-flume-1.9.0-bin]# jps -ml 2740 org.apache.flume.node.Application --name a1 --conf-file conf/file-to-hdfs.conf 2966 sun.tools.jps.Jps -ml [root@bigdata04 apache-flume-1.9.0-bin]# kill -9 2740 [root@bigdata04 apache-flume-1.9.0-bin]# jps -ml 2976 sun.tools.jps.Jps -ml [1]+ Killed nohup bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-hdfs.conf -Dflume.root.logger=INFO,console [root@bigdata04 apache-flume-1.9.0-bin]#
jps -ml
显示main方法所在进程的名称
-
-
采集网站日志上传至HDFS
- 需求分析
bigdata02:192.168.35.101
bigdata03:192.168.35.102
bigdata04:192.168.35.103
-
分别安装Flume
# 在 bigdata02 上安装 Flume [root@bigdata02 soft]# tar -zxvf apache-flume-1.9.0-bin.tar.gz [root@bigdata02 soft]# cd apache-flume-1.9.0-bin/conf [root@bigdata02 conf]# mv flume-env.sh.template flume-env.sh # 在 bigdata03 上安装Flume [root@bigdata03 soft]# tar -zxvf apache-flume-1.9.0-bin.tar.gz [root@bigdata03 soft]# cd apache-flume-1.9.0-bin/conf [root@bigdata03 conf]# mv flume-env.sh.template flume-env.sh
-
编写脚本模拟实时产生日志文件 access.log
[root@bigdata02 log]# vi generateAccessLog.sh #!/bin/bash # 循环向文件中生成数据 while [ "1" = "1" ] do # 获取当前时间戳 curr_time=`date +%s` # 获取当前主机名 name=`hostname` echo ${name}_${curr_time} >> /data/log/access.log # 暂停1秒 sleep 1 done [root@bigdata02 log]# pwd /data/log # bigdata03 编写同样脚本 [root@bigdata02 log]# scp -rq generateAccessLog.sh 192.168.35.102:/data/log
-
分别编写配置文件
# bigdata04 编写脚本 [root@bigdata04 conf]# cat avro-to-hdfs.conf # agent的名称是a1 # 指定source组件、channel组件和Sink组件的名称 a1.sources = r1 a1.channels = c1 a1.sinks = k1 # 配置source组件 a1.sources.r1.type = avro a1.sources.r1.bind = 0.0.0.0 a1.sources.r1.port = 45454 # 配置channel组件 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 配置sink组件 a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://192.168.35.100:9000/access/%Y%m%d a1.sinks.k1.hdfs.filePrefix = access a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.rollInterval = 3600 a1.sinks.k1.hdfs.rollSize = 134217728 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.useLocalTimeStamp = true # 把组件连接起来 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 # bigdata02 编写脚本 [root@bigdata02 log]# cat /data/soft/apache-flume-1.9.0-bin/conf/file-to-avro-101.conf # agent的名称是a1 # 指定source组件、channel组件和Sink组件的名称 a1.sources = r1 a1.channels = c1 a1.sinks = k1 # 配置source组件 a1.sources.r1.type = exec a1.sources.r1.command = tail -F /data/log/access.log # 配置channel组件 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 配置sink组件 a1.sinks.k1.type = avro a1.sinks.k1.hostname = 192.168.35.103 a1.sinks.k1.port = 45454 # 把组件连接起来 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 You have new mail in /var/spool/mail/root [root@bigdata02 log]# # bigdata03 编写配置文件 [root@bigdata03 log]# cat /data/soft/apache-flume-1.9.0-bin/conf/file-to-avro-102.conf #agent的名称是a1 # 指定source组件、channel组件和Sink组件的名称 a1.sources = r1 a1.channels = c1 a1.sinks = k1 # 配置source组件 a1.sources.r1.type = exec a1.sources.r1.command = tail -F /data/log/access.log # 配置channel组件 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 配置sink组件 a1.sinks.k1.type = avro a1.sinks.k1.hostname = 192.168.35.103 a1.sinks.k1.port = 45454 # 把组件连接起来 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 [root@bigdata03 log]#
-
查看HDFS
[root@bigdata04 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://192.168.35.100:9000/access/20220825/access.1661409230827.tmp
.tmp结尾代表文件正在被使用,关闭监听后会去除
-
启动顺序和关闭顺序
# 启动 bigdata04 [root@bigdata04 apache-flume-1.9.0-bin]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/avro-to-hdfs.conf -Dflume.root.logger=INFO,console # 启动bigdata03 [root@bigdata03 apache-flume-1.9.0-bin]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-avro-102.conf -Dflume.root.logger=INFO,console # bigdata03脚本启动 [root@bigdata03 log]# sh -x generateAccessLog.sh # 启动bigdata02 [root@bigdata02 conf]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-avro-101.conf -Dflume.root.logger=INFO,console # bigdata02 脚本启动 [root@bigdata02 log]# sh -x generateAccessLog.sh
注意启动顺序 bigdata04–>bigdata02—>bigdata03
关闭顺序:bigdata02—>bigdata03—>bigdata04
-
总结
- 给每个组件起名字
- 配置每个组件的相关参数
- 把它们联通起来
6.6 Flume高级组件
-
创建监控文件
[root@bigdata04 conf]# cat /data/log/moreType.log {"id":"14943445328940974601","uid":"840717325115457536","lat":"53.530598","lnt":"-2.5620373","hots":0,"title":"0","status":"1","topicId":"0","end_time":"1494344570","watch_num":0,"share_num":"1","replay_url":null,"replay_num":0,"start_time":"1494344544","timestamp":1494344571,"type":"video_info"} {"uid":"861848974414839801","nickname":"mick","usign":"","sex":1,"birthday":"","face":"","big_face":"","email":"abc@qq.com","mobile":"","reg_type":"102","last_login_time":"1494344580","reg_time":"1494344580","last_update_time":"1494344580","status":"5","is_verified":"0","verified_info":"","is_seller":"0","level":1,"exp":0,"anchor_level":0,"anchor_exp":0,"os":"android","timestamp":1494344580,"type":"user_info"} {"send_id":"834688818270961664","good_id":"223","video_id":"14943443045138661356","gold":"10","timestamp":1494344574,"type":"gift_record"}
-
Source Interceptors(拦截器配置文件 :file-to-hdfs-moreType.conf)
# agent的名称是a1 # 指定source组件、channel组件和Sink组件的名称 a1.sources = r1 a1.channels = c1 a1.sinks = k1 # 配置source组件 a1.sources.r1.type = exec a1.sources.r1.command = tail -F /data/log/moreType.log # 配置拦截器 [多个拦截器按照顺序依次执行] a1.sources.r1.interceptors = i1 i2 i3 i4 a1.sources.r1.interceptors.i1.type = search_replace a1.sources.r1.interceptors.i1.searchPattern = "type":"video_info" a1.sources.r1.interceptors.i1.replaceString = "type":"videoInfo" a1.sources.r1.interceptors.i2.type = search_replace a1.sources.r1.interceptors.i2.searchPattern = "type":"user_info" a1.sources.r1.interceptors.i2.replaceString = "type":"userInfo" a1.sources.r1.interceptors.i3.type = search_replace a1.sources.r1.interceptors.i3.searchPattern = "type":"gift_record" a1.sources.r1.interceptors.i3.replaceString = "type":"giftRecord" a1.sources.r1.interceptors.i4.type = regex_extractor a1.sources.r1.interceptors.i4.regex = "type":"(\\w+)" a1.sources.r1.interceptors.i4.serializers = s1 a1.sources.r1.interceptors.i4.serializers.s1.name = logType # 配置channel组件 a1.channels.c1.type = file a1.channels.c1.checkpointDir = /data/soft/apache-flume-1.9.0-bin/data/moreType/checkpoint a1.channels.c1.dataDirs = /data/soft/apache-flume-1.9.0-bin/data/moreType/data # 配置sink组件 a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://192.168.35.100:9000/moreType/%Y%m%d/%{logType} a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.rollInterval = 3600 a1.sinks.k1.hdfs.rollSize = 134217728 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.useLocalTimeStamp = true #增加文件前缀和后缀 a1.sinks.k1.hdfs.filePrefix = data a1.sinks.k1.hdfs.fileSuffix = .log # 把组件连接起来 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
Channel Selectors
- 多Channel之Replicating Channel Selector
- 多Channel之Multiplexing Channel Selector
-
Sink Processors
- 负载均衡
- 故障转移
6.7 Flume 进阶
- 自定义组件
- Flume 优化
- 调整Flume进程的内存大小
- 在一台服务器启动多个agent的时候,建议修改配置区分日志文件