第六章 Flume

最新推荐文章于 2024-07-18 21:37:27 发布

Adopat

最新推荐文章于 2024-07-18 21:37:27 发布

阅读量126

点赞数

分类专栏：大数据文章标签： flume

本文链接：https://blog.csdn.net/weixin_44446122/article/details/127094708

版权

大数据专栏收录该内容

11 篇文章 0 订阅

订阅专栏

6.1 概念

Flume是一个高可用，高可靠，分布式的海量日志采集、聚合和传输的系统，能够有效的收集、聚合、移动大量的日志数据的工具。

一句话靠谱方便的日志采集工具。

Flume偏向数据传输。Flume 和 Logstash 区别

6.2特性

简单，灵活的基于流的结构
具有负载均衡和故障转移机制

6.3 应用场景

基本应用场景

WebServer：Web应用产生日志,

Agent：Flume代理，是一个持续传输数据的服务，数据在Agent内部传输的基本单位是Event

Agent是由Source、Channel、Sink这三大组件组成的，这就是Flume中的三大核心组件

Event是flume中处理消息的基本单元，由零个或者多个header和正文body组成。
高级应用场景

6.4 核心组件

Source

Source数据源，通过source组件可以指定让Flume读取哪里的数据，然后将数据传递给后面的channel。

常见的Source类型
- Exec Source
  
  监控文件中新增的内容
- NetCat TCP/UDP Source
  
  采集指定端口(tcp、udp)的数据，可以读取流经端口的每一行数据
- Spooling Directory Source
  
  采集文件夹中新增的文件
- Kafka Source
  
  从Kafka消息队列中采集数据
Channel

Channel，临时存储数据的管道，

常见的存储数据的类型
- Memory Channel
  
  优点是效率高，不涉及磁盘IO
  
  缺点有两个
  1. 可能会丢数据，如果Flume的agent挂了，那么channel中的数据就丢失了。
  2. 内存是有限的，会存在内存不够用的情况
- File Channel
  
  优点是数据不会丢失
  缺点是效率相对内存来说会有点慢，但是这个慢并没有我们想象中的那么慢，所以这个也是比较常用的一种channel。
- Spillable Memory Channel
  
  优点：解决了内存不够用的问题。
  缺点：还是存在数据丢失的风险
Sink

Sink负责把Channel中的数据，写入目的地。

常见的Sink类型
- Logger Sink
  
  将数据作为日志处理，可以选择打印到控制台或者写到文件中，这个主要在测试的时候使用。
- HDFS Sink
  
  将数据传输到HDFS中，这个是比较常见的，主要针对离线计算的场景
- Kafka Sink
  
  将数据发送到kafka消息队列中，这个也是比较常见的，主要针对实时计算场景，数据不落盘，实时传输，最后使用实时计算框架直接处理。

6.5 快速上手

入门案例

需求分析

Source:NetCat TCP/UDP Source

Channel:Memory Channel

Sink:Logger Sink

编写配置文件

[root@bigdata04 ~]# cat /data/soft/apache-flume-1.9.0-bin/conf/example.conf 
#example.conf: A single-node Flume configuration

# Name the components on this agent
# a1 代表agent Flume代理，分别给source,sink,channel 命名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 指定source 类型为 netcat 监听端口发送来的信息
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# Describe the sink
# 指定 sink 类型 为 logger
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动 Flume
```
[root@bigdata04 apache-flume-1.9.0-bin]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/example.conf -Dflume.root.logger=INFO,console
```
后面指定agent，表示启动一个Flume的agent代理

–name：指定agent的名字

–conf：指定flume配置文件的根目录

–conf-file：指定Agent对应的配置文件(包含source、channel、sink配置的文件)

-D：动态添加一些参数，在这里是指定了flume的日志输出级别和输出位置，INFO表示日志级别

发送消息

[root@bigdata04 ~]# telnet 192.168.35.103 44444
Trying 192.168.35.103...
Connected to 192.168.35.103.
Escape character is '^]'.
hello big data04
OK

yum install -y telnet

控制台打印信息

[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 62 69 67 20 64 61 74 61 30 34 hello big data04 }

采集文件内容上传至HDFS

需求分析

Source:Spooling Directory Source

Channel:File Channel

Sink:HDFS Sink

设置客户端节点

将bigdata04设置为客户端节点以访问HDFS

# 拷贝配置文件
[root@bigdata01 soft]# scp -rq hadoop-3.2.0 192.168.182.103:/data/soft/
# 修改配置文件(建议将bigdata01的配置文件直接拷贝至bigdata04)
export HADOOP_HOME=/data/soft/hadoop-3.2.0
export PATH=.:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/bin:$PATH

注意配置文件生效 source /etc/profile

编写配置文件

[root@bigdata04 conf]# cat file-to-hdfs.conf 
me the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /data/log/studentDir


# Use a channel which buffers events in memory
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /data/soft/apache-flume-1.9.0-bin/data/studentDir/checkpoint
a1.channels.c1.dataDirs = /data/soft/apache-flume-1.9.0-bin/data/studentDir/data

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.35.100:9000/flume/studentDir
a1.sinks.k1.hdfs.filePrefix = stu-
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
al.sinks.k1.hdfs.fileSuffix= .log

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动HADOOP集群

[root@bigdata01 sbin]# start-all.sh
[root@bigdata01 sbin]# pwd
/data/soft/hadoop-3.2.0/sbin

新增文件

在 /data/log/studentDir目录下新增文件

[root@bigdata04 hadoop-3.2.0]# ll /data/log/studentDir/
total 36
-rw-r--r--. 1 root root 76 Aug 23 16:08 class1.dat.COMPLETED
-rw-r--r--. 1 root root 76 Aug 23 16:13 class2.dat.COMPLETED
-rw-r--r--. 1 root root 13 Aug 25 10:24 LOL1111.txt.COMPLETED
-rw-r--r--. 1 root root 16 Aug 25 10:55 LOL222.dat.COMPLETED
-rw-r--r--. 1 root root 24 Aug 23 16:27 LOL666.dat.COMPLETED
-rw-r--r--. 1 root root 19 Aug 23 16:37 LOL9999.dat.COMPLETED
-rw-r--r--. 1 root root 27 Aug 23 16:30 LOL999.dat.COMPLETED
-rw-r--r--. 1 root root 10 Aug 23 16:14 LOL.dat.COMPLETED
-rw-r--r--. 1 root root 22 Aug 23 16:45 LOLjjjjj.dat.COMPLETED

注意 Flume 读取完的文件有.COMPLETED后缀

启动Flume

[root@bigdata04 apache-flume-1.9.0-bin]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-hdfs.conf -Dflume.root.logger=INFO,console

查看 HDFS文件

[root@bigdata04 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://192.168.35.100:9000/flume/studentDir/stu-.1661396200534
hello LOL!!!
hello github!!!

使用 nohup & 启动

# 1.nohup 启动
[root@bigdata04 apache-flume-1.9.0-bin]# nohup bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-hdfs.conf -Dflume.root.logger=INFO,console &
# 2.创建源文件
[root@bigdata04 hadoop-3.2.0]# vi /data/log/studentDir/1111.dat

11111
# 3.查看hdfs
[root@bigdata04 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://192.168.35.100:9000/flume/studentDir/stu-.1661396803549
11111
# 4.查看 nohup日志
[root@bigdata04 apache-flume-1.9.0-bin]# cat nohup.out
# 关闭nohup
[root@bigdata04 apache-flume-1.9.0-bin]# jps -ml
2740 org.apache.flume.node.Application --name a1 --conf-file conf/file-to-hdfs.conf
2966 sun.tools.jps.Jps -ml
[root@bigdata04 apache-flume-1.9.0-bin]# kill -9 2740
[root@bigdata04 apache-flume-1.9.0-bin]# jps -ml
2976 sun.tools.jps.Jps -ml
[1]+  Killed                  nohup bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-hdfs.conf -Dflume.root.logger=INFO,console
[root@bigdata04 apache-flume-1.9.0-bin]#

jps -ml显示main方法所在进程的名称

采集网站日志上传至HDFS

需求分析

在这里插入图片描述

bigdata02:192.168.35.101

bigdata03:192.168.35.102

bigdata04:192.168.35.103

分别安装Flume

# 在 bigdata02 上安装 Flume
[root@bigdata02 soft]# tar -zxvf apache-flume-1.9.0-bin.tar.gz
[root@bigdata02 soft]# cd apache-flume-1.9.0-bin/conf
[root@bigdata02 conf]# mv flume-env.sh.template  flume-env.sh
# 在 bigdata03 上安装Flume
[root@bigdata03 soft]# tar -zxvf apache-flume-1.9.0-bin.tar.gz
[root@bigdata03 soft]# cd apache-flume-1.9.0-bin/conf
[root@bigdata03 conf]# mv flume-env.sh.template  flume-env.sh

编写脚本模拟实时产生日志文件 access.log

[root@bigdata02 log]# vi generateAccessLog.sh 
#!/bin/bash
# 循环向文件中生成数据
while [ "1" = "1" ]
do
        # 获取当前时间戳
        curr_time=`date +%s`
        # 获取当前主机名
        name=`hostname`
        echo ${name}_${curr_time} >> /data/log/access.log
        # 暂停1秒
        sleep 1
done
[root@bigdata02 log]# pwd
/data/log
# bigdata03 编写同样脚本
[root@bigdata02 log]# scp -rq generateAccessLog.sh 192.168.35.102:/data/log

分别编写配置文件

# bigdata04 编写脚本
[root@bigdata04 conf]# cat avro-to-hdfs.conf 
# agent的名称是a1
# 指定source组件、channel组件和Sink组件的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source组件
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 45454

# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


# 配置sink组件
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.35.100:9000/access/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = access
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 3600
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true


# 把组件连接起来
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# bigdata02 编写脚本
[root@bigdata02 log]# cat /data/soft/apache-flume-1.9.0-bin/conf/file-to-avro-101.conf 
# agent的名称是a1
# 指定source组件、channel组件和Sink组件的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source组件
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /data/log/access.log


# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


# 配置sink组件
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.35.103
a1.sinks.k1.port = 45454


# 把组件连接起来
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
You have new mail in /var/spool/mail/root
[root@bigdata02 log]# 
# bigdata03 编写配置文件
[root@bigdata03 log]# cat /data/soft/apache-flume-1.9.0-bin/conf/file-to-avro-102.conf 
#agent的名称是a1
# 指定source组件、channel组件和Sink组件的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source组件
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /data/log/access.log


# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


# 配置sink组件
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.35.103
a1.sinks.k1.port = 45454


# 把组件连接起来
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[root@bigdata03 log]#

查看HDFS

[root@bigdata04 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://192.168.35.100:9000/access/20220825/access.1661409230827.tmp

.tmp结尾代表文件正在被使用，关闭监听后会去除

启动顺序和关闭顺序

# 启动 bigdata04
[root@bigdata04 apache-flume-1.9.0-bin]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/avro-to-hdfs.conf -Dflume.root.logger=INFO,console
# 启动bigdata03
[root@bigdata03 apache-flume-1.9.0-bin]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-avro-102.conf -Dflume.root.logger=INFO,console
# bigdata03脚本启动
[root@bigdata03 log]# sh -x generateAccessLog.sh 
# 启动bigdata02
[root@bigdata02 conf]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-avro-101.conf -Dflume.root.logger=INFO,console
# bigdata02 脚本启动
[root@bigdata02 log]# sh -x generateAccessLog.sh

注意启动顺序 bigdata04–>bigdata02—>bigdata03

关闭顺序:bigdata02—>bigdata03—>bigdata04

总结
1. 给每个组件起名字
2. 配置每个组件的相关参数
3. 把它们联通起来

6.6 Flume高级组件

创建监控文件

[root@bigdata04 conf]# cat /data/log/moreType.log 
{"id":"14943445328940974601","uid":"840717325115457536","lat":"53.530598","lnt":"-2.5620373","hots":0,"title":"0","status":"1","topicId":"0","end_time":"1494344570","watch_num":0,"share_num":"1","replay_url":null,"replay_num":0,"start_time":"1494344544","timestamp":1494344571,"type":"video_info"}
{"uid":"861848974414839801","nickname":"mick","usign":"","sex":1,"birthday":"","face":"","big_face":"","email":"abc@qq.com","mobile":"","reg_type":"102","last_login_time":"1494344580","reg_time":"1494344580","last_update_time":"1494344580","status":"5","is_verified":"0","verified_info":"","is_seller":"0","level":1,"exp":0,"anchor_level":0,"anchor_exp":0,"os":"android","timestamp":1494344580,"type":"user_info"}
{"send_id":"834688818270961664","good_id":"223","video_id":"14943443045138661356","gold":"10","timestamp":1494344574,"type":"gift_record"}

Source Interceptors(拦截器配置文件 :file-to-hdfs-moreType.conf)

# agent的名称是a1
# 指定source组件、channel组件和Sink组件的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source组件
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /data/log/moreType.log

# 配置拦截器 [多个拦截器按照顺序依次执行]
a1.sources.r1.interceptors = i1 i2 i3 i4
a1.sources.r1.interceptors.i1.type = search_replace
a1.sources.r1.interceptors.i1.searchPattern = "type":"video_info"
a1.sources.r1.interceptors.i1.replaceString = "type":"videoInfo"

a1.sources.r1.interceptors.i2.type = search_replace
a1.sources.r1.interceptors.i2.searchPattern = "type":"user_info"
a1.sources.r1.interceptors.i2.replaceString = "type":"userInfo"

a1.sources.r1.interceptors.i3.type = search_replace
a1.sources.r1.interceptors.i3.searchPattern = "type":"gift_record"
a1.sources.r1.interceptors.i3.replaceString = "type":"giftRecord"


a1.sources.r1.interceptors.i4.type = regex_extractor
a1.sources.r1.interceptors.i4.regex = "type":"(\\w+)"
a1.sources.r1.interceptors.i4.serializers = s1
a1.sources.r1.interceptors.i4.serializers.s1.name = logType


# 配置channel组件
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /data/soft/apache-flume-1.9.0-bin/data/moreType/checkpoint
a1.channels.c1.dataDirs = /data/soft/apache-flume-1.9.0-bin/data/moreType/data


# 配置sink组件
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.35.100:9000/moreType/%Y%m%d/%{logType}
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 3600
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true

#增加文件前缀和后缀
a1.sinks.k1.hdfs.filePrefix = data
a1.sinks.k1.hdfs.fileSuffix = .log

# 把组件连接起来
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1