【Flume入门】

Tonystark_sunshine

已于 2022-09-13 20:52:18 修改

阅读量325

点赞数 2

分类专栏： Flume

于 2022-09-13 19:32:10 首次发布

本文链接：https://blog.csdn.net/Tonystark_lz/article/details/126839153

版权

Flume 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Flume

概述
- 定义
- 基础架构
安装

概述

定义

Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume基于流式架构，灵活简单。

基础架构

Flume基础架构
总体是一个JVM进程，称之为Agent，它以事件（Event）的形式将数据从源头送至目的地。
Agent主要有3个部分组成，Source、Channel、Sink。

Source负责接收数据
Sink不断地轮询Channel中的事件且批量地移除它们至目的端
Channel是位于Source和Sink之间的缓冲区

Flume自带两种Channel：Memory Channel和File Channel。

Memory Channel顾名思义更快，但可能会不安全
File Channel安全不会丢数据，但是效率不如内存缓冲区

Event是数据传输的基本单元，由Header和Body两部分组成，Header用来存放该event的一些属性，为K-V结构，Body用来存放该条数据，形式为字节数组。

安装

安装地址

（1）Flume官网地址：http://flume.apache.org/
（2）文档查看地址：http://flume.apache.org/FlumeUserGuide.html
（3）下载地址：http://archive.apache.org/dist/flume/

安装部署

（1）将apache-flume-1.9.0-bin.tar.gz上传到linux的/opt/software目录下
（2）解压apache-flume-1.9.0-bin.tar.gz到/opt/module/目录下

[atguigu@hadoop102 software]$ tar -zxvf /opt/software/apache-flume-1.9.0-bin.tar.gz -C /opt/module/

（3）修改apache-flume-1.9.0-bin的名称为flume

[atguigu@hadoop102 module]$ mv /opt/module/apache-flume-1.9.0-bin /opt/module/flume

（4）将lib文件夹下的guava-11.0.2.jar删除以兼容Hadoop 3.1.3

[atguigu@hadoop102 lib]$  rm /opt/module/flume/lib/guava-11.0.2.jar

（5）修改conf下的log4j.properties确定日志打印的位置

#console表示同时将日志输出到控制台
flume.root.logger=INFO,LOGFILE,console
#固定日志输出的位置
flume.log.dir=/opt/module/flume/logs
#日志文件的名称
flume.log.file=flume.log

Flume入门案例

重点就是写配置文件，参考官方文档：https://flume.apache.org/FlumeUserGuide.html （看不懂可以找中文文档）

案例一：监控端口数据官方案例

1）案例需求：
使用Flume监听一个端口，收集该端口数据，并打印到控制台。
2）实现步骤：
（1）安装netcat工具

[atguigu@hadoop102 software]$ sudo yum install -y nc

（2）判断44444端口是否被占用

[atguigu@hadoop102 flume]$ sudo netstat -nlp | grep 44444

（3）在conf文件夹下创建Flume Agent配置文件nc-flume-log.conf。

[atguigu@hadoop102 conf]$ vim nc-flume-log.conf

（4）在nc-flume-log.conf文件中添加如下内容。
添加内容如下：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 6666

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

（5）先开启flume监听端口

[atguigu@hadoop102 flume]$ bin/flume-ng agent -n a1 -c conf -f job/netcat-logger.conf

参数说明：
–conf/-c：表示配置文件存储在conf/目录
–name/-n：表示给agent起名为a1
–conf-file/-f：flume本次启动读取的配置文件是在conf文件夹下的nc-flume-log.conf文件。
（6）使用netcat工具向本机的44444端口发送内容

[atguigu@hadoop102 ~]$ nc localhost 6666
hello

（7）在Flume监听页面观察接收数据情况

2022-09-13 19:16:32,418 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)]
 Event: { headers:{} body: 68 65 6C 6C 6F                                  hello }

（8）event打印的源码介绍
LoggerSink的process方法：

if (event != null) {
    if (logger.isInfoEnabled()) {
        logger.info("Event: " + EventHelper.dumpEvent(event, maxBytesToLog));
    }
}

dumpEvent方法返回值：buffer是固定长度的字符串，前端是16进制表示的字符的阿斯卡码值。

return "{ headers:" + event.getHeaders() + " body:" + buffer + " }";

案例二：实时监控目录下的多个追加文件

Taildir Source适合用于监听多个实时追加的文件，并且能够实现断点续传。
1）案例需求:使用Flume监听整个目录的实时追加文件，并上传至HDFS
2）实现步骤
（1）在conf目录下创建配置文件dir-hdfs.conf
①创建一个文件

[atguigu@hadoop102 conf]$ vim dir-hdfs.conf

②添加如下内容

# 监听目录
a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /opt/module/flume/logs/flume/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/flume/logs/test1/.*


#memory channel
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000

#HDFS sink
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/%H%M
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

（2）向/opt/module/flume/logs/test1/文件夹中追加内容
（3）启动监控文件夹命令

 bin/flume-ng agent -n a1 -c conf -f job/dir-hdfs.conf

（4）查看HDFS上的数据
在这里插入图片描述

Tonystark_sunshine

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【Flume入门】

Flume入门
复制链接

扫一扫

专栏目录