目录
(1)下载 flume-ng-1.6.0-cdh5.7.0.tar.gz
(4)在 flume-env.sh 中配置 Java JDK 的路径
(5)编写 flume 的启动脚本(生产环境推荐使用这种方式)
(7)查看 jq 安装包是否存在,安装 jq 工具,便于查看 json 格式的内容
(3)编写 agent 的启动脚本(生产环境推荐使用这种方式)
实战 03:实时读取本地文件到 HDFS 中(需要 flume 节点配置 Hadoop 集群环境)
实战 04:从服务器 A 收集数据到服务器 B 并上传到 HDFS(需要服务器 B 节点配置 Hadoop 集群环境)
实战 05:多 flume 汇总数据到单 flume(需要单 flume 汇聚节点配置 hadoop 集群环境)
写在最前之应用场景:
flume 在大数据中扮演着数据收集的角色,收集到数据以后再通过计算框架进行处理。flume 是 Cloudera 提供的一个高可用的、高可靠的、分布式的海量日志采集、聚合和传输的系统,flume 支持在日志系统中定制各类数据发送方,用于收集数据;同时,flume 提供对数据进行简单处理,并写到各种数据接收方(可定制)的能力。
Flume 作为 Hadoop 中的日志采集工具,非常的好用,但是在安装 Flume 的时候,查阅很多资料时发现形形色色,有的说安装 Flume 很简单,有的说安装 Flume 很复杂,需要依赖 zookeeper,所以一方面说直接安装 Flume,解压即可用,还有一方面说需要先装了 Zookeeper 才可以安装 Flume。那么为何会才生这种情况呢?其实两者说的都对,只是 Flume 的不同版本问题。
背景介绍:Cloudera 开发的分布式日志收集系统 Flume,是 Hadoop 周边组件之一。其可以实时的将分布在不同节点、机器上的日志收集到 hdfs 中。Flume 初始的发行版本目前被统称为 Flume OG(original generation),属于 cloudera。但随着 FLume 功能的扩展,Flume OG 代码工程臃肿、核心组件设计不合理、核心配置不标准等缺点暴露出来,尤其是在 Flume OG 的最后一个发行版本 0.94.0 中,日志传输不稳定的现象尤为严重,这点可以在 BigInsights 产品文档的 troubleshooting 板块发现。为了解决这些问题,2011 年 10 月 22 日,cloudera 完成了 Flume-728,对 Flume 进行了里程碑式的改动:重构核心组件、核心配置以及代码架构,重构后的版本统称为 Flume NG(next generation);改动的另一原因是将 Flume 纳入 apache 旗下,cloudera Flume 改名为 Apache Flume。
FLUME OG 有三种角色的节点:代理节点(agent)、收集节点(collector)、主节点(master)。
FLUME NG 只有一种角色的节点:代理节点(agent)。
Flume OG vs Flume NG:
- 在 OG 版本中,Flume 的使用稳定性依赖 zookeeper。它需要 zookeeper 对其多类节点(agent、collector、master)的工作进行管理,尤其是在集群中配置多个 master 的情况下。当然,OG也可以用内存的方式管理各类节点的配置信息,但是需要用户能够忍受在机器出现故障时配置信息出现丢失。所以说 OG 的稳定行使用是依赖 zookeeper 的。
- 而在 NG 版本中,节点角色的数量由 3 缩减到 1,不存在多类角色的问题,所以就不再需要 zookeeper 对各类节点协调的作用了,由此脱离了对 zookeeper 的依赖。由于 OG 的稳定使用对 zookeeper 的依赖表现在整个配置和使用过程中,这就要求用户掌握对 zookeeper 集群的搭建及其使用。
- OG 在安装时:在 flume-env.sh 中设置 $JAVA_HOME。 需要配置文件 flume-conf.xml。其中最主要的、必须的配置与 master 有关。集群中的每个 Flume 都需要配置 master 相关属性(如 flume.master.servers、flume.master.store、flume.master.serverid)。 如果想稳定使用 Flume 集群,还需要安装 zookeeper 集群,这需要用户对 zookeeper 有较深入的了解。 安装 zookeeper 之后,需要配置 flume-conf.xml 中的相关属性,如 flume.master.zk.use.external、flume.master.zk.servers。 在使用 OG 版本传输数据之前,需要启动 master、agent。
- NG 在安装时,只需要在 flume-env.sh 中设置$JAVA_HOME。
所以,当我们使用 Flume 的时候,一般都采用 Flume NG。
一、Flume 架构和核心组件
1、Event 的概念
flume 的核心是把数据从数据源(source)收集过来,再将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功,在送到目的地(sink)之前,会先缓存数据(channel),待数据真正到达目的地(sink)后,flume 再删除自己缓存的数据。在整个数据传输过程中,流动的事 event,即事务保证是在 event 级别进行的。
那么什么是 event 呢?
event 将传输的数据进行封装,是 flume 传输数据的基本单位,如果是文本文件,通常是一行记录,event 也是事务的基本单位。event 从 source,流向 channel,再到 sink,本身为一个字节数组,并可携带 headers(头信息)信息,event 代表着一个数据的最小完整单元,从外部数据源来,向外部的目的去。
2、Flume 架构
flume 之所以如此神奇,就是源于它自身的一个设计,这个设计就是 agent,agent 本身是一个 Java 进程,运行在日志收集节点(所谓日志收集节点就是服务器节点)。
agent 里面包含 3 个核心的组件:source➡️channel➡️sink,类似生产者、仓库、消费者的架构。
- source:source 组件是专门用来收集数据的,可以处理各种类型、各种格式的日志数据,包括 avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定义。
- channel:source 组件把数据收集来以后,临时存放在 channel 中,即 channel 组件在 agent 中是专门用来存放临时数据的(对采集到的数据进行简单的缓存,可以放在 memory、jdbc、file 等等)。
- sink:sink 组件是用于把数据发送到目的地的组件,目的地包括 hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定义。
3、Flume 的运行机制
flume 的核心就是一个 agent,这个 agent 对外有两个进行交互的地方,一个是接收数据的输入(source),一个是数据的输出(sink),sink 负责将数据发送到外部指定的目的地。source 接收到数据之后,将数据发送给 channel,channel 作为一个数据缓冲区会临时存放这些数据,随后 sink 会将 channel 中的数据发送到指定的地方(如 HDFS 等)。
注意:只有在 sink 将 channel 中的数据成功发送出去之后,channel 才会将临时数据进行删除,这种机制保证了数据传输的可靠性与安全性。
4、Flume 的广义用法
flume 可以支持多级 flume 的 agent,即 flume 可以前后相继,例如 sink 可以将数据写到下一个 agent 的 source 中,这样的话就可以练成串了,可以整体处理了。flume 还支持扇入(fan-in)、扇出(fan-out)。所谓扇入就是 source 可以接收多个输入,所谓扇出就是 sink 可以将数据输出多个目的地 destination 中。
值得注意的是,flume 提供了大量内置的 source、channel 和 sink 类型。不同类型的 source、channel 和 sink 可以自由组合。组合方式基于用户设置的配置文件,非常灵活。
举个例子:channel 可以把事件暂存在内存里,也可以持久化到本地硬盘上。sink 可以把日志写入 HDFS、HBase,甚至是另一个 source 等等。
flume 支持用户建立多级流,也就是说,多个 agent 可以协同工作,并且支持 fan-in、fan-out、contextual routing、backup routes。如下图👇所示:
二、Flume 环境搭建
1、前置条件
- flume 需要 Java 1.7 及以上(推荐 1.8)
- 足够的内存和磁盘空间
- 对 agent 监控目录的读写权限
2、搭建
(1)下载 flume-ng-1.6.0-cdh5.7.0.tar.gz
下载地址01:https://download.csdn.net/download/weixin_42018518/12314171,Flume-ng-1.6.0-cdh.zip 内压缩了 3 个项目,分别为:flume-ng-1.6.0-cdh5.5.0.tar.gz、flume-ng-1.6.0-cdh5.7.0.tar.gz 和 flume-ng-1.6.0-cdh5.10.1.tar.gz,选择你需要的版本,我们暂时选择 cdh5.7.0 这个版本。
下载地址 02:wget http://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz
(2)上传到服务器,并解压
[root@yz-sre-backup019 ~]# cd
[root@yz-sre-backup019 ~]# mkdir apps
[root@yz-sre-backup019 ~]# cd /data/soft
[root@yz-sre-backup019 soft]# tar -zxvf flume-ng-1.6.0-cdh5.7.0.tar.gz -C /root/apps/
(3)配置环境变量
[root@yz-sre-backup019 apps]# vim ~/.bash_profile
# 配置 Flume 的路径,根据自己安装的路径进行修改
export FLUME_HOME=/root/apps/apache-flume-1.6.0-cdh5.7.0-bin
export PATH=$FLUME_HOME/bin:$PATH
// 使配置文件生效
[root@yz-sre-backup019 apps]# source ~/.bash_profile
(4)在 flume-env.sh 中配置 Java JDK 的路径
a. 首先下载最新稳定 JDK:
注意:JDK 安装在哪个用户下,就给哪个用户使用
当前最新版下载地址:https://www.oracle.com/java/technologies/javase-downloads.html
b. 将下载的 JDK 上传到 /data/soft 下,并修改文件权限:
[root@yz-sre-backup019 soft]# rz
[root@yz-sre-backup019 soft]# chmod 755 jdk-8u241-linux-x64.tar.gz
c. 解压 JDK 到 /usr/:
[root@yz-sre-backup019 soft]# tar -zxvf jdk-8u241-linux-x64.tar.gz -C /usr/
d. 配置 JDK 环境变量:
[root@yz-sre-backup019 soft]# vim /etc/profile
# Java environment
export JAVA_HOME=/usr/jdk1.8.0_241
export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
export PATH=$PATH:${JAVA_HOME}/bin
// 使 JDK 配置文件生效
[root@yz-sre-backup019 soft]# source /etc/profile
e. 检查 JDK 是否安装成功:
[root@yz-sre-backup019 soft]# java -version
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)
f. 在 flume-env.sh 中配置 Java JDK 的路径:
[root@yz-sre-backup019 apps]# cd $FLUME_HOME/conf
// 复制模板
[root@yz-sre-backup019 conf]# cp flume-env.sh.template flume-env.sh
[root@yz-sre-backup019 conf]# vim flume-env.sh
// 配置 Java 目录,在末尾新增一行
export JAVA_HOME=/usr/jdk1.8.0_241
g. 检测
// 在 flume 的 bin 目录下执行 flume-ng version 可查看版本
[root@yz-sre-backup019 bin]# cd $FLUME_HOME/bin
[root@yz-sre-backup019 bin]# flume-ng version
// 出现以下内容,说明安装成功
Flume 1.6.0-cdh5.7.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 8f5f5143ae30802fe79f9ab96f893e6c54a105d1
Compiled by jenkins on Wed Mar 23 11:38:48 PDT 2016
From source with checksum 50b533f0ffc32db9246405ac4431872e
三、Flume 实战
使用 flume 的关键就是写配置文件,主要分为以下四步:
- 配置 source
- 配置 channel
- 配置 sink
- 把以上三个组件串起来
实战 01:从指定网络端口采集数据输出到控制台
(1)自定义 flume 的配置文件存放目录
[root@yz-sre-backup019 data]# mkdir -pv /data/flume/{log,job,bin}
mkdir: created directory `/data/flume'
mkdir: created directory `/data/flume/log'
mkdir: created directory `/data/flume/job'
mkdir: created directory `/data/flume/bin'
[root@yz-sre-backup019 data]# cd flume/
[root@yz-sre-backup019 flume]# ll
total 12
drwxr-xr-x 2 root root 4096 Apr 10 10:34 bin # 用户存放启动脚本
drwxr-xr-x 2 root root 4096 Apr 10 10:34 job # 用于存放 flume 启动 agent 的配置文件
drwxr-xr-x 2 root root 4096 Apr 10 10:34 log # 用于存放启动脚本
(2)配置 agent
agent 选型:netcat source + memory channel + logger sink
在 /data/flume 的 job 目录下新建 flume-netcat.conf 文件(目录和文件名可以自定义,只需要在后续启动 agent 时需要用到)
# flume-netcat.conf: A single_node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名称; r1: source 的名称; k1: sink 的名称; c1: channel 的名称
# Describe/configure the source
# 配置 source 的类型
a1.sources.r1.type = netcat
# 配置 source 绑定的主机
a1.sources.r1.bind = localhost
# 配置 source 绑定的主机端口
a1.sources.r1.port = 8888
# 指定 sink 的类型,我们这里指定的为 logger,即控制台输出
# 配置 sink 的类型
a1.sinks.k1.type = logger
# 指定 channel 的类型为 memory,指定 channel 的容量是 1000,每次传输的容量是 100
# 配置 channel 的类型
a1.channels.c1.type = memory
# 配置通道中存储的最大 event 数
a1.channels.c1.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
a1.channels.c1.transactionCapacity = 100
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
a1.sources.r1.channels = c1
# 把 sink 和 channel 做关联,只能输出到一个 channel
a1.sinks.k1.channel = c1
(3)启动 agent
[root@yz-sre-backup019 ~]# flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-netcat.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console
// 启动后另开窗口就行下面测试
(4)连接测试
// 如果不能 telnet,记得先安装
[root@yz-sre-backup019 ~]# yum -y install telnet net-tools
[root@yz-sre-backup019 ~]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
LISTEN 0 50 *:10501 *:*
[root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飞花点点轻!
OK
Welcome to Beijing...。。
OK
Python
OK
SHOWufei
OK
(5)编写 flume 的启动脚本(生产环境推荐使用这种方式)
- [root@yz-sre-backup019 flume]# cd /data/flume/bin
- [root@yz-sre-backup019 bin]# vim start-netcat.sh
- [root@yz-sre-backup019 bin]# chmod +x start-netcat.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Fri Apr 10 11:13:11 CST 2020
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-netcat.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-netcat.log 2>&1 &
(6)启动 flume 并查看启动日志信息
[root@yz-sre-backup019 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
[root@yz-sre-backup019 bin]# bash start-netcat.sh
[root@yz-sre-backup019 bin]#
[root@yz-sre-backup019 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
LISTEN 0 50 *:10501 *:*
[root@yz-sre-backup019 log]# telnet 127.0.0.1 8888
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飞花点点轻!
OK
Welcome to Beijing...。。
OK
Python
OK
SHOWufei
OK
[root@yz-sre-backup019 log]# tail -f flume-netcat.log
2020-04-10 15:31:36,708 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel c1
2020-04-10 15:31:36,832 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2020-04-10 15:31:36,833 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: CHANNEL, name: c1 started
2020-04-10 15:31:36,837 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink k1
2020-04-10 15:31:36,838 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source r1
2020-04-10 15:31:36,839 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:155)] Source starting
2020-04-10 15:31:36,865 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:169)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:8888]
2020-04-10 15:31:36,883 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2020-04-10 15:31:36,944 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26.cloudera.4
2020-04-10 15:31:36,979 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SelectChannelConnector@0.0.0.0:10501
2020-04-10 15:32:30,851 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: E9 A3 9E E8 8A B1 E7 82 B9 E7 82 B9 E8 BD BB 21 ...............! }
2020-04-10 15:32:39,853 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 65 6C 63 6F 6D 65 20 74 6F 20 42 65 69 6A 69 Welcome to Beiji }
2020-04-10 15:32:55,060 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 50 79 74 68 6F 6E 0D Python. }
2020-04-10 15:33:00,781 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 53 48 4F 57 75 66 65 69 0D SHOWufei. }
(7)查看 jq 安装包是否存在,安装 jq 工具,便于查看 json 格式的内容
[root@yz-sre-backup019 soft]# wget -O jq https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64
[root@yz-sre-backup019 soft]# chmod +x ./jq
[root@yz-sre-backup019 soft]# cp jq /usr/bin
(8)查看 flume 度量值
[root@yz-sre-backup019 ~]# curl http://127.0.0.1:10501/metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
129 259 0 259 0 0 25951 0 --:--:-- --:--:-- --:--:-- 37000
{
"CHANNEL.c1": { # 这是 c1 的 CHANEL 监控数据,c1 该名称在 flume-netcat.conf 中配置文件中定义的
"ChannelCapacity": "1000", # channel 的容量,目前仅支持 File Channel、Memory channel 的统计数据
"ChannelFillPercentage": "0.4", # channel 已填入的百分比
"Type": "CHANNEL", # 很显然,这里是CHANNEL监控项,类型为 CHANNEL
"EventTakeSuccessCount": "0", # sink 成功从 channel 读取事件的总数量
"ChannelSize": "4", # 目前channel 中事件的总数量,目前仅支持 File Channel、Memory channel 的统计数据
"EventTakeAttemptCount": "0", # sink 尝试从 channel 拉取事件的总次数。这不意味着每次时间都被返回,因为 sink 拉取的时候 channel 可能没有任何数据
"StartTime": "1586489375175", # channel 启动时的毫秒值时间
"EventPutAttemptCount": "4", # Source 尝试写入 Channe 的事件总次数
"EventPutSuccessCount": "4", # 成功写入 channel 且提交的事件总次数
"StopTime": "0" # channel 停止时的毫秒值时间,为 0 表示一直在运行
}
}
小提示:如果还想了解更多度量值,可参考官方文档:http://flume.apache.org/FlumeUserGuide.html#monitoring
(9)删掉对应的 flume 进程
[root@yz-sre-backup019 ~]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
LISTEN 0 50 *:10501 *:*
[root@yz-sre-backup019 ~]# netstat -untalp | grep 8888
tcp 0 0 127.0.0.1:8888 0.0.0.0:* LISTEN 4565/java
[root@yz-sre-backup019 ~]# kill 4565
[root@yz-sre-backup019 ~]# netstat -untalp | grep 8888
[root@yz-sre-backup019 ~]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
实战 02:监控一个文件实时采集新增的数据输出到本地文件中
(1)agent 的选型
exec source + memory channel + file_roll sink
(2)配置 agent
在 /data/flume 的 job 目录下新建 flume-file.conf 文件(目录和文件名可以自定义,只需要在后续启动 agent 时需要用到)
# flume-file.conf: A single_node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名称; r1: source 的名称; k1: sink 的名称; c1: channel 的名称
# Describe/configure the source
# 配置 source 的类型
a1.sources.r1.type = exec
# 配置 source 执行的命令
a1.sources.r1.command = tail -F /data/data.log
# 配置 source 让 bash 将一个字符串作为完整的命令来执行
a1.sources.r1.shell = /bin/bash -c
# 指定 sink 的类型,我们这里指定的为 file_roll,即本地文件输出
# 配置 sink 的类型,将数据传输到本地文件,需要设置文件路径
a1.sinks.k1.type = file_roll
# 配置 sink 输出到本地的路径
a1.sinks.k1.sink.directory = /data/flume/data
# 配置 channel 的类型
a1.channels.c1.type = memory
# 配置通道中存储的最大 event 数
a1.channels.c1.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
a1.channels.c1.transactionCapacity = 100
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
a1.sources.r1.channels = c1
# 把 sink 和 channel 做关联,只能输出到一个 channel
a1.sinks.k1.channel = c1
(3)编写 agent 的启动脚本(生产环境推荐使用这种方式)
- [root@yz-sre-backup019 flume]# cd /data/flume/bin
- [root@yz-sre-backup019 bin]# vim start-file.sh
- [root@yz-sre-backup019 bin]# chmod +x start-file.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Fri Apr 10 15:05:56 CST 2020
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-file.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/flume-file.log 2>&1 &
(4)启动 agent 并测试往监听文件中输入数据,
在 /data/flume/data 中生成的文件中查看 event 数据,测试完后删掉 flume 进程。
[root@yz-sre-backup019 bin]# bash start-file.sh
[root@yz-sre-backup019 bin]# cd /data
[root@yz-sre-backup019 data]# echo "帅飞飞!!!" >> data.log
[root@yz-sre-backup019 bin]# cd /data/flume/data
[root@yz-sre-backup019 data]# tail 1586513526197-4
帅飞飞!!!
[root@yz-sre-backup019 data]# ps -ef | grep flume
root 17682 1 1 18:12 pts/1 00:00:03 /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/* -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
root 17970 956 0 18:16 pts/0 00:00:00 grep flume
[root@yz-sre-backup019 data]# kill 17682
[root@yz-sre-backup019 data]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
实战 03:实时读取本地文件到 HDFS 中(需要 flume 节点配置 Hadoop 集群环境)
上面两个需求中把数据输出到控制台没有任何意义,实际需求可能需要输出到 hdfs 中,只需要改动 agent 的配置,把 sink 的类型改为 hdfs,然后指定 hdfs 的 url 和写入路径。
(1)agent 选型
exec source - memory channel - hdfs sink
(2)配置 agent
# flume-hdfs.conf: A single_node Flume configuration
# Name the components on this agent
wufei03.sources = file_source
wufei03.sinks = hdfs_sink
wufei03.channels = memory_channel
# wufei03: agent 的名称; file_source: source 的名称; hdfs_sink: sink 的名称; memory_channel: channel 的名称
# Describe/configure the source
# 配置 source 的类型
wufei03.sources.file_source.type = exec
# 配置 source 执行的命令
# wufei03.sources.file_source.command = tail -F /data/messages
wufei03.sources.file_source.command = tail -F /data/data.log
# 配置 source 让 bash 将一个字符串作为完整的命令来执行
wufei03.sources.file_source.shell = /bin/bash -c
# 指定 sink 的类型,我们这里指定的为 hdfs
# 配置 sink 的类型,将数据传输到 HDFS 集群
wufei03.sinks.hdfs_sink.type = hdfs
# 配置 sink 输出到本 hdfs 的 url 和写入路径
wufei03.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/dt=%Y-%m-%d/%H
# 上传文件的前缀
wufei03.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.2.24-
# 是否按照时间滚动文件夹
wufei03.sinks.hdfs_sink.hdfs.round = true
# 多少时间单位创建一个文件夹
wufei03.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定义时间单位
wufei03.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地时间戳
wufei03.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 积攒多少个 event 才 flush 到 hdfs 一次
wufei03.sinks.hdfs_sink.hdfs.batchSize = 1000
# 设置文件类型,可支持压缩
wufei03.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一个新文件
wufei03.sinks.hdfs_sink.hdfs.rollInterval = 600
# 设置每个文件的滚动大小
wufei03.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滚动与 event 数量无关
wufei03.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本数
wufei03.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
# 配置 channel 的类型
wufei03.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
wufei03.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
wufei03.channels.memory_channel.transactionCapacity = 1000
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
wufei03.sources.file_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
wufei03.sinks.hdfs_sink.channel = memory_channel
(3)编写启动脚本并启动 flume
- [root@yz-sre-backup019 flume]# cd /data/flume/bin
- [root@yz-sre-backup019 bin]# vim start-hdfs.sh
- [root@yz-sre-backup019 bin]# chmod +x start-hdfs.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Tue Apr 14 11:56:51 CST 2020
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 wufei03
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name wufei03 --conf-file=/data/flume/job/flume-hdfs.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-hdfs.log 2>&1 &
- [root@yz-bi-web01 bin]# bash start-hdfs.sh
- [root@yz-bi-web01 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
.....
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 50 *:10502 *:*
LISTEN 0 128 *:80 *:*
[root@yz-bi-web01 bin]#
[root@yz-bi-web01 data]# cd /data/
[root@yz-bi-web01 data]# echo "SHOWufei" >> data.log
[root@yz-bi-web01 data]# echo "帅飞飞!!!" >> data.log
(4)查看 flume 日志收集信息
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access
...。。
2020-04-14 12:16:15,648 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.ExecSource.start(ExecSource.java:169)] Exec source starting with command:tail -F /data/data.log
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SINK, name: hdfs_sink: Successfully registered new MBean.
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SOURCE, name: file_source: Successfully registered new MBean.
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SINK, name: hdfs_sink started
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SOURCE, name: file_source started
2020-04-14 12:16:15,683 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2020-04-14 12:16:15,725 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26.cloudera.4
2020-04-14 12:16:15,761 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SelectChannelConnector@0.0.0.0:10502
2020-04-14 12:16:53,678 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-14 12:16:53,977 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
2020-04-14 12:26:55,533 (hdfs-hdfs_sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:363)] Closing hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
2020-04-14 12:26:55,571 (hdfs-hdfs_sink-call-runner-6) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:629)] Renaming hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp to hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679
2020-04-14 12:26:55,580 (hdfs-hdfs_sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:394)] Writer callback called.
2020-04-14 17:47:30,793 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-14 17:47:30,839 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/17/gz_10.20.2.24-.1586857650794.tmp
(5)查看 hdfs 对应目录是否生成相应的日志信息
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-14/12
Found 1 items
-rw-r--r-- 3 root hadoop 9 2020-04-14 12:16 /flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-14
Found 2 items
drwxrwxrwx - root hadoop 0 2020-04-14 12:26 /flume/dt=2020-04-14/12
drwxrwxrwx - root hadoop 0 2020-04-14 17:47 /flume/dt=2020-04-14/17
(6)Browsing HDFS
(7)查看 flume 度量值
[root@yz-bi-web01 ~]# curl http://127.0.0.1:10502/metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
105 841 0 841 0 0 163k 0 --:--:-- --:--:-- --:--:-- 273k
{
"SOURCE.file_source": { # source 的名称
"OpenConnectionCount": "0", # 目前与客户端或 sink 保持连接的总数量,目前仅支持 avro source 展现该度量
"Type": "SOURCE", # 当前类型为 SOURRCE
"AppendBatchAcceptedCount": "0", # 成功提交到 channel 的批次的总数量
"AppendBatchReceivedCount": "0", # 接收到事件批次的总数量
"EventAcceptedCount": "3", ## 成功写出到channel的事件总数量
"AppendReceivedCount": "0", # 每批只有一个事件的事件总数量(与 RPC 调用的一个 append 调用相等)
"StopTime": "0", # SOURCE 停止时的毫秒值时间,0 代表一直运行着
"StartTime": "1586837775650", # SOURCE 启动时的毫秒值时间
"EventReceivedCount": "3", ## 目前为止 source 已经接收到的事件总数量
"AppendAcceptedCount": "0" # 逐条录入的次数,单独传入的事件到 Channel 且成功返回的事件总数量
},
"SINK.hdfs_sink": { # sink 的名称
"BatchCompleteCount": "0", # 批量处理event的个数等于批处理大小的数量
"ConnectionFailedCount": "0", # 连接失败的次数
"EventDrainAttemptCount": "3", ## sink 尝试写出到存储的事件总数量
"ConnectionCreatedCount": "2", # 下一个阶段(或存储系统)创建链接的数量(如HDFS创建一个文件)
"Type": "SINK", # 当前类型为 SINK
"BatchEmptyCount": "2551", # 批量处理 event 的个数为 0 的数量(空的批量的数量),如果数量很大表示 source 写入数据的速度比 sink 处理数据的速度慢很多
"ConnectionClosedCount": "1", # 连接关闭的次数
"EventDrainSuccessCount": "3", ## sink成功写出到存储的事件总数量
"StopTime": "0", # SINK 停止时的毫秒值时间
"StartTime": "1586837775650", # SINK 启动时的毫秒值时间
"BatchUnderflowCount": "3" # 批量处理 event 的个数小于批处理大小的数量(比 sink 配置使用的最大批量尺寸更小的批量的数量),如果该值很高也表示 sink 比 source 更快
},
"CHANNEL.memory_channel": { # channel 的名称
"EventPutSuccessCount": "3", ## 成功写入channel且提交的事件总次数
"ChannelFillPercentage": "0.0", # channel已填入的百分比
"Type": "CHANNEL", # 当前类型为 CHANNEL
"StopTime": "0", # CHANNEL 停止时的毫秒值时间
"EventPutAttemptCount": "3", ## Source 尝试写入 Channe 的事件总次数
"ChannelSize": "0", # 目前 channel 中事件的总数量,目前仅支持 File Channel,Memory channel 的统计数据
"StartTime": "1586837775646", # CHANNEL 启动时的毫秒值时间
"EventTakeSuccessCount": "3", ## sink 成功从 channel 读取事件的总数量
"ChannelCapacity": "1000", # channel 的容量,目前仅支持 File Channel,Memory channel 的统计数据
"EventTakeAttemptCount": "2558" # sink 尝试从 channel 拉取事件的总次数。这不意味着每次时间都被返回,因为 sink 拉取的时候 channel 可能没有任何数据
}
}
(8)测试完后删掉 flume 进程
[root@yz-bi-web01 ~]# ps -ef | grep flume
root 19768 14759 0 17:57 pts/11 00:00:00 grep flume
root 26653 1 0 12:16 pts/0 00:00:26 /usr/local/jdk1.7.0_76/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/hadoop/hadoop-2.7.1/etc/hadoop:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/activation-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/asm-3.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/avro-1.7.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-cli-1.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-codec-1.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-collections-3.2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-compress-1.4.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-configuration-1.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-digester-1.8.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-httpclient-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-io-2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-lang-2.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-logging-1.1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-math3-3.1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-net-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-client-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-framework-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/gson-2.2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/guava-11.0.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-annotations-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-auth-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hamcrest-core-1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpclient-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpcore-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-core-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-json-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-server-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jets3t-0.9.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jettison-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-util-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsch-0.1.42.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsp-api-2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsr305-3.0.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/junit-4.11.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/log4j-1.2.17.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/mockito-all-1.8.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/netty-3.6.2.Final.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/paranamer-2.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/com
[root@yz-bi-web01 ~]# kill 26653
[root@yz-bi-web01 ~]# ps -ef | grep flume
root 19777 14759 0 17:58 pts/11 00:00:00 grep flume
(9)清除 hdfs 上数据
[root@yz-bi-web01 ~]# su - hadoop
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
[hadoop@yz-bi-web01 ~]$
实战 04:从服务器 A 收集数据到服务器 B 并上传到 HDFS(需要服务器 B 节点配置 Hadoop 集群环境)
重点:服务器 A 的 sink 类型是 avro,而服务器 B 的 source 类型是 avro。
流程:
- 机器 A 监控一个文件,把日志记录到 data.log 中
- avro sink 把新产生的日志输出到指定的 hostname 和 port 上
- 通过 avro source 对应的 agent 将日志输出到控制台、kafka、hdfs 等
(1)机器 A 配置
agent 选型:exec source + memory channel + avro sink
# flume-file.conf: A single_node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名称; r1: source 的名称; k1: sink 的名称; c1: channel 的名称
# Describe/configure the source
# 配置 source 的类型
a1.sources.r1.type = exec
# 配置 source 执行的命令
a1.sources.r1.command = tail -F /data/data.log
# 配置 source 让 bash 将一个字符串作为完整的命令来执行
a1.sources.r1.shell = /bin/bash -c
# 指定 sink 的类型,我们这里指定的为 avro,即将数据发送到端口,需要设置端口名称、端口号
a1.sinks.k1.type = avro
# 配置 sink 主机名称
a1.sinks.k1.hostname = 10.20.2.24
# 配置 sink 主机端口
a1.sinks.k1.port = 8888
# 配置 channel 的类型
a1.channels.c1.type = memory
# 配置通道中存储的最大 event 数
a1.channels.c1.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
a1.channels.c1.transactionCapacity = 100
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
a1.sources.r1.channels = c1
# 把 sink 和 channel 做关联,只能输出到一个 channel
a1.sinks.k1.channel = c1
(2)机器 B 配置
agent 选型:avro source + memory channel + hdfs sink
# flume-hdfs.conf: A single_node Flume configuration
# Name the components on this agent
wufei03.sources = file_source
wufei03.sinks = hdfs_sink
wufei03.channels = memory_channel
# wufei03: agent 的名称; file_source: source 的名称; hdfs_sink: sink 的名称; memory_channel: channel 的名称
# Describe/configure the source
# 配置 source 的类型
wufei03.sources.file_source.type = avro
# 配置 source 绑定主机
wufei03.sources.file_source.bind = 10.20.2.24
# 配置 source 绑定主机端口
wufei03.sources.file_source.port = 8888
# 指定 sink 的类型,我们这里指定的为 hdfs
# 配置 sink 的类型,将数据传输到 HDFS 集群
wufei03.sinks.hdfs_sink.type = hdfs
# 配置 sink 输出到本 hdfs 的 url 和写入路径
wufei03.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/dt=%Y-%m-%d/%H
# 上传文件的前缀
wufei03.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.3.36-
# 是否按照时间滚动文件夹
wufei03.sinks.hdfs_sink.hdfs.round = true
# 多少时间单位创建一个文件夹
wufei03.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定义时间单位
wufei03.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地时间戳
wufei03.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 积攒多少个 event 才 flush 到 hdfs 一次
wufei03.sinks.hdfs_sink.hdfs.batchSize = 1000
# 设置文件类型,可支持压缩
wufei03.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一个新文件
wufei03.sinks.hdfs_sink.hdfs.rollInterval = 600
# 设置每个文件的滚动大小
wufei03.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滚动与 event 数量无关
wufei03.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本数
wufei03.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
# 配置 channel 的类型
wufei03.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
wufei03.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
wufei03.channels.memory_channel.transactionCapacity = 100
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
wufei03.sources.file_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
wufei03.sinks.hdfs_sink.channel = memory_channel
(3)编写启动脚本
// 机器 A 启动脚本
[root@yz-sre-backup019 bin]# vim start-file.sh
[root@yz-sre-backup019 bin]# cat start-file.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Wed Apr 15 11:22:24 CST 2020
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-file.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/flume-file.log 2>&1 &
// 机器 B 启动脚本
[root@yz-bi-web01 bin]# vim start-hdfs.sh
[root@yz-bi-web01 bin]# cat start-hdfs.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Wed Apr 15 11:22:24 CST 2020
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 wufei03
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name wufei03 --conf-file=/data/flume/job/flume-hdfs.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-hdfs.log 2>&1 &
(4)启动脚本并查看对应的日志信息
// 机器 A
[root@yz-sre-backup019 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
// 机器 B
[root@yz-bi-web01 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
// 先启动机器 B 的 agent
[root@yz-bi-web01 bin]# bash start-hdfs.sh
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access
// 后启动机器 A 的 agent
[root@yz-sre-backup019 bin]# bash start-file.sh
[root@yz-sre-backup019 log]# tail -f flume-file.log
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
2020-04-15 11:55:35,409 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 11:55:35,416 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started
// 插入测试数据
[root@yz-sre-backup019 ~]# cd /data/
[root@yz-sre-backup019 data]# echo "帅飞飞!!!" >> data.log
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
2020-04-15 11:55:51,147 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859.tmp
[root@yz-sre-backup019 data]# echo "SHOWufei!!!" >> data.log
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
2020-04-15 12:03:14,139 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969.tmp
(5)查看 hdfs 对应目录是否生成相应的日志信息
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15
Found 2 items
drwxrwxrwx - root hadoop 0 2020-04-15 12:05 /flume/dt=2020-04-15/11
drwxrwxrwx - root hadoop 0 2020-04-15 12:03 /flume/dt=2020-04-15/12
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15/11
Found 1 items
-rw-r--r-- 3 root hadoop 19 2020-04-15 12:05 /flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859
[root@yz-bi-web01 ~]# hadoop fs -cat /flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859
帅飞飞!!!
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15/12
Found 1 items
-rw-r--r-- 3 root hadoop 18 2020-04-15 12:03 /flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969.tmp
[root@yz-bi-web01 ~]# hadoop fs -cat /flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969
SHOWufei!!!
(6)查看 flume 度量值
// 机器 A
[root@yz-sre-backup019 ~]# curl http://127.0.0.1:10501/metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
101 811 0 811 0 0 4924 0 --:--:-- --:--:-- --:--:-- 4975
{
"SINK.k1": {
"ConnectionCreatedCount": "1",
"ConnectionClosedCount": "0",
"Type": "SINK",
"BatchCompleteCount": "0",
"BatchEmptyCount": "109",
"EventDrainAttemptCount": "2",
"StartTime": "1586922935645",
"EventDrainSuccessCount": "2",
"BatchUnderflowCount": "2",
"StopTime": "0",
"ConnectionFailedCount": "0"
},
"CHANNEL.c1": {
"ChannelCapacity": "1000",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"ChannelSize": "0",
"EventTakeSuccessCount": "2",
"EventTakeAttemptCount": "114",
"StartTime": "1586922935643",
"EventPutAttemptCount": "2",
"EventPutSuccessCount": "2",
"StopTime": "0"
},
"SOURCE.r1": {
"EventReceivedCount": "2",
"AppendBatchAcceptedCount": "0",
"Type": "SOURCE",
"EventAcceptedCount": "2",
"AppendReceivedCount": "0",
"StartTime": "1586922935652",
"AppendAcceptedCount": "0",
"OpenConnectionCount": "0",
"AppendBatchReceivedCount": "0",
"StopTime": "0"
}
}
// 机器 B
[root@yz-bi-web01 ~]# curl http://127.0.0.1:10502/metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
104 839 0 839 0 0 7163 0 --:--:-- --:--:-- --:--:-- 7295
{
"SOURCE.file_source": {
"OpenConnectionCount": "1",
"Type": "SOURCE",
"AppendBatchReceivedCount": "2",
"AppendBatchAcceptedCount": "2",
"EventAcceptedCount": "2",
"AppendReceivedCount": "0",
"StopTime": "0",
"StartTime": "1586922913313",
"EventReceivedCount": "2",
"AppendAcceptedCount": "0"
},
"SINK.hdfs_sink": {
"BatchCompleteCount": "0",
"ConnectionFailedCount": "0",
"EventDrainAttemptCount": "2",
"ConnectionCreatedCount": "2",
"Type": "SINK",
"BatchEmptyCount": "117",
"ConnectionClosedCount": "1",
"EventDrainSuccessCount": "2",
"StopTime": "0",
"StartTime": "1586922912838",
"BatchUnderflowCount": "2"
},
"CHANNEL.memory_channel": {
"EventPutSuccessCount": "2",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"EventPutAttemptCount": "2",
"ChannelSize": "0",
"StopTime": "0",
"StartTime": "1586922912835",
"EventTakeSuccessCount": "2",
"ChannelCapacity": "1000",
"EventTakeAttemptCount": "121"
}
}
(7)测试完删掉 flume 进程并清除 hdfs 上数据
// 先删掉机器 A 的 flume 进程
[root@yz-sre-backup019 bin]# ps -ef | grep flume
root 10492 6728 0 11:54 pts/2 00:00:00 tail -f flume-file.log
root 10500 1 0 11:55 pts/0 00:00:09 /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/* -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
root 11084 5377 0 12:12 pts/0 00:00:00 grep flume
[root@yz-sre-backup019 bin]# kill 10500
[root@yz-sre-backup019 bin]# ps -ef | grep flume
root 10492 6728 0 11:54 pts/2 00:00:00 tail -f flume-file.log
root 11092 5377 0 12:12 pts/0 00:00:00 grep flume
// 后删掉机器 B 的 flume 进程
[root@yz-bi-web01 ~]# ps -ef | grep flume
root 5725 16077 0 11:54 pts/11 00:00:00 tail -f flume-hdfs.log
root 5735 1 1 11:55 pts/0 00:00:20 /usr/local/jdk1.7.0_76/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/hadoop/hadoop-2.7.1/etc/hadoop:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/activation-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/asm-3.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/avro-1.7.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-cli-1.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-codec-1.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-collections-3.2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-compress-1.4.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-configuration-1.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-digester-1.8.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-httpclient-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-io-2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-lang-2.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-logging-1.1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-math3-3.1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-net-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-client-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-framework-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/gson-2.2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/guava-11.0.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-annotations-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-auth-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hamcrest-core-1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpclient-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpcore-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-core-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-json-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-server-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jets3t-0.9.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jettison-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-util-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsch-0.1.42.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsp-api-2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsr305-3.0.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/junit-4.11.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/log4j-1.2.17.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/mockito-all-1.8.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/netty-3.6.2.Final.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/paranamer-2.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/com
root 18949 9025 0 12:12 pts/12 00:00:00 grep flume
[root@yz-bi-web01 ~]# kill 5735
[root@yz-bi-web01 ~]# ps -ef | grep flume
root 5725 16077 0 11:54 pts/11 00:00:00 tail -f flume-hdfs.log
root 18963 9025 0 12:12 pts/12 00:00:00 grep flume
// 清除 hdfs 上的数据
[root@yz-bi-web01 ~]# su - hadoop
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
[hadoop@yz-bi-web01 ~]$ exit
logout
[root@yz-bi-web01 ~]#
实战 05:多 flume 汇总数据到单 flume(需要单 flume 汇聚节点配置 hadoop 集群环境)
(1)流程
- Agent1 监控文件 /data/data.log(exec source - memory channel - avro sink)
- Agent2 监控某一端口数据流 (netcat source - memory channel - avro sink)
- Agent3 实时指定目录文件内容(spooldir source - memory channel - avro sink)
- Agent1、Agent2、Agent3 将数据发送给 Agent4
- Agent4 将最终数据写入到 hdfs(avro source - memory channel - hdfs sink)
(2)编写相应的 agent 配置文件
[root@yz-sre-backup019 job]# vim agent1-exec.conf # Name the components on this agent # Describe/configure the source # 指定 sink 的类型,我们这里指定的为 avro,即将数据发送到端口,需要设置端口名称、端口号 # 配置 channel 的类型 # 绑定 source 和 sink | [root@yz-bi-web01 job]# vim agent4-hdfs.conf # Name the components on this agent # Describe/configure the source # 指定 sink 的类型,我们这里指定的为 hdfs # 配置 channel 的类型 # 绑定 source 和 sink |
[root@yz-sre-backup019 job]# vim agent2-netcat.conf # Name the components on this agent # Describe/configure the source # 指定 sink 的类型,我们这里指定的为 avro,即将数据发送到端口,需要设置端口名称、端口号 # 指定 channel 的类型为 memory,指定 channel 的容量是 1000,每次传输的容量是 100 # 绑定 source 和 sink | |
[root@yz-sre-backup019 job]# vim agent3-dir.conf # Name the components on this agent # Describe/configure the source # 指定 sink 的类型,我们这里指定的为 avro,即将数据发送到端口,需要设置端口名称、端口号 # 配置 channel 的类型 # 绑定 source 和 sink [root@yz-sre-backup019 job]# mkdir -pv /data/flume/upload // 创建测试监视文件夹 |
(3)编写相应的 agent 启动脚本
#!/bin/bash # 启动 flume 自身的监控参数,默认执行以下脚本 |
#!/bin/bash # 启动 flume 自身的监控参数,默认执行以下脚本 |
#!/bin/bash # 启动 flume 自身的监控参数,默认执行以下脚本 |
#!/bin/bash # 启动 flume 自身的监控参数,默认执行以下脚本 |
(4)分别启动各 agent 并查看对应的日志信息
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 |
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 |
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 |
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 |
(5)分别进行测试并查看对应的日志信息
// agent1 测试
|
2020-04-15 17:16:04,150 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false |
// agent2 测试
Trying 127.0.0.1... |
2020-04-15 17:17:07,096 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false |
// agent3 测试
|
2020-04-15 17:26:37,864 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false 2020-04-15 17:27:18,969 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false |
// 查看 hdfs 对应目录是否生成相应的日志信息
Found 4 items
帅飞飞!!!
飞花点点轻!
https://showufei.blog.csdn.net
|
|
(6)查看 flume 度量值
// agent1
% Total % Received % Xferd Average Speed Time Time Time Current
| // agent3
% Total % Received % Xferd Average Speed Time Time Time Current
|
// agent2
% Total % Received % Xferd Average Speed Time Time Time Current
| // agent4
% Total % Received % Xferd Average Speed Time Time Time Current
|
(7)测试完删掉 flume 进程并清除 hdfs 上数据
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
实战 06:挑选器案例
channel selector:通道挑选器,选择指定的 event 发送到指定的 channel
- Replicating Channel Selector:默认副本挑选器,事件均以副本方式输出,换句话说就是有几个 channel 就发送几个副本
- multiplexing selector:多路复用挑选器,作用就是可以将不同的内容发送到指定的 channel
- 详情参考官方文档:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-channel-selectors
流程图:
实战 07:主机拦截器案例
拦截器(interceptor):是 source 端的在处理过程中能够对数据(event)进行修改或丢弃的组件。
常见拦截器有:
- host interceptor:将发送的 event 添加主机名 header
- timestamp interceptor:将发送的 event 添加时间戳的 header
- 更多拦截器可参考官方文档:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-interceptors
(1)编辑主机拦截器配置文件(案例一)
agent 选型:netcat source + memory channel + logger sink
[root@yz-sre-backup019 job]# vim flume-host_interceptor.conf
[root@yz-sre-backup019 job]# cat flume-host_interceptor.conf
# flume-host_interceptor.conf: A single_node Flume configuration
# Name the components on this agent
wf_host_interceptor.sources = netcat_source
wf_host_interceptor.sinks = logger_sink
wf_host_interceptor.channels = memory_channel
# wf_host_interceptor: agent 的名称; netcat_source: source 的名称; logger_sink: sink 的名称; memory_channel: channel 的名称
# Describe/configure the source
# 配置 source 的类型
wf_host_interceptor.sources.netcat_source.type = netcat
# 配置 source 绑定的主机
wf_host_interceptor.sources.netcat_source.bind = 127.0.0.1
# 配置 source 绑定的主机端口
wf_host_interceptor.sources.netcat_source.port = 8888
# 指定添加拦截器
wf_host_interceptor.sources.netcat_source.interceptors = host_interceptor
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.type = org.apache.flume.interceptor.HostInterceptor$Builder
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.preserveExisting = false
# 指定 header 的 key
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.hostHeader = hostname
# 指定 header 的 value 为主机 IP
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.useIP = true
# 指定 sink 的类型,我们这里指定的为 logger,即控制台输出
# 配置 sink 的类型,
wf_host_interceptor.sinks.logger_sink.type = logger
# 指定 channel 的类型为 memory,指定 channel 的容量是 1000,每次传输的容量是 100
# 配置 channel 的类型
wf_host_interceptor.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
wf_host_interceptor.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
wf_host_interceptor.channels.memory_channel.transactionCapacity = 100
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
wf_host_interceptor.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
wf_host_interceptor.sinks.logger_sink.channel = memory_channel
(2)编写启动脚本
[root@yz-sre-backup019 bin]# vim start-host_interceptor.sh
[root@yz-sre-backup019 bin]# chmod +x start-host_interceptor.sh
[root@yz-sre-backup019 bin]# cat start-host_interceptor.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Thu Apr 16 11:33:39 CST 2020
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-host_interceptor.conf --name wf_host_interceptor -Dflume.monitoring.type=http -Dflume.monitoring.port=10520 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-host_interceptor.log 2>&1 &
(3)启动并连接到指定端口发送测试数据
|
Info: Sourcing environment configuration script //root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 2020-04-16 12:08:55,492 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SelectChannelConnector@0.0.0.0:10520 |
Trying 127.0.0.1... |
(4)编辑时间戳拦截器配置文件(案例二)
agent 选型:netcat source + memory channel + logger sink
[root@yz-sre-backup019 job]# vim flume-timestamp_interceptor.conf
[root@yz-sre-backup019 job]# cat flume-timestamp_interceptor.conf
# flume-timestamp_interceptor.conf: A single_node Flume configuration
# Name the components on this agent
wf_timestamp_interceptor.sources = netcat_source
wf_timestamp_interceptor.sinks = logger_sink
wf_timestamp_interceptor.channels = memory_channel
# wf_timestamp_interceptor: agent 的名称; netcat_source: source 的名称; logger_sink: sink 的名称; memory_channel: channel 的名称
# Describe/configure the source
# 配置 source 的类型
wf_timestamp_interceptor.sources.netcat_source.type = netcat
# 配置 source 绑定的主机
wf_timestamp_interceptor.sources.netcat_source.bind = 127.0.0.1
# 配置 source 绑定的主机端口
wf_timestamp_interceptor.sources.netcat_source.port = 8888
# 指定添加拦截器
wf_timestamp_interceptor.sources.netcat_source.interceptors = timestamp_interceptor
wf_timestamp_interceptor.sources.netcat_source.interceptors.timestamp_interceptor.type = timestamp
# 指定 sink 的类型,我们这里指定的为 logger,即控制台输出
# 配置 sink 的类型,
wf_timestamp_interceptor.sinks.logger_sink.type = logger
# 指定 channel 的类型为 memory,指定 channel 的容量是 1000,每次传输的容量是 100
# 配置 channel 的类型
wf_timestamp_interceptor.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
wf_timestamp_interceptor.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
wf_timestamp_interceptor.channels.memory_channel.transactionCapacity = 100
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
wf_timestamp_interceptor.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
wf_timestamp_interceptor.sinks.logger_sink.channel = memory_channel
(5)编写启动脚本
[root@yz-sre-backup019 bin]# vim start-timestamp_interceptor.sh
[root@yz-sre-backup019 bin]# chmod +x start-timestamp_interceptor.sh
[root@yz-sre-backup019 bin]# cat start-timestamp_interceptor.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Thu Apr 16 12:26:26 CST 2020
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-timestamp_interceptor.conf --name wf_timestamp_interceptor -Dflume.monitoring.type=http -Dflume.monitoring.port=10521 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-timestamp_interceptor.log 2>&1 &
(6)启动并连接到指定端口发送测试数据
State Recv-Q Send-Q Local Address:Port Peer Address:Port |
Info: Sourcing environment configuration script //root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 2020-04-16 12:28:55,062 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SelectChannelConnector@0.0.0.0:10521 2020-04-16 12:30:15,386 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{timestamp=1587011415381} body: 53 48 4F 57 75 66 65 69 E3 80 82 2E 2E 2E E3 80 SHOWufei........ } |
Trying 127.0.0.1... |
四、在生成环境的实际应用
待实施...。。