分布式日志收集框架 Flume NG 实战案例

目录

写在最前之应用场景:

一、Flume 架构和核心组件

1、Event 的概念

2、Flume 架构

3、Flume 的运行机制

4、Flume 的广义用法

二、Flume 环境搭建

1、前置条件

2、搭建

(1)下载 flume-ng-1.6.0-cdh5.7.0.tar.gz

(2)上传到服务器,并解压

(3)配置环境变量

(4)在 flume-env.sh 中配置 Java JDK 的路径

三、Flume 实战

实战 01:从指定网络端口采集数据输出到控制台

(1)自定义 flume 的配置文件存放目录

(2)配置 agent

(3)启动 agent

(4)连接测试

(5)编写 flume 的启动脚本(生产环境推荐使用这种方式)

(6)启动 flume 并查看启动日志信息

(7)查看 jq 安装包是否存在,安装 jq 工具,便于查看 json 格式的内容

(8)查看 flume 度量值

(9)删掉对应的 flume 进程

实战 02:监控一个文件实时采集新增的数据输出到本地文件中

(1)agent 的选型

(2)配置 agent

(3)编写 agent 的启动脚本(生产环境推荐使用这种方式)

(4)启动 agent 并测试往监听文件中输入数据,

实战 03:实时读取本地文件到 HDFS 中(需要 flume 节点配置 Hadoop 集群环境)

(1)agent 选型

(2)配置 agent

(3)编写启动脚本并启动 flume

(4)查看 flume 日志收集信息

(5)查看 hdfs 对应目录是否生成相应的日志信息

(6)Browsing HDFS

(7)查看 flume 度量值

(8)测试完后删掉 flume 进程

(9)清除 hdfs 上数据

实战 04:从服务器 A 收集数据到服务器 B 并上传到 HDFS(需要服务器 B 节点配置 Hadoop 集群环境)

(1)机器 A 配置

(2)机器 B 配置

(3)编写启动脚本

(4)启动脚本并查看对应的日志信息

(5)查看 hdfs 对应目录是否生成相应的日志信息

(6)查看 flume 度量值

(7)测试完删掉 flume 进程并清除 hdfs 上数据

实战 05:多 flume 汇总数据到单 flume(需要单 flume 汇聚节点配置 hadoop 集群环境)

(1)流程

(2)编写相应的 agent 配置文件

(3)编写相应的 agent 启动脚本

(4)分别启动各 agent 并查看对应的日志信息

(5)分别进行测试并查看对应的日志信息

(6)查看 flume 度量值

(7)测试完删掉 flume 进程并清除 hdfs 上数据

实战 06:挑选器案例

实战 07:主机拦截器案例

(1)编辑主机拦截器配置文件(案例一)

(2)编写启动脚本

(3)启动并连接到指定端口发送测试数据

(4)编辑时间戳拦截器配置文件(案例二)

(5)编写启动脚本

(6)启动并连接到指定端口发送测试数据

四、在生成环境的实际应用


写在最前之应用场景:

flume 在大数据中扮演着数据收集的角色,收集到数据以后再通过计算框架进行处理。flume 是 Cloudera 提供的一个高可用的、高可靠的、分布式的海量日志采集、聚合和传输的系统,flume 支持在日志系统中定制各类数据发送方,用于收集数据;同时,flume 提供对数据进行简单处理,并写到各种数据接收方(可定制)的能力。

Flume 作为 Hadoop 中的日志采集工具,非常的好用,但是在安装 Flume 的时候,查阅很多资料时发现形形色色,有的说安装 Flume 很简单,有的说安装 Flume 很复杂,需要依赖 zookeeper,所以一方面说直接安装 Flume,解压即可用,还有一方面说需要先装了 Zookeeper 才可以安装 Flume。那么为何会才生这种情况呢?其实两者说的都对,只是 Flume 的不同版本问题。

背景介绍:Cloudera 开发的分布式日志收集系统 Flume,是 Hadoop 周边组件之一。其可以实时的将分布在不同节点、机器上的日志收集到 hdfs 中。Flume 初始的发行版本目前被统称为 Flume OG(original generation),属于 cloudera。但随着 FLume 功能的扩展,Flume OG 代码工程臃肿、核心组件设计不合理、核心配置不标准等缺点暴露出来,尤其是在 Flume OG 的最后一个发行版本 0.94.0 中,日志传输不稳定的现象尤为严重,这点可以在 BigInsights 产品文档的 troubleshooting 板块发现。为了解决这些问题,2011 年 10 月 22 日,cloudera 完成了 Flume-728,对 Flume 进行了里程碑式的改动:重构核心组件、核心配置以及代码架构,重构后的版本统称为 Flume NG(next generation);改动的另一原因是将 Flume 纳入 apache 旗下,cloudera Flume 改名为 Apache Flume。

FLUME OG 有三种角色的节点:代理节点(agent)、收集节点(collector)、主节点(master)。

FLUME NG 只有一种角色的节点:代理节点(agent)。

Flume OG vs Flume NG:

  • 在 OG 版本中,Flume 的使用稳定性依赖 zookeeper。它需要 zookeeper 对其多类节点(agent、collector、master)的工作进行管理,尤其是在集群中配置多个 master 的情况下。当然,OG也可以用内存的方式管理各类节点的配置信息,但是需要用户能够忍受在机器出现故障时配置信息出现丢失。所以说 OG 的稳定行使用是依赖 zookeeper 的。
  • 而在 NG 版本中,节点角色的数量由 3 缩减到 1,不存在多类角色的问题,所以就不再需要 zookeeper 对各类节点协调的作用了,由此脱离了对 zookeeper 的依赖。由于 OG 的稳定使用对 zookeeper 的依赖表现在整个配置和使用过程中,这就要求用户掌握对 zookeeper 集群的搭建及其使用。
  • OG 在安装时:在 flume-env.sh 中设置 $JAVA_HOME。 需要配置文件 flume-conf.xml。其中最主要的、必须的配置与 master 有关。集群中的每个 Flume 都需要配置 master 相关属性(如 flume.master.servers、flume.master.store、flume.master.serverid)。 如果想稳定使用 Flume 集群,还需要安装 zookeeper 集群,这需要用户对 zookeeper 有较深入的了解。 安装 zookeeper 之后,需要配置 flume-conf.xml 中的相关属性,如 flume.master.zk.use.external、flume.master.zk.servers。 在使用 OG 版本传输数据之前,需要启动 master、agent。
  • NG 在安装时,只需要在 flume-env.sh 中设置$JAVA_HOME。

所以,当我们使用 Flume 的时候,一般都采用 Flume NG。

一、Flume 架构和核心组件

1、Event 的概念

flume 的核心是把数据从数据源(source)收集过来,再将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功,在送到目的地(sink)之前,会先缓存数据(channel),待数据真正到达目的地(sink)后,flume 再删除自己缓存的数据。在整个数据传输过程中,流动的事 event,即事务保证是在 event 级别进行的。

那么什么是 event 呢?

event 将传输的数据进行封装,是 flume 传输数据的基本单位,如果是文本文件,通常是一行记录,event 也是事务的基本单位。event 从 source,流向 channel,再到 sink,本身为一个字节数组,并可携带 headers(头信息)信息,event 代表着一个数据的最小完整单元,从外部数据源来,向外部的目的去。

 

2、Flume 架构

 

flume 之所以如此神奇,就是源于它自身的一个设计,这个设计就是 agent,agent 本身是一个 Java 进程,运行在日志收集节点(所谓日志收集节点就是服务器节点)。

agent 里面包含 3 个核心的组件:source➡️channel➡️sink,类似生产者、仓库、消费者的架构。

  • source:source 组件是专门用来收集数据的,可以处理各种类型、各种格式的日志数据,包括 avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定义。
  • channel:source 组件把数据收集来以后,临时存放在 channel 中,即 channel 组件在 agent 中是专门用来存放临时数据的(对采集到的数据进行简单的缓存,可以放在 memory、jdbc、file 等等)。
  • sink:sink 组件是用于把数据发送到目的地的组件,目的地包括 hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定义。

3、Flume 的运行机制

flume 的核心就是一个 agent,这个 agent 对外有两个进行交互的地方,一个是接收数据的输入(source),一个是数据的输出(sink),sink 负责将数据发送到外部指定的目的地。source 接收到数据之后,将数据发送给 channel,channel 作为一个数据缓冲区会临时存放这些数据,随后 sink 会将 channel 中的数据发送到指定的地方(如 HDFS 等)。

注意:只有在 sink 将 channel 中的数据成功发送出去之后,channel 才会将临时数据进行删除,这种机制保证了数据传输的可靠性与安全性。

4、Flume 的广义用法

flume 可以支持多级 flume 的 agent,即 flume 可以前后相继,例如 sink 可以将数据写到下一个 agent 的 source 中,这样的话就可以练成串了,可以整体处理了。flume 还支持扇入(fan-in)、扇出(fan-out)。所谓扇入就是 source 可以接收多个输入,所谓扇出就是 sink 可以将数据输出多个目的地 destination 中。

值得注意的是,flume 提供了大量内置的 source、channel 和 sink 类型。不同类型的 source、channel 和 sink 可以自由组合。组合方式基于用户设置的配置文件,非常灵活。

举个例子:channel 可以把事件暂存在内存里,也可以持久化到本地硬盘上。sink 可以把日志写入 HDFS、HBase,甚至是另一个 source 等等。

flume 支持用户建立多级流,也就是说,多个 agent 可以协同工作,并且支持 fan-in、fan-out、contextual routing、backup routes。如下图👇所示:

 

二、Flume 环境搭建

1、前置条件

  • flume 需要 Java 1.7 及以上(推荐 1.8)
  • 足够的内存和磁盘空间
  • 对 agent 监控目录的读写权限

2、搭建

(1)下载 flume-ng-1.6.0-cdh5.7.0.tar.gz

下载地址01:https://download.csdn.net/download/weixin_42018518/12314171,Flume-ng-1.6.0-cdh.zip 内压缩了 3 个项目,分别为:flume-ng-1.6.0-cdh5.5.0.tar.gz、flume-ng-1.6.0-cdh5.7.0.tar.gz 和 flume-ng-1.6.0-cdh5.10.1.tar.gz,选择你需要的版本,我们暂时选择 cdh5.7.0 这个版本。

下载地址 02:wget http://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz

(2)上传到服务器,并解压

知识点:MaciTerm2安装lszrz rz sz命令

[root@yz-sre-backup019 ~]# cd
[root@yz-sre-backup019 ~]# mkdir apps
[root@yz-sre-backup019 ~]# cd /data/soft
[root@yz-sre-backup019 soft]# tar -zxvf flume-ng-1.6.0-cdh5.7.0.tar.gz -C /root/apps/

(3)配置环境变量

[root@yz-sre-backup019 apps]# vim ~/.bash_profile
    # 配置 Flume 的路径,根据自己安装的路径进行修改
    export FLUME_HOME=/root/apps/apache-flume-1.6.0-cdh5.7.0-bin
    export PATH=$FLUME_HOME/bin:$PATH
  
// 使配置文件生效
[root@yz-sre-backup019 apps]# source ~/.bash_profile

(4)在 flume-env.sh 中配置 Java JDK 的路径

a. 首先下载最新稳定 JDK:

注意:JDK 安装在哪个用户下,就给哪个用户使用

当前最新版下载地址:https://www.oracle.com/java/technologies/javase-downloads.html

 

b. 将下载的 JDK 上传到 /data/soft 下,并修改文件权限:

[root@yz-sre-backup019 soft]# rz
[root@yz-sre-backup019 soft]# chmod 755 jdk-8u241-linux-x64.tar.gz

c. 解压 JDK 到 /usr/:

[root@yz-sre-backup019 soft]# tar -zxvf jdk-8u241-linux-x64.tar.gz -C /usr/

d. 配置 JDK 环境变量:

[root@yz-sre-backup019 soft]# vim /etc/profile
# Java environment
    export JAVA_HOME=/usr/jdk1.8.0_241
    export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
    export PATH=$PATH:${JAVA_HOME}/bin
  
// 使 JDK 配置文件生效
[root@yz-sre-backup019 soft]# source /etc/profile

e. 检查 JDK 是否安装成功:

[root@yz-sre-backup019 soft]# java -version
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)

f. 在 flume-env.sh 中配置 Java JDK 的路径:

[root@yz-sre-backup019 apps]# cd $FLUME_HOME/conf
// 复制模板
[root@yz-sre-backup019 conf]# cp flume-env.sh.template flume-env.sh
[root@yz-sre-backup019 conf]# vim flume-env.sh
// 配置 Java 目录,在末尾新增一行
    export JAVA_HOME=/usr/jdk1.8.0_241

g. 检测

// 在 flume 的 bin 目录下执行 flume-ng version 可查看版本
[root@yz-sre-backup019 bin]# cd $FLUME_HOME/bin
[root@yz-sre-backup019 bin]# flume-ng version
// 出现以下内容,说明安装成功
Flume 1.6.0-cdh5.7.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 8f5f5143ae30802fe79f9ab96f893e6c54a105d1
Compiled by jenkins on Wed Mar 23 11:38:48 PDT 2016
From source with checksum 50b533f0ffc32db9246405ac4431872e

三、Flume 实战

使用 flume 的关键就是写配置文件,主要分为以下四步:

  1. 配置 source
  2. 配置 channel
  3. 配置 sink
  4. 把以上三个组件串起来

实战 01:从指定网络端口采集数据输出到控制台

(1)自定义 flume 的配置文件存放目录

[root@yz-sre-backup019 data]# mkdir -pv /data/flume/{log,job,bin}
mkdir: created directory `/data/flume'
mkdir: created directory `/data/flume/log'
mkdir: created directory `/data/flume/job'
mkdir: created directory `/data/flume/bin'
[root@yz-sre-backup019 data]# cd flume/
[root@yz-sre-backup019 flume]# ll
total 12
drwxr-xr-x 2 root root 4096 Apr 10 10:34 bin    # 用户存放启动脚本
drwxr-xr-x 2 root root 4096 Apr 10 10:34 job    # 用于存放 flume 启动 agent 的配置文件
drwxr-xr-x 2 root root 4096 Apr 10 10:34 log    # 用于存放启动脚本

(2)配置 agent

agent 选型:netcat source + memory channel + logger sink

在 /data/flume 的 job 目录下新建 flume-netcat.conf 文件(目录和文件名可以自定义,只需要在后续启动 agent 时需要用到

# flume-netcat.conf: A single_node Flume configuration
  
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名称; r1: source 的名称; k1: sink 的名称; c1: channel 的名称
  
# Describe/configure the source
# 配置 source 的类型
a1.sources.r1.type = netcat
# 配置 source 绑定的主机
a1.sources.r1.bind = localhost
# 配置 source 绑定的主机端口
a1.sources.r1.port = 8888
  
# 指定 sink 的类型,我们这里指定的为 logger,即控制台输出
# 配置 sink 的类型
a1.sinks.k1.type = logger
  
# 指定 channel 的类型为 memory,指定 channel 的容量是 1000,每次传输的容量是 100
# 配置 channel 的类型
a1.channels.c1.type = memory
# 配置通道中存储的最大 event 数
a1.channels.c1.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
a1.channels.c1.transactionCapacity = 100
  
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
a1.sources.r1.channels = c1
# 把 sink 和 channel 做关联,只能输出到一个 channel
a1.sinks.k1.channel = c1

(3)启动 agent

[root@yz-sre-backup019 ~]# flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-netcat.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console
// 启动后另开窗口就行下面测试

(4)连接测试

// 如果不能 telnet,记得先安装
[root@yz-sre-backup019 ~]# yum -y install telnet net-tools
[root@yz-sre-backup019 ~]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      50                                                    127.0.0.1:8888                                                         *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*
LISTEN     0      50                                                            *:10501                                                        *:*
[root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飞花点点轻!
OK
Welcome to Beijing...。。
OK
Python
OK
SHOWufei
OK

(5)编写 flume 的启动脚本(生产环境推荐使用这种方式)

  1. [root@yz-sre-backup019 flume]# cd /data/flume/bin
  2. [root@yz-sre-backup019 bin]# vim start-netcat.sh
  3. [root@yz-sre-backup019 bin]# chmod +x start-netcat.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Fri Apr 10 11:13:11 CST 2020
 
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-netcat.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-netcat.log 2>&1 &

(6)启动 flume 并查看启动日志信息

[root@yz-sre-backup019 bin]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*
[root@yz-sre-backup019 bin]# bash start-netcat.sh
[root@yz-sre-backup019 bin]#
[root@yz-sre-backup019 bin]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      50                                                    127.0.0.1:8888                                                         *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*
LISTEN     0      50                                                            *:10501                                                        *:*
[root@yz-sre-backup019 log]# telnet 127.0.0.1 8888
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飞花点点轻!
OK
Welcome to Beijing...。。
OK
Python
OK
SHOWufei
OK
[root@yz-sre-backup019 log]# tail -f flume-netcat.log
2020-04-10 15:31:36,708 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel c1
2020-04-10 15:31:36,832 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2020-04-10 15:31:36,833 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: CHANNEL, name: c1 started
2020-04-10 15:31:36,837 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink k1
2020-04-10 15:31:36,838 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source r1
2020-04-10 15:31:36,839 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:155)] Source starting
2020-04-10 15:31:36,865 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:169)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:8888]
2020-04-10 15:31:36,883 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2020-04-10 15:31:36,944 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26.cloudera.4
2020-04-10 15:31:36,979 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SelectChannelConnector@0.0.0.0:10501
2020-04-10 15:32:30,851 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: E9 A3 9E E8 8A B1 E7 82 B9 E7 82 B9 E8 BD BB 21 ...............! }
2020-04-10 15:32:39,853 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 65 6C 63 6F 6D 65 20 74 6F 20 42 65 69 6A 69 Welcome to Beiji }
2020-04-10 15:32:55,060 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 50 79 74 68 6F 6E 0D                            Python. }
2020-04-10 15:33:00,781 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 53 48 4F 57 75 66 65 69 0D                      SHOWufei. }

(7)查看 jq 安装包是否存在,安装 jq 工具,便于查看 json 格式的内容

[root@yz-sre-backup019 soft]# wget -O jq https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64
[root@yz-sre-backup019 soft]# chmod +x ./jq
[root@yz-sre-backup019 soft]# cp jq /usr/bin

(8)查看 flume 度量值

 [root@yz-sre-backup019 ~]# curl http://127.0.0.1:10501/metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
129   259    0   259    0     0  25951      0 --:--:-- --:--:-- --:--:-- 37000
{
  "CHANNEL.c1": {                           # 这是 c1 的 CHANEL 监控数据,c1 该名称在 flume-netcat.conf 中配置文件中定义的
    "ChannelCapacity": "1000",              # channel 的容量,目前仅支持 File Channel、Memory channel 的统计数据
    "ChannelFillPercentage": "0.4",         # channel 已填入的百分比
    "Type": "CHANNEL",                      # 很显然,这里是CHANNEL监控项,类型为 CHANNEL
    "EventTakeSuccessCount": "0",           # sink 成功从 channel 读取事件的总数量
    "ChannelSize": "4",                     # 目前channel 中事件的总数量,目前仅支持 File Channel、Memory channel 的统计数据
    "EventTakeAttemptCount": "0",           # sink 尝试从 channel 拉取事件的总次数。这不意味着每次时间都被返回,因为 sink 拉取的时候 channel 可能没有任何数据
    "StartTime": "1586489375175",           # channel 启动时的毫秒值时间
    "EventPutAttemptCount": "4",            # Source 尝试写入 Channe 的事件总次数
    "EventPutSuccessCount": "4",            # 成功写入 channel 且提交的事件总次数
    "StopTime": "0"                         # channel 停止时的毫秒值时间,为 0 表示一直在运行
  }
}

小提示:如果还想了解更多度量值,可参考官方文档:http://flume.apache.org/FlumeUserGuide.html#monitoring

(9)删掉对应的 flume 进程

[root@yz-sre-backup019 ~]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      50                                                    127.0.0.1:8888                                                         *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*
LISTEN     0      50                                                            *:10501                                                        *:*
[root@yz-sre-backup019 ~]# netstat -untalp  | grep 8888
tcp        0      0 127.0.0.1:8888              0.0.0.0:*                   LISTEN      4565/java
[root@yz-sre-backup019 ~]# kill 4565
[root@yz-sre-backup019 ~]# netstat -untalp  | grep 8888
[root@yz-sre-backup019 ~]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*

实战 02:监控一个文件实时采集新增的数据输出到本地文件中

(1)agent 的选型

exec source + memory channel + file_roll sink

(2)配置 agent

在 /data/flume 的 job 目录下新建 flume-file.conf 文件(目录和文件名可以自定义,只需要在后续启动 agent 时需要用到

# flume-file.conf: A single_node Flume configuration
  
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名称; r1: source 的名称; k1: sink 的名称; c1: channel 的名称
  
# Describe/configure the source
# 配置 source 的类型
a1.sources.r1.type = exec
# 配置 source 执行的命令
a1.sources.r1.command = tail -F /data/data.log
# 配置 source 让 bash 将一个字符串作为完整的命令来执行
a1.sources.r1.shell = /bin/bash -c
  
# 指定 sink 的类型,我们这里指定的为 file_roll,即本地文件输出
# 配置 sink 的类型,将数据传输到本地文件,需要设置文件路径
a1.sinks.k1.type = file_roll
# 配置 sink 输出到本地的路径
a1.sinks.k1.sink.directory = /data/flume/data
  
# 配置 channel 的类型
a1.channels.c1.type = memory
# 配置通道中存储的最大 event 数
a1.channels.c1.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
a1.channels.c1.transactionCapacity = 100
  
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
a1.sources.r1.channels = c1
# 把 sink 和 channel 做关联,只能输出到一个 channel
a1.sinks.k1.channel = c1

(3)编写 agent 的启动脚本(生产环境推荐使用这种方式)

  1. [root@yz-sre-backup019 flume]# cd /data/flume/bin
  2. [root@yz-sre-backup019 bin]# vim start-file.sh
  3. [root@yz-sre-backup019 bin]# chmod +x start-file.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Fri Apr 10 15:05:56 CST 2020
 
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-file.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/flume-file.log 2>&1 &

(4)启动 agent 并测试往监听文件中输入数据,

在 /data/flume/data 中生成的文件中查看 event 数据,测试完后删掉 flume 进程。

[root@yz-sre-backup019 bin]# bash start-file.sh
[root@yz-sre-backup019 bin]# cd /data
[root@yz-sre-backup019 data]# echo "帅飞飞!!!" >> data.log
[root@yz-sre-backup019 bin]# cd /data/flume/data
[root@yz-sre-backup019 data]# tail 1586513526197-4
帅飞飞!!!
[root@yz-sre-backup019 data]# ps -ef | grep flume
root     17682     1  1 18:12 pts/1    00:00:03 /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/* -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
root     17970   956  0 18:16 pts/0    00:00:00 grep flume
[root@yz-sre-backup019 data]# kill 17682
[root@yz-sre-backup019 data]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*

实战 03:实时读取本地文件到 HDFS 中(需要 flume 节点配置 Hadoop 集群环境)

上面两个需求中把数据输出到控制台没有任何意义,实际需求可能需要输出到 hdfs 中,只需要改动 agent 的配置,把 sink 的类型改为 hdfs,然后指定 hdfs 的 url 和写入路径。

 

(1)agent 选型

exec source - memory channel - hdfs sink

(2)配置 agent

# flume-hdfs.conf: A single_node Flume configuration
  
# Name the components on this agent
wufei03.sources = file_source
wufei03.sinks = hdfs_sink
wufei03.channels = memory_channel
# wufei03: agent 的名称; file_source: source 的名称; hdfs_sink: sink 的名称; memory_channel: channel 的名称
  
# Describe/configure the source
# 配置 source 的类型
wufei03.sources.file_source.type = exec
# 配置 source 执行的命令
# wufei03.sources.file_source.command = tail -F /data/messages
wufei03.sources.file_source.command = tail -F /data/data.log
# 配置 source 让 bash 将一个字符串作为完整的命令来执行
wufei03.sources.file_source.shell = /bin/bash -c
  
# 指定 sink 的类型,我们这里指定的为 hdfs
# 配置 sink 的类型,将数据传输到 HDFS 集群
wufei03.sinks.hdfs_sink.type = hdfs
# 配置 sink 输出到本 hdfs 的 url 和写入路径
wufei03.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/dt=%Y-%m-%d/%H
# 上传文件的前缀
wufei03.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.2.24-
# 是否按照时间滚动文件夹
wufei03.sinks.hdfs_sink.hdfs.round = true
# 多少时间单位创建一个文件夹
wufei03.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定义时间单位
wufei03.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地时间戳
wufei03.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 积攒多少个 event 才 flush 到 hdfs 一次
wufei03.sinks.hdfs_sink.hdfs.batchSize = 1000
# 设置文件类型,可支持压缩
wufei03.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一个新文件
wufei03.sinks.hdfs_sink.hdfs.rollInterval = 600
# 设置每个文件的滚动大小
wufei03.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滚动与 event 数量无关
wufei03.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本数
wufei03.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
  
# 配置 channel 的类型
wufei03.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
wufei03.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
wufei03.channels.memory_channel.transactionCapacity = 1000
  
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
wufei03.sources.file_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
wufei03.sinks.hdfs_sink.channel = memory_channel

(3)编写启动脚本并启动 flume

  1. [root@yz-sre-backup019 flume]# cd /data/flume/bin
  2. [root@yz-sre-backup019 bin]# vim start-hdfs.sh
  3. [root@yz-sre-backup019 bin]# chmod +x start-hdfs.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Tue Apr 14 11:56:51 CST 2020
 
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 wufei03
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name wufei03 --conf-file=/data/flume/job/flume-hdfs.conf -Dflume.monitoring.type=http  -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console  >> /data/flume/log/flume-hdfs.log  2>&1 &
  1. [root@yz-bi-web01 bin]# bash start-hdfs.sh
  2. [root@yz-bi-web01 bin]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
.....
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      50                                                            *:10502                                                        *:*
LISTEN     0      128                                                           *:80                                                           *:*
[root@yz-bi-web01 bin]#
[root@yz-bi-web01 data]# cd /data/
[root@yz-bi-web01 data]# echo "SHOWufei" >> data.log
[root@yz-bi-web01 data]# echo "帅飞飞!!!" >> data.log

(4)查看 flume 日志收集信息

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access
  
...。。
 
2020-04-14 12:16:15,648 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.ExecSource.start(ExecSource.java:169)] Exec source starting with command:tail -F /data/data.log
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SINK, name: hdfs_sink: Successfully registered new MBean.
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SOURCE, name: file_source: Successfully registered new MBean.
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SINK, name: hdfs_sink started
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SOURCE, name: file_source started
2020-04-14 12:16:15,683 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2020-04-14 12:16:15,725 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26.cloudera.4
2020-04-14 12:16:15,761 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SelectChannelConnector@0.0.0.0:10502
2020-04-14 12:16:53,678 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-14 12:16:53,977 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
2020-04-14 12:26:55,533 (hdfs-hdfs_sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:363)] Closing hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
2020-04-14 12:26:55,571 (hdfs-hdfs_sink-call-runner-6) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:629)] Renaming hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp to hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679
2020-04-14 12:26:55,580 (hdfs-hdfs_sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:394)] Writer callback called.
2020-04-14 17:47:30,793 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-14 17:47:30,839 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/17/gz_10.20.2.24-.1586857650794.tmp

(5)查看 hdfs 对应目录是否生成相应的日志信息

[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-14/12
Found 1 items
-rw-r--r--   3 root hadoop          9 2020-04-14 12:16 /flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-14
Found 2 items
drwxrwxrwx   - root hadoop          0 2020-04-14 12:26 /flume/dt=2020-04-14/12
drwxrwxrwx   - root hadoop          0 2020-04-14 17:47 /flume/dt=2020-04-14/17

(6)Browsing HDFS

 

(7)查看 flume 度量值

[root@yz-bi-web01 ~]# curl http://127.0.0.1:10502/metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
105   841    0   841    0     0   163k      0 --:--:-- --:--:-- --:--:--  273k
{
  "SOURCE.file_source": {               # source 的名称
    "OpenConnectionCount": "0",         # 目前与客户端或 sink 保持连接的总数量,目前仅支持 avro source 展现该度量
    "Type": "SOURCE",                   # 当前类型为 SOURRCE
    "AppendBatchAcceptedCount": "0",    # 成功提交到 channel 的批次的总数量
    "AppendBatchReceivedCount": "0",    # 接收到事件批次的总数量
    "EventAcceptedCount": "3",          ## 成功写出到channel的事件总数量
    "AppendReceivedCount": "0",         # 每批只有一个事件的事件总数量(与 RPC 调用的一个 append 调用相等)
    "StopTime": "0",                    # SOURCE 停止时的毫秒值时间,0 代表一直运行着
    "StartTime": "1586837775650",       # SOURCE 启动时的毫秒值时间
    "EventReceivedCount": "3",          ## 目前为止 source 已经接收到的事件总数量
    "AppendAcceptedCount": "0"          # 逐条录入的次数,单独传入的事件到 Channel 且成功返回的事件总数量
  },
  "SINK.hdfs_sink": {                   # sink 的名称
    "BatchCompleteCount": "0",          # 批量处理event的个数等于批处理大小的数量
    "ConnectionFailedCount": "0",       # 连接失败的次数
    "EventDrainAttemptCount": "3",      ## sink 尝试写出到存储的事件总数量
    "ConnectionCreatedCount": "2",      # 下一个阶段(或存储系统)创建链接的数量(如HDFS创建一个文件)
    "Type": "SINK",                     # 当前类型为 SINK
    "BatchEmptyCount": "2551",          # 批量处理 event 的个数为 0 的数量(空的批量的数量),如果数量很大表示 source 写入数据的速度比 sink 处理数据的速度慢很多
    "ConnectionClosedCount": "1",       # 连接关闭的次数
    "EventDrainSuccessCount": "3",      ## sink成功写出到存储的事件总数量
    "StopTime": "0",                    # SINK 停止时的毫秒值时间
    "StartTime": "1586837775650",       # SINK 启动时的毫秒值时间
    "BatchUnderflowCount": "3"          # 批量处理 event 的个数小于批处理大小的数量(比 sink 配置使用的最大批量尺寸更小的批量的数量),如果该值很高也表示 sink 比 source 更快
  },
  "CHANNEL.memory_channel": {           # channel 的名称
    "EventPutSuccessCount": "3",        ## 成功写入channel且提交的事件总次数
    "ChannelFillPercentage": "0.0",     # channel已填入的百分比
    "Type": "CHANNEL",                  # 当前类型为 CHANNEL
    "StopTime": "0",                    # CHANNEL 停止时的毫秒值时间
    "EventPutAttemptCount": "3",        ## Source 尝试写入 Channe 的事件总次数
    "ChannelSize": "0",                 # 目前 channel 中事件的总数量,目前仅支持 File Channel,Memory channel 的统计数据
    "StartTime": "1586837775646",       # CHANNEL 启动时的毫秒值时间
    "EventTakeSuccessCount": "3",       ## sink 成功从 channel 读取事件的总数量
    "ChannelCapacity": "1000",          # channel 的容量,目前仅支持 File Channel,Memory channel 的统计数据
    "EventTakeAttemptCount": "2558"     # sink 尝试从 channel 拉取事件的总次数。这不意味着每次时间都被返回,因为 sink 拉取的时候 channel 可能没有任何数据
  }
}

(8)测试完后删掉 flume 进程

[root@yz-bi-web01 ~]# ps -ef | grep flume
root     19768 14759  0 17:57 pts/11   00:00:00 grep flume
root     26653     1  0 12:16 pts/0    00:00:26 /usr/local/jdk1.7.0_76/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/hadoop/hadoop-2.7.1/etc/hadoop:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/activation-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/asm-3.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/avro-1.7.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-cli-1.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-codec-1.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-collections-3.2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-compress-1.4.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-configuration-1.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-digester-1.8.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-httpclient-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-io-2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-lang-2.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-logging-1.1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-math3-3.1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-net-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-client-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-framework-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/gson-2.2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/guava-11.0.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-annotations-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-auth-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hamcrest-core-1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpclient-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpcore-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-core-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-json-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-server-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jets3t-0.9.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jettison-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-util-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsch-0.1.42.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsp-api-2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsr305-3.0.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/junit-4.11.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/log4j-1.2.17.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/mockito-all-1.8.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/netty-3.6.2.Final.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/paranamer-2.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/com
[root@yz-bi-web01 ~]# kill 26653
[root@yz-bi-web01 ~]# ps -ef | grep flume
root     19777 14759  0 17:58 pts/11   00:00:00 grep flume

(9)清除 hdfs 上数据

[root@yz-bi-web01 ~]# su - hadoop
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
[hadoop@yz-bi-web01 ~]$

实战 04:从服务器 A 收集数据到服务器 B 并上传到 HDFS(需要服务器 B 节点配置 Hadoop 集群环境)

重点:服务器 A 的 sink 类型是 avro,而服务器 B 的 source 类型是 avro。

 

流程:

  • 机器 A 监控一个文件,把日志记录到 data.log  中
  • avro sink 把新产生的日志输出到指定的 hostname 和 port 上
  • 通过 avro source 对应的 agent 将日志输出到控制台、kafka、hdfs 等

(1)机器 A 配置

agent 选型:exec source + memory channel + avro sink

# flume-file.conf: A single_node Flume configuration
 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名称; r1: source 的名称; k1: sink 的名称; c1: channel 的名称
 
# Describe/configure the source
# 配置 source 的类型
a1.sources.r1.type = exec
# 配置 source 执行的命令
a1.sources.r1.command = tail -F /data/data.log
# 配置 source 让 bash 将一个字符串作为完整的命令来执行
a1.sources.r1.shell = /bin/bash -c
 
# 指定 sink 的类型,我们这里指定的为 avro,即将数据发送到端口,需要设置端口名称、端口号
a1.sinks.k1.type = avro
# 配置 sink 主机名称
a1.sinks.k1.hostname = 10.20.2.24
# 配置 sink 主机端口
a1.sinks.k1.port = 8888
 
# 配置 channel 的类型
a1.channels.c1.type = memory
# 配置通道中存储的最大 event 数
a1.channels.c1.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
a1.channels.c1.transactionCapacity = 100
 
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
a1.sources.r1.channels = c1
# 把 sink 和 channel 做关联,只能输出到一个 channel
a1.sinks.k1.channel = c1

(2)机器 B 配置

agent 选型:avro source + memory channel + hdfs sink

# flume-hdfs.conf: A single_node Flume configuration
 
# Name the components on this agent
wufei03.sources = file_source
wufei03.sinks = hdfs_sink
wufei03.channels = memory_channel
# wufei03: agent 的名称; file_source: source 的名称; hdfs_sink: sink 的名称; memory_channel: channel 的名称
 
# Describe/configure the source
# 配置 source 的类型
wufei03.sources.file_source.type = avro
# 配置 source 绑定主机
wufei03.sources.file_source.bind = 10.20.2.24
# 配置 source 绑定主机端口
wufei03.sources.file_source.port = 8888
 
# 指定 sink 的类型,我们这里指定的为 hdfs
# 配置 sink 的类型,将数据传输到 HDFS 集群
wufei03.sinks.hdfs_sink.type = hdfs
# 配置 sink 输出到本 hdfs 的 url 和写入路径
wufei03.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/dt=%Y-%m-%d/%H
# 上传文件的前缀
wufei03.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.3.36-
# 是否按照时间滚动文件夹
wufei03.sinks.hdfs_sink.hdfs.round = true
# 多少时间单位创建一个文件夹
wufei03.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定义时间单位
wufei03.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地时间戳
wufei03.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 积攒多少个 event 才 flush 到 hdfs 一次
wufei03.sinks.hdfs_sink.hdfs.batchSize = 1000
# 设置文件类型,可支持压缩
wufei03.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一个新文件
wufei03.sinks.hdfs_sink.hdfs.rollInterval = 600
# 设置每个文件的滚动大小
wufei03.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滚动与 event 数量无关
wufei03.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本数
wufei03.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
 
# 配置 channel 的类型
wufei03.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
wufei03.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
wufei03.channels.memory_channel.transactionCapacity = 100
 
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
wufei03.sources.file_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
wufei03.sinks.hdfs_sink.channel = memory_channel

(3)编写启动脚本

// 机器 A 启动脚本
[root@yz-sre-backup019 bin]# vim start-file.sh
[root@yz-sre-backup019 bin]# cat start-file.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Wed Apr 15 11:22:24 CST 2020
 
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-file.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/flume-file.log 2>&1 &
  
// 机器 B 启动脚本
[root@yz-bi-web01 bin]# vim start-hdfs.sh
[root@yz-bi-web01 bin]# cat start-hdfs.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Wed Apr 15 11:22:24 CST 2020
 
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 wufei03
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name wufei03 --conf-file=/data/flume/job/flume-hdfs.conf -Dflume.monitoring.type=http  -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console  >> /data/flume/log/flume-hdfs.log  2>&1 &

(4)启动脚本并查看对应的日志信息

// 机器 A
[root@yz-sre-backup019 bin]# ss -ntl
State       Recv-Q Send-Q                                   Local Address:Port                                     Peer Address:Port
LISTEN      0      128                                                  *:22                                                  *:*
LISTEN      0      100                                          127.0.0.1:25                                                  *:*
LISTEN      0      128                                                  *:1988                                                *:*
// 机器 B
[root@yz-bi-web01 bin]# ss -ntl
State       Recv-Q Send-Q                                   Local Address:Port                                     Peer Address:Port
LISTEN      0      128                                                  *:22                                                  *:*
LISTEN      0      100                                          127.0.0.1:25                                                  *:*
LISTEN      0      128                                                  *:1988                                                *:*
  
// 先启动机器 B 的 agent
[root@yz-bi-web01 bin]# bash start-hdfs.sh
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access
// 后启动机器 A 的 agent
[root@yz-sre-backup019 bin]# bash start-file.sh
[root@yz-sre-backup019 log]# tail -f flume-file.log
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
2020-04-15 11:55:35,409 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 11:55:35,416 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started
  
// 插入测试数据
[root@yz-sre-backup019 ~]# cd /data/
[root@yz-sre-backup019 data]# echo "帅飞飞!!!" >> data.log
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
2020-04-15 11:55:51,147 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859.tmp
[root@yz-sre-backup019 data]# echo "SHOWufei!!!" >> data.log
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
2020-04-15 12:03:14,139 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969.tmp

(5)查看 hdfs 对应目录是否生成相应的日志信息

[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15
Found 2 items
drwxrwxrwx   - root hadoop          0 2020-04-15 12:05 /flume/dt=2020-04-15/11
drwxrwxrwx   - root hadoop          0 2020-04-15 12:03 /flume/dt=2020-04-15/12
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15/11
Found 1 items
-rw-r--r--   3 root hadoop         19 2020-04-15 12:05 /flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859
[root@yz-bi-web01 ~]# hadoop fs -cat /flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859
帅飞飞!!!
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15/12
Found 1 items
-rw-r--r--   3 root hadoop         18 2020-04-15 12:03 /flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969.tmp
[root@yz-bi-web01 ~]# hadoop fs -cat /flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969
SHOWufei!!!

(6)查看 flume 度量值

// 机器 A
[root@yz-sre-backup019 ~]# curl http://127.0.0.1:10501/metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
101   811    0   811    0     0   4924      0 --:--:-- --:--:-- --:--:--  4975
{
  "SINK.k1": {
    "ConnectionCreatedCount": "1",
    "ConnectionClosedCount": "0",
    "Type": "SINK",
    "BatchCompleteCount": "0",
    "BatchEmptyCount": "109",
    "EventDrainAttemptCount": "2",
    "StartTime": "1586922935645",
    "EventDrainSuccessCount": "2",
    "BatchUnderflowCount": "2",
    "StopTime": "0",
    "ConnectionFailedCount": "0"
  },
  "CHANNEL.c1": {
    "ChannelCapacity": "1000",
    "ChannelFillPercentage": "0.0",
    "Type": "CHANNEL",
    "ChannelSize": "0",
    "EventTakeSuccessCount": "2",
    "EventTakeAttemptCount": "114",
    "StartTime": "1586922935643",
    "EventPutAttemptCount": "2",
    "EventPutSuccessCount": "2",
    "StopTime": "0"
  },
  "SOURCE.r1": {
    "EventReceivedCount": "2",
    "AppendBatchAcceptedCount": "0",
    "Type": "SOURCE",
    "EventAcceptedCount": "2",
    "AppendReceivedCount": "0",
    "StartTime": "1586922935652",
    "AppendAcceptedCount": "0",
    "OpenConnectionCount": "0",
    "AppendBatchReceivedCount": "0",
    "StopTime": "0"
  }
}
  
// 机器 B
[root@yz-bi-web01 ~]# curl http://127.0.0.1:10502/metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
104   839    0   839    0     0   7163      0 --:--:-- --:--:-- --:--:--  7295
{
  "SOURCE.file_source": {
    "OpenConnectionCount": "1",
    "Type": "SOURCE",
    "AppendBatchReceivedCount": "2",
    "AppendBatchAcceptedCount": "2",
    "EventAcceptedCount": "2",
    "AppendReceivedCount": "0",
    "StopTime": "0",
    "StartTime": "1586922913313",
    "EventReceivedCount": "2",
    "AppendAcceptedCount": "0"
  },
  "SINK.hdfs_sink": {
    "BatchCompleteCount": "0",
    "ConnectionFailedCount": "0",
    "EventDrainAttemptCount": "2",
    "ConnectionCreatedCount": "2",
    "Type": "SINK",
    "BatchEmptyCount": "117",
    "ConnectionClosedCount": "1",
    "EventDrainSuccessCount": "2",
    "StopTime": "0",
    "StartTime": "1586922912838",
    "BatchUnderflowCount": "2"
  },
  "CHANNEL.memory_channel": {
    "EventPutSuccessCount": "2",
    "ChannelFillPercentage": "0.0",
    "Type": "CHANNEL",
    "EventPutAttemptCount": "2",
    "ChannelSize": "0",
    "StopTime": "0",
    "StartTime": "1586922912835",
    "EventTakeSuccessCount": "2",
    "ChannelCapacity": "1000",
    "EventTakeAttemptCount": "121"
  }
}

(7)测试完删掉 flume 进程并清除 hdfs 上数据

// 先删掉机器 A 的 flume 进程
[root@yz-sre-backup019 bin]# ps -ef | grep flume
root     10492  6728  0 11:54 pts/2    00:00:00 tail -f flume-file.log
root     10500     1  0 11:55 pts/0    00:00:09 /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/* -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
root     11084  5377  0 12:12 pts/0    00:00:00 grep flume
[root@yz-sre-backup019 bin]# kill 10500
[root@yz-sre-backup019 bin]# ps -ef | grep flume
root     10492  6728  0 11:54 pts/2    00:00:00 tail -f flume-file.log
root     11092  5377  0 12:12 pts/0    00:00:00 grep flume
  
// 后删掉机器 B 的 flume 进程
[root@yz-bi-web01 ~]# ps -ef | grep flume
root      5725 16077  0 11:54 pts/11   00:00:00 tail -f flume-hdfs.log
root      5735     1  1 11:55 pts/0    00:00:20 /usr/local/jdk1.7.0_76/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/hadoop/hadoop-2.7.1/etc/hadoop:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/activation-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/asm-3.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/avro-1.7.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-cli-1.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-codec-1.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-collections-3.2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-compress-1.4.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-configuration-1.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-digester-1.8.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-httpclient-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-io-2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-lang-2.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-logging-1.1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-math3-3.1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-net-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-client-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-framework-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/gson-2.2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/guava-11.0.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-annotations-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-auth-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hamcrest-core-1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpclient-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpcore-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-core-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-json-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-server-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jets3t-0.9.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jettison-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-util-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsch-0.1.42.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsp-api-2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsr305-3.0.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/junit-4.11.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/log4j-1.2.17.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/mockito-all-1.8.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/netty-3.6.2.Final.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/paranamer-2.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/com
root     18949  9025  0 12:12 pts/12   00:00:00 grep flume
[root@yz-bi-web01 ~]# kill 5735
[root@yz-bi-web01 ~]# ps -ef | grep flume
root      5725 16077  0 11:54 pts/11   00:00:00 tail -f flume-hdfs.log
root     18963  9025  0 12:12 pts/12   00:00:00 grep flume
  
// 清除 hdfs 上的数据
[root@yz-bi-web01 ~]# su - hadoop
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
[hadoop@yz-bi-web01 ~]$ exit
logout
[root@yz-bi-web01 ~]#

实战 05:多 flume 汇总数据到单 flume(需要单 flume 汇聚节点配置 hadoop 集群环境)

(1)流程

  • Agent1 监控文件 /data/data.log(exec source - memory channel - avro sink
  • Agent2 监控某一端口数据流 (netcat source - memory channel - avro sink
  • Agent3 实时指定目录文件内容(spooldir source - memory channel - avro sink
  • Agent1、Agent2、Agent3 将数据发送给 Agent4
  • Agent4 将最终数据写入到 hdfs(avro source - memory channel - hdfs sink

(2)编写相应的 agent 配置文件

[root@yz-sre-backup019 job]# vim agent1-exec.conf
[root@yz-sre-backup019 job]# cat agent1-exec.conf
# agent1-exec.conf: Clusterde_node Flume configuration

# Name the components on this agent
agent1.sources = exec_source
agent1.sinks = avro_sink
agent1.channels = memory_channel
# agent1: agent 的名称; exec_source: source 的名称; avro_sink: sink 的名称; memory_channel: channel 的名称

# Describe/configure the source
# 配置 source 的类型
agent1.sources.exec_source.type = exec
# 配置 source 执行的命令
agent1.sources.exec_source.command = tail -F /data/data.log
# 配置 source 让 bash 将一个字符串作为完整的命令来执行
agent1.sources.exec_source.shell = /bin/bash -c

# 指定 sink 的类型,我们这里指定的为 avro,即将数据发送到端口,需要设置端口名称、端口号
agent1.sinks.avro_sink.type = avro
# 配置 sink 主机名称
agent1.sinks.avro_sink.hostname = 10.20.2.24
# 配置 sink 主机端口
agent1.sinks.avro_sink.port = 6666

# 配置 channel 的类型
agent1.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
agent1.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
agent1.channels.memory_channel.transactionCapacity = 100

# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
agent1.sources.exec_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
agent1.sinks.avro_sink.channel = memory_channel

[root@yz-bi-web01 job]# vim agent4-hdfs.conf
[root@yz-bi-web01 job]# cat agent4-hdfs.conf
# agent4-hdfs.conf: Clusterde_node Flume configuration

# Name the components on this agent
agent4.sources = avro_source
agent4.sinks = hdfs_sink
agent4.channels = memory_channel
# agent4: agent 的名称; avro_source: source 的名称; hdfs_sink: sink 的名称; memory_channel: channel 的名称

# Describe/configure the source
# 配置 source 的类型
agent4.sources.avro_source.type = avro
# 配置 source 绑定主机
agent4.sources.avro_source.bind = 10.20.2.24
# 配置 source 绑定主机端口
agent4.sources.avro_source.port = 6666

# 指定 sink 的类型,我们这里指定的为 hdfs
# 配置 sink 的类型,将数据传输到 HDFS 集群
agent4.sinks.hdfs_sink.type = hdfs
# 配置 sink 输出到本 hdfs 的 url 和写入路径
agent4.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/%Y-%m-%d/%H
# 上传文件的前缀
agent4.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.2.24
# 是否按照时间滚动文件夹
agent4.sinks.hdfs_sink.hdfs.round = true
# 多少时间单位创建一个文件夹
agent4.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定义时间单位
agent4.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地时间戳
agent4.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 积攒多少个 event 才 flush 到 hdfs 一次
agent4.sinks.hdfs_sink.hdfs.batchSize = 100
# 设置文件类型,可支持压缩
agent4.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一个新文件
agent4.sinks.hdfs_sink.hdfs.rollInterval = 60
# 设置每个文件的滚动大小
agent4.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滚动与 event 数量无关
agent4.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本数
agent4.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
# 和 agent3 的 basenameHeader,basenameHeaderKey 两个属性一起用可以保持原文件名称上传
agent4.sinks.hdfs_sink.hdfs.filePrefix = %{fileName}

# 配置 channel 的类型
agent4.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
agent4.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
agent4.channels.memory_channel.transactionCapacity = 100

# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
agent4.sources.avro_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
agent4.sinks.hdfs_sink.channel = memory_channel

[root@yz-sre-backup019 job]# vim agent2-netcat.conf
[root@yz-sre-backup019 job]# cat agent2-netcat.conf
# agent2-netcat.conf: Clusterde_node Flume configuration

# Name the components on this agent
agent2.sources = netcat_source
agent2.sinks = avro_sink
agent2.channels = memory_channel
# agent2: agent 的名称; netcat_source: source 的名称; avro_sink: sink 的名称; memory_channel: channel 的名称

# Describe/configure the source
# 配置 source 的类型
agent2.sources.netcat_source.type = netcat
# 配置 source 绑定的主机
agent2.sources.netcat_source.bind = 127.0.0.1
# 配置 source 绑定的主机端口
agent2.sources.netcat_source.port = 8888

# 指定 sink 的类型,我们这里指定的为 avro,即将数据发送到端口,需要设置端口名称、端口号
agent2.sinks.avro_sink.type = avro
# 配置 sink 主机名称
agent2.sinks.avro_sink.hostname = 10.20.2.24
# 配置 sink 主机端口
agent2.sinks.avro_sink.port = 6666

# 指定 channel 的类型为 memory,指定 channel 的容量是 1000,每次传输的容量是 100
# 配置 channel 的类型
agent2.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
agent2.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
agent2.channels.memory_channel.transactionCapacity = 100

# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
agent2.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
agent2.sinks.avro_sink.channel = memory_channel

[root@yz-sre-backup019 job]# vim agent3-dir.conf
[root@yz-sre-backup019 job]# cat agent3-dir.conf
# agent3-dir.conf: Clusterde_node Flume configuration

# Name the components on this agent
agent3.sources = spooldir_source
agent3.sinks = avro_sink
agent3.channels = memory_channel
# agent3: agent 的名称; spooldir_source: source 的名称; avro_sink: sink 的名称; memory_channel: channel 的名称

# Describe/configure the source
# 配置 source 的类型,监视一个文件夹,需要文件夹路径
agent3.sources.spooldir_source.type = spooldir
# 配置 source 监视文件夹路径
agent3.sources.spooldir_source.spoolDir = /data/flume/upload
# 配置 source 文件缓冲
agent3.sources.spooldir_source.fileSuffix = .COMPLETED
#
agent3.sources.spooldir_source.fileHeader = true
# 忽略所有以.tmp 结尾的文件,不上传
agent3.sources.spooldir_source.ignorePattern = ([^ ]*\.tmp)
# 获取源文件名称,方便下面的 sink 调用变量 fileName
agent3.sources.spooldir_source.basenameHeader = true
agent3.sources.spooldir_source.basenameHeaderKey = fileName

# 指定 sink 的类型,我们这里指定的为 avro,即将数据发送到端口,需要设置端口名称、端口号
agent3.sinks.avro_sink.type = avro
# 配置 sink 主机名称
agent3.sinks.avro_sink.hostname = 10.20.2.24
# 配置 sink 主机端口
agent3.sinks.avro_sink.port = 6666

# 配置 channel 的类型
agent3.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
agent3.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
agent3.channels.memory_channel.transactionCapacity = 100

# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
agent3.sources.spooldir_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
agent3.sinks.avro_sink.channel = memory_channel

[root@yz-sre-backup019 job]# mkdir -pv  /data/flume/upload // 创建测试监视文件夹

(3)编写相应的 agent 启动脚本

  • [root@yz-sre-backup019 bin]# vim start-agent1.sh
  • [root@yz-sre-backup019 bin]# cat start-agent1.sh

#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Wed Apr 15 16:29:37 CST 2020

# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name agent1 --conf-file /data/flume/job/agent1-exec.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/agent1-exec.log 2>&1 &

  • [root@yz-sre-backup019 bin]# vim start-agent3.sh
  • [root@yz-sre-backup019 bin]# cat start-agent3.sh

#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Wed Apr 15 16:29:37 CST 2020

# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name agent3 --conf-file /data/flume/job/agent3-dir.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10503 -Dflume.root.logger==INFO,console >> /data/flume/log/agent3-dir.log 2>&1 &

  • [root@yz-sre-backup019 bin]# vim start-agent2.sh
  • [root@yz-sre-backup019 bin]# cat start-agent2.sh

#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Wed Apr 15 16:29:37 CST 2020

# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name agent2 --conf-file /data/flume/job/agent2-netcat.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger==INFO,console >> /data/flume/log/agent2-netcat.log 2>&1 &

  • [root@yz-bi-web01 bin]# vim start-agent4.sh
  • [root@yz-bi-web01 bin]# cat start-agent4.sh

#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Wed Apr 15 16:36:36 CST 2020

# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 wufei03
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name agent4 --conf-file=/data/flume/job/agent4-hdfs.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10504 -Dflume.root.logger=INFO,console >> /data/flume/log/agent4-hdfs.log 2>&1 &

(4)分别启动各 agent 并查看对应的日志信息

  • [root@yz-bi-web01 bin]# bash start-agent4.sh
  • [root@yz-bi-web01 bin]# tail -f /data/flume/log/agent4-hdfs.log

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access

...。。

  • [root@yz-sre-backup019 bin]# bash start-agent2.sh
  • [root@yz-sre-backup019 log]# tail -f /data/flume/log/agent2-netcat.log

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name agent2 --conf-file /data/flume/job/agent2-netcat.conf
2020-04-15 16:55:45,585 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 16:55:45,592 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started

...。。

  • [root@yz-sre-backup019 bin]# bash start-agent1.sh
  • [root@yz-sre-backup019 bin]# tail -f /data/flume/log/agent1-exec.log

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name agent1 --conf-file /data/flume/job/agent1-exec.conf
2020-04-15 16:54:29,551 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 16:54:29,558 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started

...。。

  • [root@yz-sre-backup019 bin]# bash start-agent3.sh
  • [root@yz-sre-backup019 log]# tail -f /data/flume/log/agent3-dir.log

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10503 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name agent3 --conf-file /data/flume/job/agent3-dir.conf
2020-04-15 16:57:24,138 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 16:57:24,145 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started

...。。

(5)分别进行测试并查看对应的日志信息

// agent1 测试

  • [root@yz-sre-backup019 ~]# echo "帅飞飞!!!" >> /data/data.log
  • [root@yz-bi-web01 log]# tail -f /data/flume/log/agent4-hdfs.log

2020-04-15 17:16:04,150 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-15 17:16:04,444 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/2020-04-15/17/.1586942164150.tmp

// agent2 测试

  • [root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飞花点点轻!
OK

  • [root@yz-bi-web01 log]# tail -f /data/flume/log/agent4-hdfs.log

2020-04-15 17:17:07,096 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-15 17:17:07,143 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/2020-04-15/17/.1586942227097.tmp

// agent3 测试

 

  • [root@yz-bi-web01 log]# tail -f /data/flume/log/agent4-hdfs.log

2020-04-15 17:26:37,864 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-15 17:26:37,902 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/2020-04-15/17/wufei.csdn.1586942797865.tmp

2020-04-15 17:27:18,969 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-15 17:27:19,009 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/2020-04-15/17/wufei.py.1586942838970.tmp

// 查看 hdfs 对应目录是否生成相应的日志信息

  • [root@yz-bi-web01 ~]# su - hadoop
  • [hadoop@yz-bi-web01 ~]$ hdfs dfs -ls /flume/2020-04-15/17

Found 4 items
-rw-r--r-- 3 root hadoop 33 2020-04-15 17:17 /flume/2020-04-15/17/.1586942164150
-rw-r--r-- 3 root hadoop 18 2020-04-15 17:18 /flume/2020-04-15/17/.1586942227097
-rw-r--r-- 3 root hadoop 31 2020-04-15 17:27 /flume/2020-04-15/17/wufei.csdn.1586942797865
-rw-r--r-- 3 root hadoop 31 2020-04-15 17:28 /flume/2020-04-15/17/wufei.py.1586942838970

  • [hadoop@yz-bi-web01 ~]$ hadoop fs -cat /flume/2020-04-15/17/.1586942164150

帅飞飞!!!

  • [hadoop@yz-bi-web01 ~]$ hadoop fs -cat /flume/2020-04-15/17/.1586942227097

飞花点点轻!

  • [hadoop@yz-bi-web01 ~]$ hadoop fs -cat /flume/2020-04-15/17/wufei.csdn.1586942797865

https://showufei.blog.csdn.net

  • [hadoop@yz-bi-web01 ~]$ hadoop fs -cat /flume/2020-04-15/17/wufei.py.1586942838970

https://showufei.blog.csdn.net

 

(6)查看 flume 度量值

// agent1

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
104 839 0 839 0 0 5109 0 --:--:-- --:--:-- --:--:-- 5179
{
"CHANNEL.memory_channel": {
"ChannelCapacity": "1000",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"EventTakeSuccessCount": "2",
"ChannelSize": "0",
"EventTakeAttemptCount": "409",
"StartTime": "1586940869786",
"EventPutAttemptCount": "2",
"EventPutSuccessCount": "2",
"StopTime": "0"
},
"SOURCE.exec_source": {
"EventReceivedCount": "2",
"AppendBatchAcceptedCount": "0",
"Type": "SOURCE",
"EventAcceptedCount": "2",
"AppendReceivedCount": "0",
"StartTime": "1586940869797",
"AppendAcceptedCount": "0",
"OpenConnectionCount": "0",
"AppendBatchReceivedCount": "0",
"StopTime": "0"
},
"SINK.avro_sink": {
"ConnectionCreatedCount": "1",
"ConnectionClosedCount": "0",
"Type": "SINK",
"BatchCompleteCount": "0",
"BatchEmptyCount": "404",
"EventDrainAttemptCount": "2",
"StartTime": "1586940869789",
"EventDrainSuccessCount": "2",
"BatchUnderflowCount": "2",
"StopTime": "0",
"ConnectionFailedCount": "0"
}
}

 

// agent3

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
105 843 0 843 0 0 5025 0 --:--:-- --:--:-- --:--:-- 5109
{
"CHANNEL.memory_channel": {
"ChannelCapacity": "1000",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"EventTakeSuccessCount": "2",
"ChannelSize": "0",
"EventTakeAttemptCount": "188",
"StartTime": "1586942644042",
"EventPutAttemptCount": "2",
"EventPutSuccessCount": "2",
"StopTime": "0"
},
"SINK.avro_sink": {
"ConnectionCreatedCount": "1",
"ConnectionClosedCount": "0",
"Type": "SINK",
"BatchCompleteCount": "0",
"BatchEmptyCount": "184",
"EventDrainAttemptCount": "2",
"StartTime": "1586942644045",
"EventDrainSuccessCount": "2",
"BatchUnderflowCount": "2",
"StopTime": "0",
"ConnectionFailedCount": "0"
},
"SOURCE.spooldir_source": {
"EventReceivedCount": "2",
"AppendBatchAcceptedCount": "2",
"Type": "SOURCE",
"AppendReceivedCount": "0",
"EventAcceptedCount": "2",
"StartTime": "1586942644126",
"AppendAcceptedCount": "0",
"OpenConnectionCount": "0",
"AppendBatchReceivedCount": "2",
"StopTime": "0"
}
}

 

// agent2

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
113 567 0 567 0 0 3419 0 --:--:-- --:--:-- --:--:-- 3478
{
"CHANNEL.memory_channel": {
"ChannelCapacity": "1000",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"ChannelSize": "0",
"EventTakeSuccessCount": "3",
"EventTakeAttemptCount": "357",
"StartTime": "1586941307761",
"EventPutAttemptCount": "3",
"EventPutSuccessCount": "3",
"StopTime": "0"
},
"SINK.avro_sink": {
"ConnectionCreatedCount": "1",
"ConnectionClosedCount": "0",
"Type": "SINK",
"BatchCompleteCount": "0",
"BatchEmptyCount": "351",
"EventDrainAttemptCount": "3",
"StartTime": "1586941307764",
"EventDrainSuccessCount": "3",
"BatchUnderflowCount": "3",
"StopTime": "0",
"ConnectionFailedCount": "0"
}
}

 

// agent4

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
104 839 0 839 0 0 7874 0 --:--:-- --:--:-- --:--:-- 8067
{
"SOURCE.avro_source": {
"OpenConnectionCount": "3",
"Type": "SOURCE",
"AppendBatchReceivedCount": "6",
"AppendBatchAcceptedCount": "6",
"EventAcceptedCount": "6",
"AppendReceivedCount": "0",
"StopTime": "0",
"StartTime": "1586942110599",
"EventReceivedCount": "6",
"AppendAcceptedCount": "0"
},
"SINK.hdfs_sink": {
"BatchCompleteCount": "0",
"ConnectionFailedCount": "0",
"EventDrainAttemptCount": "6",
"ConnectionCreatedCount": "5",
"Type": "SINK",
"BatchEmptyCount": "255",
"ConnectionClosedCount": "5",
"EventDrainSuccessCount": "6",
"StopTime": "0",
"StartTime": "1586942110125",
"BatchUnderflowCount": "6"
},
"CHANNEL.memory_channel": {
"EventPutSuccessCount": "6",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"StopTime": "0",
"EventPutAttemptCount": "6",
"ChannelSize": "0",
"StartTime": "1586942110121",
"EventTakeSuccessCount": "6",
"ChannelCapacity": "1000",
"EventTakeAttemptCount": "268"
}
}

 

(7)测试完删掉 flume 进程并清除 hdfs 上数据

[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume

实战 06:挑选器案例

channel selector:通道挑选器,选择指定的 event 发送到指定的 channel

  1. Replicating Channel Selector:默认副本挑选器,事件均以副本方式输出,换句话说就是有几个 channel 就发送几个副本
  2. multiplexing selector:多路复用挑选器,作用就是可以将不同的内容发送到指定的 channel
  3. 详情参考官方文档:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-channel-selectors

流程图:

实战 07:主机拦截器案例

拦截器(interceptor):是 source 端的在处理过程中能够对数据(event)进行修改或丢弃的组件。

常见拦截器有:

  1. host interceptor:将发送的 event 添加主机名 header
  2. timestamp interceptor:将发送的 event 添加时间戳的 header
  3. 更多拦截器可参考官方文档:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-interceptors

(1)编辑主机拦截器配置文件(案例一)

agent 选型:netcat source + memory channel + logger sink

[root@yz-sre-backup019 job]# vim flume-host_interceptor.conf
[root@yz-sre-backup019 job]# cat flume-host_interceptor.conf
# flume-host_interceptor.conf: A single_node Flume configuration
 
# Name the components on this agent
wf_host_interceptor.sources = netcat_source
wf_host_interceptor.sinks = logger_sink
wf_host_interceptor.channels = memory_channel
# wf_host_interceptor: agent 的名称; netcat_source: source 的名称; logger_sink: sink 的名称; memory_channel: channel 的名称
 
# Describe/configure the source
# 配置 source 的类型
wf_host_interceptor.sources.netcat_source.type = netcat
# 配置 source 绑定的主机
wf_host_interceptor.sources.netcat_source.bind = 127.0.0.1
# 配置 source 绑定的主机端口
wf_host_interceptor.sources.netcat_source.port = 8888
 
# 指定添加拦截器
wf_host_interceptor.sources.netcat_source.interceptors = host_interceptor
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.type = org.apache.flume.interceptor.HostInterceptor$Builder
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.preserveExisting = false
# 指定 header 的 key
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.hostHeader = hostname
# 指定 header 的 value 为主机 IP
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.useIP = true
 
# 指定 sink 的类型,我们这里指定的为 logger,即控制台输出
# 配置 sink 的类型,
wf_host_interceptor.sinks.logger_sink.type = logger
 
# 指定 channel 的类型为 memory,指定 channel 的容量是 1000,每次传输的容量是 100
# 配置 channel 的类型
wf_host_interceptor.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
wf_host_interceptor.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
wf_host_interceptor.channels.memory_channel.transactionCapacity = 100
 
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
wf_host_interceptor.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
wf_host_interceptor.sinks.logger_sink.channel = memory_channel

(2)编写启动脚本

[root@yz-sre-backup019 bin]# vim start-host_interceptor.sh
[root@yz-sre-backup019 bin]# chmod +x start-host_interceptor.sh
[root@yz-sre-backup019 bin]# cat start-host_interceptor.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Thu Apr 16 11:33:39 CST 2020
 
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-host_interceptor.conf --name wf_host_interceptor -Dflume.monitoring.type=http -Dflume.monitoring.port=10520 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-host_interceptor.log 2>&1 &

(3)启动并连接到指定端口发送测试数据

  • [root@yz-sre-backup019 bin]# bash start-host_interceptor.sh
  • [root@yz-sre-backup019 bin]# ss -ntl

State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 *:10520 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*

  • [root@yz-sre-backup019 log]# tail -f /data/flume/log/flume-host_interceptor.log

Info: Sourcing environment configuration script //root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10520 -Dflume.root.logger=INFO,console -cp '//root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --conf-file=/data/flume/job/flume-host_interceptor.conf --name wf_host_interceptor
2020-04-16 12:08:55,064 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting

...。。

2020-04-16 12:08:55,492 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SelectChannelConnector@0.0.0.0:10520
2020-04-16 12:09:41,350 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{hostname=10.20.3.36} body: 53 48 4F 57 75 66 65 69 0D SHOWufei. }
2020-04-16 12:09:50,352 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{hostname=10.20.3.36} body: 41 6E 20 69 6E 74 65 72 63 65 70 74 6F 72 20 69 An interceptor i }

  • [root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
SHOWufei
OK
An interceptor is an aircraft or ground-based missile system designed to intercept and attack enemy planes.
OK

(4)编辑时间戳拦截器配置文件(案例二)

agent 选型:netcat source + memory channel + logger sink

[root@yz-sre-backup019 job]# vim flume-timestamp_interceptor.conf
[root@yz-sre-backup019 job]# cat flume-timestamp_interceptor.conf
# flume-timestamp_interceptor.conf: A single_node Flume configuration
 
# Name the components on this agent
wf_timestamp_interceptor.sources = netcat_source
wf_timestamp_interceptor.sinks = logger_sink
wf_timestamp_interceptor.channels = memory_channel
# wf_timestamp_interceptor: agent 的名称; netcat_source: source 的名称; logger_sink: sink 的名称; memory_channel: channel 的名称
 
# Describe/configure the source
# 配置 source 的类型
wf_timestamp_interceptor.sources.netcat_source.type = netcat
# 配置 source 绑定的主机
wf_timestamp_interceptor.sources.netcat_source.bind = 127.0.0.1
# 配置 source 绑定的主机端口
wf_timestamp_interceptor.sources.netcat_source.port = 8888
 
# 指定添加拦截器
wf_timestamp_interceptor.sources.netcat_source.interceptors = timestamp_interceptor
wf_timestamp_interceptor.sources.netcat_source.interceptors.timestamp_interceptor.type = timestamp
 
# 指定 sink 的类型,我们这里指定的为 logger,即控制台输出
# 配置 sink 的类型,
wf_timestamp_interceptor.sinks.logger_sink.type = logger
 
# 指定 channel 的类型为 memory,指定 channel 的容量是 1000,每次传输的容量是 100
# 配置 channel 的类型
wf_timestamp_interceptor.channels.memory_channel.type = memory
# 配置通道中存储的最大 event 数
wf_timestamp_interceptor.channels.memory_channel.capacity = 1000
# 配置通道从源或提供接收器的最大 event 数
wf_timestamp_interceptor.channels.memory_channel.transactionCapacity = 100
 
# 绑定 source 和 sink
# 把 source 和 channel 做关联,其中属性是 channels,说明 sources 可以和多个 channel 做关联
wf_timestamp_interceptor.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做关联,只能输出到一个 channel
wf_timestamp_interceptor.sinks.logger_sink.channel = memory_channel

(5)编写启动脚本

[root@yz-sre-backup019 bin]# vim start-timestamp_interceptor.sh
[root@yz-sre-backup019 bin]# chmod +x start-timestamp_interceptor.sh
[root@yz-sre-backup019 bin]# cat start-timestamp_interceptor.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: wufei@higohappy.com
# @Date: Thu Apr 16 12:26:26 CST 2020
 
# 启动 flume 自身的监控参数,默认执行以下脚本
# --conf: flume 的配置目录
# --conf-file: 自定义 flume 的 agent 配置文件
# --name: 指定 agent 的名称,与自定义 agent 配置文件中对应,即 a1
# -Dflume.root.logger: 日志级别和输出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-timestamp_interceptor.conf --name wf_timestamp_interceptor -Dflume.monitoring.type=http -Dflume.monitoring.port=10521 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-timestamp_interceptor.log 2>&1 &

(6)启动并连接到指定端口发送测试数据

  • [root@yz-sre-backup019 bin]# bash start-timestamp_interceptor.sh
  • [root@yz-sre-backup019 bin]# ss -ntl

State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 50 *:10521 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*

  • [root@yz-sre-backup019 log]# tail -f /data/flume/log/flume-timestamp_interceptor.log

Info: Sourcing environment configuration script //root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10521 -Dflume.root.logger=INFO,console -cp '//root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --conf-file=/data/flume/job/flume-timestamp_interceptor.conf --name wf_timestamp_interceptor
2020-04-16 12:28:54,666 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting

...。。

2020-04-16 12:28:55,062 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SelectChannelConnector@0.0.0.0:10521

2020-04-16 12:30:15,386 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{timestamp=1587011415381} body: 53 48 4F 57 75 66 65 69 E3 80 82 2E 2E 2E E3 80 SHOWufei........ }
2020-04-16 12:30:24,945 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{timestamp=1587011424945} body: 41 6E 20 69 6E 74 65 72 63 65 70 74 6F 72 20 69 An interceptor i }

  •  [root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
SHOWufei...。。
OK
An interceptor is an aircraft or ground-based missile system designed to intercept and attack enemy planes.
OK

四、在生成环境的实际应用

待实施...。。

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值