使用 Flume 收集数据来源: 实验楼链接: https://www.shiyanlou.com/courses/801-CSDN博客

一、实验介绍
1.1 实验内容
Flume 是分布式的日志收集系统，可以处理各种类型各种格式的日志数据，包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定义等，本节课主要讲解 Flume 的应用案例。

1.2 课程来源
本课程源自图灵教育的《Hadoop基础教程》第10章，真诚感谢图灵教育授权实验楼发布。如需系统的学习本书，请购买《Hadoop基础教程》。

为了保证可以在实验楼环境中完成本次实验，我们在原书内容基础上补充了一系列的实验指导，比如实验截图，代码注释，帮助您更好的实战。

如果您对于实验有疑惑或者建议可以随时在讨论区中提问，与同学们一起探讨。

1.3 先学课程
Flume 介绍与安装：https://www.shiyanlou.com/courses/237
Hadoop部署及管理：https://www.shiyanlou.com/courses/35
1.4 实验知识点
Flume 核心概念 agent
agent 里面包含3个核心组件：source、channel、sink。
sink 组件是用于把数据发送到目的地的组件，目的地包括 hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定义。
1.5 实验环境
Hadoop-2.7.3
Flume-1.6.0
Xfce 终端
1.6 适合人群
本课程属于中等难度级别，适合具有大数据 hadoop 基础的用户，如果对数据采集了解，能够更好的上手本课程。

二、实验步骤
我们已经在实验楼环境里下载并配置启动 hadoop-2.7.3 所需的文件，免除您配置文件的麻烦，您可以在 /opt 找到，只需格式化并启动 hadoop 进程即可。

2.1　准备工作
首先，我们打开桌面上的Xfce终端，切换到hadoop用户下：

su -l hadoop #密码为hadoop

在 /opt 目录下格式化 hadoop。

$ cd /opt/
$ hdfs namenode -format

在 /opt 目录下启动 hadoop 进程。

$ hadoop-2.7.3/sbin/start-all.sh

用 jps 查看 hadoop 进程是否启动。

jps

您可以通过下面命令将 Flume 下载到环境中，进行安装配置。

注意，实验楼环境中已经配置了flume，下面的配置无须动手操作

cd /opt/
wget http://labfile.oss.aliyuncs.com/courses/785/apache-flume-1.6.0-bin.tar.gz
sudo tar -zxvf apache-flume-1.6.0-bin.tar.gz

修改 flume-env.sh

cd apache-flume-1.6.0-bin/conf/
sudo cp flume-env.sh.template flume-env.sh
sudo vi flume-env.sh

flume-env.sh文件需修改内容：

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Give Flume more memory and pre-allocate, enable remote monitoring via JMX

export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"

修改 flume-conf.properties

sudo cp flume-conf.properties.template flume-conf.properties
用 mkdir 命令在 Flume 的解压包下创建 logs 目录，并用 chmod 命令给以权限。

$ sudo mkdir apache-flume-1.6.0-bin/logs
$ sudo chmod 777 -R apache-flume-1.6.0-bin

2.2 案例一之 Spool
Spool监测配置的目录下新增的文件，并将文件中的数据读取出来。需要注意两点：

拷贝到 spool 目录下的文件不可以再打开编辑。

在 /opt 创建 agent 的配置文件 spool.conf。

sudo vi apache-flume-1.6.0-bin/conf/spool.conf

添加如下内容

Describe the agent

a1.sources = r1
a1.sinks = k1
a1.channels = c1

Describe/configure the source

a1.sources.r1.type = spooldir
a1.sources.r1.channels = c1
a1.sources.r1.spoolDir = /opt/apache-flume-1.6.0-bin/logs
a1.sources.r1.fileHeader = true

Describe the sink

a1.sinks.k1.type = logger

Use a channel which buffers events in memory

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

Bind the source and sink to the channel

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动 Flume 代理。

cd apache-flume-1.6.0-bin/
mkdir logs
bin/flume-ng agent -c conf -f conf/spool.conf -n a1 -Dflume.root.logger=INFO,console

另外开启一个Xfce 终端，追加文件到 apache-flume-1.6.0-bin/logs 目录。

su -l hadoop
cd /opt
echo “Hello World” > apache-flume-1.6.0-bin/logs/spool.log

在 Flume 代理这个 Xfce 终端可以看到以下相关信息：

用快捷键 Ctrl + c 可以结束 Xfce 终端。

查看 logs 目录下文件输出信息：

cd logs/
ls
more spool.log.COMPLETED

2.3 案例二之 Exec
EXEC 执行一个给定的命令获得输出的源。

在 /opt 创建 agent 的配置文件 exec.conf。

sudo vi apache-flume-1.6.0-bin/conf/exec.conf

添加如下内容

Describe the agent

a1.sources = r1
a1.sinks = k1
a1.channels = c1

Describe the source

a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/apache-flume-1.6.0-bin/logs/log_exec_tail

Describe the sink

a1.sinks.k1.type = logger

Use a channel which buffers events in memory

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

Bind the source and sink to the channel

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
~
启动 Flume 代理。

cd apache-flume-1.6.0-bin/
bin/flume-ng agent -c conf -f conf/exec.conf -n a1 -Dflume.root.logger=INFO,console
另外开启一个 Xfce 终端，用脚本输出信息到 /opt/apache-flume-1.6.0-bin/log_exec_tail，如图示：

for i in {1…1000}
do
echo “exec tail$i” >> /opt/apache-flume-1.6.0-bin/logs/log_exec_tail
done
在 Flume 代理这个 Xfce 终端可以看到以下相关信息：

用快捷键 Ctrl + c 可以结束 Xfce 终端。

查看 logs 目录下文件输出信息：

more /opt/apache-flume-1.6.0-bin/logs/log_exec_tail

2.4 案例三之 JSONHandler
从远程客户端接收数据。

在 /opt 创建 agent 的配置文件 json.conf。

sudo vi apache-flume-1.6.0-bin/conf/json.conf

添加如下内容

Describe the agent

a1.sources = r1
a1.sinks = k1
a1.channels = c1

Describe the source

a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 8888
a1.sources.r1.channels = c1

Describe the sink

a1.sinks.k1.type = logger

Use a channel which buffers events in memory

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

Bind the source and sink to the channel

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
~

启动 Flume 代理。

cd apache-flume-1.6.0-bin/
bin/flume-ng agent -c conf -f conf/json.conf -n a1 -Dflume.root.logger=INFO,console

生成 JSON 格式的POST request。

curl -X POST -d ‘[{ “headers” :{“a” : “a1”,“b” : “b1”},“body” : “shiyanlou.org_body”}]’ http://localhost:8888
在 Flume 代理这个终端可以看到以下相关信息：

用快捷键 Ctrl + c 可以结束 Xfce 终端。

2.5 案例四之 Syslogtcp
接下来，我们将要介绍如何把数据写入HDFS。

在 /opt 创建 agent 配置文件 syslogtcp.conf。

sudo vi apache-flume-1.6.0-bin/conf/syslogtcp.conf

添加如下内容

a1.sources = r1
a1.sinks = k1
a1.channels = c1

Describe/configure the source

a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 4444
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1

Describe the sink

a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://localhost:9000/user/hadoop/syslogtcp
a1.sinks.k1.hdfs.filePrefix = Syslog
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute