# Flume
官网: http://flume.apache.org/
## Flume安装
1.下载安装包
wget 链接地址 --no-check-certificate
2.解压
```sh
tar zxvf 压缩包
```
3.配置Java环境变量
```sh
cd $FLUME_HOME/conf
mv flume-env.sh.template flume-env.sh
vi flume-env.sh
-----------------------------------------
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.312.b07-1.el7_9.x86_64
```
4.配置Flume环境变量
```sh
vi /etc/profile
-----------------------------------------
# FLUME
export FLUME_HOME=/data/apache-flume-1.9.0-bin
export PATH=$FLUME_HOME/bin:$PATH
-----------------------------------------
source /etc/profile
```
5.查看版本
```sh
╰─# flume-ng version
Flume 1.9.0-cdh6.2.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 125073b4a71ea5104eb134ea2cd60231f5054162
Compiled by jenkins on Thu Mar 14 00:09:13 PDT 2019
From source with checksum 793f2bd22d87741fb31195b30e693d58
```
## 演示示例
### netcat source
![image-20210115085602523](https://gitee.com/forever428/picgo/raw/master/img/image-20210115085602523.png)
1.编写Flume的配置文件
```sh
vi flume.conf
--------------------------------
# 定义 source, channel, 和sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
# 对source的一些设置
a1.sources.s1.type = netcat
a1.sources.s1.bind = 10.106.215.93
a1.sources.s1.port = 44444
a1.sources.s1.channels = c1
# 对channel的一些设置
a1.channels.c1.type = memory
# 对sink的一些设置
a1.sinks.sk1.type = logger
a1.sinks.sk1.channel = c1
```
2.启动Flume【路径容易存在问题:①配置文件的路径, ②Flume脚本的路径】
```sh
flume-ng agent --name a1 -f flume.conf -Dflume.root.logger=INFO,console
```
3.启动telnet发送数据
```sh
# 安装telnet
yum -y install telnet.x86_64
# 启动telnet
telnet localhost 5678
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
# 发送数据
hello world
OK
```
4.查看Flume日志输出
![image-20210115091858853](https://gitee.com/forever428/picgo/raw/master/img/image-20210115091858853.png)
### exec source
1.编写Flume的配置文件
```sh
vi source_exec.conf
--------------------------------
# 定义 source, channel, 和sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
# 对source的一些设置
# 设置source的类型为exec, 代表需要执行一条命令, 所以需要给定一个command
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/flume_conf/xxx.log
a1.sources.s1.channels = c1
# 对channel的一些设置
a1.channels.c1.type = memory
# 对sink的一些设置
a1.sinks.sk1.type = logger
a1.sinks.sk1.channel = c1
```
2.启动Flume, 直接启动Flume, 由于没有`/root/flume_conf/xxx.log`这个文件, 程序会显示`exited with 1`
所以应该先创建`/root/flume_conf/xxx.log`, 然后在启动Flume
```sh
flume-ng agent --name a1 -f source_exec.conf -Dflume.root.logger=INFO,console
```
![image-20210115100505835](https://gitee.com/forever428/picgo/raw/master/img/image-20210115100505835.png)
3.对文件追加数据. 我们在这个文件当中添加一些数据, 验证Flume是否检测到并采集
![image-20210115101114003](https://gitee.com/forever428/picgo/raw/master/img/image-20210115101114003.png)
### spooldir source
1.编写Flume的配置文件
```sh
vi source_spooldir.conf
--------------------------------
# 定义 source, channel, 和sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
# 对source的一些设置
# 设置source的类型为spooldir, 代表监控一个文件夹里边是否有新文件产生, 所以需要给定一个文件夹
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /root/test
a1.sources.s1.channels = c1
# 对channel的一些设置
a1.channels.c1.type = memory
# 对sink的一些设置
a1.sinks.sk1.type = logger
a1.sinks.sk1.channel = c1
--------------------------------
# 创建文件夹
mkdir /root/test
```
2.启动Flume, 直接启动Flume
```sh
flume-ng agent --name a1 -f source_spooldir.conf -Dflume.root.logger=INFO,console
```
![image-20210115105005199](https://gitee.com/forever428/picgo/raw/master/img/image-20210115105005199.png)
3.向`/root/test`拷贝文件, 查看Flume的输出
```
cp flume_conf/*.conf test/
```
![image-20210115105124746](https://gitee.com/forever428/picgo/raw/master/img/image-20210115105124746.png)
4.查看`/root/test`下的文件名
![image-20210115105256417](https://gitee.com/forever428/picgo/raw/master/img/image-20210115105256417.png)
### http source
1.编写Flume的配置文件
```sh
vi source_http.conf
--------------------------------
# 定义 source, channel, 和sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
# 对source的一些设置
# 设置source的类型为spooldir, 代表监控一个文件夹里边是否有新文件产生, 所以需要给定一个文件夹
a1.sources.s1.type = http
a1.sources.s1.port = 5678
a1.sources.s1.channels = c1
# 对channel的一些设置
a1.channels.c1.type = memory
# 对sink的一些设置
a1.sinks.sk1.type = logger
a1.sinks.sk1.channel = c1
--------------------------------
```
2.启动Flume, 直接启动Flume
```sh
flume-ng agent --name a1 -f source_http.conf -Dflume.root.logger=INFO,console
```
![image-20210115105832796](https://gitee.com/forever428/picgo/raw/master/img/image-20210115105832796.png)
3.发送post请求, 查看Flume输出
```sh
curl -XPOST localhost:5678 -d'[{"headers":{"h1":"v1","h2":"v2"},"body":"hello body"}]'
```
![image-20210115110012034](https://gitee.com/forever428/picgo/raw/master/img/image-20210115110012034.png)
### taildir source
可以同时监控一个或多个文件, 并带有偏移量存储文件来记录上次读到的位置, 下次可以接着读
```sh
# 设置sources和channels的名字
a1.sources = r1
a1.channels = c1
# 配置source
a1.sources.r1.type = TAILDIR # 指定类型为TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json # 定义偏移量存储路径
a1.sources.r1.filegroups = f1 f2 # 定义文件组, 多个文件 f1,f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log # 对f1指定绝对路径
a1.sources.r1.headers.f1.headerKey1 = value1 # 向f1的header添加kv对
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.* # 对f2指定绝对路径
a1.sources.r1.headers.f2.headerKey1 = value2 # 向f2的header添加kv对
a1.sources.r1.headers.f2.headerKey2 = value2-2 # 向f2的header添加kv对
a1.sources.r1.fileHeader = true # 是否添加一个头信息来存储文件的绝对路径, 默认是false
```
### avro sink and source
1.编写Flume的配置文件
```sh
vi sink_avro.conf
--------------------------------
# 定义 source, channel, 和sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
# 对source的一些设置
# 设置source的类型为spooldir, 代表监控一个文件夹里边是否有新文件产生, 所以需要给定一个文件夹
a1.sources.s1.type = http
a1.sources.s1.port = 5678
a1.sources.s1.channels = c1
# 对channel的一些设置
a1.channels.c1.type = memory
# 对sink的一些设置, 设置格式为avro, 主机和端口号
a1.sinks.sk1.type = avro
a1.sinks.sk1.hostname = localhost
a1.sinks.sk1.port = 4444
a1.sinks.sk1.channel = c1
--------------------------------
vi source_avro.conf
--------------------------------
# 定义 source, channel, 和sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
# 对source的一些设置
# 设置格式为avro, 主机和端口号
a1.sources.s1.type = avro
a1.sources.s1.bind = localhost
a1.sources.s1.port = 4444
a1.sources.s1.channels = c1
# 对channel的一些设置
a1.channels.c1.type = memory
# 对sink的一些设置
a1.sinks.sk1.type = logger
a1.sinks.sk1.channel = c1
--------------------------------
```
2.启动Flume, 直接启动Flume, 先启动sink_avro, 在启动source_avro
```sh
flume-ng agent --name a1 -f source_avro.conf -Dflume.root.logger=INFO,console
flume-ng agent --name a1 -f sink_avro.conf -Dflume.root.logger=INFO,console
```
![image-20210115112325136](https://gitee.com/forever428/picgo/raw/master/img/image-20210115112325136.png)
![image-20210115112339649](https://gitee.com/forever428/picgo/raw/master/img/image-20210115112339649.png)
3.发送post请求, 查看Flume输出
```sh
curl -XPOST localhost:44444 -d'[{"headers":{"h1":"v1","h2":"v2"},"body":"hello body"}]'
```
![image-20210115112426324](https://gitee.com/forever428/picgo/raw/master/img/image-20210115112426324.png)
### HDFS sink
1.编写Flume的配置文件
```sh
vi sink_hdfs.conf
--------------------------------
# 定义 source, channel, 和sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
# 对source的一些设置
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sources.s1.channels = c1
# 对channel的一些设置
a1.channels.c1.type = memory
# 对sink的一些设置
a1.sinks.sk1.type = hdfs
a1.sinks.sk1.hdfs.path = /data/20210115
a1.sinks.sk1.channel = c1
```
2.启动Flume
```sh
flume-ng agent --name a1 -f sink_hdfs.conf -Dflume.root.logger=INFO,console
```
![image-20210115114644690](https://gitee.com/forever428/picgo/raw/master/img/image-20210115114644690.png)
3.启动telnet发送数据
```sh
# 启动telnet
telnet localhost 44444
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
# 发送数据
hello world
OK
hello spark
OK
hello hadoop scala
OK
```
4.查看Flume日志输出
![image-20210115114808621](https://gitee.com/forever428/picgo/raw/master/img/image-20210115114808621.png)
5.查看HDFS上的文件
```sh
hdfs dfs -text /data/20210115/FlumeData.1610682426351
```
![image-20210115115009833](https://gitee.com/forever428/picgo/raw/master/img/image-20210115115009833.png)
【扩展】实际开发中常用参数
![image-20210115115644466](https://gitee.com/forever428/picgo/raw/master/img/image-20210115115644466.png)
### Spark sink
1.编写Flume的配置文件
```sh
vi sink_spark-push.conf
--------------------------------
# 定义 source, channel, 和sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = avroSink
# 对source的一些设置
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 5678
a1.sources.s1.channels = c1
# 对channel的一些设置
a1.channels.c1.type = memory
# 对sink的一些设置
a1.sinks.avroSink.type = avro
a1.sinks.avroSink.channel = c1
a1.sinks.avroSink.hostname = singleNode
a1.sinks.avroSink.port = 9999
```
2.编写Spark程序
```scala
package cn.kgc.spark.Streaming
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by wangchunhui on 2021/1/19 11:40
* 第一步: 编写Flume程序, 配置Avro sink
* 第二步: 导入依赖包
* 第三步: 编写Spark程序, 使用 FlumeUtils.createStream 配置数据源
* 第四步: 打成jar包, 上传到集群, 使用spark-submit提交
* 第五步: 启动Flume程序
*/
object Demo05_FlumePushWordCount {
def main(args: Array[String]): Unit = {
// 模板代码: ①创建SparkConf, ②创建StreamingContext, 第二个参数是批处理间隔大小
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[4]")
val ssc = new StreamingContext(conf,Seconds(5))
// 1.加载数据源
val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(ssc,"singleNode",9999)
val lines: DStream[String] = flumeStream.map(x=>x.event.getBody.array().toString.trim)
// 2.对数据进行处理[WordCount]
val result: DStream[(String, Int)] = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
// 3.数据写出[print]
result.print()
// 4.启动程序
ssc.start()
ssc.awaitTermination()
}
}
```
3.打成jar包, 上传到集群运行
```sh
spark-submit --class cn.kgc.spark.Streaming.Demo05_FlumePushWordCount SparkLearn-1.0-SNAPSHOT-jar-with-dependencies.jar
```
4.启动Flume
```sh
flume-ng agent --name a1 -f sink_spark-push.conf -Dflume.root.logger=INFO,console
```
5.启动telnet输入数据, 查看Spark输出结果
```sh
telnet localhost 5678
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello world spark hadoop hello
OK
```
![image-20210120105146709](https://gitee.com/forever428/picgo/raw/master/img/image-20210120105146709.png)