安装flume
版本:apache-flume-1.9.0-bin.tar.gz
1.1 安装地址
(1) Flume官网地址:http://flume.apache.org/
(2)文档查看地址:http://flume.apache.org/FlumeUserGuide.html
(3)下载地址:http://archive.apache.org/dist/flume/
1.2 Flume安装部署
- 将apache-flume-1.9.0-bin.tar.gz上传到linux的/opt/software目录下
- 解压apache-flume-1.9.0-bin.tar.gz到/opt/module/目录下
[sarah@hadoop102 software]$ tar -zxf apache-flume-1.9.0-bin.tar.gz -C /opt/module/
- 修改apache-flume-1.9.0-bin的名称为flume
[sarah@hadoop102 module]$ mv /opt/module/apache-flume-1.9.0-bin /opt/module/flume
- 将lib文件夹下的guava-11.0.2.jar删除以兼容Hadoop 3.1.3
[sarah@hadoop102 module]$ rm /opt/module/flume/lib/guava-11.0.2.jar
**
日誌採集
**
自定义拦截器
1.创建Maven 工程flume_interceptor
2.创建报名: com.sarah.flume.interceptor
3.在pom.xml文件中添加如下配置
<dependencies>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.62</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
4.在com.sarah.flume.interceptor包下创建JSONUtils类
package com.sarah.flume.interceptor;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
/**
* @author leon
* @ClassName JSONUtils.java
* @createTime 2022年01月23日 02:22:00
*/
public class JSONUtils {
public static boolean isJSONValidate(String log){
try {
// 1. 解析JSON字符串
JSON.parseObject(log);
return true;
} catch (Exception e) {
// 2. 失败了,证明不是JSON字符串
return false;
}
}
}
5.在com.sarah.flume.interceptor包下创建ETLInterceptor类
package com.sarah.flume.interceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.nio.charset.StandardCharsets;
import java.util.Iterator;
import java.util.List;
/**
* @author leon
* @ClassName ETLInterceptor.java
* @createTime 2022年01月23日 02:25:00
*/
public class ETLInterceptor implements Interceptor {
@Override
public void initialize() {}
@Override
public Event intercept(Event event) {
// 1. 获取事件体
byte[] body = event.getBody();
// 2. 解析事件体为字符串
String log = new String(body, StandardCharsets.UTF_8);
// 3.判断是否是JSON字符串
if(JSONUtils.isJSONValidate(log))
// 是,返回事件
return event;
else
// 不是,返回空
return null;
}
/**
* @param events
* @describe 在这里,真正的将不符合要求的事件移除
* @return
*/
@Override
public List<Event> intercept(List<Event> events) {
Iterator<Event> iterator = events.iterator();
while (iterator.hasNext()){
Event next = iterator.next();
if(intercept(next)==null){
// 将不是json格式的数据移除
iterator.remove();
}
}
return events;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new ETLInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
- 打包上传到hadoop102的 /opt/module/flume/lib文件夹下
[sarah@hadoop102 flume]$ ls -al lib/ | grep flume-interceptor*
-rw-rw-r--. 1 sarah sarah 660724 1月 23 02:39 flume-interceptor-1.0-SNAPSHOT-jar-with-dependencies.jar
7.分发flume的lib目录到其他节点(保证hadoop102和hadoop103上的flume有)
[sarah@hadoop102 module]$ xsync flume/lib
6.3 编写日志采集flume的配置文件
6.3.1 编写配置文件
- 在flume的根目录下创建job目录,在job中创建Agent配置文件flume-tailDir-kafka.conf
[sarah@hadoop102 conf]$ vim flume-tailDir-kafka.conf
#为各组件命名
a1.sources = r1
a1.channels = c1
#描述source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/applog/log/app.*
a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.sarah.flume.interceptor.ETLInterceptor$Builder
#描述channel
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = topic_log
a1.channels.c1.parseAsFlumeEvent = false
#绑定source和channel以及sink和channel的关系
a1.sources.r1.channels = c1
- 同步配置文件到其他节点(主要是hadoop102和hadoop103上)
[sarah@hadoop102 flume]$ xsync job/
6.3.2 编写启动停止脚本
0. 创建logs文件夹
[sarah@hadoop102 flume]$ xcall mkdir -p /opt/module/flume/logs
- 在用户根目录/home/sarah/bin/下创建脚本文件f1.sh
[sarah@hadoop102 bin]$ vim f1.sh
#!/bin/bash
# 1. 判断是否存在参数
if [ $# == 0 ];then
echo -e "请输入参数:\nstart 启动日志采集flume;\nstop 关闭日志采集flume;"&&exit
fi
FLUME_HOME=/opt/module/flume
# 2. 根据传入的参数执行命令
case $1 in
"start"){
# 3. 分别在hadoop102 hadoop103 上启动日志采集flume
for host in hadoop102 hadoop103
do
echo "---------- 启动 $host 上的 日志采集flume ----------"
ssh $host " nohup $FLUME_HOME/bin/flume-ng agent -n a1 -c $FLUME_HOME/conf/ -f $FLUME_HOME/job/flume-tailDir-kafka.conf -Dflume.root.logger=INFO,LOGFILE >$FLUME_HOME/logs/flume.log 2>&1 &"
done
};;
"stop"){
# 4. 分别在hadoop102 hadoop103 上启动日志采集flume
for host in hadoop102 hadoop103
do
echo "---------- 停止 $host 上的 日志采集flume ----------"
flume_count=$(xcall jps -ml | grep flume-tailDir-kafka|wc -l);
if [ $flume_count != 0 ];then
ssh $host "ps -ef | grep flume-tailDir-kafka | grep -v grep | awk '{print \$2}' | xargs -n1 kill -9"
else
echo "$host 当前没有日志采集flume在运行"
fi
done
};;
esac