由于需求的变更需要实时传输数据,然后就要将之前的hdfs换成kafka
直接上配置
a1.sources = r1 r2 r4
a1.sinks = k1 k2 k4
a1.channels = c1 c2 c4
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/mjxt/flume_data/001/
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200000
a1.channels.c1.transactionCapacity = 200000
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
# 因为没有kafka集群 所以随便指定得 测试
a1.sinks.k1.kafka.topic = etc_mj_001
a1.sinks.k1.kafka.bootstrap.servers = 10.42.3.56:9092,10.42.3.55:9092,10.42.3.53:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = -1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy
a1.sources.r2.type = spooldir
a1.sources.r2.spoolDir = /home/mjxt/flume_data/002/
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.channels.c2.type = memory
a1.channels.c2.capacity = 200000
a1.channels.c2.transactionCapacity = 200000
a1.sinks.k2.channel = c2
a1.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k2.kafka.topic = etc_mj_002
a1.sinks.k2.kafka.bootstrap.servers = 10.42.3.56:9092,10.42.3.55:9092,10.42.3.53:9092
a1.sinks.k2.kafka.flumeBatchSize = 20
a1.sinks.k2.kafka.producer.acks = -1
a1.sinks.k2.kafka.producer.linger.ms = 1
a1.sinks.k2.kafka.producer.compression.type = snappy
a1.sources.r4.type = spooldir
a1.sources.r4.spoolDir = /home/mjxt/flume_data/004/
a1.sources.r4.interceptors = i4
a1.sources.r4.interceptors.i4.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.channels.c4.type = memory
a1.channels.c4.capacity = 200000
a1.channels.c4.transactionCapacity = 200000
a1.sinks.k4.channel = c4
a1.sinks.k4.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k4.kafka.topic = etc_mj004
a1.sinks.k4.kafka.bootstrap.servers = 10.42.3.56:9092,10.42.3.55:9092,10.42.3.53:9092
a1.sinks.k4.kafka.flumeBatchSize = 20
a1.sinks.k4.kafka.producer.acks = -1
a1.sinks.k4.kafka.producer.linger.ms = 1
a1.sinks.k4.kafka.producer.compression.type = snappy
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sources.r2.channels = c2
a1.sinks.k2.channel = c2
a1.sources.r4.channels = c4
a1.sinks.k4.channel = c4
因为数据不是特别重要,所以这里kafka也没有什么优化,正常使用就行了,source用的spoolDir的source,传输速度是最快的,但是也容易丢失数据。
之后通过脚本可以进行flume的重启,由于之前是用的hdfs的sink,所以需要先停止之前的,将app换个参数就行了。
#!/bin/bash
app="spool-kafka"
log_path="/home/mjxt/shell/heartbeat.log"
#检测方法
checkStatus(){
pid=$(ps -ef |grep $app |grep -v "grep" |awk '{print $2}');
#datetime=`date +%Y-%m-%d,%H:%m:%s`
datetime="`date`"
if [ -z "${pid}" ]; then
echo "$datetime ---- 开始启动服务$APP_NAME" >> $log_path
/home/mjxt/apache-flume-1.9.0-bin/bin/flume-ng agent -n a1 -c /home/mjxt/apache-flume-1.9.0-bin/conf -f /home/mjxt/apache-flume-1.9.0-bin/conf/spool-hdfs.conf -Dflume.root.logger=INFO,console >/dev/null 2>&1 &
else
echo "$datetime ---- 项目$APP_NAME已经启动,进程pid是${pid}!" >> $log_path
fi
}
restart(){
#pid=$(ps -ef |grep $app |grep -v "grep" |awk '{print $2}');
process=`ps -ef|grep spool-kafka.conf |grep -v grep|grep -v PPID|awk '{print $2}'`
for i in $process
do
echo "kill the process [$i]"
kill -9 $i
done
cd /home/mjxt/apache-flume-1.9.0-bin/;bin/flume-ng agent -n a1 -c /home/mjxt/apache-flume-1.9.0-bin/conf -f /home/mjxt/apache-flume-1.9.0-bin/conf/spool-kafka.conf >/dev/null 2>&1 &
#datetime=`date +%Y-%m-%d,%H:%m:%s`
#datetime="`date`"
}
restart
有两个优化问题就是,如果采用的channel是menory类型的
a1.channels.c4.type = memory
a1.channels.c4.capacity = 200000
a1.channels.c4.transactionCapacity = 200000
很容易就oom了,建议把下面两个参数调大点,第二个的数值不能超过第三个,之后就是flume-ng的配置优化
打开flume-ng文件,这个JAVA_OPTS默认是20m,建议加大内存,这个也是导致oom的原因之一。
然后如果需要实时将数据移动到指定的flume监控目录,可以参考我的上一篇文章https://blog.csdn.net/mianhuatang__/article/details/125766761?spm=1001.2014.3001.5502
这两个可以结合使用,效果还行。