flume监控文件send到kafka

路人君� 。

已于 2022-07-13 17:16:40 修改

阅读量564

点赞数 1

文章标签：大数据 flume kafka

于 2022-07-13 17:11:01 首次发布

本文链接：https://blog.csdn.net/mianhuatang__/article/details/125768007

版权

由于需求的变更需要实时传输数据，然后就要将之前的hdfs换成kafka

直接上配置


a1.sources = r1 r2 r4
a1.sinks = k1 k2 k4
a1.channels = c1 c2 c4



a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/mjxt/flume_data/001/
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

a1.channels.c1.type = memory
a1.channels.c1.capacity = 200000
a1.channels.c1.transactionCapacity = 200000

a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
# 因为没有kafka集群 所以随便指定得 测试
a1.sinks.k1.kafka.topic = etc_mj_001
a1.sinks.k1.kafka.bootstrap.servers = 10.42.3.56:9092,10.42.3.55:9092,10.42.3.53:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = -1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy


a1.sources.r2.type = spooldir
a1.sources.r2.spoolDir = /home/mjxt/flume_data/002/
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

a1.channels.c2.type = memory
a1.channels.c2.capacity = 200000
a1.channels.c2.transactionCapacity = 200000

a1.sinks.k2.channel = c2
a1.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k2.kafka.topic = etc_mj_002
a1.sinks.k2.kafka.bootstrap.servers = 10.42.3.56:9092,10.42.3.55:9092,10.42.3.53:9092
a1.sinks.k2.kafka.flumeBatchSize = 20
a1.sinks.k2.kafka.producer.acks = -1
a1.sinks.k2.kafka.producer.linger.ms = 1
a1.sinks.k2.kafka.producer.compression.type = snappy



a1.sources.r4.type = spooldir
a1.sources.r4.spoolDir = /home/mjxt/flume_data/004/
a1.sources.r4.interceptors = i4
a1.sources.r4.interceptors.i4.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

a1.channels.c4.type = memory
a1.channels.c4.capacity = 200000
a1.channels.c4.transactionCapacity = 200000

a1.sinks.k4.channel = c4
a1.sinks.k4.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k4.kafka.topic = etc_mj004
a1.sinks.k4.kafka.bootstrap.servers = 10.42.3.56:9092,10.42.3.55:9092,10.42.3.53:9092
a1.sinks.k4.kafka.flumeBatchSize = 20
a1.sinks.k4.kafka.producer.acks = -1
a1.sinks.k4.kafka.producer.linger.ms = 1
a1.sinks.k4.kafka.producer.compression.type = snappy



a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sources.r2.channels = c2
a1.sinks.k2.channel = c2
a1.sources.r4.channels = c4
a1.sinks.k4.channel = c4

因为数据不是特别重要，所以这里kafka也没有什么优化，正常使用就行了，source用的spoolDir的source，传输速度是最快的，但是也容易丢失数据。

之后通过脚本可以进行flume的重启，由于之前是用的hdfs的sink，所以需要先停止之前的，将app换个参数就行了。

#!/bin/bash

app="spool-kafka"
log_path="/home/mjxt/shell/heartbeat.log"

#检测方法
checkStatus(){
  pid=$(ps -ef |grep $app |grep -v "grep" |awk '{print $2}');
  #datetime=`date +%Y-%m-%d,%H:%m:%s`
  datetime="`date`"
  if [ -z "${pid}" ]; then
     echo "$datetime ---- 开始启动服务$APP_NAME" >> $log_path
      /home/mjxt/apache-flume-1.9.0-bin/bin/flume-ng agent -n a1 -c /home/mjxt/apache-flume-1.9.0-bin/conf -f /home/mjxt/apache-flume-1.9.0-bin/conf/spool-hdfs.conf -Dflume.root.logger=INFO,console >/dev/null  2>&1 &
     
  else
     echo "$datetime ---- 项目$APP_NAME已经启动,进程pid是${pid}！" >> $log_path
  fi
}

restart(){
  #pid=$(ps -ef |grep $app |grep -v "grep" |awk '{print $2}');
  process=`ps -ef|grep spool-kafka.conf |grep -v grep|grep -v PPID|awk '{print $2}'`
  for i in $process
  do
    echo "kill the process [$i]"
    kill -9 $i
  done
  
  cd /home/mjxt/apache-flume-1.9.0-bin/;bin/flume-ng agent -n a1 -c /home/mjxt/apache-flume-1.9.0-bin/conf -f /home/mjxt/apache-flume-1.9.0-bin/conf/spool-kafka.conf  >/dev/null  2>&1 &
  #datetime=`date +%Y-%m-%d,%H:%m:%s`
  #datetime="`date`"
}

restart

有两个优化问题就是，如果采用的channel是menory类型的

a1.channels.c4.type = memory
a1.channels.c4.capacity = 200000
a1.channels.c4.transactionCapacity = 200000

很容易就oom了，建议把下面两个参数调大点，第二个的数值不能超过第三个，之后就是flume-ng的配置优化

打开flume-ng文件，这个JAVA_OPTS默认是20m，建议加大内存，这个也是导致oom的原因之一。

然后如果需要实时将数据移动到指定的flume监控目录，可以参考我的上一篇文章https://blog.csdn.net/mianhuatang__/article/details/125766761?spm=1001.2014.3001.5502

这两个可以结合使用，效果还行。

路人君� 。

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
flume监控文件send到kafka

由于需求的变更需要实时传输数据，然后就要将之前的hdfs换成kafka
复制链接

扫一扫