整合Flume和kafka完成实时数据采集
kafka和Flume都有发送和接收数据功能,为什么还需要配合使用呢,个人认为,Flume是一个数据采集工具,只管采集和发送,并没有存储功能,做不到缓存,接收到如果不能及时消费信息,会有数据丢失的风险,kafka完全可以解决这个问题,kafka自带存储,可以先接收,再慢慢消费,做日志缓存应该是更为合适的。
当然,没有最好的工具,只有最合适的工具,应对不同使用场景,可以根据kafka和flume的特性做调整,本文就flume+kafka做介绍。
整体流程图
flume 配置
exec-memory-avro
avro-memory-kafka
exec-memory-avro.conf
exec-memory-avro.sources = exec-source
exec-memory-avro.sinks = avro-sink
exec-memory-avro.channels = memory-channel
exec-memory-avro.sources.exec-source.type = exec
exec-memory-avro.sources.exec-source.command = tail -F /root/data/data.log
exec-memory-avro.sources.exec-source.shell = /bin/sh -c
exec-memory-avro.sinks.avro-sink.type = avro
exec-memory-avro.sinks.avro-sink.hostname = hadoop1
exec-memory-avro.sinks.avro-sink.port = 44444
exec-memory-avro.channels.memory-channel.type = memory
exec-memory-avro.sources.exec-source.channels = memory-channel
exec-memory-avro.sinks.avro-sink.channel = memory-channel
avro-memory-kafka.conf
avro-memory-kafka.sources = avro-source
avro-memory-kafka.sinks = kafka-sink
avro-memory-kafka.channels = memory-channel
avro-memory-kafka.sources.avro-source.type = avro
avro-memory-kafka.sources.avro-source.bind = hadoop000
avro-memory-kafka.sources.avro-source.port = 44444
###kafka sink 配置
avro-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
avro-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
avro-memory-kafka.sinks.kafka-sink.topic = hello_topic
avro-memory-kafka.sinks.kafka-sink.batchSize = 5
avro-memory-kafka.sinks.kafka-sink.requiredAcks =1
avro-memory-kafka.channels.memory-channel.type = memory
avro-memory-kafka.sources.avro-source.channels = memory-channel
avro-memory-kafka.sinks.kafka-sink.channel = memory-channel
先启动flume接收端,再启动flume发送端
flume-ng agent \
--name avro-memory-kafka \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/avro-memory-kafka.conf \
-Dflume.root.logger=INFO,console
flume-ng agent \
--name exec-memory-avro \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/exec-memory-avro.conf \
-Dflume.root.logger=INFO,console
kafka 配置
启动具体配置根据:https://blog.csdn.net/u012133048/article/details/81783477 此篇博客编写配置文件。
创建topic:
kafka-topics.sh --create --zookeeper hadoop000:2181 --replication-factor 3 --partitions 1 --topic hello_topic
打开终端消费:
kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic hello_topic