大数据进行流式数据处理的时候Flume采集数据,Kafka消费数据,Spark Streaming
处理数据是一种非常常见的架构,这里记录一下Kafka
整合Flume
的不过,以备后用
这里默认已经安装好了Kafka
和Flume
,不再介绍,大家可以自行去网上找下
其实最主要的就是为Flume创建kafka的配置文件kafka-conf.properties
:
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'agent'
agent.sources = r1
agent.channels = c1
agent.sinks = s1
# Flume中Source相关的配置
# 指定要监控的类型,比如说端口号,目录
agent.sources.r1.type = spooldir
# 指定要监控的目录
agent.sources.r1.spoolDir = /opt/shortcut/flume/logs/data
agent.sources.r1.fileHeader = true
# Flume中Sink相关的配置
# 指定输出结果到Kafka
agent.sinks.s1.type = org.apache.flume.sink.kafka.KafkaSink
# 指定输出结果到Kafka指定的Topic为test
agent.sinks.s1.topic = test
# 指定Kafka的brokerList为cm02.spark.com:9092
agent.sinks.s1.brokerList = cm02.spark.com:9092
agent.sinks.s1.requiredAcks = 1
agent.sinks.s1.batchSize = 2
# Flume中Channel相关的配置
agent.channels.c1.type = memory
agent.channels.c1.capacity = 100
agent.sources.r1.channels = c1
agent.sinks.s1.channel = c1
启动Flume:
bin/flume-ng agent --conf conf -f ./conf/kafka-conf.properties -n agent -Dflume.root.logger=INFO,console
启动Kafka消费者:
bin/kafka-console-consumer.sh --zookeeper cm01.spark.com:2181,cm02.spark.com:2181,cm03.spark.com:2181 --topic test --from-beginning
上传一个文本文件到Flume监听的目录:
在Kafka的消费者中查看消费情况: