Flume + Kafka+sparkstreaming

最新推荐文章于 2022-02-14 22:07:49 发布

kylin_xue

最新推荐文章于 2022-02-14 22:07:49 发布

阅读量423

点赞数

分类专栏：数据仓库文章标签：流处理 flume kafka

本文链接：https://blog.csdn.net/kylin_xue/article/details/87888273

版权

数据仓库专栏收录该内容

9 篇文章 0 订阅

订阅专栏

整合Flume、Kafka搭建实时日志收集系统

Flume收集某一个目录的日志，设置kafka sink，Kafka从sink中pull数据进行消费。

物理配置

主机名：s201

zookeeper3.4.12：s201:2181

kafka0.9.0.1：s201:9092

flume1.7.0

spark：2.2.3

flume配置文件如下：

# 监听flume_msg的日志，将数据传到avroSink
exec-memory-avro.sources=execSrc
exec-memory-avro.channels=memoryChannel
exec-memory-avro.sinks=avroSink

exec-memory-avro.sources.execSrc.type=exec
exec-memory-avro.sources.execSrc.command=tail -F /home/hadoop/data/flume/source/flume_msg/data.log
exec-memory-avro.sources.execSrc.shell=/bin/sh -c

exec-memory-avro.sinks.avroSink.type=avro
exec-memory-avro.sinks.avroSink.hostname=s201
exec-memory-avro.sinks.avroSink.port=33333

exec-memory-avro.sources.execSrc.channels=memoryChannel
exec-memory-avro.sinks.avroSink.channel=memoryChannel

exec-memory-avro.channels.memoryChannel.type=memory
exec-memory-avro.channels.memoryChannel.capacity=100

#-----------------------------------------------------------
#-----------------------------------------------------------

# avro将数据传到kafka的hello_topic中
avro-memory-kafka.sources=avroSource
avro-memory-kafka.sinks=kafkaSink
avro-memory-kafka.channels=memoryChannel

avro-memory-kafka.sources.avroSource.type=avro
avro-memory-kafka.sources.avroSource.bind=s201
avro-memory-kafka.sources.avroSource.port=33333

avro-memory-kafka.sinks.kafkaSink.type=org.apache.flume.sink.kafka.KafkaSink
avro-memory-kafka.sinks.kafkaSink.kafka.bootstrap.servers=s201:9092
avro-memory-kafka.sinks.kafkaSink.kafka.topic=hello
avro-memory-kafka.sinks.kafkaSink.batchSize=5
avro-memory-kafka.sinks.kafkaSink.requiredAcks=1

avro-memory-kafka.channels.memoryChannel.type=memory
avro-memory-kafka.channels.memoryChannel.capacity=100

avro-memory-kafka.sources.avroSource.channels=memoryChannel
avro-memory-kafka.sinks.kafkaSink.channel=memoryChannel

kafka中增加hello这个topic，用于接受生产者生产的消息。

kafka-topics.sh --create --zookeeper s201:2181/mykafka --replication-factor 1 --partitions 1 --topic hello

测试方法

启动 avro-memory-kafka这个flume agent，用于接收日志。
启动exec-memory-avro这个flume agent，用于从source发送日志。
启动kafka自带的消费者，消费hello这个topic。

# 首先启动avro源，监听其它服务器发过来的消息
bin/flume-ng agent \
--name avro-memory-kafka \
--conf conf \
--conf-file conf/avro-memory-kafka.properties \
-Dflume.root.logger=INFO,console

# 监听data.log日志，一旦有变化将新的消息传出去
bin/flume-ng agent \
--name exec-memory-avro \
--conf conf \
--conf-file conf/exec-memory-avro.properties \
-Dflume.root.logger=INFO,console

# 启动kafka消费者，消费hello这个topic
kafka-console-consumer.sh --zookeeper s201:2181/mykafka --topic hello

以上安装好之后，向data.log写入数据: echo "hello1" >> data.log ...

新版本flume支持 taildir的source，exec-memory-avro这个agent可以进行修改。

Flume、Kafka整合Sparkstreaming

Flume整合Sparkstreaming

Push方式

windows10上安装netcat工具，加入环境变量，使用的时候可以直接 nc.exe hostname port即可。

flume配置文件如下：

# flume_push_streaming.properties
# netcat-memory-avro
flume_push_streaming.sources=netcatSrc
flume_push_streaming.channels=memoryChannel
flume_push_streaming.sinks=avroSink

flume_push_streaming.sources.netcatSrc.type=netcat
flume_push_streaming.sources.netcatSrc.bind=s201
flume_push_streaming.sources.netcatSrc.port=22222

flume_push_streaming.sinks.avroSink.type=avro
# 本地IDE环境所在IP地址
flume_push_streaming.sinks.avroSink.hostname=192.168.204.1  
flume_push_streaming.sinks.avroSink.port=33333

flume_push_streaming.sources.netcatSrc.channels=memoryChannel
flume_push_streaming.sinks.avroSink.channel=memoryChannel

flume_push_streaming.channels.memoryChannel.type=memory
flume_push_streaming.channels.memoryChannel.capacity=100

scala代码如下：

object FlumePushWordCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("FlumePush")
    val ssc = new StreamingContext(sparkConf, Seconds(10))

    // TODO... 如何使用SparkStreaming整合Flume
    // FlumeUtils可以将flume的event流转换为DStream类型，进而进行处理
    // 0.0.0.0表示任意网卡都可以
    val flumeStream = FlumeUtils.createStream(ssc, "0.0.0.0", 33333)
    flumeStream.map(x => new String(x.event.getBody().array()).trim)
        .flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).print

    ssc.start()
    ssc.awaitTermination()
  }
}

以上代码需要注意：FlumeUtils.createStream(ssc, hostname, port) 返回的类型为 ReceiverInputDStream[SparkFlumeEvent]

测试

# 1. 首先启动本地IDEA编写的spark-streaming程序。
# 2. 启动flume
bin/flume-ng agent \
--name flume_push_streaming \
--conf conf \
--conf-file conf/flume_push_streaming.properties \
-Dflume.root.logger=INFO,console
# 3. 使用netcat想flume源监听端口发消息
# 此处是在windows环境下使用的nc
nc.exe s201 22222

Pull方式

flume将数据push到sink，数据被缓存。spark-streaming使用a reliable flume receiver 从sink中拉取数据。

flume的配置文件 sink的type换成 org.apache.spark.streaming.flume.sink.SparkSink

scala代码中换成：

FlumeUtils.createPollingStream(ssc, “sink machine hostname”, “sink port”)

# flume_pull_streaming.properties
flume_push_streaming.sources=netcatSrc
flume_push_streaming.channels=memoryChannel
flume_push_streaming.sinks=sparkSink  # 此处开始不同

flume_push_streaming.sources.execSrc.type=netcat
flume_push_streaming.sources.execSrc.bind=s201
flume_push_streaming.sources.execSrc.port=22222
# 以下开始不同
flume_push_streaming.sinks.sparkSink.type=org.apache.spark.streaming.flume.sink.SparkSink
flume_push_streaming.sinks.sparkSink.hostname=s201
flume_push_streaming.sinks.sparkSink.port=33333

flume_push_streaming.sources.netcatSrc.channels=memoryChannel
flume_push_streaming.sinks.sparkSink.channel=memoryChannel

flume_push_streaming.channels.memoryChannel.type=memory
flume_push_streaming.channels.memoryChannel.capacity=100

小技巧：如果服务器上需要导入依赖包，可以使用 --jars参数指定依赖包即可

例如：spark-submit --jars org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.3 ...

Kafka整合Sparkstreaming

官网：http://spark.apache.org/docs/2.2.3/streaming-kafka-0-8-integration.html

Receiver-based

使用Kafka的高阶API将缴费的消息offsets存在zookeeper中，需要通过receiver将数据存储在Write Ahead Log中，增加了数据被重复复制的开销，效率不如directStream高。仅仅能够保证at least once，可能数据会用重复，无法做到exactly once。

Direct Approach【用的比较多】 1.3之后被引入

不使用Receiver接受数据，而是周期性的查询Kafka每一个topic+partition中最近的offsets，通过Kafka的simple consumer API读取Kafka中自定义的offset ranges。

这种方式中SparkStreaming创建的RDD partitions和要消费的Kafka partitions是一样多的，两者是一一对应的，简化了并行度。

*使用低阶Kafka API将offsets记录在sparkstreaming的checkpoint中，而不是zookeeper中，能够保证exactly once语义。

KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
  ssc,
  Map[String, String]("metadata.broker.list"->brokers),  // kafkaparams
  topicsSet // 要消费的topic集合
)

Flume+Kafka+Sparkstreaming整合

kylin_xue

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Flume + Kafka+sparkstreaming

整合Flume、Kafka搭建实时日志收集系统Flume收集某一个目录的日志，设置kafka sink，Kafka从sink中pull数据进行消费。物理配置主机名：s201 zookeeper3.4.12：s201:2181 kafka0.9.0.1：s201:9092flume1.7.0spark：2.2.3flume配置文件如下：# 监听flume_...
复制链接

扫一扫

专栏目录