streaming-kafka：streaming消费kafka数据

最新推荐文章于 2024-04-02 16:47:33 发布

置顶

大数据-刘耀文

最新推荐文章于 2024-04-02 16:47:33 发布

阅读量1.9k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/weixin_42741866/article/details/86253668

版权

Kafka-消费模型

High Level Consumer API

不需要自己管理offset
默认实现最少一次消息传递语义（At least once）
comsumer数量大于 partiton数量，浪费。
comsumer数量小于 partiton数量，一个comsumer对应多个partiton
最好partiton数目是consumer数目的整数倍

Low Level Consumer API（Simple Consumer API）

需要自己管理offset
可以实现各种消息传递语义

Kafka-消息组织

磁盘顺序读写（sequential disk access ）

采用预读和批量大数据量写
寻道

零字节拷贝（sendfile system call）

传统：

数据从磁盘读取到内核空间的pagecache中
应用程序从内核空间读取数据到用户空间缓冲区
应用程序将数据从内核空间写到套接字缓冲区
从套接字缓冲区复制到NIC缓冲区

SendFile：

数据从内核空间复制到套接字缓冲区
从套接字缓冲区复制到NIC缓冲区
数据都是在内核空间传递，效率高。
减少了两次拷贝

在这里插入图片描述

Kafka-消息检索原理

index文件的序号就是message在日志文件中的相对偏移量

OffsetIndex是稀疏索引，也就是说不会存储所有的消息的相对offset和position

在这里插入图片描述

以这个partition目录下面，00000000001560140916为例
定位offset 为1560140921的message

定位到具体的segment日志文件
由于log日志文件的文件名是这个文件中第一条消息的offset-1.
因此可以根据offset定位到这个消息所在日志文件：00000000001560140916.log
计算查找的offset在日志文件的相对偏移量
segment文件中第一条消息的offset = 1560140917
计算message相对偏移量：需要定位的offset - segment文件中第一条消息的offset + 1 = 1560140921 -
1560140917 + 1 = 5
查找index索引文件，可以定位到该消息在日志文件中的偏移字节为456.
综上，直接读取文件夹00000000001560140916.log中偏移456字节的数据即可。
1560140922 -1560140917 + 1 = 6
如果查找的offset在日志文件的相对偏移量在index索引文件不存在，可根据其在index索引文件最接近的上限
偏移量，往下顺序查找

Spark Streaming + Kafka整合

Receiver-based Approach

Kafka的topic分区和Spark Streaming中生成的RDD分区没有关系。在KafkaUtils.createStream中增加
分区数量只会增加单个receiver的线程数，不会增加Spark的并行度
可以创建多个的Kafka的输入DStream，使用不同的group和topic，使用多个receiver并行接收数据。
如果启用了HDFS等有容错的存储系统，并且启用了写入日志，则接收到的数据已经被复制到日志中。
因此，输入流的存储级别设置StorageLevel.MEMORY_AND_DISK_SER（即使用
KafkaUtils.createStream（…，StorageLevel.MEMORY_AND_DISK_SER））的存储级别。

首先模拟一个生产者

import java.util.HashMap
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import scala.util.Random

// Produces some random words between 1 and 100.
object KafkaWordCountProducer {
  def main(args: Array[String]) {
    // metadataBrokerList：kafka列表，topic：topic名称，
    // messagesPerSec：每秒的消息数，wordsPerMessage：每秒的单词数量
    if (args.length < 2) {
      System.err.println("Usage: KafkaWordCountProducer <metadataBrokerList> <topic>")
      // 退出程序
      // system.exit(0):正常退出，相当于shell的kill
      // system.exit(1):非正常退出，相当于shell的kill -9
      System.exit(1)
    }
    // args: node01:9092,node02:9092,node03:9092 kefkawc
    val Array(brokers, topic) = args
    // Zookeeper connection properties
    val props = new HashMap[String, Object]()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
      "org.apache.kafka.common.serialization.StringSerializer")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
      "org.apache.kafka.common.serialization.StringSerializer")

    val producer = new KafkaProducer[String, String](props)

    val arr = Array(
      "hello tom",
      "hello jerry",
      "hello kitty",
      "hello suke"
    )
    val r = new Random();
    // Send some messages
    while (true) {
      val message = arr(r.nextInt(arr.length))
      producer.send(new ProducerRecord[String, String]("kafkawc", message))
      Thread.sleep(1000)
    }
  }

}

再模拟一个消费者–Receiver

/**
  * 通过receive方式读取kafka数据
  */
object ReceiveKafkaWordCount {
  def main(args: Array[String]) {
    // zkQuorum：zk列表，group：group id，topics：可以放多个topic并且以“,”号分隔，numThreads：消费的线程数
    if (args.length < 4) {
      System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
      System.exit(1)
    }
    // args: node01:2181,node02:2181,node03:2181 group01 kafkawc 2
    val Array(zkQuorum, group, topics, numThreads) = args
    val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    ssc.checkpoint("checkpoint")

    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
    val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1L))
      .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Direct Approach (No Receivers)

简化的并行性：不需要创建多个输入Kafka流并将其合并。使用directStream，Spark Streaming将创建
与使用Kafka分区一样多的RDD分区，这些分区将全部从Kafka并行读取数据。所以在Kafka和RDD分
区之间有一对一的映射关系。
效率：在第一种方法中实现零数据丢失需要将数据存储在预写日志中，这会进一步复制数据。这实际
上是效率低下的，因为数据被有效地复制了两次 - 一次是Kafka，另一次是由预先写入日志（Write
Ahead Log）复制。这个第二种方法消除了这个问题，因为没有接收器，因此不需要预先写入日志。
只要Kafka数据保留时间足够长。
正好一次（Exactly-once）的语义：第一种方法使用Kafka的高级API来在Zookeeper中存储消耗的偏移
量。传统上这是从Kafka消费数据的方式。虽然这种方法（结合预写日志）可以确保零数据丢失
（即至少一次语义），但是在某些失败情况下，有一些记录可能会消费两次。发生这种情况是因为
Spark Streaming可靠接收到的数据与Zookeeper跟踪的偏移之间的不一致。因此，在第二种方法中，
我们可以不使用Zookeeper的简单Kafka API。在其检查点内，Spark Streaming跟踪偏移量。这消除了
Spark Streaming和Zookeeper / Kafka之间的不一致，因此Spark Streaming每次记录都会在发生故障的
情况下有效地收到一次。为了实现输出结果的一次语义，将数据保存到外部数据存储区的输出操作必须
是幂等的，或者是保存结果和偏移量的原子事务。

模拟一个消费者–Direct

/**
  * Consumes messages from one or more topics in Kafka and does wordcount.
  * Usage: DirectKafkaWordCount <brokers> <topics>
  *   <brokers> is a list of one or more Kafka brokers
  *   <topics> is a list of one or more kafka topics to consume from
  *
  * Example:
  *    $ bin/run-example streaming.DirectKafkaWordCount broker1-host:port,broker2-host:port \
  *    topic1,topic2
  */
object DirectKafkaWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println(s"""
                            |Usage: DirectKafkaWordCount <brokers> <topics>
                            |  <brokers> is a list of one or more Kafka brokers
                            |  <topics> is a list of one or more kafka topics to consume from
                            |
        """.stripMargin)
      System.exit(1)
    }
    
    val Array(brokers, topics) = args

    // Create context with 2 second batch interval
    val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))

    // Create direct kafka stream with brokers and topics
    val topicsSet = topics.split(",").toSet
    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topicsSet)

    // Get the lines, split them into words, count the words and print
    val lines = messages.map(_._2)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
    wordCounts.print()

    // Start the computation
    ssc.start()
    ssc.awaitTermination()
  }
}

Kafka Offset 管理

使用外部存储保存offset

Checkpoints
HBase
ZooKeeper
Kafka
…

在这里插入图片描述

不保存offset

Kafka Offset 管理–Checkpoint

启用Spark Streaming的checkpoint是存储偏移量最简单的方法。
流式checkpoint专门用于保存应用程序的状态，比如保存在HDFS上，
在故障时能恢复。
Spark Streaming的checkpoint无法跨越应用程序进行恢复。
Spark 升级也将导致无法恢复。
在关键生产应用，不建议使用spark检查点的管理offset方式。

/**
  * 用checkpoint记录offset
  * 优点：实现过程简单
  * 缺点：如果streaming的业务更改，或别的作业也需要获取该offset，是获取不到的
  */
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}

object StreamingWithCheckpoint {
  def main(args: Array[String]) {
    //val Array(brokers, topics) = args
    val processingInterval = 2
    val brokers = "node01:9092,node02:9092,node03:9092"

最低0.47元/天解锁文章

大数据-刘耀文

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
streaming-kafka：streaming消费kafka数据

Kafka-消费模型High Level Consumer API不需要自己管理offset默认实现最少一次消息传递语义（At least once）comsumer数量大于 partiton数量，浪费。comsumer数量小于 partiton数量，一个comsumer对应多个partiton最好partiton数目是consumer数目的整数倍Low Level Cons...
复制链接

扫一扫