SparkSteaming整合Kafka的方式

最新推荐文章于 2024-07-20 07:15:00 发布

UAreNotMe

最新推荐文章于 2024-07-20 07:15:00 发布

阅读量623

点赞数 2

本文链接：https://blog.csdn.net/weixin_40981792/article/details/90477810

版权

1、基于Receiver方式

这种方式构建出的DStream有一个接收者Receiver，通过这个接收者将数据保存在Executor中。
这种方式是需要独享CPU的core，也就是说需要独立占用若干个线程。所以如果在本地模式下，local[N]中的N指定为1的话，就只有一个线程来运行SparkStreaming程序，这一个线程只能用来接收数据，没有额外的线程去计算，所以会看到数据不被处理的现象。
这种方式数据可能在计算失败的情况下丢失。为了防止数据零丢失，我们需要开启SparkStreaming的预写日志机制（write ahead log，WAL），该机制会同步的将接收到的Kafka数据写入分布式文件系统，比如HDFS中。所及，即使底层节点出现了失败，也可以使用预写日志中的数据进行恢复。
开启方式：spark.streaming.receiver.writeAheadLog.enable=true，默认是false。
需要注意的方面：
1、Kafka的topic分区和SparkStreaming中生成的RDD分区没有关系。在KafkaUtils.createStream中增加分区数量只会增加单个receiver的线程数，不会增加spark的并行度。
2、可以创建多个的Kafka输入DSteam，使用不同的group和topic，使用多个Receiver并行接收数据。
3、如果启用了HDFS等有容错的存储系统，并且启用了写入日志，则接收到的数据已经被复制到日志中。因此，输入流的存储级别设置StorageLevel.MEMORY_AND_DISK_SER（即使用KafkaUtils.createStream（…，StorageLevel.MEMORY_AND_DISK_SER））的存储级别。
代码案例

object _01SparkStreamingWithKafkaReceiverOps {
    def main(args: Array[String]): Unit = {
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
        Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
        Logger.getLogger("org.spark-project").setLevel(Level.WARN)
        if(args == null || args.length < 3) {
            println(
                """Parameter Errors ! Usage: <batchInterval> <zk> <groupId>
                  |batchInterval    ：  作业提交的间隔时间
                  |zk               ：  zk的元数据地址
                  |groupId          ：  分组id
                """.stripMargin)
            System.exit(-1)
        }
        val Array(batchInterval, zk, groupId) = args
        val conf = new SparkConf()
                .setAppName("SparkStreamingWithKafkaReceiver")
                .setMaster("local[*]")
        val ssc = new StreamingContext(conf, Seconds(batchInterval.toLong))
        val topics = Map[String, Int](
            "t-1810-1" -> 3
        )
        /**
          * 这里面的Key--->kafka中message的Key，如果没有指定key，key就是null
          * value--->kafka中message的value
          */
        val kafkaStream:InputDStream[(String, String)] = KafkaUtils.createStream(ssc, zk, groupId, topics)
//        kafkaStream.print()
        val retDStream:DStream[(String, Int)] = kafkaStream.map{case (key, msg) => msg}.flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_+_)

        retDStream.print()
        ssc.start()
        ssc.awaitTermination()
    }
}

读取流程分析

2、基于Direct

这种方式区别于Receiver的方式，没有接收者专门消耗CPU core或者线程去接收数据，是通过kafka底层的api直接从kafka中读取数据。每次读取的是偏移量的范围代表数据[fromOffset,untilOffset]，而这个有范围的偏移量数据就构成了我们进行处理的DSteam或者RDD，也就是说RDD是有范围的。
需要注意的地方
1、简化的并行性：不需要创建多个输入kafka流并将其合并。使用DirectSteam，SparkSteaming将创建与kafka分区一样多的RDD分区，这些分区将全部从Kafka并行读取数据。所以在Kafka和RDD分区之间有一对一关系。
2、效率：如果要保证零数据丢失，在基于receiver的方式中，需要开启WAL机制。这种方式其实效率低下，因为数据实际上被复制了两份，Kafka自己本身就有高可靠的机制，会对数据复制一份，而这里又会复制一份到WAL中。而基于direct的方式，不依赖Receiver，不需要开启WAL机制，只要Kafka中作了数据的复制，那么就可以通过Kafka的副本进行恢复。
3、Exactly-once语义：第一种方法使用Kafka的高阶API来在Zookeeper中存储消耗的偏移量。虽然这种方法结合WAL机制可以确保零数据丢失（即至少一次语义），但是在某些失败情况下，有一些记录可能会被消费两次。发生这种情况是因为SparkStreaming接收到数据与Zookeeper跟踪的偏移量之间不一致。因此，在第二种方法中，我们不使用Zookeeper的简单KafkaAPI，在其检查点内，SparkStreaming跟踪偏移量。这消除了SparkSteaming和Zookeeper/Kafka之间的不一致。因此SparkStreaming每次记录都会在发生故障的情况下有效地收到一次。为了实现输出结果的一次语义，将数据保存在外部数据存储区的输出操作必须是幂等的，或者是保存结果和偏移量的原子事务。
代码案例：

object _01SparkStreamingWithDirectKafkaOps {
    def main(args: Array[String]): Unit = {
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
        Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
        Logger.getLogger("org.spark-project").setLevel(Level.WARN)
        if(args == null || args.length < 3) {
            println(
                """Parameter Errors ! Usage: <batchInterval> <groupId> <topicList>
                  |batchInterval    ：  作业提交的间隔时间
                  |groupId          ：  分组id
                  |topicList        ：  要消费的topic列表
                """.stripMargin)
            System.exit(-1)
        }
        val Array(batchInterval, groupId, topicList) = args

        val conf = new SparkConf()
                    .setAppName("SparkStreamingWithDirectKafkaOps")
                    .setMaster("local[*]")

        val ssc = new StreamingContext(conf, Seconds(batchInterval.toLong))
        val kafkaParams = Map[String, String](
            "bootstrap.servers" -> "bigdata01:9092,bigdata02:9092,bigdata03:9092",
            "group.id" -> groupId,
            //largest从偏移量最新的位置开始读取数据
            //smallest从偏移量最早的位置开始读取
            "auto.offset.reset" -> "smallest"
        )
        val topics = topicList.split(",").toSet
        //基于Direct的方式读取数据
        val kafkaDStream:InputDStream[(String, String)] = KafkaUtils
            .createDirectStream[String, String, StringDecoder, StringDecoder](
            ssc, kafkaParams, topics)

        kafkaDStream.foreachRDD((rdd, bTime) => {
            if(!rdd.isEmpty()) {
                println("-------------------------------------------")
                println(s"Time: $bTime")
                println("-------------------------------------------")
                rdd.foreach{case (key, value) => {
                    println(value)
                }}
                //查看rdd的范围
                println("偏移量范围：")
                val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
                for (offsetRange <- offsetRanges) {
                    val topic = offsetRange.topic
                    val parition = offsetRange.partition
                    val fromOffset = offsetRange.fromOffset
                    val untilOffset = offsetRange.untilOffset
                    val count = offsetRange.count()
                    println(s"topic:${topic}, partition:${parition}, " +
                        s"fromOffset:${fromOffset}, untilOffset:${untilOffset}, count:${count}")
                }
            }
        })
        ssc.start()
        ssc.awaitTermination()
    }
}

读取流程分析

3、基于Direct的问题

如果我们设置auto.offset.reset=largest的话，如果程序挂掉，然后重启这个过程当中，kafka中可能有没有被消费；如果设置为smallest的话，每一次重启之后，程序都会从头开始读取数据，造成重复消费，这都不是我们愿意看到的。
归根到的产生问题的原因就是偏移量默认情况下，不可控，有SparkStreaming来控制，所以要想解决这个问题，就必须自己来管理偏移量。管理偏移量的方式有很多，在企业中常用的方式，有基于ZooKeeper，HBase，ES等等，我们这里给大家通过使用ZK来管理offset。
基于ZK的解决方式
1、步骤：
2、代码：

object _02SparkStreamingWithDirectKafkaOps {
    def main(args: Array[String]): Unit = {
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
        Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
        Logger.getLogger("org.spark-project").setLevel(Level.WARN)
        if(args == null || args.length < 3) {
            println(
                """Parameter Errors ! Usage: <batchInterval> <groupId> <topicList>
                  |batchInterval    ：  作业提交的间隔时间
                  |groupId          ：  分组id
                  |topicList        ：  要消费的topic列表
                """.stripMargin)
            System.exit(-1)
        }
        val Array(batchInterval, groupId, topicList) = args

        val conf = new SparkConf()
                    .setAppName("SparkStreamingWithDirectKafkaOps")
                    .setMaster("local[*]")

        val ssc = new StreamingContext(conf, Seconds(batchInterval.toLong))
        val kafkaParams = Map[String, String](
            "bootstrap.servers" -> "bigdata01:9092,bigdata02:9092,bigdata03:9092",
            "group.id" -> groupId,
            //largest从偏移量最新的位置开始读取数据
            //smallest从偏移量最早的位置开始读取
            "auto.offset.reset" -> "smallest"
        )
        val topics = topicList.split(",").toSet

        val messages:InputDStream[(String, String)] = createMsg(ssc, kafkaParams, topics, groupId)
        //step 3、业务处理
        messages.foreachRDD((rdd, bTime) => {
            if(!rdd.isEmpty()) {
                println("-------------------------------------------")
                println(s"Time: $bTime")
                println("-------------------------------------------")
                println("########rdd'count: " + rdd.count())
                //step 4、更新偏移量
                store(rdd.asInstanceOf[HasOffsetRanges].offsetRanges, groupId)
            }
        })

        ssc.start()
        ssc.awaitTermination()
    }
    def createMsg(ssc:StreamingContext, kafkaParams:Map[String, String],
                  topics:Set[String], group:String):InputDStream[(String, String)] = {
        //step 1、从zk中获取偏移量
        val offsets: Map[TopicAndPartition, Long] = getOffsets(topics, group)
        var messages:InputDStream[(String, String)] = null
        //step 2、基于偏移量创建message
        if(!offsets.isEmpty) {//读取到了对应的偏移量
            val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message)
            messages = KafkaUtils.createDirectStream[String, String,
                            StringDecoder, StringDecoder,
                            (String, String)](ssc,
                            kafkaParams, offsets, messageHandler)
        } else {//无偏移量
            messages = KafkaUtils.createDirectStream[
                        String, String,
                        StringDecoder, StringDecoder](
                            ssc, kafkaParams, topics)
        }
        messages
    }

    /**
      * step 1、获取对应topic的partition在zk中的偏移量信息
      * 框架默认将数据保存的路径：/kafka/consumers/${groupId}/offsets/${topic}/${partition}
      *                                                                          数据是offset
      * 自己模拟一个路径：
      *     /kafka/mykafka/offsets/${topic}/${groupId}/${partition}
      *                                                     数据是offset
      */

    def getOffsets(topics:Set[String], group:String): Map[TopicAndPartition, Long] = {
        val offsets = mutable.Map[TopicAndPartition, Long]()
        for (topic <- topics) {
            val path = s"${topic}/${group}"

            checkExists(path)

            for(partition <- JavaConversions.asScalaBuffer(curator.getChildren.forPath(path))) {
                val fullPath = s"${path}/${partition}"
                //获取指定分区的偏移量
                val offset = new String(curator.getData.forPath(fullPath)).toLong
                offsets.put(TopicAndPartition(topic, partition.toInt), offset)
            }
        }
        offsets.toMap
    }

    //step 4、更新偏移量
    def store(offsetRanges: Array[OffsetRange], group:String) {
        for(offsetRange <- offsetRanges) {
            val topic = offsetRange.topic
            val partition = offsetRange.partition
            val offset = offsetRange.untilOffset
            val fullPath = s"${topic}/${group}/${partition}"

            checkExists(fullPath)

            curator.setData().forPath(fullPath, (offset + "").getBytes())
        }
    }

    def checkExists(path:String): Unit = {
        if(curator.checkExists().forPath(path) == null) {
            curator.create().creatingParentsIfNeeded().forPath(path)//路径一定存在
        }
    }
    val curator = {
        val zk = "bigdata01:2181,bigdata02:2181,bigdata03:2181"
        val curator:CuratorFramework = CuratorFrameworkFactory
            .builder()
            .connectString(zk)
            .namespace("kafka/mykafka/offsets")
            .retryPolicy(new ExponentialBackoffRetry(1000, 3))
            .build()
        curator.start()
        curator
    }
}