Spark_SparkStreaming的高级数据源_Kafka

_WeiA

于 2021-03-06 01:43:11 发布

阅读量290

点赞数

分类专栏： Spark 文章标签： kafka spark

本文链接：https://blog.csdn.net/weixin_44449054/article/details/114424367

版权

Spark 专栏收录该内容

29 篇文章 0 订阅

订阅专栏

标题

官方文档说明：
http://spark.apache.org/docs/2.2.0/streaming-kafka-integration.html

在这里插入图片描述
目前KafkaUtils里面提供了两个创建dstream的方法，一种为KafkaUtils.createDstream，另一种为KafkaUtils.createDirectStream
又分为两个大版本： 0.8版本与0.10版本
0.8版本支持KafkaUtils.createDstream与KafkaUtils.createDirectStream
0.10版本支持KafkaUtils.createDirectStream

0.8版本支持createDstream：是使用高阶API进行消费，每隔一段时间就将offset值保存在ZooKeeper中，当机器出现问题时，机器会寻找offset并将数据找回，但有可能会造成重复消费的情况。所以它的消费模式是至少一次

0.8版本支持createDirectDstream：使用kafka的低阶API进行消费，消费数据的offset全部在kafka当中的一个topic当中，会自动提交offset，当机器出现问题时，寻找offset地址，只会从当前offset位置的数据开始，所以会造成数据丢失，它的消费模式为最多一次

它也支持手动提交维护offset更加安全

0.10版本支持createDirectDstream：使用kafka的低阶API进行消费，手动提交offset进行offset的管理维护，保证数据不会丢失，消费模式为仅一次

1. kafka之0.8版本CreateDstream方式

1、启动zookeeper集群，启动kafka：
三台机器都启动：
cd /export/servers/zookeeper-3.4.5-cdh5.14.0/
bin/zkServer.sh start
cd /export/servers/kafka_2.11-1.0.0/
bin/kafka-server-start.sh config/server.properties > /dev/null 2>&1 &

2、创建kafka的topic
node01服务器执行，创建kafka的topic sparkafka
cd /export/servers/kafka_2.11-1.0.0/
bin/kafka-topics.sh --create --partitions 3 --replication-factor 2 --topic sparkafka --zookeeper node01:2181,node02:2181,node03:2181

3、启动kafka生产者
node01服务器执行，模拟kafka生产者
cd /export/servers/kafka_2.11-1.0.0/
bin/kafka-console-producer.sh --broker-list node01:9092,node02:9092,node03:9092 --topic sparkafka

4、导jar包：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>cn.twy</groupId>
  <artifactId>spark_SparkStreaming</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>
  <properties>
    <scala.version>2.11.8</scala.version>
    <spark.version>2.2.0</spark.version>
  </properties>
  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.7.5</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.38</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-flume_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>


  </dependencies>
  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.0</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
          <encoding>UTF-8</encoding>
          <!--    <verbal>true</verbal>-->
        </configuration>
      </plugin>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.2.0</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.1.1</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                  </excludes>
                </filter>
              </filters>
              <transformers>
                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                  <mainClass></mainClass>
                </transformer>
              </transformers>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

5、编写程序：
offset保存在zookeeper当中： val zkQuorum = "node01:2181,node02:2181,node03:2181"
val stream = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)

package cn.twy.kafkaStreaming

import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}


object StreamKafkaReceiver {
  def main(args: Array[String]): Unit = {
    //获取sparkconf
    val sparkconf = new SparkConf().setMaster("local[5]").setAppName("hahahahaha").set("spark.streaming.receiver.writeAheadLog.enable","true")

    //获取sparkcontext
    val sc = new SparkContext(sparkconf)
    //设置日志级别
    sc.setLogLevel("WARN")


    //获取StreamingContext
    val ssc = new StreamingContext(sc,Seconds(5))
    ssc.checkpoint("/Kafka_Receiver2")


      //创建createStream的参数
    val zkQuorum = "node01:2181,node02:2181,node03:2181"
    val groupId = "sparkafka_group"
    val topics = Map("sparkafka"->3)

    //使用createStream获取获取数据
    //开启3个recevier接受数据
    val receiverDStream = (1 to 3).map(x => {
      val stream = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
      stream
    })

    //使用ssc.union方法合并所有的receiver中的数据
    val unionDStream = ssc.union(receiverDStream)
    val topicData= unionDStream.map(_._2)

    unionDStream.print()

    topicData.print()

    ssc.start()
    ssc.awaitTermination()


  }


}

6、生产数据：
在这里插入图片描述

7、执行程序：
unionDStream.print()的效果： (null,wo hen hao) 说明拿到的原始数据是一个map集合，key为null，value为行输入
topicData.print()：打印vlaue wo hen hao
在这里插入图片描述

2.kafka之0.8版本CreateDirectDstream方式

消费数据的offset全部维护在kafka当中的一个topic当中，会自动提交offset
1、准备步骤同上：
2、编写代码：
注意： KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaParams,topics = Set("sparkafka"))中的[String, String, StringDecoder, StringDecoder]需要指定
在这里插入图片描述
"metadata.broker.list"->"node01:9092,node02:9092,node03:9092":offset保存在topics当中

package cn.twy.kafkaStreaming

import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object SparkStreamingKafka_Direct {
  def main(args: Array[String]): Unit = {

    //获取sparkconf
    val sparkConf = new SparkConf().setMaster("local[6]").setAppName("xixixixiixixxi")

    //获取sparkContext
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("WARN")

    //获取sparkStreamingContext
    val ssc = new StreamingContext(sc,Seconds(5))

    //设置checkpoint
    ssc.checkpoint("./receiver_createdirectStream")

    //创建参数
    val kafkaParams = Map("metadata.broker.list"->"node01:9092,node02:9092,node03:9092","group.id" -> "Kafka_Direct")

//    val topics = Set["sparkafka"]
    //读取DStream
    val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaParams,topics = Set("sparkafka"))

    //获取value数据
    val results = dstream.map(_._2)

    //打印结果
    results.print()

    //启动环境
    ssc.start()
    ssc.awaitTermination()
  }
}

3、生产数据：
在这里插入图片描述

4、结果：
在这里插入图片描述

3.kafka之0.10版本CreateDirectDstream方式

手动提交offset，保证数据不会丢失

1、导jar包：
需要注释掉0.8版本的，否者会报异常

<!-- <dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
	<version>2.2.0</version>
</dependency>-->
<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
	<version>2.2.0</version>
</dependency>

2、编写程序：

package cn.twy.kafkaStreaming

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.{SparkConf, SparkContext, TaskContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingKafka0dot10 {
  def main(args: Array[String]): Unit = {
    //获取sparkconf
    val sparkConf = new SparkConf().setMaster("local[6]").setAppName("xixixixiixixxi")

    //获取sparkContext
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("WARN")

    //获取sparkStreamingContext
    val ssc = new StreamingContext(sc,Seconds(5))

    //设置checkpoint
    ssc.checkpoint("./010_createdirectstream")

    //创建参数
    //创建topic
    val brokers= "node01:9092,node02:9092,node03:9092"
    val sourcetopic="sparkafka";

    var group="sparkafkaGroup"

    val kafkaParam = Map(
      "bootstrap.servers" -> brokers,//用于初始化链接到集群的地址
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      //用于标识这个消费者属于哪个消费团体
      "group.id" -> group,
      //如果没有初始化偏移量或者当前的偏移量不存在任何服务器上，可以使用这个配置属性
      //可以使用这个配置，latest自动重置偏移量为最新的偏移量
      "auto.offset.reset" -> "latest",
      //如果是true，则这个消费者的偏移量会在后台自动提交
      "enable.auto.commit" -> (false: java.lang.Boolean)
    );


    //获取Dstream
    var dStream = KafkaUtils.createDirectStream[String,String](ssc,LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array("sparkafka"),kafkaParam))

    //遍历Dstream
    dStream.foreachRDD(x=>{
      if(x.count()>0){
        //打印值
        x.foreach(x=>{
          val value = x.value()
          println(value)
        })
        //打印offset的值
        val offsetRanges = x.asInstanceOf[HasOffsetRanges].offsetRanges


        x.foreachPartition(x=>{
          val o  = offsetRanges(TaskContext.get.partitionId)
          println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
        })

        // 等输出操作完成后提交offset
        dStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      }


    })

    ssc.start()
    ssc.awaitTermination()

  }

}