Spark_SparkStreaming的高级数据源_Kafka


官方文档说明:
http://spark.apache.org/docs/2.2.0/streaming-kafka-integration.html

在这里插入图片描述
目前KafkaUtils里面提供了两个创建dstream的方法,一种为KafkaUtils.createDstream,另一种为KafkaUtils.createDirectStream
又分为两个大版本: 0.8版本与0.10版本
0.8版本支持KafkaUtils.createDstreamKafkaUtils.createDirectStream
0.10版本支持KafkaUtils.createDirectStream

0.8版本支持createDstream: 是使用高阶API进行消费,每隔一段时间就将offset值保存在ZooKeeper中,当机器出现问题时,机器会寻找offset并将数据找回,但有可能会造成重复消费的情况。所以它的消费模式是至少一次

0.8版本支持createDirectDstream: 使用kafka的低阶API进行消费,消费数据的offset全部在kafka当中的一个topic当中,会自动提交offset,当机器出现问题时,寻找offset地址,只会从当前offset位置的数据开始,所以会造成数据丢失,它的消费模式为最多一次

它也支持手动提交维护offset更加安全

0.10版本支持createDirectDstream: 使用kafka的低阶API进行消费,手动提交offset进行offset的管理维护,保证数据不会丢失,消费模式为仅一次

1. kafka之0.8版本CreateDstream方式

1、启动zookeeper集群,启动kafka:
三台机器都启动:
cd /export/servers/zookeeper-3.4.5-cdh5.14.0/
bin/zkServer.sh start
cd /export/servers/kafka_2.11-1.0.0/
bin/kafka-server-start.sh config/server.properties > /dev/null 2>&1 &

2、创建kafka的topic
node01服务器执行,创建kafka的topic sparkafka
cd /export/servers/kafka_2.11-1.0.0/
bin/kafka-topics.sh --create --partitions 3 --replication-factor 2 --topic sparkafka --zookeeper node01:2181,node02:2181,node03:2181

3、启动kafka生产者
node01服务器执行,模拟kafka生产者
cd /export/servers/kafka_2.11-1.0.0/
bin/kafka-console-producer.sh --broker-list node01:9092,node02:9092,node03:9092 --topic sparkafka

4、导jar包:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>cn.twy</groupId>
  <artifactId>spark_SparkStreaming</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>
  <properties>
    <scala.version>2.11.8</scala.version>
    <spark.version>2.2.0</spark.version>
  </properties>
  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.7.5</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.38</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-flume_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>


  </dependencies>
  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.0</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
          <encoding>UTF-8</encoding>
          <!--    <verbal>true</verbal>-->
        </configuration>
      </plugin>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.2.0</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.1.1</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                  </excludes>
                </filter>
              </filters>
              <transformers>
                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                  <mainClass></mainClass>
                </transformer>
              </transformers>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

5、编写程序:
offset保存在zookeeper当中: val zkQuorum = "node01:2181,node02:2181,node03:2181"
val stream = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)

package cn.twy.kafkaStreaming

import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}


object StreamKafkaReceiver {
  def main(args: Array[String]): Unit = {
    //获取sparkconf
    val sparkconf = new SparkConf().setMaster("local[5]").setAppName("hahahahaha").set("spark.streaming.receiver.writeAheadLog.enable","true")

    //获取sparkcontext
    val sc = new SparkContext(sparkconf)
    //设置日志级别
    sc.setLogLevel("WARN")


    //获取StreamingContext
    val ssc = new StreamingContext(sc,Seconds(5))
    ssc.checkpoint("/Kafka_Receiver2")


      //创建createStream的参数
    val zkQuorum = "node01:2181,node02:2181,node03:2181"
    val groupId = "sparkafka_group"
    val topics = Map("sparkafka"->3)

    //使用createStream获取获取数据
    //开启3个recevier接受数据
    val receiverDStream = (1 to 3).map(x => {
      val stream = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
      stream
    })

    //使用ssc.union方法合并所有的receiver中的数据
    val unionDStream = ssc.union(receiverDStream)
    val topicData= unionDStream.map(_._2)

    unionDStream.print()

    topicData.print()

    ssc.start()
    ssc.awaitTermination()


  }


}

6、生产数据:
在这里插入图片描述

7、执行程序:
unionDStream.print()的效果: (null,wo hen hao) 说明拿到的原始数据是一个map集合,key为null,value为行输入
topicData.print(): 打印vlaue wo hen hao
在这里插入图片描述

2.kafka之0.8版本CreateDirectDstream方式

消费数据的offset全部维护在kafka当中的一个topic当中,会自动提交offset
1、准备步骤同上:
2、编写代码:
注意: KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaParams,topics = Set("sparkafka"))中的[String, String, StringDecoder, StringDecoder]需要指定
在这里插入图片描述
"metadata.broker.list"->"node01:9092,node02:9092,node03:9092":offset保存在topics当中

package cn.twy.kafkaStreaming

import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object SparkStreamingKafka_Direct {
  def main(args: Array[String]): Unit = {

    //获取sparkconf
    val sparkConf = new SparkConf().setMaster("local[6]").setAppName("xixixixiixixxi")

    //获取sparkContext
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("WARN")

    //获取sparkStreamingContext
    val ssc = new StreamingContext(sc,Seconds(5))

    //设置checkpoint
    ssc.checkpoint("./receiver_createdirectStream")

    //创建参数
    val kafkaParams = Map("metadata.broker.list"->"node01:9092,node02:9092,node03:9092","group.id" -> "Kafka_Direct")

//    val topics = Set["sparkafka"]
    //读取DStream
    val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaParams,topics = Set("sparkafka"))

    //获取value数据
    val results = dstream.map(_._2)

    //打印结果
    results.print()

    //启动环境
    ssc.start()
    ssc.awaitTermination()
  }
}

3、生产数据:
在这里插入图片描述

4、结果:
在这里插入图片描述

3.kafka之0.10版本CreateDirectDstream方式

手动提交offset,保证数据不会丢失

1、导jar包:
需要注释掉0.8版本的,否者会报异常

<!-- <dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
	<version>2.2.0</version>
</dependency>-->
<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
	<version>2.2.0</version>
</dependency>

2、编写程序:

package cn.twy.kafkaStreaming

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.{SparkConf, SparkContext, TaskContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingKafka0dot10 {
  def main(args: Array[String]): Unit = {
    //获取sparkconf
    val sparkConf = new SparkConf().setMaster("local[6]").setAppName("xixixixiixixxi")

    //获取sparkContext
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("WARN")

    //获取sparkStreamingContext
    val ssc = new StreamingContext(sc,Seconds(5))

    //设置checkpoint
    ssc.checkpoint("./010_createdirectstream")

    //创建参数
    //创建topic
    val brokers= "node01:9092,node02:9092,node03:9092"
    val sourcetopic="sparkafka";

    var group="sparkafkaGroup"

    val kafkaParam = Map(
      "bootstrap.servers" -> brokers,//用于初始化链接到集群的地址
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      //用于标识这个消费者属于哪个消费团体
      "group.id" -> group,
      //如果没有初始化偏移量或者当前的偏移量不存在任何服务器上,可以使用这个配置属性
      //可以使用这个配置,latest自动重置偏移量为最新的偏移量
      "auto.offset.reset" -> "latest",
      //如果是true,则这个消费者的偏移量会在后台自动提交
      "enable.auto.commit" -> (false: java.lang.Boolean)
    );


    //获取Dstream
    var dStream = KafkaUtils.createDirectStream[String,String](ssc,LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array("sparkafka"),kafkaParam))

    //遍历Dstream
    dStream.foreachRDD(x=>{
      if(x.count()>0){
        //打印值
        x.foreach(x=>{
          val value = x.value()
          println(value)
        })
        //打印offset的值
        val offsetRanges = x.asInstanceOf[HasOffsetRanges].offsetRanges


        x.foreachPartition(x=>{
          val o  = offsetRanges(TaskContext.get.partitionId)
          println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
        })

        // 等输出操作完成后提交offset
        dStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      }


    })

    ssc.start()
    ssc.awaitTermination()

  }

}

3、生产数据:
在这里插入图片描述

4、输出结果:
第一个遍历:遍历值
第二个遍历:遍历topic,分区,来自哪个offset,直到哪个offset结束
然后再提交这个untiloffset,保证数据不丢失
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值