标题
官方文档说明:
http://spark.apache.org/docs/2.2.0/streaming-kafka-integration.html
目前KafkaUtils里面提供了两个创建dstream的方法,一种为KafkaUtils.createDstream
,另一种为KafkaUtils.createDirectStream
又分为两个大版本: 0.8版本与0.10版本
0.8版本支持KafkaUtils.createDstream
与KafkaUtils.createDirectStream
0.10版本支持KafkaUtils.createDirectStream
0.8版本支持createDstream: 是使用高阶API进行消费,每隔一段时间就将offset值保存在ZooKeeper中,当机器出现问题时,机器会寻找offset并将数据找回,但有可能会造成重复消费的情况。所以它的消费模式是至少一次
0.8版本支持createDirectDstream: 使用kafka的低阶API进行消费,消费数据的offset全部在kafka当中的一个topic当中,会自动提交offset,当机器出现问题时,寻找offset地址,只会从当前offset位置的数据开始,所以会造成数据丢失,它的消费模式为最多一次
它也支持手动提交维护offset更加安全
0.10版本支持createDirectDstream: 使用kafka的低阶API进行消费,手动提交offset进行offset的管理维护,保证数据不会丢失,消费模式为仅一次
1. kafka之0.8版本CreateDstream方式
1、启动zookeeper集群,启动kafka:
三台机器都启动:
cd /export/servers/zookeeper-3.4.5-cdh5.14.0/
bin/zkServer.sh start
cd /export/servers/kafka_2.11-1.0.0/
bin/kafka-server-start.sh config/server.properties > /dev/null 2>&1 &
2、创建kafka的topic
node01服务器执行,创建kafka的topic sparkafka
cd /export/servers/kafka_2.11-1.0.0/
bin/kafka-topics.sh --create --partitions 3 --replication-factor 2 --topic sparkafka --zookeeper node01:2181,node02:2181,node03:2181
3、启动kafka生产者
node01服务器执行,模拟kafka生产者
cd /export/servers/kafka_2.11-1.0.0/
bin/kafka-console-producer.sh --broker-list node01:9092,node02:9092,node03:9092 --topic sparkafka
4、导jar包:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.twy</groupId>
<artifactId>spark_SparkStreaming</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.2.0</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
<!-- <verbal>true</verbal>-->
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
5、编写程序:
offset保存在zookeeper当中: val zkQuorum = "node01:2181,node02:2181,node03:2181"
val stream = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
package cn.twy.kafkaStreaming
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object StreamKafkaReceiver {
def main(args: Array[String]): Unit = {
//获取sparkconf
val sparkconf = new SparkConf().setMaster("local[5]").setAppName("hahahahaha").set("spark.streaming.receiver.writeAheadLog.enable","true")
//获取sparkcontext
val sc = new SparkContext(sparkconf)
//设置日志级别
sc.setLogLevel("WARN")
//获取StreamingContext
val ssc = new StreamingContext(sc,Seconds(5))
ssc.checkpoint("/Kafka_Receiver2")
//创建createStream的参数
val zkQuorum = "node01:2181,node02:2181,node03:2181"
val groupId = "sparkafka_group"
val topics = Map("sparkafka"->3)
//使用createStream获取获取数据
//开启3个recevier接受数据
val receiverDStream = (1 to 3).map(x => {
val stream = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
stream
})
//使用ssc.union方法合并所有的receiver中的数据
val unionDStream = ssc.union(receiverDStream)
val topicData= unionDStream.map(_._2)
unionDStream.print()
topicData.print()
ssc.start()
ssc.awaitTermination()
}
}
6、生产数据:
7、执行程序:
unionDStream.print()的效果: (null,wo hen hao)
说明拿到的原始数据是一个map集合,key为null,value为行输入
topicData.print(): 打印vlaue wo hen hao
2.kafka之0.8版本CreateDirectDstream方式
消费数据的offset全部维护在kafka当中的一个topic当中,会自动提交offset
1、准备步骤同上:
2、编写代码:
注意: KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaParams,topics = Set("sparkafka"))
中的[String, String, StringDecoder, StringDecoder]需要指定
"metadata.broker.list"->"node01:9092,node02:9092,node03:9092"
:offset保存在topics当中
package cn.twy.kafkaStreaming
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object SparkStreamingKafka_Direct {
def main(args: Array[String]): Unit = {
//获取sparkconf
val sparkConf = new SparkConf().setMaster("local[6]").setAppName("xixixixiixixxi")
//获取sparkContext
val sc = new SparkContext(sparkConf)
sc.setLogLevel("WARN")
//获取sparkStreamingContext
val ssc = new StreamingContext(sc,Seconds(5))
//设置checkpoint
ssc.checkpoint("./receiver_createdirectStream")
//创建参数
val kafkaParams = Map("metadata.broker.list"->"node01:9092,node02:9092,node03:9092","group.id" -> "Kafka_Direct")
// val topics = Set["sparkafka"]
//读取DStream
val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaParams,topics = Set("sparkafka"))
//获取value数据
val results = dstream.map(_._2)
//打印结果
results.print()
//启动环境
ssc.start()
ssc.awaitTermination()
}
}
3、生产数据:
4、结果:
3.kafka之0.10版本CreateDirectDstream方式
手动提交offset,保证数据不会丢失
1、导jar包:
需要注释掉0.8版本的,否者会报异常
<!-- <dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.2.0</version>
</dependency>-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
2、编写程序:
package cn.twy.kafkaStreaming
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.{SparkConf, SparkContext, TaskContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingKafka0dot10 {
def main(args: Array[String]): Unit = {
//获取sparkconf
val sparkConf = new SparkConf().setMaster("local[6]").setAppName("xixixixiixixxi")
//获取sparkContext
val sc = new SparkContext(sparkConf)
sc.setLogLevel("WARN")
//获取sparkStreamingContext
val ssc = new StreamingContext(sc,Seconds(5))
//设置checkpoint
ssc.checkpoint("./010_createdirectstream")
//创建参数
//创建topic
val brokers= "node01:9092,node02:9092,node03:9092"
val sourcetopic="sparkafka";
var group="sparkafkaGroup"
val kafkaParam = Map(
"bootstrap.servers" -> brokers,//用于初始化链接到集群的地址
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
//用于标识这个消费者属于哪个消费团体
"group.id" -> group,
//如果没有初始化偏移量或者当前的偏移量不存在任何服务器上,可以使用这个配置属性
//可以使用这个配置,latest自动重置偏移量为最新的偏移量
"auto.offset.reset" -> "latest",
//如果是true,则这个消费者的偏移量会在后台自动提交
"enable.auto.commit" -> (false: java.lang.Boolean)
);
//获取Dstream
var dStream = KafkaUtils.createDirectStream[String,String](ssc,LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array("sparkafka"),kafkaParam))
//遍历Dstream
dStream.foreachRDD(x=>{
if(x.count()>0){
//打印值
x.foreach(x=>{
val value = x.value()
println(value)
})
//打印offset的值
val offsetRanges = x.asInstanceOf[HasOffsetRanges].offsetRanges
x.foreachPartition(x=>{
val o = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
})
// 等输出操作完成后提交offset
dStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
})
ssc.start()
ssc.awaitTermination()
}
}
3、生产数据:
4、输出结果:
第一个遍历:遍历值
第二个遍历:遍历topic,分区,来自哪个offset,直到哪个offset结束
然后再提交这个untiloffset,保证数据不丢失