文章目录
kafka分区相关知识
Kafka0.10版本使用
package cn.edu360.streaming.kafka10
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object DirectStream {
def main(args: Array[String]): Unit = {
val group = "g0"
val topic = "xiaoniu"
//创建SparkConf,如果将任务提交到集群中,那么要去掉.setMaster("local[2]")
val conf = new SparkConf().setAppName("DirectStream").setMaster("local[*]")
//创建一个StreamingContext,其里面包含了一个SparkContext
val streamingContext = new StreamingContext(conf, Seconds(5));
//配置kafka的参数
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "node1:9092,node2:9092,node3:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> group,
"auto.offset.reset" -> "earliest", // lastest
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array(topic)
//用直连方式读取kafka中的数据,在Kafka中记录读取偏移量
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
//位置策略(如果kafka和spark程序部署在一起,会有最优位置)
PreferConsistent,
//订阅的策略(可以指定用正则的方式读取topic,比如my-ordsers-.*)
Subscribe[String, String](topics, kafkaParams)
)
//迭代DStream中的RDD,将每一个时间点对于的RDD拿出来
stream.foreachRDD { rdd =>
if(!rdd.isEmpty()) {
//获取该RDD对于的偏移量
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
//拿出对于的数据,foreach是一个aciton
rdd.foreach{ line =>
println(line.key() + " " + line.value())
}
//更新偏移量
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
}
streamingContext.start()
streamingContext.awaitTermination()
}
}
注意:在使用0.10时,client与server版本要一致,否则将会抛出异常。
org.apache.kafka.common.protocol.types.SchemaException: Error reading field ‘brokers’: Error reading field ‘host’: Error reading string of length 28271, only 159 bytes available
Spark Streaming
Spark Streaming原理简介
实时计算的特点:一直会执行,除非出现故障,或人为的把它停掉。
离线计算:有开始,有结束。
spark streaming不停的从存储系统中读取,然后将处理后的结果放到存储系统中。
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Spark Streaming 整合nc
关于nc介绍
SparkStreaming整合nc案例
package day9
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object StreamWordCount {
def main(args: Array[String]): Unit = {
//离线任务是创建sparkcontext,现在要实现实时计算,用StreamingContext
val conf = new SparkConf().setAppName("StreamingWordCount").setMaster("local[*]") //Spark集群至少需要两台机器,一个负责接收数据,一个负责计算。
val sc = new SparkContext(conf)
//StreamingContext是对SparkContext的包装,包了一层就增加了实时的功能
//第二个参数是小批次产生的时间间隔,
val ssc = new StreamingContext(sc, Milliseconds(5000))
//有了StreamingContext,就可以创建SparkStreaming的抽象了DStream
//从一个Socket端口中读取数据
val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.145.101", 8888)
//对DStream进行操作,你操作这个抽象(代理描述),就像操作一个本地集合一样。
//切分压平
val words: DStream[String] = lines.flatMap(_.split(" "))
//单词和1组合在一起
val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
//聚合
val reduced: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)
//打印结果
reduced.print()
//启动sparkStreaming程序
ssc.start()
//等待优雅的退出
ssc.awaitTermination()
}
}
在程序对应的IP主机上,启动8888进程,使用nc前需要安装:yum install nc
在8888创建Socket服务:nc -lk 8888
注意:需要先开启服务,再启动计算应用程序。nc服务充当生产者,sparkStreaming充当消费者,只能计算当前批次提交的数据。
SparkStreaming整合kafka
案例一
在指定的三台主机上开启Kafka服务:
/appdata/kafka/bin/kafka-server-start.sh -daemon /appdata/kafka/config/server.properties
在其中的一台主机上启动生产者服务:
/appdata/kafka/bin/kafka-console-producer.sh --broker-list node1:9092,node2:9092,node3:9092 --topic xiaoniu
在控制台输入需要计算的数据,通过下面程序就能将程序计算完成。
package day9
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object KafkaWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))
val zkQuorum = "node1:2181,node2:2181,node3:2181"
val groupId = "g1"
val topic = Map[String, Int]("xiaoniu" -> 1)
//创建DStream,需要KafkaDStream
val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topic)
//对数据进行处理
//Kafka的ReceiverInputDStream[(String, String)]里面装的是一个元组(key是写入的key, value是实际写入的内容)
val lines: DStream[String] = data.map(_._2)
val words: DStream[String] = lines.flatMap(_.split(" "))
//单词和1组合在一起
val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
//聚合
val reduced: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)
//打印结果(Action)
reduced.print()
//启动sparkStreaming程序
ssc.start()
//等待优雅的退出
ssc.awaitTermination()
}
}
案例二
将不同批次的结果累加起来。(在打印的时候,会统计以前所有的数据和)
缺点:程序退出之后,不会记的数据读到哪里,只会读取并处理该程序启动之后产生的数据。(没有数据处理到什么位置的标记,容易丢失数据)
package day9
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
//可更新状态的wordCount
object StatefulKafkaWordCount {
/**
* 第一个参数:聚合的key,就是单词
* 第二个参数:当前批次产生批次该单词在每一个分区出现的次数
* 第三个参数:初始值或累加的中间结果
*/
val updateFunc = (iterator : Iterator[(String, Seq[Int], Option[Int])]) => {
//iterator.map(t => (t._1, t._2.sum + t._3.getOrElse(0)))
iterator.map{ case(x, y, z) => (x, y.sum + z.getOrElse(0))}
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))
//如果要使用课更新历史数据(累加),那么就要把中间结果保存起来
ssc.checkpoint("./ck") //在生产环境中,要将中间结果
val zkQuorum = "node1:2181,node2:2181,node3:2181"
val groupId = "g1"//指定组的名字,同一个组中的消费者不会消费相同的数据。
val topic = Map[String, Int]("xiaoniu" -> 1)
//创建DStream,需要KafkaDStream
val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topic)
//对数据进行处理
//Kafka的ReceiverInputDStream[(String, String)]里面装的是一个元组(key是写入的key, value是实际写入的内容)
val lines: DStream[String] = data.map(_._2)
val words: DStream[String] = lines.flatMap(_.split(" "))
//单词和1组合在一起
val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
//聚合
val reduced: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
//打印结果
reduced.print()
//启动sparkStreaming程序
ssc.start()
//等待优雅的退出
ssc.awaitTermination()
}
}
深入理解DStream
DStream是SparkStream中一个高级的抽象(对RDD的封装)。
1.Spark Streaming是一个基于Spark Core之上的实时计算框架,可以从很多数据源消费数据并对数据进行处理,
在Spark Streaing中有一个最基本的抽象叫DStream(代理),本质上就是一系列连续的RDD,DStream其实就是对RDD的封装。
2.深入理解DStream:他是sparkStreaming中的一个最基本的抽象,代表了一下列连续的数据流,本质上是一系列连续的RDD,你对DStream进行操作,就是对RDD进行操作。
DStream每隔一段时间生成一个RDD,你对DStream进行操作,本质上是对里面的对应时间的RDD进行操作
3.DStream可以认为是一个RDD的工厂,该DStream里面生产都是相同业务逻辑的RDD,只不过是RDD里面要读取数据的不相同
4.DSteam和DStream之间存在依赖关系,在一个固定的时间点,多个存在依赖关系的DSrteam对应的RDD也存在依赖关系,
每间隔一个固定的时间,其实生产了一个小的DAG,周期性的将生成的小DAG提交到集群中运行。
Receiver方式和直连方式介绍
Receiver原理图
Receiver方式:Receiver从Kafka中拉取数据以固定的时间,由于拉取固定的时间内存中可能装不下,所以要将数据写入磁盘。
这种方式容易丢失数据,当数据丢失时,从磁盘回复很慢,在0.10丢弃了。
直连方式
直连方式:DStream的分区与Kafka的分区直接建立联系通道。
SparkStraming的Receiver方式和直连方式有什么区别?
Receiver接收固定时间间隔的数据(放在内存中的),使用kafka高级API,自动维护偏移量,省事,数据达到固定时间才进行处理,效率低并且容易丢失数据。
Direct直连方式,相当于直连到Kafka的分区上,使用kafka底层的API,效率高,需要自己维护偏移量。
直连方式案例
kafka0.8
package day9
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import kafka.utils.{ZKGroupTopicDirs, ZkUtils}
import org.I0Itec.zkclient.ZkClient
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange}
import org.apache.spark.streaming.{Duration, StreamingContext}
object KafkaDirectWordCount {
def main(args: Array[String]): Unit = {
//指定组名
val group = "g001"
//创建SparkConf
val conf = new SparkConf().setAppName("KafkaDirectWordCount").setMaster("local[2]")
//创建SparkStreaming,并设置间隔时间
val ssc = new StreamingContext(conf, Duration(5000))
//指定消费的topic名字
val topic = "wwcc"
//指定kafka的broker地址(sparkStream的Task直接连到kafka的分区上,用更加底层的API消费,效率更高)
val brokerList = "node1:9092,node2:9092,node3:9092"
//指定zk的地址,后期更新消费的偏移量使用(以后可以使用Redis,MySQL来记录偏移量)
val zkQuorum = "node1:2181,node2:2181,node3:2181"
//创建stream使用时topic名字集合,SparkStreaming可同时消费多个topic
val topics: Set[String] = Set(topic)
//创建一个zkGroupTopicDirs对象,其实是指定往zk中写入数据的目录,用于保存偏移量
val topicDirs = new ZKGroupTopicDirs(group, topic)
//获取zookeeper中的路径“/g001/offsets/wordcount/"
val zkTopicPath = s"${topicDirs.consumerOffsetDir}"
//准备Kafka的参数
val kafkaParams = Map(
"metadata.broker.list" -> brokerList,
"group.id" -> group,
//从头开始读取数据
"auto.offset.reset" -> kafka.api.OffsetRequest.SmallestTimeString
)
//zookeeper的host和IP,创建一个client,用于更新偏移量的
//是zookeeper的客户端,可以从zk中读取偏移量数据,并更新偏移量
val zkClient = new ZkClient(zkQuorum)
//查询该路径下是否子节点(默认有子节点为我们自己保存不同partition时生成的)
// /g001/offsets/wordcount/0/10001"
// /g001/offsets/wordcount/1/30001"
// /g001/offsets/wordcount/2/10001"
// zkTopicPath -> /g001/offsets/wordcount/
val children = zkClient.countChildren(zkTopicPath)
var kafkaStream: InputDStream[(String, String)] = null
//如果zookeeper中有保存offset,我们会利用这个offset作为kafkaStream的起始位置
var fromOffsets: Map[TopicAndPartition, Long] = Map()
//如果保存过offset
if(children > 0){
for(i <- 0 until children){
// /g001/offsets/wordcount/0/10001
// /g001/offsets/wordcount/0
val partitionOffset = zkClient.readData[String](s"$zkTopicPath/${i}")
// wordcount/0
val tp = TopicAndPartition(topic, i)
//将不同 partition 对应的 offset 增加到 fromOffsets 中
// wordcount/0 -> 10001
fromOffsets += (tp -> partitionOffset.toLong)
}
//Key: kafka的key values: "hello tom hello jerry"
//这个会将 kafka 的消息进行 transform,最终 kafak 的数据都会变成 (kafka的key, message) 这样的 tuple
val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key(), mmd.message())
//通过KafkaUtils创建直连的DStream(fromOffsets参数的作用是:按照前面计算好了的偏移量继续消费数据)
//[String, String, StringDecoder, StringDecoder, (String, String)]
// key value key的解码方式 value的解码方式
kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)
} else {
//如果未保存,根据 kafkaParam 的配置使用最新(largest)或者最旧的(smallest) offset
kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
}
//偏移量的范围
var offsetRanges = Array[OffsetRange]()
//从kafka读取的消息,DStream的Transform方法可以将当前批次的RDD获取出来
//该transform方法计算获取到当前批次RDD,然后将RDD的偏移量取出来,然后在将RDD返回到DStream
val transform: DStream[(String, String)] = kafkaStream.transform { rdd =>
//得到该 rdd 对应 kafka 的消息的 offset
//该RDD是一个KafkaRDD,可以获得偏移量的范围
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
val messages: DStream[String] = transform.map(_._2)
//依次迭代DStream中的RDD
messages.foreachRDD { rdd =>
//对RDD进行操作,触发Action
rdd.foreachPartition(partition =>
partition.foreach(x => {
println(x)
})
)
for (o <- offsetRanges) {
// /g001/offsets/wordcount/0
val zkPath = s"${topicDirs.consumerOffsetDir}/${o.partition}"
//将该 partition 的 offset 保存到 zookeeper
// /g001/offsets/wordcount/0/20000
ZkUtils.updatePersistentPath(zkClient, zkPath, o.untilOffset.toString)
}
}
ssc.start()
ssc.awaitTermination()
}
}
优化案例
kafka0.8
package day9
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import kafka.utils.{ZKGroupTopicDirs, ZkUtils}
import org.I0Itec.zkclient.ZkClient
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange}
import org.apache.spark.streaming.{Duration, StreamingContext}
object KafkaDirectWordCountV2 {
def main(args: Array[String]): Unit = {
//指定组名
val group = "g001"
//创建SparkConf
val conf = new SparkConf().setAppName("KafkaDirectWordCountV2").setMaster("local[2]")
//创建SparkStreaming,并设置间隔时间
val ssc = new StreamingContext(conf, Duration(5000))
//指定消费的topic名字
val topic = "wwcc"
//指定kafka的broker地址(sparkStream的Task直接连到kafka的分区上,用更加底层的API消费,效率更高)
val brokerList = "node1:9092,node2:9092,node3:9092"
//指定zk的地址,后期更新消费的偏移量使用(以后可以使用Redis,MySQL来记录偏移量)
val zkQuorum = "node1:2181,node2:2181,node3:2181"
//创建stream使用时topic名字集合,SparkStreaming可同时消费多个topic
val topics: Set[String] = Set(topic)
//创建一个zkGroupTopicDirs对象,其实是指定往zk中写入数据的目录,用于保存偏移量
val topicDirs = new ZKGroupTopicDirs(group, topic)
//获取zookeeper中的路径“/g001/offsets/wordcount/"
val zkTopicPath = s"${topicDirs.consumerOffsetDir}"
//准备Kafka的参数
val kafkaParams = Map(
"metadata.broker.list" -> brokerList,
"group.id" -> group,
//从头开始读取数据
"auto.offset.reset" -> kafka.api.OffsetRequest.SmallestTimeString
)
//zookeeper的host和IP,创建一个client,用于更新偏移量的
//是zookeeper的客户端,可以从zk中读取偏移量数据,并更新偏移量
val zkClient = new ZkClient(zkQuorum)
//查询该路径下是否子节点(默认有子节点为我们自己保存不同partition时生成的)
// /g001/offsets/wordcount/0/10001"
// /g001/offsets/wordcount/1/30001"
// /g001/offsets/wordcount/2/10001"
// zkTopicPath -> /g001/offsets/wordcount/
val children = zkClient.countChildren(zkTopicPath)
var kafkaStream: InputDStream[(String, String)] = null
//如果zookeeper中有保存offset,我们会利用这个offset作为kafkaStream的起始位置
var fromOffsets: Map[TopicAndPartition, Long] = Map()
//如果保存过offset
if(children > 0){
for(i <- 0 until children){
// /g001/offsets/wordcount/0/10001
// /g001/offsets/wordcount/0
val partitionOffset = zkClient.readData[String](s"$zkTopicPath/${i}")
// wordcount/0
val tp = TopicAndPartition(topic, i)
//将不同 partition 对应的 offset 增加到 fromOffsets 中
// wordcount/0 -> 10001
fromOffsets += (tp -> partitionOffset.toLong)
}
//Key: kafka的key values: "hello tom hello jerry"
//这个会将 kafka 的消息进行 transform,最终 kafak 的数据都会变成 (kafka的key, message) 这样的 tuple
val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key(), mmd.message())
//通过KafkaUtils创建直连的DStream(fromOffsets参数的作用是:按照前面计算好了的偏移量继续消费数据)
//[String, String, StringDecoder, StringDecoder, (String, String)]
// key value key的解码方式 value的解码方式
kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)
} else {
//如果未保存,根据 kafkaParam 的配置使用最新(largest)或者最旧的(smallest) offset
kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
}
//偏移量的范围
var offsetRanges = Array[OffsetRange]()
//直连方式只有在KafkaDStream的RDD中才能获取偏移量,那么就不能到调用DStream的Transformation
//依次迭代DStream中的RDD
kafkaStream.foreachRDD { kafkaRDD =>
//只有KafkaRDD可以强转成HasOffsetRanges,并获取偏移量
offsetRanges = kafkaRDD.asInstanceOf[HasOffsetRanges].offsetRanges
val lines: RDD[String] = kafkaRDD.map(_._2)
//对RDD进行操作
lines.foreachPartition(partition => partition.foreach(x => {
println(x)
}))
for (o <- offsetRanges) {
// /g001/offsets/wordcount/0
val zkPath = s"${topicDirs.consumerOffsetDir}/${o.partition}"
//将该 partition 的 offset 保存到 zookeeper
// /g001/offsets/wordcount/0/20000
ZkUtils.updatePersistentPath(zkClient, zkPath, o.untilOffset.toString)
}
}
ssc.start()
ssc.awaitTermination()
}
}
kafka0.10
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.edu360.spark</groupId>
<artifactId>streaming-kafka10</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
<hadoop.version>2.7.3</hadoop.version>
<encoding>UTF-8</encoding>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
</dependency>
</dependencies>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
package cn.edu360.streaming.kafka10
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object DirectStream {
def main(args: Array[String]): Unit = {
val group = "g0"
val topic = "my-orders"
//创建SparkConf,如果将任务提交到集群中,那么要去掉.setMaster("local[2]")
val conf = new SparkConf().setAppName("DirectStream").setMaster("local[2]")
//创建一个StreamingContext,其里面包含了一个SparkContext
val streamingContext = new StreamingContext(conf, Seconds(5));
//配置kafka的参数
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "node-1:9092,node-2:9092,node-3:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> group,
"auto.offset.reset" -> "earliest", // lastest
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array(topic)
//用直连方式读取kafka中的数据,在Kafka中记录读取偏移量
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
//位置策略(如果kafka和spark程序部署在一起,会有最优位置)
PreferConsistent,
//订阅的策略(可以指定用正则的方式读取topic,比如my-ordsers-.*)
Subscribe[String, String](topics, kafkaParams)
)
//迭代DStream中的KafkaRDD,将每一个时间点对于的KafkaRDD拿出来,KafkaRDD是每一个时间点对应的第一个RDD
stream.foreachRDD { rdd =>
if(!rdd.isEmpty()) {
//获取该RDD对于的偏移量
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
//拿出对于的数据,foreach是一个aciton
rdd.foreach{ line =>
println(line.key() + " " + line.value())
}
//更新偏移量
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
}
streamingContext.start()
streamingContext.awaitTermination()
}
}