Spark Streaming 有状态wordCount示例 (updateStateByKey的使用)
示例从一个wordcount开始,不同应用场景下的state是不同的,需要根据需求修改updateFunction。
数据接收自kafka topicA。从 Spark、hadoop、flink、hbase、kafka中随机抽取一个单词发送到 topicA
代码如下:
/**
* Copyright(C) 2018 Hangzhou xianghu.wang Technology Co., Ltd. All rights reserved.
*/
package com.ccclubs.kafka;
import com.ccclubs.uitl.KafkaUtil;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
/**
* @author xianghu.wang
* @date: 2018-12-18 15:22
* @des:
*/
public class KafkaWordProducer {
private static final String TOPIC = "topicA";
public static void main(String[] args) throws InterruptedException {
KafkaProducer producer = KafkaUtil.getKafkaProucer();
String[] sources = {"spark", "hadoop", "flink", "hbase", "kafka"};
int wordIndex;
while (true) {
wordIndex = (int) (Math.random() * sources.length);
ProducerRecord record = new ProducerRecord(TOPIC, sources[wordIndex]);
System.out.println(record.value());
producer.send(record);
Thread.sleep(1000);
}
}
}
Spark Streaming接收单词,统计单词个数,并打印在控制台。
package com.ccclubs.streaming
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @author xianghu.wang
* @date: 2018-12-18 15:47
* @des: 有状态的wordcount
*/
object StreamingWordCount {
def main(args: Array[String]): Unit = {
// 创建StreamingContext
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingDemo")
val ssc = new StreamingContext(sparkConf, Seconds(1))
// 配置检查点目录
ssc.checkpoint("./checkpoint")
// kafka参数
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "zc01:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "SparkStreamingDemo",
"auto.offset.reset" -> "latest"
)
// kafka主题
val topics = Array("topicA")
// 从kafka创建DStream
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
// stream中的每一条记录都是一个ConsumerRecord,
// public ConsumerRecord(topic: String, partition: Int, offset: Long, key: K, value: V)
val kvs = stream.map(record => (record.value, 1))
val count = kvs.updateStateByKey[Int](updateFunction _)
// 打印在控制台
count.print()
// 开始
ssc.start()
ssc.awaitTermination()
}
/**
*
* @param newValues 新值序列,其类型对应键值对中的值类型(这里是Int)
* @param oldCount 之前统计的值
* @return
*/
def updateFunction(newValues: Seq[Int], oldCount: Option[Int]): Option[Int] = {
val newCount = newValues.sum
val previousCount = oldCount.getOrElse(0)
Some(newCount + previousCount)
}
}
运行结果:
注:转载请注明 出处