Spark Streaming重点知识
//************* Spark Streaming是什么
1、SPark Streaming是Spark中一个组件,基于Spark Core进行构建,用于对流式进行处理,类似于Storm。
2、Spark Streaming能够和Spark Core、Spark SQL来进行混合编程。
3、Spark Streaming我们主要关注:
1、Spark Streaming 能接受什么数据? kafka、flume、HDFS、Twitter等。
2、Spark Streaming 能怎么处理数据? 无状态的转换(前面处理的数据和后面处理的数据没啥关系)、有转换转换(前面处理的数据和后面处理的数据是有关系的,比如叠加关系)
//************* Spark Streaming是怎么实现的
1、Spark Streaming 采用“微批次”架构。
2、对于整个流式计算来说,数据流你可以想象成水流,微批次架构的意思就是将水流按照用户设定的时间间隔分割为多个水流段。一个段的水会在Spark中转换成为一个RDD,所以对水流的操作也就是对这些分割后的RDD进行单独的操作。每一个RDD的操作都可以认为是一个小的批处理(也就是离线处理)。
//************* Spark Streaming DStream是啥
1、DStream是类似于RDD和DataFrame的针对流式计算的抽象类。在源码中DStream是通过HashMap来保存他所管理的数据流的。K是RDD中数据流的时间,V是包含数据流的RDD。
2、对于DStream的操作也就是对于DStream他所包含的所有以时间序列排序的RDD的操作。
//************* Spark Streaming 怎么用
1、通过StreamingContext来进入Spark Streaming。可以通过已经创建好的SparkContext来创建SparkStreaming。
nc -lk 9999
2、自定义Receiver
1、你需要新建一个Class去继承Receiver,并给Receiver传入一个类型参数,该类型参数是你需要接收的数据的类型。
2、你需要去复写Receiver的方法: onStart方法(在Receiver启动的时候调用的方法)、onStop方法(在Receiver正常停止的情况下调用的方法)
3、你可以在程序中通过streamingContext.receiverStream( new CustomeReceiver)来调用你定制化的Receiver。
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
import java.io.{BufferedReader, InputStreamReader}
import java.net.Socket
import java.nio.charset.StandardCharsets
class CustomerReceiver(host:String,post:Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK){
override def onStart(): Unit = {
new Thread("customerThread"){
override def run():Unit = {receive()}
}.start()
}
def receive(): Unit = {
var socket = new Socket(host,post)
val reader = new BufferedReader(new InputStreamReader(socket.getInputStream(),StandardCharsets.UTF_8))
var lines = reader.readLine()
while(!isStopped()&&lines!=null){
store(lines)
lines = reader.readLine()
}
reader.close()
socket.close()
}
override def onStop(): Unit = {
}
}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object CustomerRec {
def main(args:Array[String]): Unit ={
val conf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf,Seconds(1))
val lines = ssc.receiverStream(new CustomerReceiver("HA3VM01",9999))
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word,1))
val wordCount = pairs.reduceByKey(_ + _)
wordCount.print()
ssc.start()
ssc.awaitTermination()
}
}
RDD数据源
1、你可以通过StreamingContext.queueStream(rddQueue)这个方法来监控一个RDD的队列,所有加入到这个RDD队列中的新的RDD,都会被Streaming去处理。
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingWordCount extends App{
val sparkConf = new SparkConf().setAppName("StreamingWordCount").setMaster("local[4]")
val ssc = new StreamingContext(sparkConf,Seconds(1))
val lines = ssc.socketTextStream("HA3VM01",9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map((_,1))
val result = pairs.reduceByKey(_+_)
result.print()
ssc.start()
ssc.awaitTermination()
}
与kafka
三台机器启动kafka
bin/kafka-server-start.sh config/server.properties
创建话题
bin/kafka-topics.sh --create --zookeeper ha3vm01:2181,ha3vm02:2181,ha3vm03:2181 --replication-factor 2 --partitions 2 --topic source
bin/kafka-topics.sh --create --zookeeper ha3vm01:2181,ha3vm02:2181,ha3vm03:2181 --replication-factor 2 --partitions 2 --topic target
bin/kafka-console-consumer.sh --zookeeper ha3vm01:2181,ha3vm02:2181,ha3vm03:2181 --from-beginning --topic source
bin/kafka-console-producer.sh --broker-list ha3vm01:9092,ha3vm02:9092,ha3vm03:9092 --topic source
bin/kafka-console-consumer.sh --zookeeper ha3vm01:2181,ha3vm02:2181,ha3vm03:2181 --from-beginning --topic target
import org.apache.commons.pool2.impl.{GenericObjectPool, GenericObjectPoolConfig}
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import shaded.parquet.org.codehaus.jackson.map.deser.std.StringDeserializer
object createKafkaProducerPool{
def apply(brokerList:String,topic:String): GenericObjectPool[KafkaProducerProxy] ={
val producerFactory = new BaseKafkaProducerFactory(brokerList,defaultTopic = Option(topic))
val poolProducerFactory = new PooledKafkaProducerAppFactory(producerFactory)
val poolConfig = {
val c = new GenericObjectPoolConfig
val maxNumProducers = 10
c.setMaxTotal(maxNumProducers)
c.setMaxIdle(maxNumProducers)
}
new GenericObjectPool[KafkaProducerProxy](poolProducerFactory,poolConfig)
}
}
object KafkaStreaming {
def main(args: Array[String]): Unit ={
val conf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf,Seconds(1))
val brobrokers = "ha3vm01:9092,ha3vm02:9092,ha3vm03:9092"
val sourcetopic = "source"
val targettopic = "target"
var group = "con-consumer-group"
val kafkaParam = Map(
"zookeeper"->brobrokers,
"key.deserializer"->classOf[StringDeserializer],
"value.deserializer"->classOf[StringDeserializer],
"group.id"->group,
"auto.offset.reset"->"latest",
"enable.auto.commit"->(false:java.lang.Boolean)
)
var stream = KafkaUtils.createDirectStream[String,String](ssc,LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array(sourcetopic),kafkaParam))
stream.map(s => ("id:" + s.key(),">>>>:" + s.value())).foreachRDD(rdd=>{
rdd.foreachPartition(partitionOfRecords=>{
val pool = createKafkaProducerPool(brobrokers,targettopic)
val p = pool.borrowObject()
partitionOfRecords.foreach{message=>System.out.println(message._2):p.send(message._2,Option(targettopic))}
pool.returnObject(p)
})
})
ssc.start()
ssc.awaitTermination()
}
}
这个教学视频中没有给出完整的画面,看不到全部代码
import org.apache.commons.pool2.{BasePooledObjectFactory, PooledObject}
import org.apache.commons.pool2.impl.DefaultPooledObject
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import java.util.Properties
case class KafkaProducerProxy(brokerList:String,config:Properties = new Properties(),defaultTopic:Option[String] = None,producer:Option[KafkaProducer[String,String]]){
type Key = String
type Val = String
require(brokerList == null || !brokerList.isEmpty,"Must set broker list")
private val p = producer getOrElse {
var props: Properties = new Properties();
props.put("zookeeper", brokerList);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
new KafkaProducer[String, String](props)
}
private def toMessage(value:Val,key:Option[Key] = None,topic:Option[String] = None): ProducerRecord[Key,] ={
}
def send(key:Key,value:Val,topic:Option[String] = None): Unit ={
p.send(toMessage(value,Option(key),topic))
}
def send(value:Val,topic:Option[String] = None): Unit ={
send(null,value,topic)
}
def send(value:Val,topic:Option[String] = None): Unit ={
send(null,value,topic)
}
def shutdown(): Unit = p.close()
}
abstract class KafkaProducerFactory(brokerList:String,config:Properties = new Properties(),defaultTopic:Option[String] = None)
class BaseKafkaProducerFactory(brokerList:String,config:Properties = new Properties(),defaultTopic:Option[String] = None) extends KafkaProducerFactory(brokerList,config,defaultTopic){
override def newInstance() = new KafkaProducerProxy(brokerList,config,defaultTopic)
}
class PooledKafkaProducerAppFactory(val factory: KafkaProducerFactory) extends BasePooledObjectFactory[KafkaProducerProxy] with Serializable {
override def create(): KafkaProducerProxy = factory.newInstance()
override def wrap(obj:KafkaProducerProxy): PooledObject[KafkaProducerProxy] = new DefaultPooledObject(obj)
override def destroyObject(p: PooledObject[KafkaProducerProxy]): Unit = {
p.getObject.shutdown()
super.destroyObject(p)
}
}
无状态转换,
有状态转换,