关于Kafka-spark-streaming 的一些理解

最新推荐文章于 2023-01-15 10:53:22 发布

NoOne-csdn

最新推荐文章于 2023-01-15 10:53:22 发布

阅读量390

点赞数

分类专栏： pyspark

本文链接：https://blog.csdn.net/weixin_40161254/article/details/103393504

版权

pyspark 专栏收录该内容

63 篇文章 9 订阅

订阅专栏

Spark streaming 说明文档

综述

SparkStreaming 是一套框架。
SparkStreaming 是Spark核心API的一个扩展，可以实现高吞吐量，具备容错机制的实时流数据处理。
Spark Streaming 接收Kafka Flume HDFS Kinesis TCP sockets 等来源的实时输入数据，进行处理后，处理结构保存在HDFS，DB ，Dashboard等各种地方。
Spark Streaming 可以处理机器学习，图形处理等流数据。

在这里插入图片描述
他如同流一样工作。Spark Streaming 接收流输入，把数据分成 batches，可以SPark Engine处理数据得到批处理的结果。

Dstream

Spark Streaming 提供了表示连续数据流、高度抽象的被称为离散的Dstream。
Dstream 可以看作一组RDDs,即RDD的一个序列。
在这里插入图片描述

Spark streaming 接收kafka数据

用spark streaming 流式处理Kafka中的数据，第一步是把数据接收过来，转换为spark steaming中的数据结构Dstream接收数据的方式有两种1.利用Receiver 接收数据，2.直接从Kafka读取数据。

基于REceiver的方式
这种方式利用接收器来接收Kafka中的数据，最基本是使用Kafka高阶用户API接口。对于所有的接收器，从Kafka接收来的数据会存储在spark 的executor中，之后spark Streaming 提交的job会处理这些数据。
对于不同的Group 和topic 我们可以使用多个Receiverh创建不同的D stream来并行接收数据，之后可以用union 来统一成一个Dstream .
直接读取方式
，引入了Direct方式。不同于Receiver的方式，Direct方式没有receiver这一层，其会周期性的获取Kafka中每个topic的每个partition中的最新offsets，之后根据设定的maxRatePerPartition来处理每个batch。
优势
1. 简化的并行
2. 高效
3. 精确一次

RDD

RDD(Resilient Distributed Dataset) 弹性分布式数据集，是spark 中最基本的数据抽抽象，它代表一个不可变、可分区、里面的元素可并行计算的集合。
RDD具有数据流模型的特点：自动容错、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中，后续的查询能够重用工作集，这极大地提升了查询速度。

streaming 操作

transform 转换算子
map
foreachRDD
updatestateByKey

transform(func)

DStream 的每个RDD经函数func 转为另一个RDD Dstream
func 有一个参数或者两个参数（time，rdd）
我用的pyspark 测试一下是两个参数
例如经典的transform

kafkaStreams.transform(self.storeOffsetRanges)

返回的就是RDD

foreachRDD(func)

在DStream中每个RDD 的每一个元素上，运行函数func进行更新
并没有返回
经典例子如下

kafkaStreams.transform(self.storeOffsetRanges).foreachRDD(self.printOffsetRanges)

map(func,preservesPartitioning=False)

每个DStream被函数func作用后，返回新的DStream

updatestateByKey(updateFunc, numPartitions=None)

使用新信息不断更新时保持任意状态
即当前计算结果不仅依赖于目前收到数据还需要之前结果进行合并计算的场景
pyspark 应用时需要设置checkpoint

pyspark.streaming.kafka.OffsetRange(topic, partition, fromOffset, untilOffset)

offsetRanges = rdd.offsetRanges()
for o in offsetRanges:
            print(f'''{o.topic},{o.partition},{o.fromOffset},{o.untilOffset}''')

返回Kafka topic，分区，其实位置，结束位置

createDirectStream(ssc, topics, kafkaParams, fromOffsets={})

返回DStream
fromOffsets为空时，之前存放在Kafka的信息不处理
要想从零开始处理

for i in range(12):
            from_offsets[
                TopicAndPartition(self.topic_name, i)
             ] = int(0)

不从零开始处理可以每次把offset保存到文件或者redis中

kafkaParams={"metadata.broker.list": self.brokers}

~~手动分割来更~~

Transformations on DStreams

Similar to that of RDDs, transformations allow the data from the input DStream to be modified. DStreams support many of the transformations available on normal Spark RDD’s. Some of the common ones are as follows.

Transformation	Meaning

map(func)	Return a new DStream by passing each element of the source DStream through a function func. 某函数作用于DStream的每个元素，返回一个新的DStream
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.t同map，但是每个输入元素可以返回0个或者多个输出
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. 映射任何一个函数，从RDD转为RDD，返回的依旧是DStream
updateStateByKey(func)	Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key 1、为Spark Streaming中每一个Key维护一份state状态，state类型可以是任意类型的，可以是一个自定义的对象，那么更新函数也可以是自定义的。2、通过更新函数对该key的状态不断更新，对于每个新的batch而言，Spark Streaming会在使用updateStateByKey的时候为已经存在的key进行state的状态更新

- updateStateByKey(func)

注意

1、当用upateStateByKey时需要设置checkpoint
2、多久会将内存中的数据写入到磁盘一份？
如果batchInterval设置的时间小于10秒，那么10秒写入磁盘一份。如果batchInterval设置的时间大于10秒，那么就会batchInterval时间间隔写入磁盘一份

DStream 的输出操作

输出操作可以把DStream 输出到外部系统或者文件系统。transformation操作才能真正被触发。

输出操作	意思
pprint()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.主要用来调试
saveAsTextFiles(prefix, [suffix])	Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”.输出到文件系统
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.最常用的一个输出操作，一个函数作用于DStream 生成的每个RDD算子。这个函数的作用是把每个RDDpush到外部系统，例如文件或者数据库

foreachRDD

作用于函数的两个参数（time，rdd)

demo Kafka统计单词个数（累计统计）

		sc = SparkContext('local[2]', appName="PythonStreamingKafkaWordCount")
       ssc = StreamingContext(sc, 1)
   
       kafkaParams = {"metadata.broker.list": host}

       # 手动设置  从零偏移量开始消费
       from_offsets = {}
       for i in range(3):
           from_offsets[
               TopicAndPartition(topic, i)] = 0

       kafkastreams = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams=kafkaParams, fromOffsets=from_offsets)
   	ssc.checkpoint("checkpoint")
       lines = kafkastreams.map(lambda x: x[1])
       initialStateRDD = sc.parallelize([(u'hello', 1), (u'world', 1)])
       # initialStateRDD=None
       def updateFunc(new_values, last_sum):
           return sum(new_values) + (last_sum or 0)

       running_counts = lines.flatMap(lambda line: line.split(" ")) \
           .map(lambda word: (word, 1)) \
           .updateStateByKey(updateFunc, initialRDD=initialStateRDD)
           
       running_counts.pprint()
 		kafkastreams.transform(storeOffsetRanges).foreachRDD(printOffsetRanges)
       ssc.start()
       ssc.awaitTermination()