spark-kafka

最新推荐文章于 2022-08-01 01:09:10 发布

岸芷汀兰whu

最新推荐文章于 2022-08-01 01:09:10 发布

阅读量914

点赞数

分类专栏： kafka spark 文章标签： spark kafka

spark 同时被 2 个专栏收录

66 篇文章 0 订阅

订阅专栏

kafka

1 篇文章 0 订阅

订阅专栏

主页
spark-kafka是从kafka批加载数据到spark，从spark加载到kafka的库.它不为spark流提供kafka输入DStream，因为spark-streaming-kafka是已由spark自带。

SimpleConsumerConfig

这时KafkaRDD从kafka消费数据需要的配置，包括metadata.broker.list和其他一些设置。

KafkaRDD

这是一个从kafka抽取数据的RDD，需要提供sparkcontext,kafka topic,offset ranges per kafka partition和 SimpleConsumerConfig.作为offsets的替代，你也可以提供时间，它将用于计算offsets at construction time.
kafkaRDD为每个kafka分区创建一个spark分区。

KafkaRDD is an RDD[PartitionOffsetMessage], where PartitionOffsetMessage is a case class that contains the Kafka partition， offset and the message (which contains key and value/payload). The Kafka partition and offset being part of the RDD makes it easy to calculate the last offset read for each Kafka partition, which can then be used to derive the offsets to start reading for the next batch load.

kafka是一个删除旧消息添加新消息的动态系统。 KafkaRDD on the other hand has a fixed offset range per Kafka partition set at construction time. 这意味着在创建kafkaRDD之后添加到kafka的消息将不可见。It also means that messages deleted from Kafka that are within the offset range can lead to errors within KafkaRDD。
For example if you define a KafkaRDD with a start time of OffsetRequest.EarliestTime and you access the RDD many hours later you might see an OffsetOutOfRangeException as Kafka has cleaned up the data you are trying to access.
kafka的伴生对象包括writeWithKeysToKafka和writeToKafka方法，可以用于向kafka写入RDD，你需要提供kafka topic 和ProducerConfig.
writeToKafka也可以用于sparkStreaming保存DStream的潜在的RDD到kafka(使用DStream的foreachRDD方法）