spark-kafka

1 篇文章 0 订阅

主页
spark-kafka是从kafka批加载数据到spark,从spark加载到kafka的库.它不为spark流提供kafka输入DStream,因为spark-streaming-kafka是已由spark自带。

SimpleConsumerConfig

这时KafkaRDD从kafka消费数据需要的配置,包括metadata.broker.list和其他一些设置。

KafkaRDD

这是一个从kafka抽取数据的RDD,需要提供sparkcontext,kafka topic,offset ranges per kafka partition和 SimpleConsumerConfig.作为offsets的替代,你也可以提供时间,它将用于计算offsets at construction time.
kafkaRDD为每个kafka分区创建一个spark分区。

KafkaRDD is an RDD[PartitionOffsetMessage], where PartitionOffsetMessage is a case class that contains the Kafka partition, offset and the message (which contains key and value/payload). The Kafka partition and offset being part of the RDD makes it easy to calculate the last offset read for each Kafka partition, which can then be used to derive the offsets to start reading for the next batch load.

kafka是一个删除旧消息添加新消息的动态系统。 KafkaRDD on the other hand has a fixed offset range per Kafka partition set at construction time. 这意味着在创建kafkaRDD之后添加到kafka的消息将不可见。It also means that messages deleted from Kafka that are within the offset range can lead to errors within KafkaRDD。
For example if you define a KafkaRDD with a start time of OffsetRequest.EarliestTime and you access the RDD many hours later you might see an OffsetOutOfRangeException as Kafka has cleaned up the data you are trying to access.
kafka的伴生对象包括writeWithKeysToKafka和writeToKafka方法,可以用于向kafka写入RDD,你需要提供kafka topic 和ProducerConfig.
writeToKafka也可以用于sparkStreaming保存DStream的潜在的RDD到kafka(使用DStream的foreachRDD方法)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值