spark streaming kafka offset commit

最新推荐文章于 2023-01-22 16:55:10 发布

街北槐花

最新推荐文章于 2023-01-22 16:55:10 发布

阅读量651

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/pengchengqing/article/details/79035234

版权

spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

由于spark的

  rdd.asInstanceOf[HasOffsetRanges].offsetRanges

这个操作：

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  rdd.foreachPartition { iter =>
    val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
    println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
  }
}

Note that the typecast to HasOffsetRanges will only succeed if it is done in the first method called on the result of createDirectStream, not later down a chain of methods. Be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window().

就是说获取ossset这个操作必须是createDirectStream 之后的第一个方法内才可以成功。后续的方法都不可以。需要注意的是，RDD的partion和Kafka的partion的一对一映射关系在任何的shuff或者repartition ，比如reduceByKey 和window操作之后，都是不可用的。

那么这就会造成at most onece的问题，如果你在接收到数据的时候就进行commit。如何保证at least once的，就必须在处理完自己逻辑的时候，commit offset。所以必须在处理逻辑开始前保存保存住所有parttion的offset，然后处理完成在commit offset。

这里介绍一个实现方式：

可以把当前parition的offset 存储到一个外存，比如 redis或者hbase。这样在处理完成的时候查出来就行提交，处理终断的时候就不提交。

街北槐花

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark streaming kafka offset commit

由于spark的 rdd.asInstanceOf[HasOffsetRanges].offsetRanges这个操作：stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd.foreachPartition { iter =>
复制链接

扫一扫