structure streaming笔记

最新推荐文章于 2024-01-22 20:47:21 发布

weixin_30644369

最新推荐文章于 2024-01-22 20:47:21 发布

阅读量128

点赞数

文章标签：大数据

原文链接：http://www.cnblogs.com/PigeonNoir/p/10630975.html

版权

基于micro-batch, spark2.3之后, 支持continues processing
基于spark SQL
如同在静态table上运行标准批查询一样表现流计算, spark 通过在一个 unbound input table 上运行增量查询来实现.
unbound input table
- 　每条输入数据, 体现为表的一条新行
result table
- 　每批新输入被处理后, 更新此表. 三种mode:
- 　complete mode: 每次都更新全表
- append mode: result table只追加新行. 即新一批输入的处理结果不会依赖且不会影响之前的输出.
- update mode: 只有被新一批输入计算结果影响了的行, 才会被更新
event time
- 数据被输入的时间. 区别于spark收到数据的时间.
fault tolerant semantics
- 　end-to-end exactly-once
  - 　捕获failure并重试process
  - 　基于checkpointing 和 WAL - 断点接续
- 　区别与:
  - 　at-most once
    - 　至多写一次. 弱保证
    　at-least once
    - 　至少写一次. 强保证
基于DataSet和DataFrame的API