Motivation:
Many big-data applications need to process large data streams in near-real time.
- Site activity statistics
- Span detection
- Cluster monitoring
Challenges:
- Stream processing systems must recover from failures and stragglers quickly and efficiently
- Traditional streaming systems don’t achieve these properties simultaneously
Introduction:Existing streaming systems
-
Based on a continuous operator processing model
-
Two approaches to recovery
1、Replication

2、Upstream backup

Discretized Streams(D-Streams)为应对大规模数据流的实时处理挑战提供了解决方案。该方法将流计算转化为一系列基于小时间间隔的无状态、确定性批处理计算,以实现快速故障恢复、低延迟和高可扩展性。通过使用弹性分布式数据集(RDD),D-Streams避免了数据复制,通过操作血统图跟踪来恢复数据,并定期检查点保存状态。此外,采用并行恢复策略和推测执行来处理节点故障和延迟任务,确保系统在大规模商品集群中的高效运行。SparkStreaming作为应用实例,利用D-Streams将输入数据流划分为批次并在内存中存储,通过生成Spark作业处理这些批次,从而实现流应用程序的执行。
最低0.47元/天 解锁文章

1017

被折叠的 条评论
为什么被折叠?



