DAG vs. MPP vs MR

最新推荐文章于 2024-04-25 11:13:42 发布

小小她爹

最新推荐文章于 2024-04-25 11:13:42 发布

阅读量1.9k

点赞数 1

分类专栏： ---大数据大数据与自然语言处理文章标签：分布式 spark

本文链接：https://blog.csdn.net/everlasting_188/article/details/79647598

版权

大数据与自然语言处理同时被 2 个专栏收录

43 篇文章 13 订阅

订阅专栏

---大数据

13 篇文章 0 订阅

订阅专栏

本文是主要来自于知乎和张包峰博客的总结

1、DAG vs MPP

Native Design

MPP每个Segment高度对称(symmetric)，狭义MPP storage各个Segment自己管理，自己备份，涉及某数据相关的query必定会落到某个Segment上，有concurrency和straggler的问题存在。

MPP天然有很优秀的Compiler和Optimizer，包括local runtime环境是数据库，解析、优化、codegen、执行一气呵成。Segment内有良好的二级资源管理和Task调度，足够细粒度且对query敏感(query隔离、内存使用监控等)。

DAG天然share storage，master能感知全局meta，所以才能单点schedule好task sets，并协调Executor之间的上下游数据shuffle、任务起停等过程。DAG每个task从设计上有简单、幂等等性质，可做task speculation的工作，甚至动态替换某个Node、更新其并发度。

DAG容易对不同存储介质的数据做IO，目前场景的是在输入和输出节点，理论上各个计算节点可挂载不同存储执行引擎，只要meta共享。

Task Schedule

MPP竖切，直通通完成Task的构造，每个Segment收到的是较为完整的sub-query。

DAG横切，节点合并(包括Spark的窄依赖和Stage)是优化手段，理论上不同Node的tasks要分散到不同计算进程上。最优的条件下，如Spark 2.0 whole-stage-codegen，是理论上把SQL优化到MPP那样的极致。

OLAP Speed

MPP全局的compiler、optimizer到本地执行数据库，理论上是DAG速度的上限。

DAG执行节点一般是合并好或者codegen好的fn，起task的时候load user lib，当然灵活性上看也可以挂数据库。

Shared Storage

DAG能运算的前提就是storage是共享的。而且job之间的数据复用也很自然。

狭义MPP的storage是借助每个Segment自己的数据库做的。广义MPP，如HAWQ，通过DFS解这个问题，同时解了由于原本只有某个segment才能执行query的concurrency问题，也解了straggler问题。

Core Aspects

MPP的核心是优化器和本地执行这块，背靠于数据库的积累。

DAG，我认为不能说核心是什么，因为它不像MPP的核心那样细腻。我只能说优势是灵活性、易用性。从这个角度看，甚至MPP能不能用DAG实现我觉得都是可以讨论的，因为DAG本身API易用(一般是Dataflow风格)，层次上很清晰地可以接上不同的DSL以及编程模型，本地执行理论上也可以挂载不同的执行引擎，数据可以来自不同存储引擎，job之间可以存在数据复用。

Hybrid

当然我们看到的系统都已经不再是狭义的DAG和MPP了。HAWQ架DFS的做法、Spark SQL做的优化工作都已经是DAG和MPP之间的一种hybrid实现了。

抛开上面说的几点，从master-slave架构、meta如何管理、task如何调度下发、query资源如何监控和隔离、数据shuffle是pipeline还是block、push还是pull等等都是不影响系统本身design的地方，大部分是分布式系统都要面临的问题，只是解决手段上有各种的侧重和常见的实现。

2、pipeline VS streaming

pipeline带来的好处就是每一批(一条、若干条或全部)数据的lantency低，各自并发可乱序，中间数据不容易膨胀。后来看了微软的一些论文和系统之后，特别是DataFlow论文的出现，让我看到了streaming的本质。streaming的数据有先天的属性：产生时间、系统处理时间。真正完备的streaming可以在编程模型上清晰地定义，何时或什么条件下要触发计算，迟到的数据需不需要及如何处理，放在那部分区间处理，如何对已输出结果做补偿。在这块，思想最先进的就是谷歌的DataFlow，它没有给出实现细节，但是把模型这层说的很清楚，真正统一了streaming和batch计算，微软的Trill也更早意识到了这点，只是论文层面没有DataFlow说的清晰(个人拙见)。

3、数据库领域新想法

Hyper： HyPer is a main-memory-based relational DBMS for mixed OLTP and OLAP workloads.，具体介绍见官方站点：http://www.hyper-db.de/，一大堆相关的论文

X100：未查到

C-Store/Vertica： C-Store/Vertica databae ，介绍文档：http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf

DB2 BLU：DB2 10.5 BLU列式存储

4、Flink vs Storm

来自于https://stackoverflow.com/questions/30699119/what-is-are-the-main-differences-between-flink-and-storm 的最佳答复

Disclaimer: I'm an Apache Flink committer and PMC member and only familiar with Storm's high-level design, not its internals.

Apache Flink is a framework for unified stream and batch processing. Flink's runtime natively supports both domains due to pipelined data transfers between parallel tasks which includes pipelined shuffles. Records are immediately shipped from producing tasks to receiving tasks (after being collected a buffer for network transfer). Batch jobs can be optionally executed using blocking data transfers.

Apache Spark is a framework that also supports batch and stream processing. Flink's batch API looks quite similar and addresses similar use cases as Spark but differs in the internals. For streaming, both systems follow very different approaches (mini-batches vs. streaming) which makes them suitable for different kinds of applications. I would say comparing Spark and Flink is valid and useful, however Spark is not the most similar stream processing engine to Flink.

Coming to the original question, Apache Storm is a data stream processor without batch capabilities. In fact, Flink's pipelined engine internally looks a bit similar to Storm, i.e., the interfaces of Flink's parallel tasks are similar to Storm's bolts. Storm and Flink have in common that they aim for low latency stream processing by pipelined data transfers. However, Flink offers a more high-level API compared to Storm. Instead of implementing the functionality of a bolts with one or more readers and collectors, Flink's DataStream API provides functions such as Map, GroupBy, Window, and Join. A lot of this functionality must be manually implemented when using Storm. Another difference are processing semantics. Storm guarantees at-least-once processing while Flink provides exactly-once. The implementations which give these processing guarantees differ quite a bit. While Storm uses record-level acknowledgments, Flink uses a variant of the Chandy-Lamport algorithm. In a nutshell, data sources periodically inject markers into the data stream. Whenever an operator receives such a marker, it checkpoints its internal state. When a marker was received by all data sinks, the marker (and all records which have been processed before) are committed. In case of a failure, all sources operatorsare reset to their state when they saw the last committed marker and processing is continued. This marker-checkpoint approach is more lightweight than Storm's record-level acknowledgments. This slide set and the corresponding talk discuss Flink's streaming processing approach including fault tolerance, checkpointing, and state handling.

Storm also offers an exactly-once, high-level API called Trident. However, Trident is based on mini-batches and hence more similar to Spark than Flink.

Flink's adjustable latency refers to the way that Flink sends records from one task to the other. I said before, that Flink uses pipelined data transfers and forwards records as soon as they are produced. For efficiency, these records are collected in a buffer which is sent over the network once it is full or a certain time threshold is met. This threshold controls the latency of records because it specifies the maximum amount of time that a record will stay in a buffer without being sent to the next task. However, it cannot be used to give hard guarantees about the time it takes for a record from entering to leaving a program because this also depends on the processing time within tasks and the number of network transfers among other things.

5、参考

大半年来做的计算这点事  http://blog.csdn.net/pelick/article/details/50575632

DAG vs. MPP http://blog.csdn.net/pelick/article/details/51538080

Introduction to Apache Flink for Spark Developers : Flink vs Spark https://www.zhihu.com/question/30151872

Top 10 Best Analytical Processing (OLAP) Tools: Business Intelligence http://www.softwaretestinghelp.com/best-olap-tools/

Flink架构、原理与部署测试 http://blog.csdn.net/JdoOudDm7i/article/details/62039337

Apache Flink和Apache Spark有什么异同？它们的发展前景分别怎样？ https://www.zhihu.com/question/30151872

What is/are the main difference(s) between Flink and Storm? https://stackoverflow.com/questions/30699119/what-is-are-the-main-differences-between-flink-and-storm