首先说明一下redbook上的几篇文章是做什么的。这几篇文章是通过几位作者对不同方面的论文进行阅读和筛选后,挑出其中具备代表性或者权威的论文来做分析,为读者提供阅读指导和建议,同时,也是对某个方面的论文进行高度的总结,归纳和点评。第五版里面有几大块内容:背景和基础,新一代数据库,大规模数据处理,分布式,查询优化,交互分析,web数据等,以及两篇关于复杂分析和数据集成的最新技术分析。
Query optimization is one of the signature components of database technology—the bridge that connects declarative languages to efficient execution. Query optimizers have a reputation as one of the hardest parts of a DBMS to implement well, so it’s no surprise they remain a clear differentiator for mature commercial DBMSs. The best of the open-source relational database optimizers are limited by comparison, and some have relatively naive optimizers that only work for the simplest of queries.
查询优化是数据库标志性的技术——是连接计算机语言与高效执行方式的桥梁。查询优化被誉为 DBMS 中最难实现的部分之一,因此这也是最重要的一点来去区别于成熟的DBMS。比较来看,最好的开源关系型数据库优化器比较有限,只有一些相对简单的优化器,而且只能工作在简单的查询上面。
“optimizer”,就是数据库优化器。首先我们要知道SQL的逻辑处理过程是不考虑性能的,而实际上SQL的实际上的处理,或者说物理的处理是完全不一样的,在物理的层面上我们是可以走很多捷径的, 这也正是数据库优化器存在的意义。具体点来说提交SQL的时候我们只关心我们需要通过什么样的逻辑拿到目标数据,而至于SQL的具体如何执行,应该以什么样的顺序访问表,以什么方式访问哪一个索引,使用什么联结算法等,这些都是优化器应该考虑的事情。优化器的作用,就是为SQL生成最优化的执行计划,这些计划生成的前提是确保可以得到正确的结果,无法保证得回正确结果的执行计划是不会被考虑的,并且下一段提到了优化器“estimation techniques to guess at real plan costs”和“heuristics”启发式算法来保证效率准确度,这也说明我们需要提供足够的信息来保证优化器生成的执行计划是足够优化的。
It’s important to remember that no query optimizer is truly producing “optimal” plans. First, they all use estimation techniques to guess at real plan costs, and it’s well known that errors in these estimation techniques can balloon—in some circumstances being as bad as random guesses [7]. Second, optimizers use heuristics to limit the search space of plans they choose, since the problem is NP-hard [6]. One assumption that’s gotten significant attention recently is the traditional use of 2-table join operators; this has been shown to be theoretically inferior to new multi-way join algorithms in certain cases [12].
首先要明确的是,没有一个优化器正在创造完美的执行计划的。首先,他们都使用预测的技术来猜测真正执行计划的消耗,但众所周知,这些估算技术中的错误在一些情况下可能会激增。其次,因为问题是NP-hard,优化器会使用启发式方法(heuristics)来去限制执行计划的搜索空间。最近传统的2表连接运算使用吸引了很多的关注,这也展示出在某些情况下,理论上是不如新的多路连接算法(multi-way join algorithms)。
启发式算法是这样定义的:一个基于直观或经验构造的算法,在可接受的花费(指计算时间和空间)下给出待解决组合优化问题每一个实例的一个可行解,该可行解与最优解的偏离程度一般不能被预计。现阶段,启发式算法以仿自然体算法为主,主要有蚁群算法、模拟退火法、神经网络等。定义也不太好理解,不过看到神经网络大概知道是什么了。借助神经网络的理解,其实也就是你提供的信息越完善,越准确,就越有可能得到最佳方案,信息越多,启发式算法得到的优化方案会越好。
对于查询优化通常有两种方案。基于启发式规则:启发式优化将查询的部分与已知的模式进行匹配,以重组计划。这些规则对查询进行转换,消除低效率的部分,这种方式不需要检查数据本身。基于代价的搜索:需要读取数据并估计执行计划的成本,然后从各个计划中选择成本最低的方案。
对于启发式规则,如果两个关系代数表达式生成相同的元组集,则它们是等价的。DBMS可以在没有成本模型的条件下生成更优的查询计划,当应用程序向数据库发送SQL查询,DBMS首先要将SQL解析成语法树的标记,Binder查询系统目录将语法树标记替换为内部标识符,生成逻辑查询计划,最后由查询优化器选择最高效的执行方案。
NP-Hard:算法中经常会说到NP问题或者NPC问题,P问题就是指该问题能在多项式复杂度内解决。NP问题就是指该问题能在多项式复杂度内被验证。NP-hard问题就是比NP问题更困难解决的问题,通常NP问题都可以说是NP-hard问题,但是不是所有的NP-hard问题都是NP问题。
“multi-way join algorithms”多路连接,并不清楚这是什么,网络上也没有很明确的解释,可能就是指的我们在MySQL熟悉多个表的JOIN。
Despite these caveats, relational query optimization has proven successful, and has enabled relational database systems to serve a wide range of bread-and-butter use cases fairly well in practice. Database vendors have invested many years into getting their optimizers to perform reliably on a range of use cases. Users have learned to live with limitations on the number of joins. Optimizers still, for the most part, make declarative SQL queries a far better choice than imperative code for most uses.
尽管有这些警告,但关系型查询优化已经被证明是成功的。而且已经可以让关系型数据库系统在实践中广泛的服务于基本用例。数据库供应商已经投入了很多年来使他们的优化器在一系列用例上更加可靠。用户已经学习到了和有限的连接共存。在大部分情况下,优化器使声明式SQL查询比起命令式代码成为更好的选择。
“make declarative SQL queries a far better choice than imperative code for most uses.”,这句话提到了“declarative SQL”和“imperative code”, 也就是声明式和命令式。首先命令式就是说,你在写代码时,你需要一步一步告诉计算机怎么做,这样计算机是不具备智能的,只是很机械的完成你交代的任务;声明式是说,你告诉计算机你想要什么,计算机负责为你设计方法,最常见的声明式语言就是SQL,用户只告诉了DBMS想要获取什么,但没有指出如何计算。因此,DBMS需要将SQL语句转换成可执行的查询计划(Query Plan)。但是对同样的数据可以有多种查询方案,性能也差距很大,查询优化器(Query Optimizer)的任务就是从给定的查询中选择一个最优的方案。显然声明式很考验计算机的能力,所以这也回应了上一段的内容。
In addition to being hard to build and tune, serious query optimizers also have a tendency to grow increasingly complex over time as they evolve to handle richer workloads and more corner cases. The research literature on database query optimization is practically a field unto itself, full of technical details—many of which have been discussed in the literature by researchers at mature vendors like IBM and Microsoft who work closely with product groups. For this book, we focus on the big picture: the main architectures that have been considered for query optimization and how have they been reevaluated over time.
除了难以构建和调优以外,严重的查询优化器会有一个趋势,随着时间的推移变得越来越复杂,因为他们进化到了需要处理更丰富的工作负载,和更多极端的情况,在写数据库优化的文献上,他们是一个单独的领域,充满了技术细节——其中许多已经被成熟的供应商写文章讨论过,像和产品组有紧密联系的IBM和Microsoft。对于这本书,我们强调整个大的趋势:被我们关注的查询优化器的主要架构,和他们是如何被重新评估的。
下文中提到的System R就是IBM最早提出的查询优化器。
Volcano/Cascades
We begin with the state of the art. There are two reference architectures for query optimization from the early days of database research that cover most of the serious optimizer implementations today. The first is Selinger et al.’s System R optimizer described in Chapter 3. System R’s optimizer is textbook material, implemented in many commercial systems; every database researcher is expected to understand it in detail. The second is the architecture that Goetz Graefe and his collaborators refined across a series of research projects: Exodus, Volcano, and Cascades. Graefe’s work is not covered as frequently in the research literature or the textbooks as the System R work, but it is widely used in practice, notably in Microsoft SQL Server, but purportedly in a number of other commercial systems as well. Graefe’s papers on the topic have something of an insider’s flavor—targeted for people who know and care about implementing query optimizers. We chose the Volcano paper for this book as the most approachable representative of the work, but aficionados should also read the Cascades paper [5]—not only does it raise and address a number of detailed deficiencies of Volcano, but it’s the latest (and hence standard) reference for the approach. Recently, two open-source Cascades-style optimizers have emerged: Greenplum’s Orca optimizer is now part of the Greenplum open source, and Apache Calcite is an optimizer that can be used with multiple backend query executors and languages, including LINQ.
我们先从目前的工艺水平开始。在数据库研究早期,有两种查询优化的参考体系结构,涵盖了如今大多数重要的优化器实现。第一个是在第三章提到过的Selinger等人的R系统优化器。R系统是教科书的材料,在许多商业系统中实现;每一个数据库研究都应该详细的理解他。第二个是Goetz Graefe和它的同事经过一系列项目所提炼出来的:Exodus,、Volcano和Cascades。Graefe的工作没有被研究论文像R系统一样经常提及,但是它在现实中被广泛使用,特别是在Microsoft SQL Server中,但是据说在其他的商业系统中也有。Graefe在这方面的文章也有一些内行家的味道——对于关注查询优化器的人。我们选择Volcano来作为这本数最平易近人的代表,但爱好者们也应该读读Cascades的论文——它不仅提出并解决了许多Volcano细节缺陷,而且也是最新的参考(因此也是最标准的)。最近,两个Cascades风格的开源优化器已经合并,Greenplum的Orca优化器是Greenplum开源项目的一部分。Apache Calcite是被使用在多后端查询执行器和多后端查询执行器语言的优化器,包括LING。
首先这个Volcano和Cascades分别指两种算法,将他们并列起来是因为二者是一脉相承的关系,很多基本思想是一样的;另外一点是很多有关Volcano 优化器的相关信息其实是由Cascades相关的文章总结和介绍的。Volcano给出了一个具有扩展性和通用性的查询优化器生成框架的设计理念,和一个粗粒度的搜索算法,但在工程实现方面基本没有描述。Cascades则是对Volcano做了很多改进,给出了工程实现相关的一些细节,比如用面向对象方式,提供了相关的数据结构定义,搜索流程等。
SQL查询优化的基础算法可以分为两大类(上文提到的启发式算法,和代价规则)。首先是基于规则的优化算法,其要点在于结构匹配和替换。应用规则的算法一般需要先在关系代数结构上匹配一部分局部的结构,再根据结构的特点进行变换乃至替换操作。需要注意的是,由于变换规则要保持关系代数语义不变的大前提,因此被匹配的部分即使内部结构完全被替换,其跟外部的接口也要保持一致性(向上输出的数据内容和类型不变,下层接受输入的数量和类型不变)。
另一个就是基于成本的优化算法。基于规则的优化算法在实际使用中会面对很多问题,变换规则的选择问题:哪些规则应该被应用?以什么顺序被使用?变换效果评价的问题:经过变换的查询性能是否会变好?多种可能的方案哪个更优?现阶段主流的方法都是基于成本(Cost)估算的方法。也就是说,给定某一关系代数代表的执行方案,将会对这一方案的执行成本进行估算,最终选择估算成本最低的方案。尽管被称为基于成本的方法,这类算法仍然往往要结合规则进行方案的探索。也就是说,基于成本的方法其实是通过不断的应用规则进行变换得到新的执行方案,然后对比方案的成本优劣进行最终选择。
Volcano是一种基于成本的优化算法,其目的是基于一些假设和工程算法的实现, 在获得成本较优的执行方案的同时,可以通过剪枝和缓存中间结果(动态规划)的方法降低计算消耗。成本最优假设是Volcano的关键。这一假设认为,在最优的方案当中,取局部的结构来看其方案也是最优的。成本最优假设利用了贪心算法的思想,在计算的过程中,如果一个方案是由几个局部区域组合而成,那么在计算总成本时,我们只考虑每个局部目前已知的最优方案和成本即可。由于引入了成本最优假设,在优化过程中我们就可以对任意子树目前已知的最优方案和最优成本进行缓存。此后在计算的过程中,如果需要利用这一子树,可以直接使用之前缓存的结果。这里应用了动态规划算法的思想。
在实现上述动态规划算法的时候存在两种遍历方法,一种是自底向上的动态规划算法,一种是自顶向下的动态规划算法。自底向上的算法最为直观:当我们试图计算节点A的最优方案时,其子树上每个节点对应的等价集合和最优方案都已经计算完成了,我们只需要在 A 节点上不断寻找可以应用的规则,并利用已经计算好的子树成本计算出母树的成本,就可以得到最优方案。事实上,包括 System R 在内的一些成熟的数据库系统都采用这种方法。
然而这种方案存在一些难以解决的问题:不方便应用剪枝技巧,在查询中可能会遇到在父亲节点的某一种方案成本很高,后续完全无需考虑的情况,尽管如此,需要被利用的子计算都已经完成了,这部分计算因此不可避免。难以实现启发式计算和限制计算层数。由于程序要不断递归到最后才能得到比较好的方案, 因此即使计算量比较大也无法提前得到一个可行的方案并停止运行。
因此,Volcano采取了自顶向下的计算方法,在计算开始,每棵子树先按照原先的样子计算成本并作为初始结果。在不断应用规则的过程中,如果出现一种新的结构被加入到当前的等价集合中,且这种等价集合具有更优的成本,这时需要向上冒泡到所有依赖这一子集合的父亲等价集合,更新集合里每个元素的成本并得到新的最优成本和方案。
总的来说就是,Volcano与System-R的Search engine有所不同,Volcano的搜索路径是自顶向下的,也就是先从关系算子树的顶层开始,以深度优先的方式来向下遍历,遍历过程中进行剪枝。
Graefe’s optimizer architecture is notable for two main reasons. First, it was expressly designed to be extensible. Volcano deserves credit for being quite forward-looking—long predating MapReduce and the big data stacks—in exploring the idea that dataflow could be useful for a wide range of data-intensive applications. As a result, the Graefe optimizers are not just for compiling SQL into a plan of dataflow iterators. They can be parameterized for other input languages and execution targets; this is a highly relevant topic in recent years with the rise of specialized data models and languages on the one hand (see Chapter 2 and 9), and specialized execution engines on the other (Chapter 5). The second innovation in these optimizers was the use of a top-down or goal-oriented search strategy for finding the cheapest plan in the space of possible plans. This design choice is connected to the extensibility API in Graefe’s designs, but that is not intrinsic: the Starburst system showed how to do extensibility for Selinger’s bottom-up algorithm [9]. This “top-down” vs “bottom-up” debate for query optimization has advocates on both sides, but no clear winner; a similar top-down/bottom-up debate came out to be more or less a tie in the recursive query processing literature as well [13]. Aficionados will be interested to note that these two bodies of literature-recursive query processing and query optimizer search-were connected directly in the Evita Raced optimizer, which implemented both top-down and bottom-up optimizer search by using recursive queries as the language for implementing an optimizer
Graefe的优化器架构有两个值得注意的原因。首先它被明确设计为可扩展的。Volcano值得称赞的是,它非常具有前瞻性——早于MapReduce和大数据栈——它探索了数据流可以用于大量数据密集型应用的程序的想法。因此,Graefe的优化器不仅仅是将SQL编译到数据流迭代计划。它们可以为其他输入语言和执行目标参数化;一方面随着专业数据模型和语言的兴起,另一方面是专业执行引擎,这成了一个高度相关的主题。这些优化者的第二个创新是使用自顶向下(top-down)或目标导向(goal-oriented)的搜索策略,为了在可能的计划空间中寻找最低耗计划。这个设计选择关系到了Graefe设计的可扩展性API,但是那不是内在原因:Starburst系统展示了如何为Selinger的自底向上(bottom-up)算法做可扩展性。在查询优化上自顶向上和自底向下的争论都分别有支持者。但没有明确的赢家;类似的,自顶向下/自底向上的争论在递归查询(recursive query)处理文献中也或多或少地出现了。有兴趣的爱好者会注意到,这两种文献——递归查询处理和查询优化器搜索——直接关联在Evita Raced优化器中,它是通过使用响应式查询作为实现优化器的语言,来实现自顶向下和自顶向上的优化器搜索。
这段提到了“dataflow”,数据流主要是用来,可以将事件中的数据写入数据库,缓存,搜索索引或类似的存储系统,然后能被其他客户端查询。数据库中的数据流主要是写入数据库的时候对数据进行编码,读取的时候进行解码。
“goal-oriented search strategy”,应该是指搜索时有一些什么东西进行监督,或者说是引导,从而提高执行计划的合理性。
上一段详细解释了自顶向下与自底向上的优劣势。
这段主要是说了Graefe优化器第一个优势是探索了数据流可以用于大量数据密集型应用的程序想法,第二个是使用自顶向下或面向目标的搜索策略。
Adaptive Query Processing
By the late 1990’s, a handful of trends suggested that the overall architecture of query optimization deserved a significant rethink. These trends included:
- Continuous queries over streaming data.
- Interactive approaches to data exploration like Online Aggregation.
- Queries over data sources that are outside the DBMS and do not provide reliable statistics or performance.
- Unpredictable and dynamic execution environments, including elastic and multitenant settings and widely distributed systems like sensor networks.
- Opaque data and user-defined functions in queries, where statistics can only be estimated by observing behavior.
In addition, there was ongoing practical concern about the theoretical fact that plan cost estimation was often erratic for multi-operator queries [7]. As a result of these trends, interest emerged in adaptive techniques for processing queries, where execution plans could change mid-query. We present two complementary points in the design space for adaptive query processing; there is a long survey with a more comprehensive overview [4].
到20实际90年代末,一些趋势表明,查询优化的整体架构应该进行重大的重新思考。这些趋势包括:
- 对流数据的持续查询
- 交互式数据探索方法,像在线聚合(Online Aggregation)
- 对数据库管理系统之外的数据源查询,不能提供可靠的统计数据或性能
- 不可预测和动态执行环境,包括弹性和多租户设置和广泛分布的系统,如传感器网络。
- 查询中不透明的数据和用户定义的函数,其中统计数据只能通过观察行为来估计。
另外,对于多操作符查询的计划成本估计常常是不稳定的这一理论事实,也存在持续的实际关注。由于这些趋势,人们对处理查询的适应性技术产生了兴趣,其中执行计划可能在查询过程中发生变化。我们提出了自适应查询处理设计空间中的两个互补点;有一个更全面的概览。
这段主要讲述了查询优化近些年研究的整体趋势。
“adaptive query processing”,自适应查询处理分析实际查询运行时统计信息,并将该信息用于后续优化。随着数据量的迅速增加,错误计算复杂计划的代价可能会导致严重的性能问题。这些问题可能以分钟或小时而不是秒或分钟来衡量。AQP 会分析实际查询运行时统计信息并使用该信息来更正先前的估计。这些更新后的估计可以为后续优化提供更好的信息。
Eddies
The work on eddies, represented by our second paper, pushed hard on the issue of adaptivity: if query “re-planning” has to occur mid-execution, why not remove the architectural distinction between planning and execution entirely? In the eddies approach, the optimizer is encapsulated as a dataflow operator that is itself interposed along other dataflow edges. It can monitor the rates of dataflow along those edges, so it has dynamic knowledge of their behavior, with whatever history it cares to record. With that ongoing flow of information, it can dynamically control the rest of the aspects of query planning via dataflow routing: the order of commutative operators is determined by the order tuples are routed through operators (the focus of the first eddies paper that we include here) the choice of physical operators (e.g. join algorithms, index selection) is determined by routing tuples among multiple alternative, potentially redundant physical operators in the flow [3, 15] the scheduling of operators is determined by buffering inputs and deciding which output to deliver to next [14]. As an extension, multiple queries can be scheduled by interposing on their flows and sharing common operators [10]. Eddies intercept the ongoing dataflow of query operators while they are in flight, pipelining data from their inputs to their output. For this reason it’s important that eddy routing be implemented efficiently; Deshpande developed implementation enhancements along these lines [2]. The advantage of this pipelined approach is that eddies can adaptively change strategies in the middle of executing a pipelined operator like a join, which is useful if a query operator is either very long-lived (as in a streaming system) or a very poor choice that should be abandoned long before it runs to completion. Interestingly, the original Ingres optimizer also had the ability to make certain query optimization decisions on a per-tuple basis [18].
在涡流上,以第二篇文章为代表,在适应性问题上做了努力推动:如果“重新计划”必须发生在运行期间,为什么不删除在计划与执行之间全部的架构区别?在涡流过程中,优化器被封装为数据流运算符,它本身沿其他数据流插入。它可以监控边缘上的这些数据流,所以它能动态的知道他们的行为,伴随着无论它所记录的什么历史。伴随持续存在的信息流,它能动态的掌控经由数据流路径的查询计划方面:为了交换操作的顺序是由元组顺序路由通过操作器(我们包括的集中在第一个漩涡)的选择物理运算符(如加入算法、索引选择)是由路由元组在多个选择,潜在的冗余流中的物理运算符的调度操作符是由缓冲输入和决定哪些输出送到下一个。作为一个延伸,多个查询能被计划通过进入它们的流,以及共享相同的运算符。涡流拦截正在进行的数据流运算符,当它们在运行态,从它们输入到输出这样一个流水线状态。对于这个原因,有效的实现涡流路由是重要的;Deshpande沿着这些做了增强。流水线过程的优势是涡流能在执行运算符过程中自适应的改变策略,像连接,他是有用的,如果一个查询运算符也是非常长久的(作为在流系统中)或一个非常普通的选择,那个选择应该在它运行完成之前很久就被抛弃了。有趣的,原始的Ingres优化器也有能力去做正确的查询优化在每一个元基础上。
这段也提到了最常见的流处理的作用,就是用于监控目的,如果某个事件发生,希望能得到警报。这段主要讲了Eddies对于流的监控,来优化查询计划。
Progressive Optimization
The third paper in this section from IBM represents a much more evolutionary approach, which extends a System R style optimizer with adaptivity features; this general technique was pioneered by Kabra and DeWitt [8] but receives a more complete treatment here. Where eddies focused on intra-operator reoptimization (while data is “in motion”), this work focuses on inter-operator reoptimization (when data is “at rest”). Some of the traditional relational operators including sorting and most hash-joins are blocking: they consume their entire input before producing any output. This presents an opportunity after input is consumed to compare observed statistics to optimizer predictions, and reoptimize the “remainder” of the query plan using traditional query optimization technique. The downside of this approach is that it does no reoptimization while operators are consuming their inputs, making it inappropriate for continuous queries over streams, for pipelining operators like symmetric hash join [17] or for long-running relational queries that have poorly-chosen operators in the initial parts of the plan - e.g. when data is being accessed from data sources outside the DBMS that do not provide useful statistics [11, 16].
在来自IBM这个部分的第三篇文章,代表了更多的进化方法,扩展了System R优化器伴随着适应性特征;这个通用的科技被用来作为了个先驱,通过Kabra和DeWitt,但在这里获得了更完整的治疗。其中eddies专注于内部运算符内的重新优化(虽然数据是“运动的”)。一些传统的关系运算符,包括排序、哈希连接被锁定:它们消耗它们全部的输入在创造输出之前。这个代表一个在输入后的机会,去消耗观察到的统计数据与优化器预测进行比较,并使用传统的查询优化技术重新优化查询计划的“余数”。这个过程的缺点是当运算符计算它们输入时没有重新优化,这使它不适合在流上持续查询,对于流水线运算符。像对称哈希连接或用于长期运行的关系查询,它是在计划的初始部分中有较差的运算符,等等那些当从不提供有用统计信息的DBMS之外的数据源引入数据时。
It’s worth noting that these two architectures for adaptivity could in principle coexist: an eddy is “just” a dataflow operator, meaning that a traditional optimizer can generate a query plan with an eddy connecting a set of streaming operators, and also do reoptimization at blocking points in the dataflow in the manner of our third paper.
两个架构在原则长适应性的共存是毫无意义的:一个涡流式“仅仅”一个数据流操作符,这意味者传统的优化器能伴随一个连接一系列流运算符的涡流来生成一个查询计划,并且以我们的第三篇论文的方式在数据流量中的阻止点进行再优化。
这段主要讲了,Eddies关注于操作符内部的重新优化,而IBM的这项工作关注于操作符之间的重新优化。并与传统的操作符进行了对比。
Discussion
This brings us to a discussion of current trends in dataflow architectures, especially in the open source big data stack. Google MapReduce set back by a decade the conversation about adaptivity of data in motion, by baking blocking operators into the execution model as a fault-tolerance mechanism. It was nearly impossible to have a reasoned conversation about optimizing dataflow pipelines in the mid-to-late 2000’s because it was inconsistent with the Google/Hadoop fault tolerance model. In the last few years the discussion about execution frameworks for big data has suddenly opened up wide, with a quickly-growing variety of dataflow and query systems being deployed that have more similarities than differences (Tenzing, F1, Dremel, DryadLINQ, Naiad, Spark, Impala, Tez, Drill, Flink, etc.) Note that all of the motivating issues for adaptive optimization listed above are very topical in today’s big data discussion, but not well treated.
这带给我们去讨论现金数据流架构的趋势,特别的在开源的大数据栈中。Google MapReduce推迟了十年对于动作中数据的适应性的讨论,通过将阻塞在执行中的运算符来作为容错机制。关于优化2000年前末期的数据流水线有一个合乎逻辑的讨论时很有可能的,因为它与Google/Hadoop的容错模型不一致。在过去的一些年,关于大数据的执行框架的讨论突然变得受关注,随着快速成长的数据流种类和那些有更多相似之处的查询系统 (Tenzing, F1, Dremel, DryadLINQ, Naiad, Spark, Impala, Tez, Drill, Flink, etc.)。注意,上面对于适应性优化的所有激励问题都是非常热门的,在当今的大数据讨论中,但处理得没有很好。
More generally, I would say that the “big data” community in both research and open source has been far too slow to focus on query optimization, to the detriment of both the current systems and the query optimization field. To begin with, the “hand-planned” MapReduce programming model remained a topic of conversation for far longer than it should have. It took a long time for the Hadoop and systems research communities to accept that a declarative language like SQL or LINQ is a good general-purpose interface, even while maintaining low-level MapReduce-style dataflow programming as a special-case “fast path”. More puzzling is the fact that even when the community started building SQL interfaces like Hive, query optimization remained a little-discussed and poorly-implemented topic. Maybe it’s because query optimizers are harder to build well than query executors. Or maybe it was fallout from the historical quality divide between commercial and open source databases. MySQL was the open source de facto reference for “database technology” for the preceding decade, with a naive heuristic optimizer. Perhaps as a result, many (most?) open source big data developers didn’t understand—or trust—query optimizer technology.
更加普遍的,我能够说在研究与开源上的“大数据”领域太缓慢了,以至于不能集中去发展查询优化,对当前查询系统和查询优化领域很不利。首先“手工计划”MapReduce编程模型保持了一个讨论的主题,远远超过它应有的。去接受SQL或LINQ声明性语言是一个好的通用目的接口,这花费了很长时间对于Hadoop和系统研究讨论,即使保持低级MapReduce-style流编程作为特殊情况的“快速路径”。更多的困惑是即使讨论开始建立SQL接口,像Hive,查询优化保持很少的讨论和不怎么样的话题。也许它是因为查询优化器更难构建好,比查询执行器。或者可能它是在历史上在商业和开源数据库之间的堕落。MySQL是开源的,来源指向十年前的“数据科学”,伴随启发式优化器。可能作为一个结果,许多(大部分?)的开源大数据研究者不理解、或不信任查询优化器技术。
In any case, this tide is turning in the big data community. Declarative queries have returned as the primary interface to big data, and there are efforts underway in essentially all the projects to start building at least a 1980’s-era optimizer. Given the list of issues I mention above, I’m confident we’ll also see more innovative query optimization approaches deployed in new systems over the coming years.
在一些情况中,这个潮流正在转向大数据社区。声明性查询已作为大数据主要接口返回,在至少19实际80年代的优化,本质上所有开始建造的项目都会有努力去尝试。通过上面我提到的问题列表,我是自信的,我们将在未来几年能看到更多创新的查询方法被部署在新的系统中。
讨论这段主要讲了,作者认为人们不重视自适应优化改进问题,对于大数据查询优化大家研究太少,以及对未来的期盼。
总结:后面几段难度太大,网上中文资料也不是很多,由于时间紧迫,只能简单阅读。通过前面几段阅读我了解了,数据查询的优化器一些知识,像优化查询优化的基本方案,优化器算法的优劣,不同优化器之间的区别,整个优化器的趋势。