Compilation in the Microsoft SQL Server Hekaton Engine 论文阅读

MoKnight

已于 2022-10-06 23:36:11 修改

阅读量213

点赞数

文章标签： microsoft 论文阅读数据库编辑器

于 2022-10-06 23:31:43 首次发布

本文链接：https://blog.csdn.net/qq_40586495/article/details/127186900

版权

Compilation in the Microsoft SQL Server Hekaton Engine

原文地址：https://15721.courses.cs.cmu.edu/spring2016/papers/freedman-ieee2014.pdf

这是关于 Query compilation 这个主题的第一篇文章，后面还会有关于这部分的论文解读，代码解读等等。

Abstract

Hekaton is a new database engine optimized for memory resident data and OLTP workloads that is fully integrated into Microsoft SQL Server. A key innovation that enables high performance in Hekaton is compilation of SQL stored procedures into machine code.

Hekaton是一个对存储持久数据和装载进MicroSoft SQL Server的OLTP负载的数据库引擎。Hekaton 中实现高性能的一个关键创新是将SQL存储过程编译为机器代码。

1 Introduction

SQL Server and other major database management systems were designed assuming that main memory is expensive and data resides on disk. This assumption is no longer valid; over the last 30 years memory prices have dropped by a factor of 10 every 5 years. Today, one can buy a server with 32 cores and 1TB of memory for about $50K and both core counts and memory sizes are still increasing. The majority of OLTP databases fit entirely in 1TB and even the largest OLTP databases can keep the active working set in memory.

Recognizing this trend SQL Server several years ago began building a database engine optimized for large main memories and many-core CPUs. The new engine, code named Hekaton [2] [3] is targeted for OLTP workloads.

SQL Server 和其他主要的数据库管理系统在设计时假定主存很昂贵，数据驻留在磁盘上。这种假设不再有效；在过去的30年里，内存价格每5年下降10倍。如今，人们可以花大约5万美元购买一台32核1TB内存的服务器，而且内核数量和内存大小都在不断增加。大多数OLTP数据库完全适合1TB，甚至最大的OLTP数据库也可以将活动工作集保留在内存中。

认识到这一趋势，SQL Server几年前就开始构建针对大型主内存和多核cpu优化的数据库引擎。代号为Hekaton 的新引擎针对OLTP工作负载。

Several main memory database systems already exist, both commercial systems [4] [5] [6] [7] [8] and research prototypes [9] [10] [11] [12]. However, Hekaton has a number of features that sets it apart from the competition.

Most importantly, the Hekaton engine is integrated into SQL Server; it is not a separate DBMS. To take advantage of Hekaton, all a user has to do is declare one or more tables in a database memory optimized. This approach offers customers major benefits compared with a separate main-memory DBMS. First, customers avoid the hassle and expense of another DBMS. Second, only the most performance-critical tables need to be in main memory; other tables can be left unchanged. Third (and the focus of this article), stored procedures accessing only Hekaton tables can be compiled into native machine code for further performance gains. Fourth, conversion can be done gradually, one table and one stored procedure at a time.

Memory optimized tables are managed by Hekaton and stored entirely in main memory. Hekaton tables can be queried and updated using T-SQL in the same way as regular SQL Server tables. A query can reference both Hekaton tables and regular tables and a single transaction can update both types of tables. Furthermore, a T-SQL stored procedure that references only Hekaton tables can be compiled into native machine code. This is by far the fastest way to query and modify data in Hekaton tables and is essential to achieving our end-to-end performance goals for Hekaton.

一些主内存数据库系统已经存在，包括商业系统和研究原型。然而，Hekaton有许多特性使其在竞争中脱颖而出。

最重要的是，Hekaton 引擎集成到 SQL Server 中；它不是一个独立的DBMS。要利用 Hekaton，用户只需在优化的数据库内存中声明一个或多个表。与单独的主内存DBMS相比，这种方法为客户提供了主要的好处。首先，客户避免了另一个DBMS的麻烦和费用。其次，只有性能最关键的表才需要放在主存中；其他表可以保持不变。第三(也是本文的重点)，仅访问Hekaton表的存储过程可以编译成本地机器代码，以进一步提高性能。第四，转换可以逐步完成，一次一个表和一个存储过程。

内存优化表由 Hekaton 管理，并完全存储在主内存中。Hekaton 表可以使用 T-SQL 以与常规 SQL Server 表相同的方式查询和更新。一个查询可以引用 Hekaton 表和常规表，一个事务可以更新这两种类型的表。此外，只引用 Hekaton 表的 T-SQL 存储过程可以编译成本机机器代码。这是迄今为止在 Hekaton 表中查询和修改数据的最快方法，对于实现 Hekaton 的端到端性能目标至关重要。

The rest of the article is organized as follows. Section 2 outlines the high-level considerations and principles behind the design of Hekaton. Section 3 describes how stored procedures and table definitions are compiled into native code. Section 4 provides some experimental results.

Terminology. We will use the terms Hekaton table and Hekaton index to refer to tables and indexes stored in main memory and managed by Hekaton. Tables and indexes managed by the traditional SQL Server engine will be called regular tables and regular indexes. Stored procedures that have been compiled to native machine code will simply be called compiled stored procedures and traditional non-compiled stored procedures will be called interpreted stored procedures.

本文的其余部分组织如下。第2节概述了Hekaton设计背后的高级考虑因素和原则。第3节描述如何将存储过程和表定义编译为本机代码。第4节给出了一些实验结果。

使用相关术语来说。我们将使用 Hekaton 表和 Hekaton 索引来指代存储在主存中并由 Hekaton 管理的表和索引。由传统SQL Server引擎管理的表和索引将被称为常规表和常规索引。已编译为本机机器码的存储过程将被简单地称为编译存储过程，而传统的非编译存储过程将被称为解释存储过程。

2 Design Considerations

Our goal at the outset of the Hekaton project was to achieve a 10-100X throughput improvement for OLTP workloads. An analysis done early on in the project drove home the fact that a 10-100X throughput improvement cannot be achieved by optimizing existing SQL Server mechanisms. Throughput can be increased in three ways: improving scalability, improving CPI (cycles per instruction), and reducing the number of instructions executed per request. The analysis showed that, even under highly optimistic assumptions, improving scalability and CPI can produce only a 3-4X improvement.

The only real hope is to reduce the number of instructions executed but the reduction needs to be dramatic. To go 10X faster, the engine must execute 90% fewer instructions and yet still get the work done. To go 100X faster, it must execute 99% fewer instructions. This level of improvement is not feasible by optimizing existing storage and execution mechanisms. Reaching the 10-100X goal requires a much more efficient way to store and process data.

So to achieve 10-100X higher throughput, the engine must execute drastically fewer instructions per transaction, achieve a low CPI, and have no bottlenecks that limit scalability. This led us to three architectural principles that guided the design.

在Hekaton项目开始时，我们的目标是为OLTP工作负载实现10-100X的吞吐量改进。在项目早期进行的一项分析明确了这样一个事实：通过优化现有的 SQL Server 机制无法实现10-100X的吞吐量改进。吞吐量可以通过三种方式提高：提高可伸缩性、提高CPI（每个指令的周期）和减少每个请求执行的指令数量。分析表明，即使在非常乐观的假设下，改进可伸缩性和CPI也只能产生3 - 4倍的改进。

唯一真正的希望是减少执行的指令数量，但这种减少需要非常显著。要达到10倍的速度，引擎必须少执行90%的指令，但仍能完成工作。要达到100倍的速度，它必须少执行99%的指令。通过优化现有的存储和执行机制，这种级别的改进是不可行的。达到10-100X的目标需要一种更有效的方式来存储和处理数据。

因此，要实现10-100倍的高吞吐量，引擎必须在每个事务中执行更少的指令，实现较低的CPI，并且没有限制可伸缩性的瓶颈。这让我们得出了三个指导设计的架构原则。

2.1 Optimize indexes for main memory

Current mainstream database systems use disk-oriented storage structures where records are stored on disk pages that are brought into memory as needed. This requires a complex buffer pool where a page must be protected by latching before it can be accessed. A simple key lookup in a B-tree index may require thousands of instructions even when all pages are in memory.

Hekaton indexes are designed and optimized for memory-resident data. Durability is ensured by logging and checkpointing records to external storage; index operations are not logged. During recovery Hekaton tables and their indexes are rebuilt entirely from the latest checkpoint and logs.

当前主流数据库系统使用面向磁盘的存储结构，其中记录存储在磁盘页面上，根据需要将这些磁盘页面放入内存中。这需要一个复杂的缓冲池，其中页面在被访问之前必须通过锁存保护。即使所有页面都在内存中，b树索引中一个简单的键查找也可能需要数千条指令。

Hekaton 索引是为内存驻留数据而设计和优化的。通过记录和检查点记录到外部存储来确保持久性;索引操作不记录日志。在恢复期间，Hekaton 表及其索引完全从最新的检查点和日志重新构建。

2.2 Eliminate latches and locks

With the growing prevalence of machines with 100’s of CPU cores, achieving good scaling is critical for high throughput. Scalability suffers when the systems has shared memory locations that are updated at high rate such as latches and spinlocks and highly contended resources such as the lock manager, the tail of the transaction log, or the last page of a B-tree index [13] [14].

All of Hekaton’s internal data structures, for example, memory allocators, hash and range indexes, and the transaction map, are entirely latch-free (lock-free). There are no latches or spinlocks on any performance-critical paths in the system. Hekaton uses a new optimistic multiversion concurrency control algorithm to provide transaction isolation semantics; there are no locks and no lock table [15]. The combination of optimistic concurrency control, multiversioning and latch-free data structures results in a system where threads execute without stalling or waiting.

随着拥有100个CPU内核的计算机越来越普遍，实现良好的伸缩性对于高吞吐量至关重要。当系统具有高速更新的共享内存位置（如锁存和自旋锁）和高度竞争的资源（如锁管理器、事务日志的尾部或B-tree索引的最后一页）时，可伸缩性就会受到影响。

Hekaton的所有内部数据结构(例如，内存分配器、散列和范围索引以及事务映射)都是完全无锁的。在系统中的任何性能关键路径上都没有锁存或自旋锁。Hekaton 使用了一种新的乐观多版本并发控制算法来提供transion & action隔离语义;没有锁，也没有锁表。乐观并发控制、多版本控制和无锁存数据结构的组合使系统的线程执行不会出现停滞或等待。

2.3 Compile requests to native code

SQL Server uses interpreter based execution mechanisms in the same ways as most traditional DBMSs. This provides great flexibility but at a high cost: even a simple transaction performing a few lookups may require several hundred thousand instructions.

Hekaton maximizes run time performance by converting statements and stored procedures written in T-SQL into customized, highly efficient native machine code. The generated code contains exactly what is needed to execute the request, nothing more. As many decisions as possible are made at compile time to reduce runtime overhead. For example, all data types are known at compile time allowing the generation of efficient code.

The importance of compiling stored procedures to native code is even greater when the instruction path length improvements and performance improvements enabled by the first two architectural principles are taken into account. Once other components of the system are dramatically sped up, those components that remain unimproved increasingly dominate the overall performance of the system.

SQL Server以与大多数传统 DBMS 相同的方式使用基于解释器的执行机制。这提供了很大的灵活性，但代价很高:即使是执行一些查找的简单事务也可能需要几十万条指令。

Hekaton 通过将用 T-SQL 编写的语句和存储过程转换为定制的、高效的本机机器代码来最大化运行时性能。生成的代码包含执行请求所需的内容，仅此而已。在编译时做出尽可能多的决策，以减少运行时开销。例如，所有数据类型在编译时都是已知的，因此可以生成高效的代码。

当考虑到前两个体系结构原则所支持的指令路径长度改进和性能改进时，将存储过程编译为本机代码的重要性就更大了。一旦系统的其他组件显著加速，那些未改进的组件就会逐渐主导系统的整体性能。

3 Native Compilation in Hekaton

As noted earlier, a key architectural principle of Hekaton is to perform as much computation at compile time as possible. Hekaton maximizes run time performance by converting SQL statements and stored procedures into highly customized native code. Database systems traditionally use interpreter based execution mechanisms that perform many run time checks during the execution of even simple statements. For example, the Volcano iterator model [1] uses a relatively small number of query operators to execute any query. Each iterator by definition must be able to handle a wide variety of scenarios and cannot be customized to any one case.

Our primary goal is to support efficient execution of compile-once-and-execute-many-times workloads as opposed to optimizing the execution of ad hoc queries. We also aim for a high level of language compatibility to ease the migration of existing SQL Server applications to Hekaton tables and compiled stored procedures. Consequently, we chose to leverage and reuse technology wherever suitable. We reuse much of the SQL Server T-SQL compilation stack including the metadata, parser, name resolution, type derivation, and query optimizer. This tight integration helps achieve syntactic and semantic equivalence with the existing SQL Server T-SQL language. The output of the Hekaton compiler is C code and we leverage Microsoft’s Visual C/C++ compiler to convert the C code into machine code.

如前所述，Hekaton的一个关键体系结构原则是在编译时执行尽可能多的计算。Hekaton通过将SQL语句和存储过程转换为高度定制的本地代码来最大化运行时性能。数据库系统传统上使用基于解释器的执行机制，在执行简单语句时执行许多运行时检查。例如，Volcano迭代器模型使用相对较少的查询操作符来执行任何查询。根据定义，每个迭代器必须能够处理各种各样的场景，并且不能针对任何一种情况进行自定义。

我们的主要目标是支持高效执行**“编译一次，执行多次”**的工作负载，而不是优化临时查询的执行。我们还致力于实现高水平的语言兼容性，以简化现有SQL Server应用程序到Hekaton表和编译存储过程的迁移。因此，我们选择在合适的地方利用和重用技术。我们重用了大部分SQL Server T-SQL编译堆栈，包括元数据、解析器、名称解析、类型派生和查询优化器。这种紧密集成有助于实现与现有SQL Server T-SQL语言在语法和语义上的等价。Hekaton编译器的输出是C代码，我们利用Microsoft的Visual C/ c++编译器将C代码转换为机器代码。

While it was not a goal to optimize ad hoc queries, we do want to preserve the ad hoc feel of the SQL language. Thus, a table or stored procedure is available for use immediately after it has been created. To create a Hekaton table or a compiled stored procedure, the user merely needs to add some additional syntax to the CREATE TABLE or CREATE PROCEDURE statement. Code generation is completely transparent to the user.

Figure 1 illustrates the overall architecture of the Hekaton compiler. There are two main points where we invoke the compiler: during creation of a Hekaton table and during creation of a compiled stored procedure.

As noted above, we begin by reusing the existing SQL Server compilation stack. We convert the output of this process into a data structure called the mixed abstract tree or MAT. This data structure is a rich abstract syntax tree capable of representing metadata, imperative logic, expressions, and query plans. We then transform the MAT into a second data structure called the pure imperative tree or PIT. The PIT is a much ”simpler” data structure that can be easily converted to C code (or theoretically directly into the intermediate representation for a compiler backend such as Phoenix [17] or LLVM [16]). We discuss the details of the MAT to PIT transformation further in Section 3.2. Once we have C code, we invoke the Visual C/C++ compiler and linker to produce a DLL. At this point it is just a matter of using the OS loader to bring the newly generated code into the SQL Server address space where it can be executed.

虽然优化临时查询不是我们的目标，但我们确实希望保留SQL语言的临时感觉。因此，表或存储过程在创建之后就可以立即使用。要创建 Hekaton 表或编译存储过程，用户只需向 create table 或 create procedure 语句添加一些额外的语法。代码生成对用户来说是完全透明的。

图1说明了 Hekaton 编译器的总体架构。调用编译器的地方有两个要点：在创建 Hekaton 表和创建编译后的存储过程期间。

如上所述，我们首先重用现有的 SQL Server 编译堆栈。我们将这个过程的输出转换为一个称为混合抽象树（MAT）的数据结构。这个数据结构是一个丰富的抽象语法树，能够表示元数据、命令式逻辑、表达式和查询计划。然后我们将 MAT 转换为第二个数据结构，称为纯命令树或PIT。PIT是一种**“更简单”**的数据结构，可以很容易地转换为C代码（或者从理论上直接转换为编译器后端的中间表示，例如Phoenix或LLVM）。我们将在第3.2节进一步讨论MAT到PIT转换的细节。一旦我们有了C代码，我们调用Visual C/ C++编译器和链接器来生成一个DLL。此时，只需使用操作系统加载器将新生成的代码放入SQL Server地址空间中执行即可。

3.1 Schema Compilation

It may not be obvious why table creation requires code generation. In fact, there are two reasons why table creation requires code generation.

The first reason is that the Hekaton storage engine treats records as opaque objects. It has no knowledge of the internal content or format of records and cannot directly access or process the data in records. The Hekaton compiler provides the engine with customized callback functions for each table. These functions perform tasks such as computing a hash function on a key or record, comparing two records, and serializing a record into a log buffer. Since these functions are compiled into native code, index operations such as inserts and searches are extremely efficient.

创建表需要代码生成的原因可能并不明显。事实上，表创建需要代码生成有两个原因。

第一个原因是Hekaton存储引擎将记录视为不透明对象。它不了解记录的内部内容或格式，不能直接访问或处理记录中的数据。Hekaton 编译器为引擎提供了针对每个表的定制的回调函数。这些函数执行的任务包括计算键或记录上的哈希函数、比较两条记录以及将记录序列化到日志缓冲区。由于这些函数被编译成本地代码，因此插入和搜索等索引操作非常高效。

The second reason is that SQL Server’s interpreted query execution engine can be leveraged to access Hekaton tables in queries that are not part of a compiled stored procedure. We refer to this method of accessing Hekaton tables as interop. While interop leverages the interpreted query execution engine code and operators, it does require some mechanism to crack Hekaton records and extract column values. When a new table is created, the Hekaton compiler determines the record layout and saves information about this record layout for use by the interop code. We discuss interop further in Section 3.4.

第二个原因是，可以利用 SQL Server 的解释查询执行引擎访问不属于编译存储过程的查询中的Hekaton表。我们将这种访问 Hekaton表的方法称为互操作。虽然互操作利用解释后的查询执行引擎代码和算子，但它确实需要某种机制来破解Hekaton记录并提取列值。当创建一个新表时，Hekaton编译器确定记录布局并保存关于该记录布局的信息以供互操作代码使用。我们将在第3.4节进一步讨论互操作。

3.2 Stored Procedure Compilation

There are numerous challenging problems that we had to address to translate T-SQL stored procedures into C code. Perhaps the most obvious challenge is the transformation of query plans into C code and we will discuss our approach to this problem momentarily. There are, however, many other noteworthy complications. For example, the T-SQL and C type systems and expression semantics are very different. T-SQL includes many data types such as date/time types and fixed precision numeric types that have no corresponding C data types. In addition, T-SQL supports NULLs while C does not. Finally, T-SQL raises errors for arithmetic expression evaluation failures such as overflow and division by zero while C either silently returns a wrong result or throws an OS exception that must be translated into an appropriate T-SQL error.

在将T-SQL存储过程转换为C代码时，我们必须解决许多具有挑战性的问题。也许最明显的挑战是将查询计划转换为C代码，我们将稍后讨论解决这个问题的方法。然而，还有许多其他值得注意的问题。例如，T-SQL和C类型系统和表达式语义非常不同。T-SQL包括许多数据类型，如日期/时间类型和固定精度数值类型，它们没有相应的C数据类型。此外，T-SQL支持null，而C不支持。最后，T-SQL会引发算术表达式计算失败的错误，如溢出和除零，而C要么静默地返回错误结果，要么抛出操作系统异常，必须将该异常转换为适当的T-SQL错误。

These complexities were a major factor in our decision to introduce the intermediate step of converting the MAT into the PIT rather than directly generating C code. The PIT is a data structure that can be easily manipulated, transformed, and even generated out of order in memory. It is much more challenging to work directly with C code in text form.

The transformation of query plans into C code warrants further discussion. To aid in this discussion, consider the simple T-SQL example in Figure 2. This procedure retrieves a customer name, address, and phone number given a customer id. The procedure declaration includes some additional syntax; we will explain below why this syntax is required.

这些复杂性是我们决定引入将 MAT 转换为 PIT 而不是直接生成C代码的中间步骤的主要因素。PIT是一种数据结构，可以很容易地操纵、转换，甚至在内存中按顺序生成。直接使用文本形式的C代码更具挑战性。

将查询计划转换为C代码值得进一步讨论。为了帮助进行讨论，请考虑图2中的简单T-SQL示例。此过程根据客户id检索客户名称、地址和电话号码。过程声明包括一些额外的语法；我们将在下面解释为什么需要这种语法。

As with many query execution engines, we begin with a query plan which is constructed out of operators such as scans, joins, and aggregations. Figure 3(a) illustrates one possible plan for executing our sample query. For this example, we are naively assuming that the DBA has not created an index on Customer.Id and that thepredicate is instead evaluated via a filter operator. In practice, we ordinarily would push the predicate down to the storage engine via a callback function. However, we use the filter operator to illustrate a more interesting outcome.

与许多查询执行引擎一样，我们从一个查询计划开始，该计划是由扫描、连接和聚合等操作符构造的。图3(a)说明了执行示例查询的一种可能计划。对于本例，我们天真地假设DBA没有在Customer上创建索引。Id和谓词通过筛选操作符求值。在实践中，我们通常会通过回调函数将谓词下推到存储引擎。然而，我们使用筛选运算符来说明一个更有趣的结果。

Each operator implements a common interface so that they can be composed into arbitrarily complex plans. In our case, this interface consists of GetFirst, GetNext, ReturnRow, and ReturnDone. However, unlike most query execution engines, we do not implement these interfaces using functions. Instead, we collapse an entire query plan into a single function using labels and gotos to implement and connect these interfaces. Figure 3(b) illustrates graphically how the operators for our example are interconnected. Each hollow circle represents a label while each arrow represents a goto statement. In many cases, we can directly link the code for the various operators bypassing intermediate operators entirely. The X’s mark labels and gotos that have been optimized out in just such a fashion. In conventional implementations, these same scenarios would result in wasted instructions where one operator merely calls another without performing any useful work.

每个算子实现一个公共接口，以便将它们组合成任意复杂的计划。在我们的例子中，这个接口由 GetFirst、GetNext、ReturnRow和ReturnDone 组成。但是，与大多数查询执行引擎不同，我们不使用函数实现这些接口。相反，我们将整个查询计划分解为单个函数来实现和连接这些接口。图3(b)以图形方式说明了我们示例中的操作符是如何相互连接的。每个空心圆代表一个标签，而每个箭头代表一个goto语句。在许多情况下，我们可以直接链接各种运算符的代码，完全绕过中间运算符。X的标记标签和gotos已经在这样一种时尚中优化出来。在传统的实现中，这些相同的场景会导致浪费指令，其中一个算子只调用另一个算子，而不执行任何有用的工作。

Execution of the code represented by Figure 3(b) begins by transferring control directly to the GetFirst entry point of the scan operator. Note already the difference as compared to traditional query processors which typically begin execution at the root of the plan and invoke repeated function calls merely to reach the leaf of the tree even when the intermediate operators have no work to do. Presuming the Customers table is not empty, the scan operator retrieves the first row and transfers control to the filter operator ReturnRow entry point. The filter operator evaluates the predicate and either transfers control back to the scan operator GetNext entry point if the current row does not qualify or to the output operator ReturnRow entry point if the row qualifies. The output operator adds the row to the output result set to be returned to the client and then transfers control back to the scan operator GetNext entry point again bypassing the filter operator. When the scan operator reaches the end of the table, execution terminates immediately. Again control bypasses any intermediate operators.

图3(b)所示代码的执行首先将控制直接转移到扫描算子的GetFirst入口点。注意与传统查询处理器相比的区别，传统查询处理器通常从计划的根部开始执行，调用重复的函数调用仅仅是为了到达树的叶子节点，即使中间算子没有工作要做。假设Customers表不为空，扫描操作符检索第一行并将控制传递给过滤算子的ReturnRow入口点。过滤算子计算谓词，如果当前行不符合条件，则将控制转移回Scan算子的GetNext入口点；如果行符合条件，则将控制转移回输出算子ReturnRow入口点。输出算子将行添加到要返回给客户机的输出结果集中，然后将控制转移回Scan算子Get接下一个入口点，再次绕过过滤算子。当Scan 算子到达表的末尾时，执行立即终止。控制流再次绕过任何中间算子。

This design is extremely flexible and can support any query operator including blocking (e.g., sort and group by aggregation) and non-blocking (e.g., nested loops join) operators. Our control flow mechanism is also flexible enough to handle operators such as merge join that alternate between multiple input streams. By keeping all of the generated code in a single function, we avoid costly argument passing between functions and expensive function calls. Although the resulting code is often challenging to read due in part to the large number of goto statements, it is important to keep in mind that our intent is not to produce code for human consumption. We rely on the compiler to generate efficient code. We have confirmed that the compiler indeed does so through inspection of the resulting assembly code.

这种设计非常灵活，可以支持任何查询算子，包括阻塞（例如，按聚合排序和分组）和非阻塞（例如，嵌套循环连接）操作符。我们的控制流机制也足够灵活，可以处理在多个输入流之间交替的合并连接等算子。通过将所有生成的代码保持在一个函数中，我们避免了函数之间传递昂贵的参数和昂贵的函数调用。尽管由于大量的goto语句，生成的代码通常难以阅读，但重要的是要记住，我们的目的不是生成供人使用的代码。我们依赖编译器来生成有效的代码。我们已经确认编译器确实通过检查生成的程序集代码做到了这一点。

Figure 4 gives a sample of the code produced for the sample procedure from Figure 2. We show only the code generated for the seek and filter operators with some minor editing for the sake of both brevity and clarity.While the generated code ordinarily has no comments to protect against security risks and potential ”C injection attacks” (see Section 3.3), we do have the ability to add comments in internal builds for development and supportability purposes. The code in Figure 4 includes these comments for the sake of clarity. A careful analysis of this code sample will show that all of the gotos and labels are just as described above.

图4给出了为图2中的样例过程生成的代码的样例。为了简洁和清晰，我们只显示为查找和过滤算子生成的代码，并进行了一些细微的编辑。虽然生成的代码通常没有注释来防止安全风险和潜在的“C注入攻击”（参见3.3节），但我们有能力在内部构建中添加注释，以达到开发和支持的目的。为了清晰起见，图4中的代码包含了这些注释。仔细分析这个代码示例将会发现，所有goto和标签都与上面描述的一样。

在这里插入图片描述

We compared this design to alternatives involving multiple functions and found that the single function design resulted in the fewest number of instructions executed as well as the smallest overall binary. This result was true even with function inlining. In fact, the use of gotos allows for code sharing within a single function. For example, an outer join needs to return two different types of rows: joined rows and NULL extended rows. Using functions and inlining with multiple outer joins, there is a risk of an exponential growth in code size [18]. Using gotos, the code always grows linearly with the number of operators.

我们将这种设计与涉及多个函数的备选方案进行了比较，发现单函数设计执行的指令数量最少，而且总体二进制数最小。即使使用函数内联，这个结果也是正确的。事实上，gotos的使用允许在单个函数中共享代码。例如，外部连接需要返回两种不同类型的行:连接行和NULL扩展行。使用函数和具有多个外部连接的内联，代码大小有指数增长的风险。使用gotos，代码总是随着操作符的数量线性增长。

There are cases where it does not make sense to generate custom code. For example, the sort operator is best implemented using a generic sort implementation with a callback function to compare records. Some functions (e.g., non-trivial math functions) are either sufficiently complex or expensive that it makes sense to include them in a library and call them from the generated code.

在有些情况下，生成定制代码没有意义。例如，排序算子最好使用泛型排序实现和一个回调函数来比较记录。有些函数（例如，非平凡的数学函数）非常复杂或昂贵，因此有必要将它们包含在库中并从生成的代码中调用它们。

3.3 Restrictions

With minor exceptions, compiled stored procedures look and feel just like any other T-SQL stored procedures. We support most of the T-SQL imperative surface area including parameter and variable declaration and assignment as well as control flow and error handling (IF, WHILE, RETURN, TRY/CATCH, and THROW). The query surface area is a bit more limited but we are expanding it rapidly. We support SELECT, INSERT, UPDATE, and DELETE. Queries currently can include filters, inner joins, sort and top sort, and basic scalar and group by aggregation.

In an effort to minimize the number of run time checks and operations that must be performed each time a compiled stored procedure is executed, we do impose some requirements.

First, unlike a conventional stored procedure which upon execution can inherit run time options from the user’s environment (e.g., to control the behavior of NULLs, errors, etc.), compiled stored procedures support a very limited set of options and those few options that can be controlled must be set at compile time only. This policy both reduces the complexity of the code generator and improves the performance of the generated code by eliminating unnecessary run time checks.

除了少许的例外，编译后的存储过程在外观和感觉上与任何其他T-SQL存储过程都很相似。我们支持大多数T-SQL命令式表面区，包括参数和变量声明、分配和控制流和错误处理（IF、WHILE、RETURN、TRY/CATCH和THROW）。查询表面积有点有限，但我们正在迅速扩大它。我们支持SELECT, INSERT, UPDATE和DELETE。当前的查询可以包括过滤器、内部连接、排序和顶部排序，以及基本标量和聚合分组。

为了尽量减少每次执行编译存储过程时必须执行的运行时检查和操作的数量，我们确实提出了一些要求。

首先，与在执行时可以从用户环境继承运行时选项（例如，控制null、错误等行为）的传统存储过程不同，编译存储过程支持非常有限的选项集，而且那些可以控制的选项必须仅在编译时设置。此策略既降低了代码生成器的复杂性，又通过消除不必要的运行时检查提高了生成代码的性能。

Second, compiled stored procedures must execute in a security or user context that is predefined when the procedure is created rather than in the context of the user who executes the procedure. This requirement allows us to run all permission checks once at procedure creation time instead of once per execution.

Third, compiled stored procedures must be schema bound. This restriction means that once a procedure is created, any tables referenced by that procedure cannot be dropped without first dropping the procedure. This requirement avoids the need to acquire costly schema stability locks before executing the procedure.

Fourth, compiled stored procedures must execute in the context of a single transaction. This requirement is enforced through the use of the BEGIN ATOMIC statement (and the prohibition of explicit BEGIN, COMMIT, and ROLLBACK TRANSACTION statements) and ensures that a procedure does not have to block or context switch midway through to wait for commit.

Finally, as we are building a commercial database product, we must take security into account in all aspects of the design. We were particularly concerned about the possibility of a ”C injection attack” in which a malicious user might include C code in a T-SQL identifier (e.g., a table, column, procedure, or variable name) or string literal in an attempt to induce the code generator to copy this code into the generated code. Clearly, we cannot allow the execution of arbitrary C code by non-administrative users. To ensure that such an attack is not possible, we never include any user identifier names or data in the generated code even as comments and we convert string literals to a binary representation.

其次，编译后的存储过程必须在创建过程时预定义的安全或用户上下文中执行，而不是在执行过程的用户上下文中执行。这个要求允许我们在过程创建时一次运行所有权限检查，而不是每次执行一次。

第三，编译后的存储过程必须是模式绑定的。这个限制意味着，一旦创建了一个过程，在没有先删除过程之前，不能删除该过程引用的任何表。这个需求避免了在执行过程之前需要获取昂贵的模式稳定性锁。

第四，编译后的存储过程必须在单个事务的上下文中执行。这一要求是通过使用BEGIN ATOMIC语句（以及禁止显式的BEGIN、COMMIT和ROLLBACK TRANSACTION语句）来实现的，并确保过程不必中途阻塞或切换上下文来等待提交。

最后，当我们正在构建一个商业数据库产品时，我们必须在设计的所有方面都考虑到安全性。我们特别担心“C注入攻击”的可能性，恶意用户可能在T-SQL标识符（例如，表、列、过程或变量名）或字符串文本中包含C代码，试图诱导代码生成器将这些代码复制到生成的代码中。显然，我们不能允许非管理用户执行任意的C代码。为了确保这种攻击不可能发生，我们从不在生成的代码中包括任何用户标识符名称或数据，即使是作为注释，并且我们将字符串字面量转换为二进制表示。

3.4 Query Interop

Compiled stored procedures do have limitations in the current implementation. The available query surface area is not yet complete and it is not possible to access regular tables from a compiled stored procedure. Recognizing these limitations, we implemented an additional mechanism that enables the interpreted query execution engine to access memory optimized tables. As noted in Section 3.1, we refer to this capability as interop. Interop enables several important scenarios including data import and export, ad hoc queries, support for query functionality not available in compiled stored procedures (including queries and transactions that access both regular and Hekaton tables)

编译后的存储过程在当前的实现中确实有局限性。可用的查询表面积尚未完成，因此无法从已编译的存储过程访问常规表。认识到这些限制，我们实现了一种额外的机制，使解释查询执行引擎能够访问内存优化的表。如3.1节所述，我们将此功能称为互操作。互操作支持几个重要场景，包括数据导入和导出、临时查询、支持编译存储过程中没有的查询功能（包括访问常规表和Hekaton表的查询和事务）。

4 Performance

We measured the benefits of native compilation through a set of simple experiments where we compare the number of instructions executed by the interpreted query execution engine with the number of instructions executed by an equivalent compiled stored procedure.

For the first experiment, we isolated and compared the cost of executing the simple predicate

Item = ? and Manufacturer = ? and Price > ?

using both the interpreted expression evaluator and the equivalent native code. We measured the cost of this predicate when a) evaluated against data stored in a regular disk-based B-tree table, b) using interop to access data stored in a Hekaton table, and c) using a compiled stored procedure to access the same Hekaton table. We ran these experiments without a suitable index as our intent was to measure the cost of the predicate evaluation not the cost of an index lookup. Since short circuiting can impact the cost of evaluating the predicate, we also measured the best (first column does not match) and worst (all columns match) case costs. The results are shown in Table 1. We use the most expensive case (the interpreted expression evaluator against a regular table with a row with all columns matching) as the baseline and report all other results as a percentage of this baseline.

使用解释的表达式求值器和等效的本机代码。在下面三种场景中，我们测量了其成本 a) 根据存储在常规基于磁盘的b -树表中的数据进行计算，b) 使用互操作访问存储在Hekaton表中的数据，c) 使用编译后的存储过程访问相同的Hekaton表时。我们在没有合适的索引的情况下进行这些实验，因为我们的目的是衡量谓词计算的成本，而不是索引查找的成本。由于短路可能影响谓词的计算成本，我们还测量了最佳（第一列不匹配）和最差（所有列都匹配）情况的成本。结果如表1所示。我们使用最昂贵的情况（对正则表的解释表达式求值器，其中行和所有列都匹配）作为基准，并以该基准的百分比报告所有其他结果。

在这里插入图片描述

Not surprisingly the compiled code is much more efficient (up to 10x fewer instructions executed) than the interpreted code. This improvement reflects the benefits of generating code where the data types and operations to be performed are known at compile time. For our second experiment, we ran the simple queries shown in Figure 5. The only difference between these queries is that the first outputs the results to the client while the second saves the result in a local variable. Because outputting to the client incurs relatively high overhead, the first query is considerably more expensive ththe second. We ran the same three tests as for the first experiment comparing the interpreted query execution engine against a regular table, interop against a Hekaton table, and a compiled stored procedure against a Hekaton table. For this experiment we created an index on the Item and Manufacturer columns to simulate a more realistic scenario. The results are shown in Table 2. Once again we use the most expensive case (the interpreted query execution engine against a regular table with the results output to the client) as the baseline。

毫无疑问，编译后的代码比解释后的代码更高效（执行的指令最多减少10倍）。这种改进反映了在编译时就知道要执行的数据类型和操作的情况下生成代码的好处。对于我们的第二个实验，我们运行如图5所示的简单查询。这些查询之间的唯一区别是，第一个查询将结果输出到客户机，而第二个查询将结果保存在本地变量中。因为向客户端输出会产生相对较高的开销，所以第一个查询的开销要比第二个查询高得多。我们运行了与第一个实验相同的三个测试，比较解释查询执行引擎与常规表、与Hekaton表的互操作以及与Hekaton表的编译存储过程。对于这个实验，我们在 Item 和 Manufacturer 列上创建了一个索引，以模拟一个更现实的场景。结果如表2所示。我们再次使用最昂贵的情况(解释查询执行引擎针对一个常规表，结果输出到客户端)作为基准。

- 在这里插入图片描述

Once again, we see that the compiled code is much more efficient (up to 15x fewer instructions executed) than the interpreted code. As in the first experiment, many of the gains come from compile time knowledge of the data types and operations. We are also able to eliminate many virtual function calls and conditional branches whose outcomes are known at compile time.

Finally, a comment regarding the cost of compilation. Hekaton’s primary focus is on OLTP queries where we compile once and execute thousands or millions of times. Thus, compiler performance was not a primary focus of this project. Nonetheless, for the vast majority of compiled stored procedures, the impact of compilation is not noticeable to the user. Most compilations complete in under one second and in many cases the cost of existing steps such as query optimization remain a significant fraction of the total end-to-end procedure creation cost. Some extremely complex procedures take longer to compile with the bulk of the time spent in compiler optimization. While such scenarios are rare, in the event that compilation takes unacceptably long, disabling some or all compiler optimizations generally improves compilation costs to an acceptable level albeit at some loss of runtime performanc.

再一次，我们看到编译后的代码比解释后的代码效率高得多（最多执行的指令少15倍）。与第一个实验一样，许多收获来自于数据类型和操作的编译时知识。我们还能够消除许多虚函数调用和条件分支，它们的结果在编译时就已经知道了。

最后，关于编译成本的评论。Hekaton主要关注OLTP查询，我们编译一次，然后执行数千次或数百万次。因此，编译器性能不是这个项目的主要关注点。尽管如此，对于绝大多数已编译的存储过程，编译的影响对用户来说并不明显。大多数编译在一秒钟内完成，在许多情况下，现有步骤(如查询优化)的成本仍然占整个端到端过程创建成本的很大一部分。一些极其复杂的过程需要更长的编译时间，因为大部分时间都花在编译器优化上。虽然这样的情况很少，但如果编译花费的时间长得不可接受，禁用部分或所有编译器优化通常会将编译成本提高到可接受的水平，尽管会损失一些运行时性能。

在这里插入图片描述

5 Concluding Remarks

Hekaton is a new database engine targeted for OLTP workloads under development at Microsoft. It is optimized for large main memories and many-core processors. It is fully integrated into SQL Server, which allows customers to gradually convert their most performance-critical tables and applications to take advantage of the very substantial performance improvements offered by Hekaton. Hekaton achieves its high performance and scalability by using very efficient latch-free data structures, multiversioning, a new optimistic concurrency control scheme, and by compiling T-SQL stored procedure into efficient machine code. As evidenced by our experiments, the Hekaton compiler reduces the instruction cost for executing common queries by an order of magnitude or more.

Hekaton是微软正在开发的一种针对OLTP工作负载的新数据库引擎。它针对大内存和多核处理器进行了优化。它完全集成到SQL Server中，允许客户逐步转换其性能最关键的表和应用程序，以利用Hekaton提供的非常显著的性能改进。Hekaton通过使用非常高效的无锁存数据结构、多版本控制、一种新的乐观并发控制方案以及将T-SQL存储过程编译为高效的机器代码来实现其高性能和可伸缩性。我们的实验证明，Hekaton编译器将执行常见查询的指令成本降低了一个数量级甚至更多。