Paper Translation : Modern Column-Oriented Database Systems - Compression

codealy

已于 2023-06-09 10:37:29 修改

阅读量96

点赞数

分类专栏： Paper 文章标签：数据库大数据数据库架构

于 2023-04-24 21:14:30 首次发布

本文链接：https://blog.csdn.net/yueny/article/details/130352352

版权

Paper 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

论文翻译，The design and implementation of modern column-oriented database systems Abadi et al., Foundations and trends in databases, 2012

Title: 现代列数据库系统的设计与实现

paper-reading-fntdbs2013-compression

如何学习这篇论文

阅读论文的摘要和引言，了解论文的主要内容和研究背景。
仔细阅读论文的每个章节，理解列式数据库系统的架构、优点、实现细节以及未来发展趋势。
尝试理解论文中使用的技术术语和概念，并查找相关资料进行深入学习。
尝试实践一些列式数据库系统，例如C-Store、MonetDB或VectorWise等，并尝试使用它们进行数据分析操作。

1. Overview

Data transfer costs from storage are often the major performance bottlenecks in database systems.

译文：存储层的数据传输成本通常是数据库系统的主要性能瓶颈。

Colume-stores are typically used in analytic applications, with queries that scan a large fraction of individual tables and compute aggregates or other statistics over them.

译文：列存储通常用于分析应用程序，查询扫描大部分单个表并计算聚合或其他统计数据。

2. Time cost 时间成本

Time cost can be separated into (时间成本分为)

寻道(seek times) 次数
传输时间(tranfer time)

After switching to an in-memory DBMS, the only ways to increase throughput is to reduce the number of instructions executed:

译文：切换到内存中 DBMS 后，提高吞吐量的唯一方法是减少执行的指令数：

To go 10x faster, the DBMS must execute 90% fewer instructions

译文：为了提高 10 倍的速度，DBMS 必须执行的指令减少 90%

To go 100x faster, the DBMS must execute 99% fewer instructions

One way to achieve such a reduction is through code specialization.

译文：一种实现方法: 通过 code specialization。

3. Advantage of Column-stores 优势

3.1 Column-store internals and advanced techniques 列存储背后的关键技术

we focus on Vectorized processing, late materialization, compression and database cracking.

and sophisticated indexing, materialized views, and vertical and horizontal partitioning.
译文：复杂的索引、物化视图和垂直和水平分区。

3.2 Block-oriented and Vectorized processing 面向块和矢量化处理

By passing cache-line sized blocks of tuples between operators, and operating on multiple values at a time, rather than using a conventional tuple-at-a-time iterator, column-stores can achieve substantially better cache utilization and CPU efficiency.
译文：通过在运算符之间传递 缓存行 大小的 元组 块，并且一次对多个值进行操作，而不是使用传统的 一次一个元组 的迭代器，列存储可以实现大幅提高缓存利用率和CPU效率。

概念

tuple-at-a-time: Operator 组成一个树，数据以此经过 Operator，一次处理一行数据。
column-at-a-time：一次处理一列，典型的是除了一列数据组成的数组

Database strategies for the query execution layer

查询执行层的数据库策略[3]

Volcano-style iterator model: tuple-at-a-time (火山式迭代器模型：元组-时间)
Full materialization
- In full materialization,each query operator works in isolation, fully consuming an, input from storage (disk, or RAM) and writing its output to storage. therefore may cause excessive resource utilization in queries that generate large intermediate results.
  译文：在完全物化中，每个查询操作者都是孤立工作的，完全消耗存储（磁盘或RAM）中的输入，并将其输出写入存储。因此，在产生大量中间结果的查询中，可能会造成过度的资源利用。
- a big intermediate result needs to be materialized, exceeding memory size, becomes an issue
Vectorized execution （矢量化执行）
- Pioneered in VectorWise, which strikes a balance(取得了平衡) between full materialization and tuple pipelining.
- This model separates query progress control logic from data processing logic.
  
  将查询进度控制逻辑与数据处理逻辑分离
- Regarding control flow, the operators in vectorized processing are similar to those in tu-ple pipelining, with the sole distinction that the next() method of eachoperator returns a vector of N tuples as opposed to only a single tuple.
  
  唯一的区别是每个运算符的next（）方法返回N个元组的矢量，而不是仅返回一个元组。
- vectorized execution combines pipelining (avoidance of material-ization of large intermediates 避免产生大量中间结果) with the array-loops code patterns that make fast.

The typical size for the vectors

The typical size for the vectors used in vectorized processing is such that each vector comfortably fits in L1 cache as this minimizes reads and writes throughout the memory hierarchy (可以最大限度地减少整个内存层次的读写).

Advantage with vectorized processing

Reduced interpretation overhead. 减少了解释的开销
- The amount of function calls performed by the query interpreter goes down by a factor equal to the vector size compared to the tuple-at-a-time model.（与 tuple-at-a-time 模型相比，查询解释器执行的函数调用量减少了一个与矢量大小相等的系数。）
  - On computationally intensive queries, e.g., TPC-H Q1, this can improve performance by two orders of magnitude. (在计算密集型查询中，例如TPC-H Q1，这可以将性能提高两个数量级。)
Better cache locality. 更好的缓存局部性
- control now stays for as many iterations as the vector size in the same primitive function, thus creating instruction locality. (VectorWise可调整向量大小，以便将计算查询所需的所有向量轻松地放在CPU缓存中。控制在同一原语函数中停留的迭代次数与向量大小相同，从而创建了指令局部性)
Compiler optimization opportunities. 编译器优化的可能性
- 正如MonetDB的描述中所提到的，vectorized primitives 通常在数组上进行严格紧密的循环，通常在数组上执行紧密的循环，适用于一些最有效的编译器优化。而且通常还会触发编译器来生成SIMD指令。
Block algorithms Block 算法
- a tuple-at-a-time execution model performs the check for every tuple, while a vectorized algorithm can first check if the output buffer has space for N more results(矢量化算法可以首先检查输出缓冲区是否有空间容纳N个以上的结果), and if so, do all the work on the vector without any checking.
Parallel memory access 并行内存访问
- Algorithms that perform memory accesses in a tight vectorized loop on modern(现代) CPUs are able to generate multiple outstanding cache misses, for different values in a vector. This is because when a cache miss occurs, modern CPUs can speculate ahead in such tight loops.
  译文：在现代CPU上，以严格的矢量化循环中执行内存访问的算法能够针对矢量中的不同值生成多个未完成的缓存未命中。这是因为当缓存未命中发生时，现代CPU可以在这种紧密的循环中提前进行推测
  
  Vectorize memory lookups in all major relational database operators, e.g., sorting, hash-table lookup, as well as hash-table probing. Such lookups often incur cache-misses, in which case code that through out-of-order speculation generates multiple parallel misses often performs four times faster than non-vectorized memory lookups.
  译文：在所有主要的关系数据库操作中，例如排序、哈希表查找以及哈希表探测，都要对内存查找进行矢量处理。这样的查找通常会导致缓存未命中，在这种情况下，通过无序推测生成多个并行未命中的代码的执行速度通常是非矢量化内存查找的四倍
Profiling 分析洞察
Since vectorized implementations of relational operators perform all expression evaluation work in a vectorized fashion, i.e., array-at-a-time for hundreds or even thousands of tuples in one go, the overhead of keeping performance profiling measurements for each individual vectorized operation is low (as bookkeeping cost is amortized over hundreds or thousands of tuples). This allows vectorized engines to provide highly detailed performance insight into where CPU cycles are spent.
译文：由于关系运算符的矢量实现以矢量方式执行所有的表达式评估工作，即一次对数百甚至数千个图元进行数组处理，因此为每个单独的矢量操作保持性能分析测量的开销很低（因为记账成本在数百或数千个图元中被摊销）。这使得矢量化引擎能够提供对CPU周期使用情况的高度详细的性能洞察。
Adaptive execution 自适应执行
Building on the latter point, performance profile information on vectorized primitives can also be exploited at run-time, during the execution of a query. For example, Vectorwise decides adaptively in case of arithmetic operations on vectors where only a subset of the values in the arrays is selected by some predicate, whether to compute the result only for the selected tuples iteratively, or for all tuples in the array. The latter strategy, while performing extra work, leads to a tight loop without if-then-else, where SIMD instructions can be used, making it overall faster as long as the percentage of selected tuples is relatively high. (后一种策略虽然执行了额外的工作，但导致了一个没有if-then-else的紧密循环，其中可以使用SIMD指令，只要所选 tuples 的比例相对较高，就会使其总体上更快。)

Vectorized Processing summarize :
Typically, sequential operators (project, selection) work best on vertical vectors (exploiting automatic memory prefetching and SIMD opportunities), whereas random access operator(hash-join or -aggregation) work best using blocks of horizontal records, due to cache locality.
译文：通常，顺序运算符（project, selection）在垂直向量上的效果最好（利用自动内存预取和SIMD机会），而随机访问运算符（hash-join或-aggregation）由于高速缓存的局部性，使用 blocks of horizontal(水平) records 的效果最好。

Since conversion between horizontal and vertical formats is cheap using vectorized execution, this creates the possibility that a query plan would change the tuple-layout as part of the query plan, possibly multiple times. This opens a new ground for query optimizers of query layout planning that should determine the best layout(确定最佳布局) for each stage of the query execution plan using cost-based estimation(使用基于成本的估计).

Vectorized Processing 相比 tuple-at-a-time 的优势

大幅减少函数调用次数
数组的处理方式可以更好的利用 CPU cache 的局部性
编译器可以利用 SIMD 指令集加速计算
特定的块算法可以简化计算
可以以更小的代价做 Profiling
适应性执行：动态选择最优实现（多臂老虎机问题）

3.3 Column-specific Compression 特定列的压缩

同一列的数据放一起信息熵要远低于来自不同列的数据。

Advantage

A major advantage of column-stores relative to row-stores is improved compression ratio.(相对于行存储，列存储的一个主要优势是提高了压缩率。)
Compression algorithms perform better on data with low information entropy (i.e., with high data value locality), and values from the same column tend to have more value locality than values from different columns.(压缩算法在信息熵低（即具有高数据值局部性）的数据上表现更好，并且来自同一列的值倾向于具有比来自不同列的值更多的值局部性。)
Compressing one column-at-a-time
This has two positive side-effects that strengthen the use of compression in column-stores; first, compression algorithms may be able to compress more data with the same common patterns as more data of the same type fit in a single page when storing data of just one attribute, and second, more similar data implies that in general the data structures, codes, etc. used for compression will be smaller and thus this leads to better compression. Furthermore, if the data is sorted by one of the columns, which is common with column-store projections, that column will be super-compressible (for example, runs of the same value can be run-length encoded).

译文：这有两个正向作用，加强了列存储中压缩的压缩率。首先，当存储只有一个属性的数据时，压缩算法可能能够压缩 具有相同公共模式 的更多数据，有更多相同类型的数据适合在一个Page中；其次，更多相似的数据意味着通常用于压缩的数据结构、代码等将更小，从而导致更友好的压缩。此外，如果数据按其中一列排序的（这在列存储投影中很常见），那么该列将是超可压缩的（例如，可以对相同值进行 长度编码）。
Exploiting extra CPU cycles 利用额外的CPU周期
Usually, the bottom line goal of a database system is performance, i.e., processing one or more queries as fast as possible, not compression ratio.

译文：通常情况下，数据库系统的底线目标是性能，即尽可能快地处理一个或多个查询，而不是压缩率。

Disk space is cheap, and is getting cheaper rapidly. However, compression does improve performance (in addition to reducing disk space); if data is compressed, then less time is spent in I/O during query processing as less data is read from disk into memory (and from memory to CPU).

译文：磁盘空间很便宜，而且正在迅速变便宜。然而，压缩确实提高了性能（除了减少磁盘空间）；如果数据被压缩，那么在查询处理过程中花在I/O上的时间就会减少，因为从磁盘读到内存（以及从内存读到CPU）的数据会减少。

Another important motivation here is that as CPUs are getting much faster compared to memory bandwidth, the cost of accessing data costs more in terms of CPU cycles than it did in the past.

译文：另一个重要的动机是，与内存带宽相比，CPU的速度越来越快，访问数据的成本在CPU周期方面比过去要高。

Intuitively, this means that now we have more CPU cycles to spare in decompressing compressed data fast which is preferable to transferring uncompressed and thus bigger data at slow speeds (in terms of waisted CPU cycles) through the memory hierarchy.

译文：直观地说，这意味着现在我们有更多的CPU周期可以用来快速解压压缩数据，这比在内存层次中以缓慢的速度（就放弃的CPU周期而言）传输未压缩的、更大的数据要好。
Fixed-width arrays and SIMD 固定宽度的数组和SIMD
Light-weight compression schemes that compress a column into mostly fixed-width (smaller) values (with exceptions handled carefully) are often preferred, since this allows a compressed column to be treated as an array.

译文：轻量级的压缩方案，将一列压缩成大部分固定宽度（较小）的值（小心处理例外情况），通常是首选，因为这允许将压缩的一列作为一个数组来处理。

Iterating through such an array (e.g., for decompression) can leverage the SIMD instruction set on modern CPUs for vectorized parallelism (as described above), significantly improving performance. With SIMD instructions we can decompress or process multiple compressed values with one instruction as long as they are packed into fixed-width and dense arrays (that nicely fit into SIMD registers of modern processors), maximizing parallelism.

译文：遍历这样一个数组（例如，用于解压）可以利用现代CPU上的SIMD指令集来实现矢量并行（如上所述），从而显著提高性能。利用SIMD指令，我们可以用一条指令解压或处理多个压缩值，只要它们被打包成固定宽度的密集数组（很适合现代处理器的SIMD寄存器），就可以最大限度地提高并行性。

Since columnstores exploit fixed-width dense arrays anyway, they can exploit SIMD execution even with uncompressed data.

译文：由于列存储无论如何都要利用固定宽度的密集数组，所以即使是未压缩的数据，它们也可以利用SIMD执行。

A modern processor has SIMD registers that typically fit 4 4-byte integers at a time and thus a column-store without compression may process 4 values at a time in this case. If data is compressed by a factor of 2 though, then we will be able to fit 8 compressed integers in the SIMD register and we can process 8 values at a time, increasing parallelism.

译文：现代处理器的SIMD寄存器通常一次可以容纳4个4字节的整数，因此在这种情况下，一个没有压缩的列存储可能一次处理4个值。如果数据被压缩了2倍，那么我们将能够在SIMD寄存器中容纳8个压缩的整数，我们可以一次处理8个值，增加并行性。
Compression algorithms 压缩(编码，压缩)算法
There are numerous possible compression schemes that can be applied, i.e., run-length encoding, bit-vector encoding, dictionary compression and patching.

译文：有许多可能的压缩方案可以应用，即运行长度编码、位矢量编码、字典压缩和修补。

Run-Length Encoding

Run-length encoding (RLE) is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run.

运行长度编码（RLE）是一种无损数据压缩的形式，其中数据运行（同一数据值在许多连续的数据元素中出现的序列）被存储为单一的数据值和计数，而不是作为原始运行。

it is well-suited for columns that are sorted or that have reasonable-sized runs of the same value.

它非常适合已排序或具有相同值的合理大小的列。

These runs are replaced with triples: (value, start position, runLength): These runs are replaced with triples: value, start postion, runLength. For example, if the first 42 elements of a column contain the value ‘M’, then these 42 elements can be replaced with the triple: (‘M’, 1, 42).

上述符合的列会被替换为如下triple的结构，包含三要素：值、起始位置、运行长度。例如，如果一列的前42个元素含有值’M’，那么这42个元素可以被替换成 triple： (‘M’, 1, 42).

Bit-Vector Encoding 位向量编码

Bit-Vector encoding is most useful when columns have a limited number of possible data values.

当列具有有限数量的可能数据值时，Bit-Vector encoding是最有用的。

However, it can be used even for columns with a large number of values if the bit-vectors are further compressed.

然而，如果位向量被进一步压缩，它甚至可以用于有大量数值的列。

例如：1132231，可以被如下3个bit-strings 所表示：
bit-string for value 1: 1100001
bit-string for value 2: 0001100
bit-string for value 3: 0010010

this scheme can be used to index row-stores (so-called bit-map indices(位图索引))

Dictionary 字典

Dictionary compression normally lends itself to optimizing queries by rewriting predicates on strings into predicates on integers (which are faster),but this is easiest to accomplish with a global dictionary.

字典压缩通常通过将字符串上的谓词改写为整数上的谓词来优化查询（速度更快），这是使用全局字典最容易的实现。

One benefit of dictionary compression is that it can result in fixed width columns if the system chooses all codes to be of the same width. This requires sacrificing a little bit in terms of the ultimate storage gains, but allows for CPU efficient access patterns.

字典压缩的一个好处是，如果系统选择所有代码都具有相同的宽度，那么它可以产生固定宽度的列。这需要在最终存储增益方面牺牲一点，但允许CPU高效访问模式。

One practical point of consideration is how to dictionary compress efficiently, which depends on fast hashing. One particularly fast technique is cuckoo hashing.