Column-oriented DBMS

最新推荐文章于 2022-10-30 22:41:11 发布

wh62592855

最新推荐文章于 2022-10-30 22:41:11 发布

阅读量2k

点赞数

分类专栏： Data Warehouse 文章标签： database compression processing concurrency access random

Data Warehouse 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

A column-oriented DBMS is a database management system (DBMS) which naturally stores its content by column rather than by row. This has advantages for databases such as data warehouses and library catalogues, where aggregates are computed over large numbers of similar data items. It is possible to achieve some benefits of either column-oriented or row-oriented organization with any database. By denoting one as column-oriented we are referring to both the ease of expression of a column oriented structure and the focus on optimizations for column-oriented workloads. ^[1][2] This approach is contrasted with row-oriented or row store databases and with correlation databases, which use a value-based storage structure.

Description

A database program must show its data as two-dimensional tables, of columns and rows, but store it as one-dimensional strings. For example, a database might have this table.

EmpId	Lastname	Firstname	Salary
1	Smith	Joe	40000
2	Jones	Mary	50000
3	Johnson	Cathy	44000

This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary).

This table exists in the computer's memory (RAM) and storage (hard drive). Although RAM and hard drives differ mechanically, the computer's operating system abstracts them. Still, the database must coax its two-dimensional table into a one-dimensional series of bytes, for the operating system to write to either the RAM, or hard drive, or both.

A row-oriented database serializes all of the values in a row together, then the values in the next row, and so on.

      1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.

      1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;

This is a simplification. Partitioning, indexing, caching, views, OLAP cubes, and transactional systems such as write ahead logging or multiversion concurrency control all dramatically affect the physical organization. That said, online transaction processing (OLTP)-focused RDBMS systems are more row-oriented, while online analytical processing (OLAP)-focused systems are a balance of row-oriented and column-oriented.

[edit] Benefits

Comparisons between row-oriented and column-oriented systems are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other delays in computers. Further, because seek time is improving at a slow rate relative to CPU power (see Moore's Law), this focus will likely continue on systems reliant on hard-disks for storage. Following is a set of over-simplified observations which attempt to paint a picture of the trade-offs between column and row oriented organizations. Note: With the advancing of in-memory technology and the low prices of RAM the access time via hard-disk is becoming a less important factor.

Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
Column-oriented systems are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
Row-oriented systems are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
Row-oriented systems are more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek.

In practice, row oriented architectures are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column stores are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes). However, there are a number of proven row-based OLAP RDBMS that handles terabytes, or even petabytes of data, such as Teradata.

[edit] Storage efficiency vs. random access

Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row oriented data. For example, many popular modern compression schemes, such as LZW, make use of the similarity of adjacent data to compress. While the same techniques may be used on row-oriented data, a typical implementation will achieve less effective results. Further, this behavior becomes more dramatic when a large percentage of adjacent column data is either the same or not-present, such as in a sparse column (similar to a sparse matrix). The opposing tradeoff is random access. Retrieving all data from a single row is more efficient when that data is located in a single location, such as in a row-oriented architecture. Further, the greater adjacent compression achieved, the more difficult random-access may become, as data might need to be uncompressed to be read. Therefore, column-oriented architectures are sometimes enriched by additional mechanisms aimed at minimizing the need of access to compressed data^[3].

wh62592855

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Column-oriented DBMS

A column-oriented DBMS is a database management system (DBMS) which naturally stores its content by column rather than by row. This has advantages for databases such as <a title="Data warehouse" hre
复制链接

扫一扫