bar(柱状图/条形图)_柱状商店-何时/如何/为什么？

最新推荐文章于 2022-03-25 21:56:05 发布

weixin_26746861

最新推荐文章于 2022-03-25 21:56:05 发布

阅读量664

点赞数

原文链接：https://towardsdatascience.com/columnar-stores-when-how-why-2d6759914319

版权

bar(柱状图/条形图)

Long ago data storage was simple — heapfiles and b-trees and that's it. Today the options are overwhelming — ORC, Parquet, Avro on HDFS or S3 or a RDBMS solution like Postgresql, MariaDB, or commercial ones like Oracle and DB2. Even within RDBMS engines and Cloud services there are many options!

很久以前，数据存储很简单-堆文件和b树，仅此而已。如今，选项已不堪重负-HDC或S3上的ORC，Parquet，Avro或Postgresql，MariaDB等RDBMS解决方案，或Oracle和DB2等商业解决方案。即使在RDBMS引擎和云服务中，也有许多选择！

This guide is a “random walk” into the broad realm of storage.

本指南是对存储领域的“随机漫步”。

样本数据集 (Sample DataSet)

I pulled the Chicago Public workers salary info. Did you know public salaries are published w/ the REAL names of each person? Look up your favorite police officer or commissioner’s salary for fun (it won’t include overtime, perks nor bribes, unfortunately).

我提取了芝加哥公共工作者的工资信息。您是否知道公开薪水带有每个人的真实姓名？查找您最喜欢的警官或专员的薪水以取乐(不幸的是，其中不包括加班费，津贴或贿赂)。

Image for post — I added a few fake people — Foo & Bar families. Otherwise its real data.

列导向与行导向 (Column Oriented vs Row Oriented)

First the basic storage mechanics of data. Row and column orientation.

首先是数据的基本存储机制。 行和列的方向。

Row oriented stores each individual record together, Doug Foo’s full record then Jane Foo’s, and so on in sequence.

面向行将每个单独的记录存储在一起，Doug Foo的完整记录然后是Jane Foo的完整记录，依此类推。

[Doug,Foo,Foostack,Programmer,M,120000] [Jane, Foo,Foostack,Programmer,F,110000] [Doug,Bar,…]..

[Doug，Foo，Foostack，Programmer，M，120000] [Jane，Foo，Foostack，Programmer，F，110000] [Doug，Bar，...] ..

Column or columnar stores all column data together — so all First Names (Doug,Jane, etc), then all the Last Names, Titles, … , finally all Salaries.

列或列式将所有列数据存储在一起-因此，所有名字(Doug，Jane等)，然后是所有姓氏，头衔，…，最后是所有薪水。

[Doug,Jane,Doug,Jane,…][Foo,Foo,Bar,Bar…][Programmer,Programmer…]…[M,F,M,F…][120000,110000,100000,130000…]..

[Doug，Jane，Doug，Jane，...] [Foo，Foo，Bar，Bar ...] [Programmer，Programmer ...] ... [M，F，M，F ...] [120000,110000,100000,130000 ...]。

Data is generally read off disk by blocks of KB (or MB/GB), so a single read for 1 record brings in a lot more than what you may want.

通常以KB(或MB / GB)为单位从磁盘上读取数据，因此一次读取1条记录所带来的好处远远超出了您的期望。

典型块大小[* 1]： (Typical Block Sizes[*1]:)

HDFS: 128MB (Chunks)
HDFS：128MB(大块)
Google FS: 64MB (Chunks)
Google FS：64MB(大块)
Amazon AWS S3: 128KB
亚马逊AWS S3：128KB
Oracle Data Warehouse: 32KB
Oracle数据仓库：32KB
Postgres DB: 8KB
Postgres数据库：8KB
Linux and Windows Filesystems: 4KB
Linux和Windows文件系统：4KB

Salary records are pretty small ~ 50chars (<50 bytes). An 8KB block with metadata could easily store 100+ row oriented records [8192 / 50 = 163]. So a single read of the block containing “Doug Foo”’s record includes a lot more.

薪水记录非常小，大约50个字符(<50个字节)。带有元数据的8KB块可以轻松存储100多个面向行的记录[8192/50 = 163]。因此，单次读取包含“ Doug Foo”记录的块将包含更多内容。

For columnar block reads — the read varies on column size. For example with 8KB blocks:

对于柱状块读取-读取随列大小而变化。例如，使用8KB块：

First Name may avg 10 bytes, meaning upwards of ~800 names.
名字的平均长度为10个字节，意味着最多约800个名字。
Salary could fit in a 32bit Int (4bytes) yielding ~2000 entries!
薪水可以容纳32位整数(4字节)，产生〜2000个条目！
Gender is M/F/? — could be stored as 1 byte or even a half nibble ~ 8000!
性别是男/女/？ —可以存储为1个字节，甚至半个字节〜8000！

This will be important to keep in mind as we do more analysis later.

在以后进行更多分析时，请记住这一点很重要。

数据转换 (Data Transformation)

I wanted to add a M/F attribute so I merged in a baby names database then wrote out the various pandas supported formats (JSON, CSV and Parquet).

我想添加一个M / F属性，所以我合并到一个婴儿名字数据库中，然后写出了各种熊猫支持的格式(JSON，CSV和Parquet)。

Basic Pandas to gen data

基本熊猫到发电数据

Note the difference in file sizes for equivalent disk formats (38k records):

请注意，等效磁盘格式 (38k条记录)的 文件大小有所不同 ：

CSV without a header is 2.526 MB
不含标题的CSV为2.526 MB
Parquet (columnar) with compression is 606 KB (1/4 of CSV)
压缩后的实木复合地板 (立柱)为606 KB(CSV的1/4)
JSON is 4.615 MB (+90% larger than CSV, + 700% larger than Parquet)
JSON为4.615 MB(比CSV大90％，比实木复合地板大700％)

Parquet is a binary compressed format so requires some effort to do a quick peek — will discuss details later.

Parquet是二进制压缩格式，因此需要花一些时间来快速浏览-稍后将讨论详细信息。

为什么不存储为CSV或JSON？ (Why not store as CSV or JSON ?)

Obvious, but let’s spell it out:

显而易见，但让我们说明一下：

Bad space utilization (numbers as strings waste space)
空间利用率不佳(数字浪费字符串空间)
No type or structural checking (chars could wind up in numeric fields)
没有类型或结构检查(字符可能会在数字字段中结束)
CSV — no metadata/header info, JSON — repeated meta/formatting
CSV-没有元数据/标题信息，JSON-重复的元数据/格式
No native compression of repeating values
没有本地压缩重复值
No native indexing /search ability
没有本地索引/搜索功能

Note CSV and JSON have distinct advantages of being very human readable thus are sometimes good interchange formats.

注意CSV和JSON具有人类可读性的明显优势，因此有时是良好的交换格式。

深入探讨Postgres格式(RDBMS) (Deep dive into Postgres format (RDBMS))

Postgres is a full featured open source DB that has both traditional row based storage (sometimes called “heapfiles”) as well as a columnar store extension (cstore_fdw).

Postgres是功能齐全的开源数据库，具有传统的基于行的存储(有时称为“堆文件”)和列式存储扩展(cstore_fdw)。

I created a schema and loaded the salary data into 2 variants, a standard row and column version.

我创建了一个架构，并将工资数据加载到两个变体中，即标准的行和列版本。

Some basic stats/comparisons between the two when querying 38k records:

查询38k记录时，两者之间的一些基本统计数据/比较：

Select count(*) → column store is slightly faster (7.9 vs 10.9 ms) and has a lower “cost” 480 vs 873 (assume this is a metric of estimated ops/reads). In theory we read less blocks with a columnar store, since you just need to scan thru a single column to get the count and it is compressed.
选择count(*) →列存储稍微快一点(7.9 vs 10.9 ms)，“成本”较低(480 vs 873)(假设这是估计的操作/读取的度量)。从理论上讲，我们使用列式存储读取的块更少 ，因为您只需要扫描单个列即可获得计数并对其进行压缩。

2. Select sum(salary) → this should be a classic case for columnar where we only need to scan one column or sequence of blocks, while row-based needs to scan all blocks to extract the salary from each row. But it is not as fast as we’d expect. The cost is cheaper 493 vs 873, but timing is 14.1ms vs 17.1ms.

2. 选择sum(salary) →这应该是典型的柱状案例，在这里我们只需要扫描一列或一连串的块，而基于行的则需要扫描所有块以从每一行提取薪水。但这并没有我们期望的那么快。成本比493和873更便宜，但时序为14.1ms和17.1ms。

3. Larger datasets — I pumped this up to 3.8 m rows and re-ran the sum query. Not much change — somewhat surprising! In theory, it should be much faster than the 20–30% gains. In fact, columnar is often slower after caching takes effect! (To be fair to columnar — you really need GB to TB datasets that are larger than your memory caches to real gains)[*2].

3. 更大的数据集-我最多抽取了3.8 m行，然后重新运行求和查询。变化不大-有点令人惊讶！从理论上讲，它应该比20％到30％的收益要快得多。实际上，缓存生效后，柱状函数通常会变慢！ (公平地说，您确实需要大于内存缓存的GB到TB数据集才能获得实际收益) [* 2] 。

After reading some bug reports and developer notes, I can guess as an extension it is not as robust as standard row storage. Native row formats offer multi-core processing and buffer caches for example. I should probably compare to a native column store database….

阅读了一些错误报告和开发人员说明后，我可以猜测它是扩展程序，不如标准行存储强大。例如，本机行格式提供多核处理和缓冲区缓存。我可能应该将其与本地列存储数据库进行比较……。

Note the file sizes when I loaded 3.8m rows:

注意加载3.8m行时的文件大小：

Standard row format: 321 MB
标准行格式： 321 MB
Columnar/compressed format: 72 MB
列式/压缩格式： 72 MB

Huge space savings (almost 4x)! Notice the raw data dump for both row and columnar on disk (using “xxd -c 32” on linux/wsl2):

节省大量空间(几乎是四倍)！注意磁盘上行和列的原始数据转储(在linux / wsl2上使用“ xxd -c 32”)：

实木复合地板-最受欢迎？ (Parquet — the most popular?)

Parquet came out of Twitter and Cloudera (Hadoop) and is maintained as an Apache project. It is the defacto format for Spark, as a result #1 in popularity. (Prior popular formats include ORC and RCFile).

Parquet来自Twitter和Cloudera(Hadoop)，并作为Apache项目进行维护。它是Spark的事实上格式，因此在流行度中排名第一。 (主流格式包括ORC和RCFile)。

It is also natively supported by Python/Pandas and offers multiple compression formats. The basic file format is a columnar in groups of rows and stores a central schema & group level metadata to allow parallel and partial reads.

Python / Pandas本身也支持它，并提供多种压缩格式。基本文件格式是行组的列状，并存储中央模式和组级元数据以允许并行和部分读取。

A modest deep dive into the raw file shows:

适度深入研究原始文件显示：

AVRO —面向行，带有模式演化 (AVRO — row oriented w/ schema evolution)

In tight competition with Parquet is Avro, which is a bit more than just a storage format but we’ll focus only on storage aspects. It is also an Apache project.[*3] Two main differences to Parquet:

与Parquet竞争激烈的是Avro，它不仅是一种存储格式，而且我们将只关注存储方面。这也是一个Apache项目。 [* 3]与Parquet的两个主要区别：

Better schema evolution
更好的架构演变
Row oriented storage
行式存储

Schema and changes (defaults for new and old rows) are embedded in the file. Adding a new column just requires defining defaults which makes slowly changing dimensions (SCD) and changes easier to handle.

模式和更改(新行和旧行的默认值)嵌入在文件中。添加新列只需要定义默认值即可使尺寸更改(SCD)缓慢且更改更易于处理。

Note it is row-oriented so you will not get some of the advantages like compression and performance that you do with Parq — assuming your use case is more about columnar aggregation rather than record by record processing.

请注意，它是面向行的，因此您不会像Parq那样获得压缩和性能等优点-假设您的用例更多是关于列聚合，而不是按记录进行记录。

Avro code and data file example:

Avro代码和数据文件示例：

其他存储选项(MongoDB，Kafka，KDB等) (Other Storage Options (MongoDB, Kafka, KDB, etc))

NoSQL — name-value object stores like MongoDB, AWS Dynamo, and Cassandra store entire JSON (or arbitrary) objects by a distinct key. The API is simple using basic key-value store (put/get) semantics. The high level +/-’s :

NoSQL-名称值对象存储(如MongoDB，AWS Dynamo和Cassandra)通过不同的键存储整个JSON(或任意)对象。使用基本键值存储(输入/获取)语义的API很简单。高位+/-：

+ Simple programming interface
+简单的编程界面
+ Fast path to get your app up and running
+快速途径来启动和运行您的应用
- Inefficient to load multiple or complex data
-无法有效加载多个或复杂的数据
- Index and query / search inflexibility
-索引和查询/搜索不灵活
- Schema evolution challenges
-模式演变挑战

Log Structured Stores (and Log Structured Merge Trees) and their innovative uses are a fascinating area — Kafka is the leader and innovator (don’t be fooled its not just messaging, its really about logs as per co-creator Jay Kreps). [*4]

日志结构化商店(和日志结构化合并树)及其创新用途是一个引人入胜的领域-Kafka是领导者和创新者(不要误以为它不仅是消息传递，而且还真正关于日志 按照共同创作者Jay Kreps的说法 )。 [* 4]

+ Simple programming interface
+简单的编程界面
+ Scalable alternative to realtime Message stores and Databases
+实时消息存储和数据库的可扩展替代方案
- Complex cluster administration
-复杂的集群管理

Finally, KDB deserves brief mention as a pioneer in columnar, realtime and timeseries memory DBs. Initially written in the ’90s and boasts unbeatable single core performance even today. Unfortunately, it is an expensive niche product, thus sold off to First Derivatives recently and likely being bled out to die …

最后，作为列式，实时和时间序列存储DB的先驱，KDB值得一提。最初写于90年代，即使在今天也拥有无与伦比的单核性能。不幸的是，它是一种昂贵的利基产品，因此最近被出售给First Derivatives，并可能被抢购一空……

+ Super fast and low footprint (800kb) fits in L1/L2 cache
+超快速和低占用空间(800kb)可容纳L1 / L2缓存
+ Leverages vector/SIMD CPU instructions
+利用向量/ SIMD CPU指令
- Does not scale past 1 machine (doesn’t need to ?)
-不超过1台机器(不需要吗？)
- Cost and obscure language (K/Q) makes it prohibitive for most consumers.
-成本和含糊不清的语言(K / Q)对大多数消费者来说是禁止的。

Apache Arrow (memory columnar format and serializer) and OLAP tech are two areas worth looking into as well.

Apache Arrow(内存列格式和序列化程序)和OLAP技术也是两个值得研究的领域。

摘要：首要考虑因素 (Summary: Top Considerations)

Does columnar fit your typical use cases (frequent aggregation on a few columns) ?
列式是否适合您的典型用例(频繁汇总几列)？
Does your existing database offer storage options like columnar storage?
您现有的数据库是否提供诸如列存储的存储选项？
Is schema evolution/change common and a key consideration?
架构演变/变更是否常见且是关键考虑因素？
Is storage space/capacity a key consideration?
存储空间/容量是否是关键考虑因素？

Think it thru and do a trial run. Relational DB’s have made massive improvements in caching and indexing that may make it less necessary in some cases.

仔细考虑并进行试运行。关系数据库在缓存和索引编制方面进行了重大改进，这在某些情况下可能会变得不必要。

参考资料，脚注和启示 (References, Footnotes, and Inspirations)

[1]- Block sizes are conceptually different for a filesystem/OS, Database, and Big Data stores like S3/GFS. Its a logical unit in each case. AWS EBS uses 4K, S3 is an object store but its min allocation is 128KB.
[1]-文件系统/ OS，数据库和大数据存储(例如S3 / GFS)的块大小在概念上是不同的。在每种情况下，它都是逻辑单元。 AWS EBS使用4K，S3是对象存储，但其最小分配为128KB。
[2]- Columnar performance was only faster on cold boot/cleared cache. Once the DB fills its caches, traditional row storage on Postgres was consistently faster. However, columnar stores are for Big Data ~TB size datasets larger than memory which would exhibit cold-cache behavior.
[2]-仅在冷启动/清除缓存时，列式性能才更快。一旦数据库填满了缓存，Postgres上的传统行存储就会始终更快。但是，列状存储用于大于存储器的大数据〜TB大小的数据集，这将表现出冷缓存行为。
[3]- To add more confusion, Avro’s serialization code can write to Parquet file formats. Its a competitor and a collaborator.
[3] -为了增加混乱，Avro的序列化代码可以写入Parquet文件格式。它是竞争对手和合作者。
[4]- Most think Kafka is just the next MQ, but it is a fundamentally different approach, centered on the commit/transaction log which is a psuedo database.
[4] -大多数人认为Kafka只是下一个MQ，但这是一种根本不同的方法，集中在作为伪数据库的提交/事务日志上。
Postgres CStore — https://github.com/citusdata/cstore_fdw
Postgres CStore - https://github.com/citusdata/cstore_fdw
Good read on Parquet — https://medium.com/swlh/insights-into-parquet-storage-ac7e46b94ffe
在Parquet上阅读不错 — https://medium.com/swlh/insights-into-parquet-storage-ac7e46b94ffe
Avro — https://www.perfectlyrandom.org/2019/11/29/handling-avro-files-in-python/
Avro - https://www.perfectlyrandom.org/2019/11/29/handling-avro-files-in-python/
LSM Trees vs B-Trees — https://blog.yugabyte.com/a-busy-developers-guide-to-database-storage-engines-the-basics/
LSM树与B树-https : //blog.yugabyte.com/a-busy-developers-guide-to-database-storage-engines-the-basics/
More on LSM Trees — https://www.slideshare.net/ConfluentInc/power-of-the-loglsm-append-only-data-structures
有关LSM树的更多信息-https: //www.slideshare.net/ConfluentInc/power-of-the-loglsm-append-only-data-structures
Kreps ♥ Logs (Kafka)— https://www.confluent.io/ebook/i-heart-logs-event-data-stream-processing-and-data-integration/
克雷普斯 ♥ 日志(卡夫卡) - https://www.confluent.io/ebook/i-heart-logs-event-data-stream-processing-and-data-integration/
Arthur Whitney on KDB & complete rewrites — https://kx.com/blog/small-core-code-and-hp-apl/
阿瑟·惠特尼在KDB和完整重写 - https://kx.com/blog/small-core-code-and-hp-apl/
Source Data Set — https://data.cityofchicago.org/Administration-Finance/Current-Employee-Names-Salaries-and-Position-Title/aned-ke5c
源数据集-https : //data.cityofchicago.org/Administration-Finance/Current-Employee-Names-Salaries-and-Position-Title/aned-ke5c
GitHub for this article — https://github.com/dougfoo/fileformats
本文的GitHub — https://github.com/dougfoo/fileformats

翻译自: https://towardsdatascience.com/columnar-stores-when-how-why-2d6759914319

bar(柱状图/条形图)

weixin_26746861

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
bar(柱状图/条形图)_柱状商店-何时/如何/为什么？

bar(柱状图/条形图)Long ago data storage was simple — heapfiles and b-trees and that's it. Today the options are overwhelming — ORC, Parquet, Avro on HDFS or S3 or a RDBMS solution like Postgresql, MariaDB, ...
复制链接

扫一扫