Carbon Data 字典编码

最新推荐文章于 2024-08-10 07:29:50 发布

韩思明

最新推荐文章于 2024-08-10 07:29:50 发布

阅读量735

点赞数

分类专栏： carbondata 文章标签： carbondata

carbondata 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

INTRODUCTION 介绍

Encoded data for reduced storage space and faster processing.

编码数据可以减少存储空间，加快处理的速度。

DESCRIPTION 描述

Most databases and big data SQL data stores employ columnar encoding to achieve data compression by storing small integer numbers (surrogate values) instead of full string values. However, almost all existing databases and data stores divide the data into row groups containing anywhere from few thousand to a million rows and employ dictionary encoding only within each row group. Hence, the same column value can have different surrogate values in different row groups. So, while reading the data, conversion from surrogate value to actual value needs to be done immediately after the data is read from the disk. But CarbonData employs global surrogate key which means that a common dictionary is maintained for the full store on one machine/node. So CarbonData can perform all the query processing work such as grouping/aggregation, sorting etc on light weight surrogate values. The conversion from surrogate to actual values needs to be done only on the final result. This procedure improves performance on two aspects. Conversion from surrogate values to actual values is done only for the final result rows which are much less than the actual rows read from the store. All query processing and computation such as grouping/aggregation, sorting, and so on is done on lightweight surrogate values which requires less memory and CPU time compared to actual values.

大多数的数据库和大数据存储都是采用存储一个小的整数(integer numbers(替代值))去替代一个完整的字符串值，来达到数据的压缩。然而，几乎所有现有的数据库和数据存储将数据分成包含从几千到一百万行的行组，并且仅在每个行组内使用字典编码。因此，相同的列值可能在不同的行组中具有不同的替代值。所以，在读取数据时，在从磁盘读取数据后，需要立即从代理值到实际值的转换，减慢了速度。但是CarbonData使用全局替代值，在一个机器/节点上完整存储并维护了一个通用字典。所以CarbonData可以在执行所有的查询处理工作时，如分组/聚合，排序等上使用轻量级替代值。 从替代值到实际值的转换只需要在最终结果上完成。 该过程在两个方面提高了性能。从替代值到真实值得转化只是在最终的结果行上完成比直接从存储中读取真实值要快很多。所有查询处理和计算，如分组/聚合，排序等都是在轻量级替代值上完成的，与实际值相比，花费更少的内存和CPU时间。

ENCODING TECHNIQUE 编码技术

Original Data 原始数据

图一

Dictionary Generation 字典生成

All the Multi Dimensional Keys(MDK)* are compressed to some lightweight(surrogate) values, which results in less memory usage. This encoding is used to achieve data compression by storing small integer numbers (surrogate values) instead of full string values. All nulls have a default value 0, Others are assigned values accordingly.

所有的Multi Dimensional Keys(MDK) 被压缩为一些轻量级的替代值，结果会减少内存的用。这种编码方式是通过存储更小的数值(替代值)来代替完整的字符串值来达到压缩数据的目的。所有的null值有一个默认的值0，其他的值也会被相应的分配替代值。

图二

All query processing and computation such as grouping/aggregation, sorting, and so on is done on lightweight surrogate values which requires less memory and CPU time compared to actual values.

所有查询处理和计算，如分组/聚合，排序等都是在轻量级替代值上完成的，与实际值相比，需要更少的内存和CPU时间。

Dictionary Encoding 字典编码

After generating the dictionary(the surrogate values for column values), the table data is updated accordingly with the new surrogate values.

在字典生成之后(每个列值都生成了相应的替代值)，表数据也会相应的被新的替代值更新。

图三

Sorting(on MDK : Multi Dimensional Keys) : 依据Multi Deimensional Keys排序

The multi dimensional keys are then sorted, and table data is arranged accordingly.

MDK被排序了之后，表数据也相应的变化。

图四

Blocklet Logical View Blocklet的逻辑视图

图五

Conversion from surrogate values to actual values is done only for the final result rows which are much less than the actual rows read from the store.

从替代值到真实值得转化只是在最终的结果行上完成比直接从存储中读取真实值要快很多。

*(MDK)Multi Dimensional Keys are the columns which represent dimensions(the keys to analyse data) of the table(ex: Location, Months etc)

(MDK)Multi Dimesional Keys 表示了表格的这些维度(分析数据的关键点)的列(例如：位置，月份等)。