power-bi_在Power BI中的VertiPaq内-压缩成功!

power-bi

Have you ever wondered what makes Power BI so fast and powerful when it comes to performance? So powerful, that it performs complex calculations over millions of rows in a blink of an eye.

您是否想过什么使Power BI在性能方面如此之快和强大? 如此强大,它可以在一瞬间对数百万行执行复杂的计算。

In this series of articles, we will dig deep to discover what is “under the hood” of Power BI, how your data is being stored, compressed, queried, and finally, brought back to your report. Once you finish reading, I hope that you will get a better understanding of the hard work happening in the background and appreciate the importance of creating an optimal data model in order to get maximum performance from the Power BI engine.

在本系列文章中,我们将深入研究以发现Power BI的“内幕”,以及如何存储,压缩,查询您的数据,最后将它们带回到您的报告中。 阅读完本书后,我希望您能更好地了解在后台进行的艰苦工作,并希望了解创建最佳数据模型以从Power BI引擎获得最佳性能的重要性。

As you might recall, in the previous article we scratched the surface of VertiPaq, a powerful storage engine, which is “responsible” for the blazing-fast performance of most of your Power BI reports (whenever you are using Import mode or Composite model).

您可能还记得, 在上一篇文章中,我们介绍了功能强大的存储引擎VertiPaq的表面,它对大多数Power BI报表的出色表现(无论何时使用导入模式或复合模型)“负有责任”。 。

3、2、1…系好安全带! (3, 2, 1…Fasten your seatbelts!)

One of the key characteristics of the VertiPaq is that it’s a columnar database. We learned that columnar databases store data optimized for vertical scanning, which means that every column has its own structure and is physically separated from other columns.

VertiPaq的主要特征之一是它是一个列式数据库。 我们了解到,列式数据库存储为垂直扫描而优化的数据,这意味着每列都有其自己的结构,并且与其他列在物理上是分开的。

Image for post
Photo by Dave Hoefler on Unsplash
Dave Hoefler在Unsplash上​​的照片

That fact enables VertiPaq to apply different types of compression to each of the columns independently, choosing the optimal compression algorithm based on the values in that specific column.

这一事实使VertiPaq可以独立地对每个列应用不同类型的压缩,并根据该特定列中的值选择最佳压缩算法。

Compression is being achieved by encoding the values within the column. But, before we dive deeper into a detailed overview of encoding techniques, just keep in mind that this architecture is not exclusively related to Power BI — in the background is a Tabular model, which is also “under the hood” of SSAS Tabular and Excel Power Pivot.

通过对列中的值进行编码来实现压缩。 但是,在深入研究编码技术之前,请记住,该体系结构并不专门与Power BI相关—在后台是Tabular模型,它也是SSAS Tabular和Excel的“幕后”动力枢轴。

值编码 (Value Encoding)

This is the most desirable value encoding type since it works exclusively with integers and, therefore, require less memory than, for example, when working with text values.

这是最理想的值编码类型,因为它仅与整数一起使用,因此比例如使用文本值时需要更少的内存。

How does this look in reality? Let’s say we have a column containing a number of phone calls per day, and the value in this column varies from 4.000 to 5.000. What the VertiPaq would do, is to find the minimum value in this range (which is 4.000) as a starting point, then calculate the difference between this value and all the other values in the column, storing this difference as a new value.

现实情况如何? 假设我们有一个列,其中包含每天的电话数量,此列中的值在4.000到5.000之间。 VertiPaq要做的是找到此范围内的最小值(4.000)作为起点,然后计算该值与列中所有其他值之间的差,并将该差存储为新值。

Image for post

At first glance, 3 bits per value might not look like a significant saving, but multiply this by millions or even billions of rows and you will appreciate the amount of memory saved.

乍一看,每个值3位可能看起来不算是大笔的节省,但是将其乘以几百万甚至数十亿行,您将欣赏到节省的内存量。

As I already stressed, Value Encoding is being applied exclusively to integer data type columns (currency data type is also stored as an integer).

正如我已经强调的那样,值编码仅应用于整数数据类型列(货币数据类型也存储为整数)。

哈希编码(字典编码) (Hash Encoding (Dictionary Encoding))

This is probably the most used compression type by a VertiPaq. Using Hash encoding, VertiPaq creates a dictionary of the distinct values within one column and afterward replaces “real” values with index values from the dictionary.

这可能是VertiPaq最常用的压缩类型。 使用哈希编码,VertiPaq在一列内创建包含不同值的字典,然后用字典中的索引值替换“实际”值。

Here is the example to make things more clear:

这是使事情更清楚的示例:

Image for post

As you may notice, VertiPaq identified distinct values within the Subjects column, built a dictionary by assigning indexes to those values, and finally stored index values as pointers to “real” values. I assume you are aware that integer values require way less memory space than text, so that’s the logic behind this type of data compression.

您可能会注意到,VertiPaq在“主题”列中标识了不同的值,通过为这些值分配索引来构建字典,最后将索引值存储为“真实”值的指针。 我假设您知道整数值所需的存储空间比文本少,因此这就是这种数据压缩的逻辑。

Additionally, by being able to build a dictionary for any data type, VertiPaq is practically data type independent!

此外,通过能够为任何数据类型构建字典,VertiPaq实际上是与数据类型无关的!

This brings us to another key takeover: no matter if your column is of text, bigint or float data type — from VertiPaq perspective it’s the same — it needs to create a dictionary for each of those columns, which implies that all these columns will provide the same performance, both in terms of speed and memory space allocated! Of course, by assuming that there are no big differences between dictionary sizes between these columns.

这给我们带来了另一个关键的接管:无论您的列是文本,bigint还是float数据类型-从VertiPaq角度来看都是相同的-它需要为每个列创建一个字典,这意味着所有这些列都将提供在速度和分配的内存空间方面都具有相同的性能! 当然,假设这些列之间的字典大小之间没有太大差异。

So, it’s a myth that the data type of the column affects its size within the data model. On the opposite, the number of distinct values within the column, which is known as cardinality, mostly influence column memory consumption.

因此,列的数据类型会影响其在数据模型中的大小,这是一个神话。 相反,列中不同值的数量(称为基数 )主要影响列的内存消耗。

RLE(行程编码) (RLE (Run-Length-Encoding))

The third algorithm (RLE) creates a kind of mapping table, containing ranges of repeating values, avoiding to store every single (repeated) value separately.

第三种算法(RLE)创建一种映射表,其中包含重复值的范围,避免了单独存储每个(重复)值的情况。

Again, taking a look at an example will help to better understand this concept:

同样,看一个例子将有助于更好地理解这个概念:

Image for post

In real life, VertiPaq doesn’t store Start values, because it can quickly calculate where the next node begins by summing previous Count values.

在现实生活中,VertiPaq不存储“开始”值,因为它可以通过对先前的“计数”值求和来快速计算下一个节点的起始位置。

As powerful as it might look at first glance, the RLE algorithm is highly dependent on the ordering within the column. If the data is stored the way you see in the example above, RLE will perform great. However, if your data buckets are smaller and rotate more frequently, then RLE would not be an optimal solution.

乍一看,RLE算法虽然功能强大,但在很大程度上取决于列中的顺序。 如果按照您在上例中看到的方式存储数据,RLE将表现出色。 但是,如果您的数据存储区较小,并且旋转频率更高,则RLE将不是最佳解决方案。

One more thing to keep in mind regarding RLE: in reality, VertiPaq doesn’t store data the way it is shown in the illustration above. First, it performs Hash encoding and creating a dictionary of the subjects and then apply RLE algorithm, so the final logic, in its most simplified way, would be something like this:

关于RLE还需要记住的一件事:实际上,VertiPaq不会像上图所示那样存储数据。 首先,它执行Hash编码并创建主题字典,然后应用RLE算法,因此最终逻辑(以其最简化的方式)将类似于以下内容:

Image for post

So, RLE occurs after Value or Hash Encoding, in those scenarios when VertiPaq “thinks” that it makes sense to compress data additionally (when data is ordered in that way that RLE would achieve better compression).

因此,在VertiPaq“认为”有必要额外压缩数据的情况下(当以这种方式订购数据时,RLE将实现更好的压缩),RLE发生在“值”或“哈希编码”之后。

重新编码注意事项 (Re-Encoding considerations)

No matter how “smart” VertiPaq is, it can also make some bad decisions, based on incorrect assumptions. Before I explain how re-encoding works, let me just briefly iterate through the process of data compression for a specific column:

无论VertiPaq多么“聪明”,它也会基于错误的假设做出一些错误的决定。 在解释重新编码的工作原理之前,让我简要地迭代一下特定列的数据压缩过程:

  • VertiPaq scans sample of rows from the column

    VertiPaq扫描列中的行样本
  • If the column data type is not an integer, it will look no further and use Hash encoding

    如果列数据类型不是整数,则不会再使用Hash编码
  • If the column is of integer data type, some additional parameters are being evaluated: if the numbers in sample linearly increase, VertiPaq assumes that it is probably a primary key and chooses Value encoding

    如果该列是整数数据类型,则将评估一些其他参数:如果样本中的数字线性增加,则VertiPaq假定它可能是主键,并选择值编码
  • If the numbers in the column are reasonably close to each other (number range is not very wide, like in our example above with 4.000–5.000 phone calls per day), VertiPaq will use Value encoding. On the contrary, when values fluctuate significantly within the range (for example between 1.000 and 1.000.000), then Value encoding doesn’t make sense and VertiPaq will apply the Hash algorithm

    如果该列中的数字彼此相当接近(数字范围不是很宽,例如在上面的示例中,每天有4.000-5.00个电话),则VertiPaq将使用值编码。 相反,当值在此范围内(例如1.000到1.000.000之间)波动很大时,则值编码没有意义,VertiPaq将应用哈希算法

However, it can happen sometimes that VertiPaq makes a decision about which algorithm to use based on the sample data, but then some outlier pops-up and it needs to re-encode the column from scratch.

但是,有时可能会发生VertiPaq根据样本数据来决定使用哪种算法的情况,但随后会弹出一些异常值,因此需要从头开始对列进行重新编码。

Let’s use our previous example for the number of phone calls: VertiPaq scans the sample and chooses to apply Value encoding. Then, after processing 10 million rows, all of a sudden it found 50.000 value (it can be an error, or whatever). Now, VertiPaq re-evaluates the choice and it can decide to re-encode the column using the Hash algorithm instead. Surely, that would impact the whole process in terms of time needed for reprocessing.

让我们使用前面的示例来计算电话数量:VertiPaq扫描样本并选择应用Value编码。 然后,在处理了1000万行之后,突然发现50.000值(这可能是错误,也可能是其他)。 现在,VertiPaq重新评估选择,并且可以决定使用Hash算法对列进行重新编码。 当然,这将影响整个过程的重新处理所需的时间。

结论 (Conclusion)

In this part of the series on the “brain & muscles” behind Power BI, we dived deep into different data compression algorithms which VertiPaq performs in order to optimize our data model.

在Power BI背后“大脑和肌肉”系列的这一部分中,我们深入研究了VertiPaq为优化我们的数据模型而执行的各种数据压缩算法。

Finally, here is the list of parameters (in order of importance) that VertiPaq considers when choosing which algorithm to use:

最后,这是VertiPaq在选择使用哪种算法时要考虑的参数列表(按重要性顺序):

  • Number of distinct values in the column (Cardinality)

    列中不同值的数量(基数)
  • Data distribution in the column — column with many repeating values can be better compressed than one containing frequently changing values (RLE can be applied)

    包含多个重复值的列中的数据分布比包含频繁更改值的列(可以应用RLE)更好地压缩
  • Number of rows in the table

    表中的行数
  • Column data type — impacts only dictionary size

    列数据类型-仅影响字典大小

In the next article, I will introduce some techniques for reducing data model size and consequentially getting the better overall performance of your Power BI report.

在下一篇文章中,我将介绍一些减少数据模型大小并因此获得Power BI报表更好的整体性能的技术

翻译自: https://towardsdatascience.com/inside-vertipaq-in-power-bi-compress-for-success-68b888d9d463

power-bi

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值