行存储索引改换成列存储索引_索引策略–第2部分–内存优化表和列存储索引

最新推荐文章于 2022-07-15 12:27:06 发布

culuo4781

最新推荐文章于 2022-07-15 12:27:06 发布

阅读量417

点赞数

文章标签：数据库 mysql java python 大数据

原文链接：https://www.sqlshack.com/index-strategies-part-2-choosing-right-table-structure/

版权

行存储索引改换成列存储索引

In the first part we started discussion about choosing the right table structure and described the heaps — in my opinion a mostly overused option. Now let’s see better alternatives.

在第一部分中，我们开始了有关选择正确的表结构的讨论，并描述了堆-我认为这是一个过度使用的选择。现在，让我们看看更好的选择。

四种不同的表格结构-继续讨论 (Four different table structures — continued discussion)

For years Microsoft has been trying to convince us that Enterprise edition is worth the additional money. For SQL Server 2014 this is even more true than it was previously. So, if you are lucky Enterprise Edition owner, you can adapt your database model to leverage unique features of this edition. It will make the design simpler and result in a better performance.

多年来，Microsoft一直试图说服我们企业版值得额外花钱。对于SQL Server 2014，这比以前更加真实。因此，如果您是幸运的企业版所有者，则可以调整数据库模型以利用该版本的独特功能。这将使设计更简单，并导致更好的性能。

In order to improve structure of tables, SQL Server 2014 Enterprise Edition offers two interesting options: memory-optimized tables and clustered columnstore indexes.

为了改善表的结构，SQL Server 2014 Enterprise Edition提供了两个有趣的选项：内存优化表和群集列存储索引。

内存优化表 (Memory-Optimized Tables)

Because the entire memory-optimized table resides in memory (the copy of it is stored on disk for durability only) and this kind of table has lock-free structures, it fits nicely into any logging scenarios. In other words, if you have a table used primary for logging, memory-optimized one should be your first choice.

因为整个内存优化表都驻留在内存中（其副本存储在磁盘上仅出于持久性考虑），并且这种表具有无锁结构，所以它非常适合任何日志记录方案。换句话说，如果您有一个使用主表的表进行日志记录，那么内存优化的表应该是您的首选。

Before a memory-optimized table can be created, a dedicated filegroup with a file (actually this file is a folder — Hekaton, the in-memory OLTP engine uses storage built on top on filetable feature) have to be added to a database (the script that creates this database can be found in the first article):

在创建内存优化表之前，必须将包含文件（实际上是一个文件，即Hekaton，内存OLTP引擎使用基于文件表功能构建的存储）的专用文件组添加到数据库中（创建该数据库的脚本可以在第一篇文章中找到：

USE master;
GO
ALTER DATABASE Indices 
ADD FILEGROUP InMemoryOLTP
CONTAINS MEMORY_OPTIMIZED_DATA;
GO
ALTER DATABASE Indices
  ADD FILE (name = 'InMemoryOLTP', filename = 'e:\SQL\InMemoryOLTP')
  TO FILEGROUP InMemoryOLTP
GO

Secondly, there are many limitations for the memory-optimized tables. Fortunately, they count for little in our scenario — logging tables usually don’t have foreign keys, check constraints or defined triggers. Still, there is one important exception — the memory-optimized tables don’t support distributed transactions. What is more, this kind of table cannot be altered in any way, drop and re-create excepted, and 1252 character set must be used for varchar or char columns (and come to that, if you want to put an index on varchar column, you will need a BIN collation. So, our table definition can look like this below:

其次，内存优化表有很多限制。幸运的是，在我们的方案中，它们没有多大用处–日志表通常没有外键，检查约束或已定义的触发器。仍然有一个重要的例外-内存优化表不支持分布式事务。而且，这种表不能以任何方式更改，删除和重新创建，并且必须对varchar或char列使用1252字符集（如果要在varchar列上放置索引，则必须这样做），您将需要BIN排序规则。因此，我们的表定义如下所示：

CREATE TABLE dbo.InMemory(
 
Id	  INT	  IDENTITY(1,1) NOT NULL PRIMARY KEY
 
		NONCLUSTERED HASH WITH (BUCKET_COUNT = 100000),
 
fixedColumns	CHAR(200) COLLATE Latin1_General_100_CI_AS 
CONSTRAINT Col2InMemDefault DEFAULT 'This column mimics all fixed-legth columns',
varColumns		VARCHAR(200) COLLATE Latin1_General_100_CI_AS	
CONSTRAINT Col3InMemDefault DEFAULT 'This column mimics all var-legth columns',
 
	)
 
WITH (MEMORY_OPTIMIZED=ON);

If we insert once again the same 100000 rows into the memory-optimized table, in spite of using 50 of concurrent sessions (the same methodology was used to load data into the heap) the average time will drop from 18 seconds to 10.

如果我们再次向内存优化表中插入相同的100000行，尽管使用了50个并发会话（使用相同的方法将数据加载到堆中），平均时间将从18秒减少到10。

However, this improvement has its price — we have to have enough memory to store all of these rows in it:

但是，这种改进有其代价-我们必须有足够的内存来在其中存储所有这些行：

Memory usage by memory optimized objects

Now let’s do our UPDATE test. Again, we will update some rows with the aid of SQLQueryStress using five concurrent sessions and two queries — first one is the UPDATE statement and the second one is a simple parameter substitution:

现在让我们进行UPDATE测试。同样，我们将使用五个并发会话和两个查询借助SQLQueryStress来更新一些行-第一个是UPDATE语句，第二个是简单的参数替换：

UPDATE dbo.InMemory
SET varColumns = REPLICATE('a',200)
WHERE id % @var = 0;
 
SELECT 9 AS nr
UNION 
SELECT 13
UNION 
SELECT 17
UNION 
SELECT 25
UNION 
SELECT 29;

For the heap, SQL Server updated about 800 rows per second. The memory-optimized table, with more than 50 000 updates per second, it’s completely different story. Now, I would like to draw your attention to the fact that the rows are not stored in pages any more so there are no side effects like forwarding pointers in heaps.

对于堆，SQL Server每秒更新约800行。经过内存优化的表，每秒更新超过5万次，情况完全不同。现在，我想提请您注意以下事实：行不再存储在页面中，因此没有副作用，例如在堆中转发指针。

With DELETEs, the space which was used by deleted rows will be reclaim by background process (by the way, this process is integrated with index scans, so this space can be recover even quicker). Therefore, also in this aspect memory optimized tables are better than heaps.

使用DELETE，被删除的行使用的空间将被后台进程回收（顺便说一句，该进程与索引扫描集成在一起，因此可以更快地恢复该空间）。因此，在这方面，内存优化表也比堆更好。

In summary, memory-optimized tables are great for logging, if only we have enough free memory to store the whole table in it.

总而言之，如果仅我们有足够的可用内存来存储整个表，则内存优化表非常适合记录日志。

列存储索引 (Columnstore indexes)

The second scenario in which heaps are traditionally being used is staging. The classic example is like this: to speedup data loading and avoid fragmentation, new data is loaded into an empty heap, and later this heap is switched into destination table as its last non-empty partition. This procedure is well-defined and worldwide proofed.

传统上使用堆的第二种情况是分段。经典示例如下所示：为了加快数据加载速度并避免碎片，将新数据加载到一个空堆中，然后将此堆作为其最后一个非空分区切换到目标表中。此过程定义明确且已在全球范围内验证。

However, if your server is not CPU constrained (most of them are not) but the I/O is its bottleneck (as it is in most cases) maybe you should revalue this strategy. The main idea is that compression (or uncompression) data on the fly, is more efficient than copy it in uncompressed state and apply compression afterwards.

但是，如果您的服务器不受CPU限制（大多数不受限制），但I / O是其瓶颈（在大多数情况下是这样），那么您应该重新评估此策略。主要思想是动态压缩（或解压缩）数据要比以未压缩状态复制数据并随后应用压缩更为有效。

The new approach, dedicated to SQL Server 2014 Enterprise Edition, can be implemented with clustered columnstore indexes — first, they are finally read-write, second, they can be partitioned.

专用于SQL Server 2014 Enterprise Edition的新方法可以使用群集的列存储索引来实现-首先，它们最终是可读写的，其次，可以对其进行分区。

To demonstrate this concept, we will need a partition function and a partition scheme:

为了演示这个概念，我们将需要一个分区函数和一个分区方案：

CREATE PARTITION FUNCTION PF_CS (int)
AS RANGE RIGHT FOR VALUES (
  1,100000,200000);
GO
 
CREATE PARTITION SCHEME PS_CS
AS 
PARTITION PF_CS
ALL TO ([PRIMARY]);
GO

To keep this example as simply as possible the partitioning function was defined for identity column. And the third table definition was slightly adjusted, just to simplify inserting rows into the third partition:

为了使该示例尽可能简单，已为标识列定义了分区功能。第三个表的定义进行了稍微的调整，只是为了简化将行插入到第三个分区中：

CREATE TABLE dbo.CS(
 
id				INT	 		IDENTITY(100000,1),
 
fixedColumns	CHAR(200) 	CONSTRAINT Col2CSDefault DEFAULT 'This column mimics all fixed-legth columns',
varColumns		VARCHAR(200) 	CONSTRAINT Col3CSDefault DEFAULT 'This column mimics all length-legth columns')
 
ON PS_CS (id);
 
GO

Finally, let’s convert this table into columnstore index by executing following statement:

最后，让我们通过执行以下语句将该表转换为列存储索引：

CREATE CLUSTERED COLUMNSTORE INDEX IX_CS
 
ON dbo.CS;
 
GO

This allows us to insert the same rows using exactly the same INSERT INTO … DEFAULT VALUES statement as for inserting rows into the heap and into the memory-optimized table. This time execution of 2000 inserts with 50 concurrent sessions took 11 seconds, so slightly longer than for the memory-optimized table but still about 40% faster than for the heap.

这使我们可以使用与将行插入堆和内存优化表中完全相同的INSERT INTO…DEFAULT VALUES语句插入相同的行。这次执行具有50个并发会话的2000次插入的时间花费了11秒，因此比内存优化表要长一些，但仍比堆快40％。

And without any additional work, such as adding constraints or switching partitions, we have already achieved our goal — the new data was loaded into the right partition:

无需进行任何其他工作（例如添加约束或切换分区），我们就已经实现了我们的目标-将新数据加载到正确的分区中：

SELECT $partition.PF_CS(id) AS 'Partition Number'
	,min(id) AS 'Min Value'
	,max(id) AS 'Max Value'
	,count(*) AS 'Rows In Partition'
FROM dbo.CS
GROUP BY $partition.PF_CS(id);
____________________________________________________________________________
 
Partition Number		Min Value	Max Value	Rows In Partition
3				       100000	   199999	   100000

One more time, the UPDATE’s performance was measured with execution of the same statement in 5 concurrent sessions. This time SQL Server updates about 7 thousand rows per second — more or less 8 times faster than for the heap but 7 to 10 times slower than for the memory-optimized table.

再过一次，在5个并发会话中执行同一条语句来衡量UPDATE的性能。这次，SQL Server每秒更新约7,000行-快于堆的8倍，但快于内存优化表的7至10倍。

But what about reclaiming unused space? In SQL Server 2014 columnstore indexes use two different storages — columnar data is still read-only, so all changes are located into row-based delta storage. Both storages are automatically used by SQL Server to satisfy queries, so you will always have the actual data, but space used by deleted rows will not be reclaim automatically:

但是，如何回收未使用的空间呢？在SQL Server 2014中，列存储索引使用两种不同的存储-列数据仍然是只读的，因此所有更改都位于基于行的增量存储中。 SQL Server会自动使用这两种存储来满足查询，因此您将始终拥有实际的数据，但是删除的行所使用的空间将不会自动回收：

EXEC sp_spaceused 'CS' 
 
DELETE  dbo.CS
WHERE id % 31 = 0;
 
EXEC sp_spaceused 'CS' 
GO
____________________________________________________________________
 
name	rows		reserved 	data		index_size	unused
CS	  100000      1240 KB	  1208 KB	 24 KB		 8 KB
CS	  96774       1240 KB	  1208 KB	 24 KB		 8 KB

To get back this space the columnstore index has to be rebuilt. Unfortunately, this operation cannot be done online:

要获得此空间，必须重建columnstore索引。不幸的是，此操作无法在线完成：

ALTER INDEX IX_CS
ON [dbo].[CS]
REBUILD 
WITH (ONLINE=ON);
____________________________________________________________________________
 
ALTER INDEX REBUILD statement failed because the ONLINE option is not allowed
when rebuilding a columnstore index. Rebuild the columnstore index without
specifying the ONLINE option.

Due to the fact that this table was partitioned, all you need is to rebuilt only one partition, so the downtime can be minimized:

由于此表已分区，因此您只需要重建一个分区，因此可以将停机时间降到最低：

ALTER INDEX IX_CS
ON [dbo].[CS]
REBUILD PARTITION = 3;
 
EXEC sp_spaceused 'CS'
________________________________________________________
 
name	rows	reserved	data	index_size	unused
CS	  96774   448 KB	  344 KB  0 KB		  104 KB

To sum up — clustered columnstore indexes can not only greatly simplify data load but also speed up this process.

综上所述，群集的列存储索引不仅可以大大简化数据加载，而且可以加快此过程。

接下来会发生什么？ (What is coming next?)

All the next parts of this series will be dedicated to b-tree, clustered and nonclustered, indexes. In the upcoming article we will finish our debate about choosing the right table structure. And if you find that the clustered index is the best option, then I will strongly recommend to evaluate with me all the plus points of nonclustered indexes.

本系列的所有下一部分将致力于b树索引（群集和非群集）的索引。在下一篇文章中，我们将结束关于选择正确的表结构的辩论。而且，如果您发现聚集索引是最佳选择，那么我强烈建议您与我一起评估非聚集索引的所有优点。

翻译自: https://www.sqlshack.com/index-strategies-part-2-choosing-right-table-structure/

行存储索引改换成列存储索引

culuo4781

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
行存储索引改换成列存储索引_索引策略–第2部分–内存优化表和列存储索引

行存储索引改换成列存储索引In the first part we started discussion about choosing the right table structure and described the heaps — in my opinion a mostly overused option. Now let’s see better alternatives. 在第一部...
复制链接

扫一扫