关于自增id十分适合做pk的思考 (簇级索引)

最新推荐文章于 2022-10-27 22:06:48 发布

oneyearlater

最新推荐文章于 2022-10-27 22:06:48 发布

阅读量1.3k

点赞数

文章标签： performance buffer compression sqlserver insert table

本文链接：https://blog.csdn.net/oneyearlater/article/details/7607889

版权

深入浅出sql + 石展的mysql的36条军规教导我们：

主键用于独一无二地识别出每条记录
主键不可以为NULL -- 插入新纪录时必须指定主键值
主键必须简洁
主键值不可以被修改 -- 如果可以改变主键值，那你就会冒着意外输入已使用值的风险，必须保持唯一

1）石展的36条军规说所以这里可以使用自增id，保证唯一性；

并且pk这里使用的是簇级索引 ---- 都说，索引的顺序就是表的存储顺序，那么修改簇级索引，会影响到这条索引在整条索引的order，那么进而需要调整这条索引对应的记录在硬盘上的物理位置，这样会引起不必要的io操作（caoz 语录把索引减缓成一个一维表，只是可以使用二分法来进行索引）

我的测试结果是单纯更新pk所需要的时间，和其他列并没有什么不同，并且更新完之后，select * from t1 不加任何order by，的确是按照pk来排序

和explain 当中确实都说index，并不影响什么。但是注意啊，这时候pk是在内存当中，还没有写回到硬盘上，当有大量的数据写回磁盘的时候，是会造成不必要的io的。

innodb存储引擎是索引组织的，按照主键顺序进行存放的，虽然mvcc在内部是先insert再delete，但是如果更改到两个列之间还是需要移动数据，有空继续研究《MySQL 技术内幕 innodb 存储引擎》

2）网上又有很多文章说不要用无意义的自增id

我的观点是这样，

1）选用有意义的、不可修改的字段作为pk

2）没有不可修改的，在查询为主的表，为了提高速度，选用自增id，在以写为主的表，则就选用可修改的、有意义的字段或者多个字段作为pk

-------------------

附上 http://rbq10829.iteye.com/blog/260909

一、聚簇索引(clustered indexes)的使用

　聚簇索引是一种对磁盘上实际数据重新组织以按指定的一个或多个列的值排序。由于聚簇索引的索引页面指针指向数据页面，所以使用聚簇索引查找数据几乎总是比使用非聚簇索引快。每张表只能建一个聚簇索引，并且建聚簇索引需要至少相当该表120%的附加空间，以存放该表的副本和索引中间页。建立聚簇索引的思想是：

1、大多数表都应该有聚簇索引或使用分区来降低对表尾页的竞争，在一个高事务的环境中，对最后一页的封锁严重影响系统的吞吐量。

2、在聚簇索引下，数据在物理上按顺序排在数据页上，重复值也排在一起，因而在那些包含范围检查(between、<、<=、& gt;、>=)或使用group by或order by的查询时，一旦找到具有范围中第一个键值的行，具有后续索引值的行保证物理上毗连在一起而不必进一步搜索，避免了大范围扫描，可以大大提高查询速度。

3、在一个频繁发生插入操作的表上建立聚簇索引时，不要建在具有单调上升值的列(如IDENTITY)上，否则会经常引起封锁冲突。

4、在聚簇索引中不要包含经常修改的列，因为码值修改后，数据行必须移动到新的位置。

5、选择聚簇索引应基于where子句和连接操作的类型。

聚簇索引的侯选列是：

1、主键列,该列在where子句中使用并且插入是随机的。

2、按范围存取的列，如pri_order > 100 and pri_order < 200。

3、在group by或order by中使用的列。

4、不经常修改的列。

5、在连接操作中使用的列。

二、非聚簇索引(nonclustered indexes)的使用

　SQLServer缺省情况下建立的索引是非聚簇索引，由于非聚簇索引不重新组织表中的数据，而是对每一行存储索引列值并用一个指针指向数据所在的页面。换句话说非聚簇索引具有在索引结构和数据本身之间的一个额外级。一个表如果没有聚簇索引时,可有250个非聚簇索引。每个非聚簇索引提供访问数据的不同排序顺序。在建立非聚簇索引时，要权衡索引对查询速度的加快与降低修改速度之间的利弊。另外，还要考虑这些问题：

1、索引需要使用多少空间。

2、合适的列是否稳定。

3、索引键是如何选择的，扫描效果是否更佳。

4、是否有许多重复值。

　对更新频繁的表来说，表上的非聚簇索引比聚簇索引和根本没有索引需要更多的额外开销。对移到新页的每一行而言，指向该数据的每个非聚簇索引的页级行也必须更新，有时可能还需要索引页的分理。从一个页面删除数据的进程也会有类似的开销，另外，删除进程还必须把数据移到页面上部，以保证数据的连续性。所以，建立非聚簇索引要非常慎重。非聚簇索引常被用在以下情况:

1、某列常用于集合函数(如Sum,....)。

2、某列常用于join,order by,group by。

3、查寻出的数据不超过表中数据量的20%。

三、覆盖索引(covering indexes)的使用

　覆盖索引是指那些索引项中包含查寻所需要的全部信息的非聚簇索引，这种索引之所以比较快也正是因为索引页中包含了查寻所必须的数据,不需去访问数据页。如果非聚簇索引中包含结果数据,那么它的查询速度将快于聚簇索引。

　但是由于覆盖索引的索引项比较多,要占用比较大的空间。而且update操作会引起索引值改变。所以如果潜在的覆盖查询并不常用或不太关键，则覆盖索引的增加反而会降低性能。

四、索引的选择技术

一般来说建立索引的思路是：

(1)、主键时常作为where子句的条件，应在表的主键列上建立聚簇索引，尤其当经常用它作为连接的时候。

(2)、有大量重复值且经常有范围查询和排序、分组发生的列，或者非常频繁地被访问的列，可考虑建立聚簇索引。

(3)、经常同时存取多列，且每列都含有重复值可考虑建立复合索引来覆盖一个或一组查询，并把查询引用最频繁的列作为前导列，如果可能尽量使关键查询形成覆盖查询。

(4)、如果知道索引键的所有值都是唯一的，那么确保把索引定义成唯一索引。

(5)、在一个经常做插入操作的表上建索引时，使用fillfactor(填充因子)来减少页分裂，同时提高并发度降低死锁的发生。如果在只读表上建索引，则可以把fillfactor置为100。

(6)、在选择索引键时，设法选择那些采用小数据类型的列作为键以使每个索引页能够容纳尽可能多的索引键和指针，通过这种方式，可使一个查询必须遍历的索引页面降到最小。此外，尽可能地使用整数为键值，因为它能够提供比任何数据类型都快的访问速度。

五、索引的维护

　上面讲到,某些不合适的索引影响到SQLServer的性能,随着应用系统的运行,数据不断地发生变化,当数据变化达到某一个程度时将会影响到索引的使用。这时需要用户自己来维护索引。索引的维护包括：

1、重建索引

　随着数据行的插入、删除和数据页的分裂，有些索引页可能只包含几页数据，另外应用在执行大块I/O的时候，重建非聚簇索引可以降低分片，维护大块I/O的效率。重建索引实际上是重新组织B-树空间。在下面情况下需要重建索引：

(1)、数据和使用模式大幅度变化。

(2)、排序的顺序发生改变。

(3)、要进行大量插入操作或已经完成。

(4)、使用大块I/O的查询的磁盘读次数比预料的要多。

(5)、由于大量数据修改，使得数据页和索引页没有充分使用而导致空间的使用超出估算。

(6)、dbcc检查出索引有问题。

当重建聚簇索引时,这张表的所有非聚簇索引将被重建。

2、索引统计信息的更新

　当在一个包含数据的表上创建索引的时候，SQLServer会创建分布数据页来存放有关索引的两种统计信息：分布表和密度表。优化器利用这个页来判断该索引对某个特定查询是否有用。但这个统计信息并不动态地重新计算。这意味着，当表的数据改变之后，统计信息有可能是过时的，从而影响优化器追求最有工作的目标。因此，在下面情况下应该运行update statistics命令：

(1)、数据行的插入和删除修改了数据的分布。

(2)、对用truncate table删除数据的表上增加数据行。

(3)、修改索引列的值。

六、结束语

　实践表明，不恰当的索引不但于事无补，反而会降低系统的执行性能。因为大量的索引在插入、修改和删除操作时比没有索引花费更多的系统时间。例如下面情况下建立的索引是不恰当的：

1、在查询中很少或从不引用的列不会受益于索引，因为索引很少或从来不必搜索基于这些列的行。

2、只有两个或三个值的列，如男性和女性(是或否)，从不会从索引中得到好处。

　另外，鉴于索引加快了查询速度，但减慢了数据更新速度的特点。可通过在一个段上建表，而在另一个段上建其非聚簇索引，而这两段分别在单独的物理设备上来改善操作性能。

-------------------------------------------------

Some people don’t probably know, but there is a difference between how indexes work in MyISAM and how they work in InnoDB, particularly when talking from the point of view of performance enhancement. Now since, InnoDB is starting to be widely used, it is important we understand how indexing works in InnoDB. Hence, the reason for this post!

The first and foremost thing to know is that InnoDB uses clustered index to store data in the table. Now what does clustered index mean?

Clustered Index

A clustered index determines the physical order of data in a table. When thinking of a clustered index think of a telephone directory, where data is physically arranged by the last name. Because the clustered index decides the physical storage order of the data in the table, a table can only have a single clustered index. But, a clustered index can comprise of multiple columns (a composite index), in the same way as a telephone directory is organized both by the first name and the last name.

Clustered Index with respect to InnoDB

InnoDB stores indexes as B+tree data structures, and same is the case with the clustered index. But the difference is that in the case of clustered index InnoDB actually stores the index and the rows together in the same structure. When a table has a clustered index, its rows are actually stored in the index’s leaf pages. Thus InnoDB tables can also be called index-organized tables.

Now lets consider how InnoDB decides which index to use as the clustered index!

How InnoDB selects a clustered index?

With InnoDB, typically PRIMARY KEY is synonymous with clustered index, but what if a PRIMARY KEY does not exist or there is not even a single index defined on the table. Then following is how InnoDB decides what to use as the clustered index:

If there is a PRIMARY KEY defined on the table, InnoDB uses it as the clustered index.
If there is no PRIMARY KEY defined on the table, InnoDB uses the first UNIQUE index where all the key columns are NOT NULL as the clustered index.
If there is no PRIMARY KEY or no suitable UNIQUE index present, InnoDB internally generates a hidden PRIMARY KEY and then uses this hidden key as the clustered index. This hidden PRIMARY KEY is a 6-byte field that increases monotonically as new rows are inserted.

Hence, my advice is that always define a PRIMARY KEY for each table that you create. If there is no logical key that can be created, add a new auto-increment column, and use it as the PRIMARY KEY.

Did you know that Secondary Index is related to the Primary Key?

In InnoDB, every SECONDARY INDEX contains the PRIMARY KEY column(s) together with the column(s) of the secondary index, automatically. That is because of the way InnoDB stores data, remember what I just told you when talking about how data is stored, a leaf node doesn’t store any pointer to the row’s physical location, but in fact stores the row’s data. So in other words the PRIMARY KEY is actually the pointer to the row data.

This makes us conclude on another interesting conclusion..

A secondary index requires two lookups! First a lookup for the secondary index itself, then a lookup for the primary key.

Advantages of clustering

Clustering provided by InnoDB has very significant performance benefits, some of which are mentioned below:

Because the data is physically stored according to the PRIMARY KEY, data lookups by PRIMARY KEY is very fast. For example, the fastest way to find a particular employee using the unique employee_id column is to create a PRIMARY KEY on the employee_id column.
With clustering, search for ranges can be extremely efficient. Suppose an application frequently searches records between a range of dates, a clustered index can quickly locate the row containing the beginning date, and then retrieve all adjacent rows in the table until the last date is reached. Thus improving the performance of range queries.
Another positive impact of clustering is on the performance of sorting data. Suppose there is a column that is used frequently to sort the data retrieved from a table, it can be advantageous to cluster the table on that column to save the cost of a sort each time the column is queried.
Also because clustered index holds both the index and the data together in one B-Tree, so retrieving rows from a clustered index is normally faster than a comparable lookup in a nonclustered index.
Secondary indexes can act as covering indexes, when the data that is requested include the primary key columns, because of the fact that secondary indexes automatically include primary key columns.

These benefits that I have mentioned can boost performance drastically, if you design your tables and queries accordingly. But clustered indexes have disadvantages as well.

Disadvantages of clustering

Following are some of the disadvantages of clustering:

If a large clustered index is defined, any secondary indexes that are defined on the same table will be significantly larger because the secondary indexes contain the clustering key.
Because of the way how the data is stored, secondary indexes require two lookups.
Clustered index can be expensive for columns that undergo frequent changes because it forces InnoDB to move each updated row to a new location.(主键的update 会导致row的物理移动，有时间去看一下到底是怎么移动的，这就要命了，innodb的数据文件默认是放在一个大文件里面的，)
Insertions can be slow, if the data is not inserted in PRIMARY KEY order, hence we can conclude that insert speeds depend heavily on insertion order. Inserting rows in primary key order is the fastest way to load data into an InnoDB table. （pk的插入速度，取决于改pk的数值具体排在整个index当中的顺序，如果是那种自增的，总是在最后，相当于是append，所以速度就上去了，这也是为什么石展要推荐自增id作为主键的原因，可能确实也是，大部分的应用当中很少有那种完全按照主键顺序来select的结果集）

Update (thanks to sunny):

Following is another thing that one should know regarding secondary indexes:

The records in InnoDB secondary are never updated in place. Therefore, what that means is that an UPDATE of a secondary index column means deleting the old record and inserting a new one. （非簇级索引的更新时先insert再等待合适时机再物理的delete）

Although, I did point out some disadvantages, but the fact is that these disadvantages can not be weighted down by the tremendous amount of benefits that comes with clustering in InnoDB. If you study and understand the aspects that I have mentioned in this article and apply them accordingly, you are going to see great performance enhancements. After all, clustering is another important step in bringing MySQL closer to MSSQL and Oracle.

Showing 5 of 17 comments

Sort by

Zardosht 1 year ago

Ovais,

I agree that a clustered index has big advantages for queries, and should be used. That is why the MySQL storage engine TokuDB (created by Tokutek where I work), allows users to define multiple clustering keys (http://tokutek.com/2009/05/int..., so that multiple indexes can produce faster query results. The disadvantages, slower insertion speed and disk usage, are mitigated by TokuDB's 10x-80x faster indexed insertions(http://tokutek.com/downloads/m... and compression.

Also, you mention that with clustered indexes, secondary indexes require two lookups. Storage engines that do not cluster the primary key require an index lookup and a fetch. With MyISAM, for example, a secondary index requires a lookup in the .MYI file, followed by a fetch in the .MYD file. The fetch still incurs a disk seek, which is expensive.

Flag
Ovais Tariq 1 year ago in reply to Zardosht

Hmm with respect to TokuDB, multiple clustered indexes does sound interesting,. but wouldn't that be requiring more space then,. because say there are 3 clustered indexes,. then the table data would have to be stored 3 times., which certainly would mean 3x storage space required,. secondly., how are inserts handled,. having multiple clustered indexes would require major work during insertions and updates.,.

And regarding the secondary indexes requiring two lookups., its just something that I have mentioned so that people would know., secondary index having the primary column as well can be an advantage as well as a disadvantage,.

And regarding the comparison with MyISAM,. lookups by keys are more or less always going to be faster in case of InnoDB,.

Flag
Zardosht 1 year ago in reply to Ovais Tariq

Ovais,
The downsides you mention exist, but TokuDB's performance drastically reduces them. Yes, there is more storage space required, but TokuDB has great compression on data. We have seen results ranging from 5x to 15x compression over InnoDB. As for insertions, I would not categorize it as "major work", but as "more work". The insertions into the clustering indexes are of larger size over insertions into secondary indexes, but TokuDB has really fast indexed insertions, from 10x-80x faster than InnoDB.

TokuDB's performance characteristics make multiple clustering indexes an excellent tool for achieving great query performance in MySQL.

Flag
Ovais Tariq 1 year ago in reply to Zardosht

How do you think TokuDB's compression compares with the compression offered by the Barracuda file format in InnoDB??

Well could you explain, how the clustered index updates and insertions are not major work., besides that, with InnoDB the inserts have to be in primary key (Clustered key) order, otherwise there is an effect on performance,. does that hold for TokuDB as well,.

Flag
Zardosht 1 year ago in reply to Ovais Tariq

Ovais,

Regarding compression, I am not know how the Barracuda file format compares with standard InnoDB and with TokuDB, so I cannot answer the question. I only know the 5x-15x improvement over standard InnoDB.

The reason that updates and insertions are not major work is that TokuDB uses fractal trees (http://tokutek.com/2010/04/how... in place of B-trees. Insertions and updates into fractal trees are one to two orders of magnitude faster, because they do not incur the disk seeks that B-tree insertions do (http://tokutek.com/2010/05/dis....

With TokuDB, each clustered key maintains a copy of the data in the order of the defined clustering key, so the query performance with clustering keys is really fast, for all of the reasons you mention in your post.

-----------------------------------------------------------------------------------

MySQL 5.5: InnoDB Change Buffering

To speed up bulk loading of data, InnoDB implements an insert buffer, a special index in the InnoDB system tablespace that buffers modifications to secondary indexes when the leaf pages are not in the buffer pool. Batched merges from the insert buffer to the index pages result in less random access patterns than when updating the pages directly. This speeds up the operation on hard disks.

In MySQL 5.5, the insert buffer has been extended to a change buffer, which covers all modifications of secondary index leaf pages. This will improve the performance of bulk deletes and updates, transaction rollback and the purging of deleted records (reducing the “purge lag”).

To assess the benefits of the extended buffering, you may want to run benchmarks with the settingsinnodb_change_buffering=all,innodb_change_buffering=inserts, andinnodb_change_buffering=none. Users of solid-state storage, where random reads are about as fast as sequential reads, might benefit from disabling the buffering altogether.

Read on to learn how the change buffering works.

Operations on Secondary Indexes

InnoDB can perform three kinds of modifications on secondary index records. If the affected index page is not in the buffer pool, the modifications can be buffered in the change buffer. When an index lookup or scan needs a page that is not in the buffer pool, the page will be read from the tablespace and any buffered changes will be merged to it.

The following operations can modify secondary index pages:

Insert

Inserting a record; supported in all versions of InnoDB

Delete-mark

Marking a record for deletion

Purge

Removing a deleted record that is no longer accessible by active transactions

Before MySQL 5.5, UPDATE, DELETE and purge operations were performed directly on the index pages, resulting in random-access I/O. In MySQL 5.5, all of these operations can be buffered.

Implications of InnoDB Multiversioning

In InnoDB, there are two types of indexes: the clustered index B-tree, where the records are stored in thePRIMARY KEY order, and secondary index B-trees, which identify rows by primary key. InnoDB multiversion concurrency control (MVCC) treats these indexes differently.

Records in the clustered index can be updated in place, and their hidden system columnsDB_TRX_ID, DB_ROLL_PTR point to undo log entries from which earlier versions can be reconstructed. InnoDB secondary index records do not contain any system columns, and their data is never updated in place. AnUPDATE of an indexed column requires the operations Delete-mark(old),Insert(new) and eventually Purge(old) in the secondary index. AnUPDATE of a PRIMARY KEY results in Delete-mark,Insert and eventually Purge in all indexes.

When a secondary index record has been marked for deletion or when the page has been updated by a newer transaction, InnoDB will look up the clustered index record. In the clustered index, it suffices to check theDB_TRX_ID and only retrieve the correct version from the undo log when the record was modified after the reading transaction started.

To Buffer or not to Buffer

When a page is in the buffer pool, it will always be updated directly. When a page is loaded to the buffer pool, any buffered changes will be merged to it, so that users never see unmerged changes.

Because change buffering works on individual leaf pages, we cannot buffer changes that would result into page splits or merges, but must perform such changes on the B-tree pages directly.

The insert buffer bitmap keeps track on the available space on pages and prevents overflows when buffering inserts. Delete-marking records can always be buffered, because the flag will be updated in place. Purging a delete-marked record could result in an empty page, something that we do not allow. We determine the non-emptiness of a page from previously buffered operations on the same page. If there are no previously buffered operations, the purge will have to load the index page to the buffer pool.

InnoDB refuses to buffer an operation when the on-disk change buffer tree would grow bigger than ⅓ of the in-memory buffer pool (innodb_buffer_pool_size). This might be a good rule-of-thumb, but some setups could benefit from the ability of setting the change buffer size independently of the buffer pool size.

Conclusion

The InnoDB change buffer is a persistent data structure and a complex mechanism that comes into play when the workload does not fit in the buffer pool. Because it trades random I/O with a larger amount of sequential I/O, it speeds up operation on hard disks, where random access is much slower than sequential access.

On solid-state storage, there is not much difference between sequential and random access times. Change buffering may still be useful if writes to solid-state storage are expensive, either in terms of speed or the consumption of limited program/erase cycles. Change buffering could reduce the write load on user tablespaces and cause more writes to the system tablespace (which contains the insert buffer) and the redo log. These should be placed on a hard disk.