Note on <Beginning Microsoft SQL Server 2012 Programming> - 03

最新推荐文章于 2020-11-26 20:10:06 发布

zlxadhkust

最新推荐文章于 2020-11-26 20:10:06 发布

阅读量907

点赞数

本文链接：https://blog.csdn.net/zlxadhkust/article/details/48058691

版权

Reference Books 同时被 3 个专栏收录

106 篇文章 1 订阅

订阅专栏

SQL

41 篇文章 0 订阅

订阅专栏

MS SQL Server

14 篇文章 0 订阅

订阅专栏

Chapter 9: SQL Server Storage and Indexs Structures

The Extent

An extent is the basic unit of storage used to allocate space for tables and indexes. It is made up of eight contiguous 8KB data pages, for a total size of 64KB.

Once an extent is full, the next record takes up not just the size of the record, but the size of a whole new extent.
By pre-allocating this space, SQL Server saves the time of allocating new space with each record.
An extent is allocated within the total space reserved by the database from your disk drive.

The Page

A page is the unit of allocation within a specific extent.

The number of rows per page depends entirely on the size of the row.
A row is not allowed to be split between pages.

Two Types of Page:

Data pages: They are the actual data in your table, with the exception of any BLOB data that is not defi ned with the text-in-row option, varchar(max) or varbinary(max).
Index pages: They hold both the non-leaf and the leaf level pages (you'll examine what these are later in the chapter) of a non-clustered index, as well as the non-leaf level pages of a clustered index.

Page Splits:

When a page becomes full, it splits. This means more than just a new page being allocated — it also means that approximately half the data from the existing page is moved to the new page.

The exception to this process is when a clustered index is in use. If there is a clustered index and the next inserted row would be physically located as the last record in the table, a new page is created, and the new row is added to the new page without relocating any of the existing data.

Rows

Rows can be up to 8KB.

In addition to the limit of 8,060 characters, there is also a maximum of 1,024 standard (non-sparse) columns.

Sparse Columns

You have columns, they are going to be null a high percentage of the time. There are many scenarios where, if you bump into a few of these kinds of columns, you tend to bump into a ton of them. Using sparse columns, you can increase the total number of allowed columns in a single table to 30,000.

Internally, the data from columns marked as being sparsely populated is embedded within a single column — allowing a way to break the former limitation of 1,024 columns without major architectural changes.

Image, text, ntext, geography, geometry, timestamp, and all user-defined data types are prohibited from being marked as a sparse column.

Question:

How to make use of sparse columns? The author never told.

Understanding Indexes

What happens in your indexes depends on the collation information you have established for your data.

In SQL Server, you have a number of different collation options available to you. Among these options are:

Binary: Sorts by the numeric representation of the character.
Dictionary order: This sorts things just as you would expect to see in a dictionary, with a twist.

Collation can be set at the database and column level, so you have a fairly fine granularity in your level of control.

B-Trees

Question:

How many elements will a root page (and non-leaf page) contain? What factors determine that? Will these pages in the figure be allocated within the same extent? Where is extent? How does extent fit into this structure?

You start at the root and look through the records until you find the last page that starts with a value less than what you're looking for. You then obtain a pointer to that node and look through it until you find the row that you want.

Non-leaf level nodes are nodes that are somewhere in between the root and the node that tells you where the data is physically stored. Non-leaf level nodes can then point to other non-leaf level nodes, or to leaf level nodes. Leaf level nodes are the nodes where you obtain the real reference to the actual physical data.

You start with the root node just as before, and then move to the node that starts with the highest value that is equal to or less than what you're looking for and is also in the next level down. You then repeat the process: Look for the node that has the highest starting value at or below the value for which you're looking. You keep doing this, level by level down the tree, until you get to the leaf level — from there you know the physical location of the data and can quickly navigate to it.

Updating B-Trees: A First Look at Page Splits

When you add data to the tree, a node eventually becomes full and you need to split it. Because in SQL Server, a node equates to a page, this is called a page split, illustrated as:

Instead of inserting just one page, you are:

Creating a new page
Migrating rows from the existing page to the new page
Adding your new row to one of the pages
Adding another entry in the parent node

Because you're in a tree arrangement, you have the possibility for something of a cascading action. When you create the new page (because of the split), you need to make another entry in the parent node. This entry in the parent node also has the potential to cause a page split at that level, and the process starts all over again. Indeed, this possibility extends all the way up to and can even affect the root node.

Page splits can have a very negative impact on system performance.

How Data Is Accessed in SQL Server

Using a table scan

When a table scan is performed, SQL Server starts at the physical beginning of the table, looking through every row in the table. As it fi nds rows that match the criteria of your query, it includes them in the result set.

Using an index

During the query optimization process, the optimizer takes a look at all the available indexes and chooses the best one (this is primarily based on the information you specify in your joins and WHERE clause, combined with statistical information SQL Server keeps on index makeup). Once that index is chosen, SQL Server navigates the tree structure to the point of data that matches your criteria and again extracts only the records it needs.

Index Types and Index Navigation

The indexes are built over:

a clustered table
Any table that has a clustered index on it. The data is physically stored in a designated order. Individual rows are uniquely identified through the use of the cluster key — the columns that define the clustered index.
a heap
Any table that does not have a clustered index on it. In this case, a unique identifier, or row ID (RID), is created based on a combination of the extent, pages, and row offset (places from the top of the page) for that row. A RID is necessary only if there is no cluster key available (no clustered index).

Clustered Indexes

A clustered index is unique for any given table — you can only have one per table. And clustered index is not mandatory.

What makes a clustered index special is that the leaf level of a clustered index is the actual data — that is, the data is re-sorted to be stored in the same physical order of the index sort criteria state. This means that once you get to the leaf level of the index, you’re done; you’re at the data.

Any don't forget, new record is inserted according to its correct physical order in the clustered index.

SQL knows it needs the rest of that page because the information from the node one level up notifies it that it’ll also need data from a few other pages. Because this is an ordered list, it can be sure it’s continuous — that means if the next page has records that should be included, the rest of this page must be included. Therefore, clustered indexes are particularly good at range search.

Non-Clustered Indexes on a Heap

The leaf level is not the data — instead, it is the level at which you can obtain a pointer to that data. This pointer comes in the form of a row identifier or RID, which, as I described earlier in the chapter, is made up of the extent, page, and row offset for the particular row that the index is pointing to.

Because the RID has the full information on the location of the row, you can go directly to the data from the RID.

With a clustered index, the data is physically in the order of the index. That means, for a range of data, when you find the row that has the beginning of your data on it, there's a good chance that the other rows are on that page with it (that is, you’re already physically almost to the next record because they are stored together).With a heap, the data is not linked together in any way other than through the index. From a physical standpoint, there is absolutely no sorting of any kind. This means that from a physical read standpoint,your system may have to retrieve records from all over the file. Indeed, it's quite possible (possibly even probable) that you'll wind up fetching data from the same page several separate times.

the odds are extremely high that any page that has already been read once will still be in the memory cache and, as such, will be retrieved extremely quickly.

Non-Clustered Indexes on a Clustered Table

Just as with non-clustered indexes on a heap, the non-leaf level of the index looks pretty much as it did for a clustered index. The difference does not come until you get to the leaf level.

With clustered indexes, when the server got to the leaf level, it found the actual data. With non-clustered indexes on a heap, it didn't find the actual data, but it did find an identifier that let it go right to the data (it was just one step away). With nonclustered indexes on a clustered table, it finds the cluster key.

SQL Server retrieved a list from the first index lookup, but that list doesn't logically match up with the cluster key in any continuous fashion.Each record needs to be looked up individually.

To Create Index

To create it explicitly:

CREATE [UNIQUE] [CLUSTERED|NONCLUSTERED]
INDEX <index name> ON <table or view name>(<column name> [ASC|DESC] [,...n])
INCLUDE (<column name> [, ...n]
[WHERE <condition>])
[WITH
	[PAD_INDEX = { ON | OFF }]
	[[,] FILLFACTOR = <fillfactor>]
	[[,] IGNORE_DUP_KEY = { ON | OFF }]
	[[,] DROP_EXISTING = { ON | OFF }]
	[[,] STATISTICS_NORECOMPUTE = { ON | OFF }]
	[[,] SORT_IN_TEMPDB = { ON | OFF }]
	[[,] ONLINE = { ON | OFF }
	[[,] ALLOW_ROW_LOCKS = { ON | OFF }
	[[,] ALLOW_PAGE_LOCKS = { ON | OFF }
	[[,] MAXDOP = <maximum degree of parallelism>
	[[,] DATA_COMPRESSION = { NONE | ROW | PAGE}]
[ON {<filegroup> | <partition scheme name> | DEFAULT }]

The meaning of the options:

Options	Description
ASC/DESC
INCLUDE	This is also what Covering Indexes are about. By this option, you can put the raw data of some non-key columns at the leaf level of your index, so it only applies to non-cluster indexes. The benefit is if that raw data is enough to resolve a query, then SQL Server will stop seeking, it won't do the final lookup on cluster index and further I/O. But there is cost, the leaves will be bigger, which means more operation overhead during insertion and update and deletion.
WHERE
PAD_INDEX
FILLFACTOR	You use this option to adjust the page densities, namelyhow full your pages are(?non-leaf ones?). But it won't affect what happens when page split occurs. By default, this percentage is two rows short of full, with a minimum of one row per page. Determine it by the read/write rate in your system.
IGNORE_DUP_KEY	If it is ON, this will just change the level of the errorwhen duplicate record is being inserted, which is warning. A rollback isn't issued for the transaction (the error is a warning error rather than a critical error), but the duplicate row is rejected.
DROP_EXISTING	As its name implies. It's much more efficient than simply dropping and re-creating an existing index when you use it with aclustered index. If you rebuild an exact match of the existing index, SQL Server knows that it need not touch the non-clustered indexes, while an explicit drop and create involves rebuilding all non-clustered indexes twice in order to accommodate the different row locations. If you change the structure of the index using DROP_EXISTING, the NCIs are rebuilt only once instead of twice. Why? How? Furthermore, you cannot simply drop and re-create an index created by a constraint, for example, to implement a certain FILLFACTOR. But DROP_EXISTING is a workaround to this.
STATISTICS_NORECOMPUTE	Just leave it OFF.
SORT_IN_TEMPDB	If this option is ON, which means the intermediate pages are written to tempdb rather than the database's own file when building the index in question. If they are on separate physical drives, this means that there is no competition between the read and write operations of the index build, and the process therefore will be faster. Leave it OFF means the intermediate pages are written out to the same physical fi les that the database is stored in.
ONLINE	Setting this to ON forces the table to remain available to your users to access, but the overhead of I/O is intensive, leave it OFF will cause the index building much faster. By default, the full index operation grabs the locks (eventually a table lock) needed for full and effi cient access to the table. The side effect, however, is that your users are blocked out.
ALLOW ROW/PAGE LOCKS
MAXDOP	Max degree of parallelism.
DATA COMPRESSION	<Professional Microsoft SQL Server 2012 Administration by Adam Jorgensen>

To create it implicitly:

Implied indexes are created when one of two constraints is added to a table:

A PRIMARY KEY
A UNIQUE constraint (aka, an alternate key)

All the options except for {CLUSTERED|NONCLUSTERED} and FILLFACTOR are not allowed when creating an index as an implied index to a constraint.

How to Make Decision? Should I Create Index or Not? What Kind of Index Should It Be? Which Columns?

Rule of thumb

Indexes, particularly non-clustered indexes, are primarily beneficial in situations where there is a reasonably high level of selectivity within the index.
Consider that the looping process created by the non-clustered index is not worth it if you don't have somewhere in the area of 90–95 percent uniqueness in the indexed column. Because if one lookup in your non-clustered index will generate multiple additional lookups in a clustered index, you are quite possibly better off with the table scan.
However, in all likelihood, you're going to benefit from having an index on a foreign key column. Because there are indexes on the related columns in both tables, the seek for both rows is very fast.
While indexes speed up performance when they're reading data, they areactually very costly when modifying data(deletion, insertion, update).
Sometimes, the best thing to do is choose your indexes based on the transactions that are critical to your system and use the table in question. Does the code for the transaction have a WHERE clause in it? What column(s) does it use? Is there a sorting required?
By default, your primary key is created with a clustered index, but you can chose not to go with it. The point here is that primary key can go without clustered index and, you shouldn't just unthinkingly accept the default.
Clustered indexes are good at when the column(s) in question is:
- the subject of a frequent ranged query
- used in the MAX, MIN, and COUNT aggregators with GROUP BY
- based when sorting with ORDER BY on
Clustered indexes are not good when:
- there are a lot non-sequential insert, it will be rather costly.
If there is no better place to create clustered index, thenyou are better off with table scan, just do not have any clustered index.
An index is only considered for use if the first column listed in the index is used in the query. The bright side is that there doesn't have to be an exact one-for-one match to every column — just the first. Naturally, the more columns that match (in order), the better, but only the first creates a definite do-not-use situation.
With Filtered Indexes, you can sharply limit which rows appear in your index. One relatively common use for filtered indexes is to effectively allow a unique constraint on a NULLable column. By creating a unique index with a WHERE IS NOT NULL clause, you can prevent duplicates in the real data and still permit as many NULL values as exist.
Unindexable conditions: When you filter on a function, rather than a column, you won't have an index that works. Either you can create that function as a persisted computed column and put an index on it or else you can try to find a way to rewrite the query such that the column is compared without a function wrapper.
Check whether your indexes are in use. If they're not in use, start looking for the reason — out of order WHERE clauses, lack of selectivity, suggested indexes, or unindexable conditions — and repair the problem.

How to Take Advantage of Tools to Verify

Display Estimated Query Plan

If you get this hint with an important query, in most cases you'll want to take it.

Database Engine Tuning Advisor

How to Maintain Index

ALTER INDEX is different from other ALTER commands — it is all about maintenance and zero about structure. If you need to change the makeup of your index, you still need either to DROP and CREATE it or to CREATE and use the index with the DROP_EXISTING=ON option.

REBUILD	The old index is completely thrown away and a new one reconstructed from scratch. Every page in both the leaf and the non-leaf levels of the index has been reconstructed as you have defined it. If the index in question is clustered, the physical data is also reorganized.
DISABLE
REORGANIZE	REORGANIZE works only on the leaf level of your index — non-leaf levels of the index go untouched. This means that you're not quite getting a full optimization. Given its much lower impact on users, this is typically the tool you'll want to use as part of your regular maintenance plan. Unfortunately, REORGANIZE does not allow for the change of several index settings, such as FILLFACTOR.

ALTER INDEX { <name of index> | ALL }
	ON <table or view name>
	{ REBUILD
		[ [ WITH (
			[ PAD_INDEX = { ON | OFF } ]
			| [[,] FILLFACTOR = <fillfactor>
			| [[,] SORT_IN_TEMPDB = { ON | OFF } ]
			| [[,] IGNORE_DUP_KEY = { ON | OFF } ]
			| [[,] STATISTICS_NORECOMPUTE = { ON | OFF } ]
			| [[,] ONLINE = { ON | OFF } ]
			| [[,] ALLOW_ROW_LOCKS = { ON | OFF } ]
			| [[,] ALLOW_PAGE_LOCKS = { ON | OFF } ]
			| [[,] MAXDOP = <max degree of parallelism>
				) ]
			| [ PARTITION = <partition number>
				[ WITH ( <partition rebuild index option>
					[ ,...n ] ) ] ] ]
	| DISABLE
	| REORGANIZE
		[ PARTITION = <partition number> ]
		[ WITH ( LOB_COMPACTION = { ON | OFF } ) ]
	| SET ([ ALLOW_ROW_LOCKS= { ON | OFF } ]
		| [[,] ALLOW_PAGE_LOCKS = { ON | OFF } ]
		| [[,] IGNORE_DUP_KEY = { ON | OFF } ]
		| [[,] STATISTICS_NORECOMPUTE = { ON | OFF } ]
		)
	} [ ; ]

The B-Tree mechanism doesn't really have a whole lot to offer as you delete data.Fragmentation means a situation where many of your data pages are holding only a small fraction of the amount of data that they could hold. Records that are spread out all over the place cause additional overhead in data retrieval. (Instead of just loading one page and grabbing the 10 rows it requires, SQL Server may have to load 10 separate pages to get that same information. It isn't just reading the row that causes effort — SQL Server has to read that page in first.) High fragmentation equates to poor read performance, but it also equates to excellent insert performance.

To identify how full the pages and extents in your database are, you can use sys.dm_db_index_physical_stats:

DECLARE @db_id SMALLINT;
DECLARE @object_id INT;
SET @db_id = DB_ID(N'AdventureWorks');
SET @object_id = OBJECT_ID(N'AdventureWorks.Sales.SalesOrderDetail');
SELECT database_id, object_id, index_id, index_depth, avg_fragmentation_in_percent, page_count
FROM sys.dm_db_index_physical_stats(@db_id, @object_id, NULL, NULL, NULL);

Logical Scan Fragmentation	An out-of-order page is one for which the next page indicated in the index allocation map (IAM) is different from that pointed to by the next page pointer in the leaf page.
Extent Scan Fragmentation	This one is telling whether an extent is not physically located next to the extent that it should be logically located next to. This just means that the leaf pages of your index are not physically in order (though they still can be logically), and it shows what percentage of the extents this problem pertains to.
avg_fragmentation_in_percent	This is what you will rely on.

You can set the FILLFACTOR to be any value between 1 and 100 as long as you are doing a full rebuild (as opposed to just a reorg). This number is the percent that your pages are full once the database reorganization is complete. Remember, though, that as your pages split, your data is still distributed 50-50 between the two pages — you cannot control the fill percentage on an ongoing basis other than by regularly rebuilding the indexes.

If a FILLFACTOR isn't provided: The ALTER TABLE uses whatever setting was used to build the index previously. If one was never specified, the FILLFACTOR makes the page full (which is too full for most situations) (???If the index has not had a specifi c FILLFACTOR specified, the pages will be reconstituted to be full, minus two records. Which one is correct?).
If a FILLFACTOR is provided: The FILLFACTOR value becomes the default FILLFACTOR for that index.
While a REORGANIZE is done live and a REBUILD can be (if you have the licensing required to enable that feature), it's not recommended: I strongly recommend against it because it locks resources and can cause a host of problems. At the very least, look at doing it at non-peak hours.

Caution:

Keep in mind that if you change which column(s) your clustered index is on, SQL Server needs to do a complete resorting of your entire table (remember, for a clustered index, the table sort order and the index order are the same). Several questions should come to mind from this:

How long will it take?
Do I have enough space? You will, on average, need an additional 1.2 times (the working space plus the new index) the amount of space your table is already taking up. This is also affected by how you have your maximum size and growth options set for your database
Should I use the SORT_IN_TEMPDB option? If tempdb is on a separate physical array from your main database and it has enough room, the answer is probably yes.

Careful with REBUILD. As soon as you kick off a rebuild, the index you are working on is essentially gone until the rebuild is complete (unless you have the ONLINE option available). Any queries that rely on that index may become exceptionally slow (potentially by orders of magnitude). This is the sort of thing you want to test on an offl ine system first to have an idea how long it's going to take, and then schedule to run in off hours (preferably with someone monitoring it to be sure it's back online when peak hours come along). In general, reorganizing is going to be a better option.