From Oracle Documents
Oracle9i Database Performance Tuning Guide and Reference
Release 2 (9.2)Part Number A96533-02
Using Histograms
The cost-based optimizer can use data value histograms to get accurate estimates of the distribution of column data. A histogram partitions the values in the column into bands, so that all column values in a band fall within the same range. Histograms provide improved selectivity estimates in the presence of data skew, resulting in optimal execution plans with nonuniform data distributions.
One of the fundamental tasks of the cost-based optimizer is determining the selectivity of predicates that appear in queries. Selectivity estimates are used to decide when to use an index and the order in which to join tables. Some attribute domains (a table's columns) are not uniformly distributed.
The cost-based optimizer uses height-based histograms on specified attributes to describe the distributions of nonuniform domains. In a height-based histogram, the column values are divided into bands so that each band contains approximately the same number of values. The useful information that the histogram provides, then, is where in the range of values the endpoints fall.
See Also: |
Consider a column C with values between 1 and 100 and a histogram with 10 buckets. If the data in C is uniformly distributed, then the histogram looks similar to Figure 3-1, where the numbers are the endpoint values.
Figure 3-1 Histogram with Uniform Distribution
Text description of the illustration pfgrf208.gif
The number of rows in each bucket is one tenth the total number of rows in the table. Four-tenths of the rows have values between 60 and 100 in this example of uniform distribution.
If the data is not uniformly distributed, then the histogram might look similar to Figure 3-2.
Figure 3-2 Histogram with Non-Uniform Distribution
Text description of the illustration pfgrf209.gif
In this case, most of the rows have the value 5 for the column; only 1/10 of the rows have values between 60 and 100.
When to Use Histograms
Histograms can affect performance and should be used only when they substantially improve query plans. Histogram statistics data is persistent, so the space required to save the data depends on the sample size. In general, create histograms on columns that are used frequently in WHERE
clauses of queries and have a highly skewed data distribution. For uniformly distributed data, the cost-based optimizer can make fairly accurate guesses about the cost of executing a particular statement without the use of histograms.
Histograms, like all other optimizer statistics, are static. They are useful only when they reflect the current data distribution of a given column. (The data in the column can change as long as the distribution remains constant.) If the data distribution of a column changes frequently, you must recompute its histogram frequently.
Histograms are not useful for columns with the following characteristics:
- All predicates on the column use bind variables.
- The column data is uniformly distributed.
- The column is unique and is used only with equality predicates.
Creating Histograms
You generate histograms by using the DBMS_STATS
package. You can generate histograms for columns of a table or partition. For example, to create a 10-bucket histogram on the SAL
column of the emp
table, issue the following statement:
EXECUTE DBMS_STATS.GATHER_TABLE_STATS ('scott','emp', METHOD_OPT => 'FOR COLUMNS SIZE 10 sal');
The SIZE
keyword declares the maximum number of buckets for the histogram. You would create a histogram on the SAL
column if there were an unusually high number of employees with the same salary and few employees with other salaries. You can also collect histograms for a single partition of a table.
Oracle Corporation recommends using the DBMS_STATS
package to have the database automatically decide which columns need histograms. This is done by specifying SIZE
AUTO
.
See Also: Oracle9i Supplied PL/SQL Packages and Types Reference for more information on the |
Choosing the Number of Buckets for a Histogram
If the number of frequently occurring distinct values in a column is relatively small, then set the number of buckets to be greater than that number. The default number of buckets for a histogram is 75. This value provides an appropriate level of detail for most data distributions. However, because the number of buckets and the data distribution both affect a histogram's usefulness, you might need to experiment with different numbers of buckets to obtain optimal results.
Types of Histograms
There are two types of histograms:
Understanding Height-Based Histograms
Height-based histograms place approximately the same number of values into each range, so that the endpoints of the range are determined by how many values are in that range. Only the last (largest) values in each bucket appear as bucket (end point) values.
Consider that a table's query results in the following four sample values: 4, 18, 30, and 35.
For a height-based histogram, each of these values occupies a portion of one bucket, in proportion to their size. The resulting selectivity is computed with the following formula:
S = Height(35) / Height(4 + 18 + 30 + 35)
Understanding Value-Based Histograms
Value-based histograms are created when the number of distinct values is less than or equal to the number of histogram buckets specified. In value-based histograms, all the values in the column have corresponding buckets, and the bucket number reflects the repetition count of each value. These can also be known as frequency histograms.
Consider the same four sample values in the previous example. In a value-based histogram, a bucket is used to represent each of the four distinct values. In other words, one bucket represents 4, one bucket represents 18, another represents 30, and another represents 35. The resulting selectivity is computed with the following formula:
S = [#rows(35)/(#rows(4) + #rows(18) + #rows(30) + #rows(35))] / #buckets
If there are many different values anticipated for a particular column of table, it is preferable to use the value-based histogram rather than the height-based histogram. This is because if there is much data skew in the height, then the skew can offset the selectivity calculation and give a nonrepresentative selectivity value.
Using Histograms
Example 3-6 illustrates the use of a histogram in order to improve the execution plan and demonstrate the skewed behavior of the s6
indexed column.
Example 3-6 Using a Histogram to Improve an Execution Plan
UPDATE so_lines l SET open_flag=null, s6=10, s6_date=sysdate, WHERE l.line_type_code in ('REGULAR','DETAIL','RETURN') AND l.open_flag = 'Y' AND NVL(l.shipped_quantity, 0)=0 OR NVL(l.shipped_quantity, 0) != 0 AND l.shipped_quantity +NVL(l.cancelled_quantity, 0)= l.ordered_quantity)) AND l.s6=18
This query shows the skewed distribution of data values for s6
. In this case, there are two distinct non-null values: 10 and 18. The majority of the rows consists of s6
= 10 (1,589,464), while a small number of rows consist of s6
= 18 (13,091).
S6: COUNT(*) ====================== 10 1,589,464 18 13,091 NULL 21,889
The selectivity of column s6
, where s6
=
18
:
S = 13,091 / (13,091 + 1,589,464) = 0.008
If No Histogram is Used: The selectivity of column s6
is assumed to be 50%, uniformly distributed across 10 and 18. This is not selective; therefore, s6
is not an ideal choice for use as an index.
If a Histogram is Used: The data distribution information is stored in the dictionary. This allows the optimizer to use this information and compute the correct selectivity based on the data distribution. In Example 3-6, the selectivity, based on the histogram data, is 0.008. This is a relatively high, or good, selectivity, which leads the optimizer to use an index on column s6
in the execution plan.
Viewing Histograms
To view histogram information, query the appropriate data dictionary view (USER_
, ALL_
, or DBA_
). The following list shows the DBA_
views:
DBA_HISTOGRAMS
DBA_PART_HISTOGRAMS
DBA_SUBPART_HISTOGRAMS
DBA_TAB_COL_STATISTICS
Number of Rows
View the DBA_HISTOGRAMS
dictionary table for the number of buckets (in other words, the number of rows) for each column:
ENDPOINT_NUMBER
ENDPOINT_VALUE
See Also: Oracle9i Database Reference for column descriptions of data dictionary views, as well as histogram use and restrictions
Verifying Histogram Statistics
To verify that histogram statistics are available, query the data dictionary's DBA_HISTOGRAMS
table, using a statement similar to Example 3-7.
Example 3-7 Verifying Histogram Statistics
SQL> SELECT ENDPOINT_NUMBER, ENDPOINT_VALUE FROM DBA_HISTOGRAMS WHERE TABLE_NAME ="SO_LINES_ALL" AND COLUMN_NAME="S2" ORDER BY ENDPOINT_NUMBER;
This query returns the following typical data:
ENDPOINT_NUMBER ENDPOINT_VALUE --------------- --------------- 1365 4 1370 5 2124 8 2228 18
One row corresponds to one bucket in the histogram. Consider the differences between ENDPOINT_NUMBER
values in Example 3-7 listed in Table 3-5.
Table 3-5 ENDPOINT_NUMBER Differences
Bucket (values) | ENDPOINT_NUMBER Difference | Number of Values in Bucket |
---|---|---|
1 (0 to 4) | N/A | N/A |
2 (4 to 5) | 1370 - 1365 | 5 |
3 (5 to 8) | 2124 - 1370 | 754 |
4 (8 to 18) | 2228 - 2124 | 104 |
Table 3-5 shows that the buckets hold very different numbers of values. The data is skewed: 754 values are between 5 and 8, but only 104 are between 8 and 18. More buckets should be used.
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/66634/viewspace-1011565/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/66634/viewspace-1011565/