Historgram Information

Metalink Note 72539.1

PURPOSE
To provide an understanding of how histogram information is stored and
can be interpreted.

SCOPE & APPLICATION
Familiarity of the Cost Based Optimizer is useful.

RELATED DOCUMENTS
Note 1031826.6

Where there is a high degree of skew in the column distribution, called a
non-uniform. distribution of data, histograms should lead to a better
estimation of selectivity. This should produce plans that are more likely
to be optimal. 

The histogram approach provides an efficient and compact way to represent
data distributions.

When building histograms the information it stores is interpreted differently
depending on whether the number of buckets requested is less than the number
distinct values or if it is the same.  Specifically, ENDPOINT_NUMBER and
ENDPOINT_VALUE in dba/user/all_histograms would have different meanings.


EXAMPLE
-------

Table TAB1

SQL> desc tab1
Name                            Null?    Type
------------------------------- -------- ----
A                                        NUMBER(6)
B                                        NUMBER(6)

Column A contains unique values from 1 to 10000.

Column B contains 10 distinct values. The value '5' occurs 9991 times. Values
'1, 2, 3, 4, 9996, 9997, 9998, 9999, 10000' occur only once.

Test queries:

(1) select * from tab1 where b=5;
(2) select * from tab1 where b=3;

Both the above queries would use a FULL TABLE SCAN as there is no other
access method available.

Then we create an index on column B.

select lpad(INDEX_NAME,10), lpad(TABLE_NAME,10),
       lpad(COLUMN_NAME,10), COLUMN_POSITION, COLUMN_LENGTH
from user_ind_columns
where table_name='TAB1'

SQL> /

LPAD(INDEX LPAD(TABLE LPAD(COLUM COLUMN_POSITION COLUMN_LENGTH
---------- ---------- ---------- --------------- -------------
    TAB1_B       TAB1          B               1            22

Now,

(1) select * from tab1 where b=5;
(2) select * from tab1 where b=3;

Both do an INDEX RANGE SCAN to get the ROWID to do a lookup in the table.

With an INDEX present, it would preferrable to do an INDEX RANGE SCAN for
query (2), but a FULL TABLE SCAN for query (1).


ANALYZING THE TABLE
-------------------

Next, analyze the table using compute statistics:

SQL> analyze table tab1 compute statistics;

From dba_tables:

  NUM_ROWS     BLOCKS EMPTY_BLOCKS  AVG_SPACE  CHAIN_CNT AVG_ROW_LEN
---------- ---------- ------------ ---------- ---------- -----------                             
     10000         64            0         86          0          10

From dba_tab_columns:

NUM_DISTINCT LOW  HIGH   DENSITY  NUM_NULLS NUM_BUCKETS LAST_ANALYZ SAMPLE_SIZE
------------ ---- ---- --------- ---------- ----------- ----------- -----------
       10000 Full Full     .0001          0           1 30-JUN-1999       10000
          10 Full Full        .1          0           1 30-JUN-1999       10000

For Oracle7, from user_histograms:

SQL> select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 10),
  2  bucket_number, endpoint_value
  3  from user_histograms
  4  where table_name='TAB1';

TABLE_NAME COLUMN_NAME BUCKET_NUMBER ENDPOINT_VALUE
---------- ----------- ------------- --------------
      TAB1           A             0              1
      TAB1           A             1          10000
      TAB1           B             0              1
      TAB1           B             1          10000

For Oracle8, from user_tab_histograms:

SQL> select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 10),
  2  bucket_number, endpoint_value
  3  from user_tab_histograms
  4  where table_name='TAB1';

Analyze has created 1 BUCKET for each column. So all values for the column
are in the same bucket.  The BUCKET_NUMBER represents the BUCKET NUMBER and
ENDPOINT_VALUE represents the last column value in that bucket.

Now query (1) and (2) ; both do a FULL TABLE SCAN.

So, the fact that you have statistics about the table and columns does not
help the optimizer to distinguish between how many of each value we have.
The reason it does a FULL TABLE SCAN is because there is a 1 BUCKET histogram
and any value selected for should be in that bucket.


CREATING HISTOGRAMS
-------------------

What you need now is to create histograms so the Optimizer knows how many
values occur for each column.

Query (1): select * from tab1 where b=5;
           should do a FULL TABLE SCAN   and

Query (2): select * from tab1 where b=3;
           should do an INDEX RANGE SCAN

SQL> analyze table tab1 compute statistics for columns b size 10;

select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 5),
       endpoint_number, endpoint_value
from user_histograms;

TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
      TAB1           B               1              1
      TAB1           B               2              2
      TAB1           B               3              3
      TAB1           B               4              4
      TAB1           B            9995              5
      TAB1           B            9996           9996
      TAB1           B            9997           9997
      TAB1           B            9998           9998
      TAB1           B            9999           9999
      TAB1           B           10000          10000

So, now there are statistics on the table and on the columns.

You requested 10 buckets (size 10) and there are 10 distinct values.

The ENDPOINT_VALUE shows the column value and the ENDPOINT_NUMBER
shows the cumulative number of rows.

For example, for ENDPOINT_VALUE 2, it has an ENDPOINT_NUMBER 2, the previous
ENDPOINT_NUMBER is 1, hence the number of rows with value 2 is 1. 

Another example is ENDPOINT_VALUE 5. Its ENDPOINT_NUMBER is 9995. The previous
bucket ENDPOINT_NUMBER is 4, so 9995 - 4 = 9991 rows containing the value 5.

So, now QUERY (1) does in fact do a Full Table Scan.

SQL> select * from tab1 where b=5
SQL> /

Execution Plan
----------------------------------------------------------
0      SELECT STATEMENT ptimizer=ALL_ROWS (Cost=10 Card=9991 Bytes=99910)

1    0   TABLE ACCESS (FULL) OF 'TAB1' (Cost=10 Card=9991 Bytes=99910)


And, QUERY (2) does do an Index Range Scan.

SQL> select * from tab1 where b=3
SQL> /

Execution Plan
----------------------------------------------------------
0      SELECT STATEMENT ptimizer=ALL_ROWS (Cost=6 Card=500 Bytes=5000)
1    0   TABLE ACCESS (BY ROWID) OF 'TAB1' (Cost=6 Card=500 Bytes=5000)
2    1     INDEX (RANGE SCAN) OF 'TAB1_B' (NON-UNIQUE)

This is fine if you have a low number of distinct values, but there can
be tables with a huge number of distinct values.  You don't want to
create a bucket for each value. There would be too much overhead.
In this case you would request less buckets than distinct values.


CREATING HISTOGRAMS WITH LESS BUCKETS THAN DISTINCT VALUES
----------------------------------------------------------

SQL> analyze table tab1 compute statistics for columns b size 8;

SQL> select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 5),
  2>       endpoint_number, endpoint_value
  3> from user_histograms;

LPAD(TABLE LPAD( ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ----- --------------- --------------
TAB1     B               0              1
TAB1     B               7              5
TAB1     B               8          10000

Here, Oracle creates the requested number of buckets but puts the same
number of values into each bucket, where there are more endpoint values
that are the same for the frequently occurring value.

The ENDPOINT_NUMBER is the actual bucket number and ENDPOINT_VALUE is
the endpoint value of the bucket determined by the column value.

From above, bucket 0 holds the low value for the column. You cannot see
buckets 1 to 6 so as to save space.

But we have bucket 1 with an endpoint of 5,
    bucket 2 with an endpoint of 5,
    bucket 3 with an endpoint of 5,
    bucket 4 with an endpoint of 5,
    bucket 5 with an endpoint of 5,
    bucket 6 with an endpoint of 5,
    bucket 7 with an endpoint of 5 AND
    bucket 8 with an endpoint of 10000

So bucket 8 contains values between 5 and 10000.
All buckets contain the same number of values (which is why they are called
height-balanced histograms), except the last bucket may have less values
then the other buckets.

If the data is uniform, you would not use histograms. However, if you request
the same number of buckets as distinct values, Oracle creates 1 bucket.  If
you request less buckets, Oracle uses an algorithm to balance values into each
bucket and any values that remain (which have to be less then the number
stored in each height-balanced bucket) go into the last bucket.


STORING CHARACTER VALUES IN HISTOGRAMS
--------------------------------------

Character columns have some exceptional behaviour, in as much as we store
histogram data for the first 32 bytes of any string.  Any predicates that
contain strings greater than 32 characters will not use histogram information
and the selectivity will be 1 / DISTINCT.

Data in histogram endpoints is normalized to double precision floating point
arithmetic.

EXAMPLE
-------

SQL> select * from morgan;

A
----------
a
b
c
d
e
e
e
e


The table contains 5 distinct values. There is one occurance of 'a', 'b', 'c'
and 'd' and 4 of 'e'.

Create a histogram with 5 buckets.

SQL> analyze table morgan compute statistics for columns a size 5;

Looking in user_histograms:

LPAD(TABLE LPAD( ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ----- --------------- --------------
    MORGAN     A               1     5.0365E+35
    MORGAN     A               2     5.0885E+35
    MORGAN     A               3     5.1404E+35
    MORGAN     A               4     5.1923E+35
    MORGAN     A               8     5.2442E+35

So, ENDPOINT_VALUE 5.0365E+35 represents a
5.0885E+35 represents b
5.1404E+35 represents c
5.1923E+35 represents d
5.2442E+35 represents e

Then if you look at the cumulative values for ENDPOINT_NUMBER,
the corresponding  ENDPOINT_VALUE's are correct.

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16505464/viewspace-582409/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/16505464/viewspace-582409/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值