【翻译自mos文章】解释直方图信息

 

 

解释直方图信息
来源于:
Interpreting Histogram Information (Doc ID 72539.1)

适用于:
Oracle Database - Enterprise Edition - Version 7.3.0.0 and later
Oracle Database - Standard Edition - Version 7.3.0.0 and later
Oracle Database - Personal Edition - Version 7.3.0.0 and later
Information in this document applies to any platform.

目的:
直方图信息是怎么被存储的,是怎么被解释的。

范围:
其他有用的直方图参考:
Document 1445372.1 Histograms: An Overview (10g and Above)

细节:
直方图是一种机制,该机制用来存储 列数据(column data)的详细信息。该数据被CBO使用,用来决定一个查询语句最优化的访问路径(access path).
没有直方图时,优化器依靠的所有信息是:一个列的高值和低值,该列的不同值个数,该列的空值个数,该table的记录总数。
(实际上列的高值和低值是以raw 格式存储的,因此不是特别有用),其他的信息可以从dictionary views中查询到。

没有列的统计信息时,优化器假设数据是均衡分布的,对于等值谓词,生成以一个选择率(column selectivity),该选择率是如下计算的:1/NVD(Number of Distinct Values)

有直方图时,你可以访问行数据的更多分布信息。

当一个列的数据分布不均衡时(即:列的数据分布 高度倾斜--数据分布倾斜的很厉害),Oracle 可以存储列的直方图以给出更好的选择率.这会产生比使用标准的统计信息(high and low values plus Number of Distinct Values)更好的执行计划

就具体实现而言(In terms of implementation),我们可以选择 将 每个不同值和该值的记录数存放在一起,对于值很少的记录数是有效的,此时,'width balanced' histograms 被使用。

随着不同值数量的增长,存储数据的数量变得过高,我们需要使用一个不同的方法来存储直方图数据。此时,我们可以选择 height balanced histograms.

使用如上两种方法,列直方图提供了一个有效和集中的方法来展现数据分布。当建立直方图时,存储的信息依靠“不同值的数量是否小于等于bucket(默认75个,最大254个)的数量”进行不同的解释。

 如果不同值的数量小于等于直方图bucket的数量(bucket最多254个),那么 Frequency Histogram 被建立
 如果不同值的数量大于直方图bucket的数量, Height Balanced Histogram 被建立。


Frequency Histogram
 Frequency Histogram 使用bucket来记录每一个不同值的记录个数

Height Balanced Histogram
  Height Balanced Histogram 通过把数据分割到不同bucket中来实现。每个bucket 包括相同数量的列值。每个bucket中的最高值(or END_POINT)和最低值被记录在零号bucket中。

一旦数据被存储于bucket中,我们可以识别两个类型的data value--- Non-popular values and popular values

 Non-popular values--are those that do not occur multiple times as end points.不会出现多次
 Popular values--occur multiple times as end points.会出现多次。

We can use Popular and Non-Popular Values to provide use with various statistics.Since we know how many values there are in a bucket we can use this information to estimate the number of rows in total that are covered by Popular and Non-Popular values.
•The selectivity for popular values can be obtained by calculation the proportion of bucket endpoints filled by that popular value.
•The selectivity for non popular values can now be calculated as 1/number non-popular bucket endpoints, so we can now be more accurate about selectivities than the original 1/NDV, because we have removed the popular values from the equation.


How histograms are used
直方图被用来得到column predicate 更好的selectivity 估算

Where there are fewer distinct values than buckets, the selectivity is simply calculated as we have accurate row information for each value. For the case where we have more distinct values than buckets, the following outlines how these selectivities are obtained.

Equality Predicate Selectivity calculated from:
•Popular Value: 
 Number of buckets for value / Total Number of buckets
•Non-Popular Value:
 Density see:


Document 43041.1 Query Optimizer: What is Density?


Less than < (Same principle applies for > & >= )
•All Values:
 Buckets with endpoints < value / Total No. of buckets

 

Histogram Examples

 

 Table TAB1

SQL> desc tab1
 Name                            Null?    Type
 ------------------------------- -------- ----
 A                                        NUMBER(6)
 B                                        NUMBER(6)

 

Column A contains unique values from 1 to 10000.
Column B contains 10 distinct values.

The value '5' occurs 9991 times.
Values '1, 2, 3, 4, 9996, 9997, 9998, 9999, 10000' occur only once.

i.e.

 

select distinct B , count(*)
from HTAB1
group by B
order by B
;

         B   COUNT(*)
---------- ----------
         1          1
         2          1
         3          1
         4          1
         5       9991
      9996          1
      9997          1
      9998          1
      9999          1
     10000          1

10 rows selected.

 

There is an index on Column B.
Statistics are gathered without Histograms using:

 

exec DBMS_STATS.GATHER_TABLE_STATS (NULL,'HTAB1', method_opt => 'FOR ALL COLUMNS SIZE 1');

Setup:

 

drop table HTAB1;
create table HTAB1 (a number, b number);

  Insert into HTAB1 ( A,B) values ( 1,1);
  Insert into HTAB1 ( A,B) values ( 2,2);
  Insert into HTAB1 ( A,B) values ( 3,3);
  Insert into HTAB1 ( A,B) values ( 4,4);
  Insert into HTAB1 ( A,B) values ( 9996,9996);
  Insert into HTAB1 ( A,B) values ( 9997,9997);
  Insert into HTAB1 ( A,B) values ( 9998,9998);
  Insert into HTAB1 ( A,B) values ( 9999,9999);
  Insert into HTAB1 ( A,B) values ( 10000,10000);

commit;
begin
 for i in 5 .. 9995 loop
  Insert into HTAB1 ( A,B)
values ( i,5);
  if (mod(i,100) = 0) then
     commit;
  end if;
 end loop;
 commit;
end;
/
commit;


create index HTAB1_B on HTAB1(b);
exec DBMS_STATS.GATHER_TABLE_STATS (NULL,'HTAB1', method_opt => 'FOR ALL COLUMNS SIZE 1');

alter session set OPTIMIZER_DYNAMIC_SAMPLING = 0;

 

Function to convert raw data in to numeric data:

create or replace function raw_to_number(my_input raw)
return number
as
    my_output number;
begin
    dbms_stats.convert_raw_value(my_input,my_output);
    return my_output;
end;
/  

This results in statistics as follows:

 

column COLUMN_NAME format a5 heading COL
column NUM_DISTINCT format 99990
column LOW_VALUE format 99990
column HIGH_VALUE format 99990
column DENSITY format 99990
column NUM_NULLS format 99990
column NUM_BUCKETS format 99990
column SAMPLE_SIZE format 99990
select COLUMN_NAME,NUM_DISTINCT,raw_to_number(LOW_VALUE) Low,raw_to_number(HIGH_VALUE) High,DENSITY,NUM_NULLS,
       NUM_BUCKETS,LAST_ANALYZED,SAMPLE_SIZE,HISTOGRAM
from user_tab_columns
where table_name = 'HTAB1';

COL   NUM_DISTINCT        LOW       HIGH DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZED        SAMPLE_SIZE HISTOGRAM
----- ------------ ---------- ---------- ------- --------- ----------- -------------------- ----------- ---------------
A            10000          1      10000       0         0           1 31-jan-2013 09:32:08       10000 NONE
B               10          1      10000       0         0           1 31-jan-2013 09:32:08       10000 NONE


select lpad(TABLE_NAME,10) TAB, lpad(COLUMN_NAME, 10) COL,
 ENDPOINT_NUMBER, ENDPOINT_VALUE
from user_histograms
where table_name='HTAB1'
order by COL, ENDPOINT_NUMBER;


TAB        COL        ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ---------- --------------- --------------
     HTAB1          A               0              1
     HTAB1          A               1          10000
     HTAB1          B               0              1
     HTAB1          B               1          10000

 

In the above you can see that the statistics gathering has not created a histogram. There is a single bucket and high and a low ENDPOINT_NUMBER for each column value ( you will always get 2 entries in USER_HISTOGRAMS for each column, for the high and low values respectively).

 

Test queries:
  • select * from htab1 where b=5;
    
  • select * from htab1 where b=3;
    

To replicate the tests you will need to disable OPTIMIZER_DYNAMIC_SAMPLING

 

alter session set OPTIMIZER_DYNAMIC_SAMPLING = 0;

 

See:

Document 336267.1 Optimizer Dynamic Sampling (OPTIMIZER_DYNAMIC_SAMPLING)

Without Histograms, both queries do an INDEX RANGE SCAN because the optimizer believes that the data is uniformly distributed in column B and that each predicate with return 1/10th of the values because there are 10 distinct values:

 

---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |  1111 |  6666 |     5   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| HTAB1   |  1111 |  6666 |     5   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | HTAB1_B |  1111 |       |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

 

In fact it may be preferable to use a Full Table Scan for the select where b=5 and index lookups for the others.

 

Gathering Histogram Statistics

If we collect histogram statistics with the recommended settings:

 

exec DBMS_STATS.GATHER_TABLE_STATS (NULL,'HTAB1', method_opt => 'FOR ALL COLUMNS SIZE AUTO'); 

 

The b=5 query now does a Full Table Scan

 

 select * from htab1 where b=5;

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |  9991 | 69937 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| HTAB1 |  9991 | 69937 |     7   (0)| 00:00:01 |
---------------------------------------------------------------------------

 

The query where B is 3 still uses an index:

 

 select * from htab1 where b=3;

---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |     1 |     7 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| HTAB1   |     1 |     7 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | HTAB1_B |     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

 

This is because a FREQUENCY Histogram has been created:

 

COL   NUM_DISTINCT        LOW       HIGH DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZED        SAMPLE_SIZE HISTOGRAM
----- ------------ ---------- ---------- ------- --------- ----------- -------------------- ----------- ---------------
A            10000          1      10000       0         0           1 31-jan-2013 09:58:01       10000 NONE
B               10          1      10000       0         0          10 31-jan-2013 09:58:01       10000 FREQUENCY

TAB        COL        ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ---------- --------------- --------------
     HTAB1          A               0              1
     HTAB1          A               1          10000
     HTAB1          B               1              1
     HTAB1          B               2              2
     HTAB1          B               3              3
     HTAB1          B               4              4
     HTAB1          B            9995              5
     HTAB1          B            9996           9996
     HTAB1          B            9997           9997
     HTAB1          B            9998           9998
     HTAB1          B            9999           9999
     HTAB1          B           10000          10000

12 rows selected.

 

On Column B there are 10 buckets matching up with the 10 distinct values.

The ENDPOINT_VALUE shows the column value and the ENDPOINT_NUMBER shows the cumulative number of rows. So the number of rows for ENDPOINT_VALUE 2, it has an ENDPOINT_NUMBER 2, the previous ENDPOINT_NUMBER is 1, hence the number of rows with value 2 is 1. Another example is ENDPOINT_VALUE 5. Its ENDPOINT_NUMBER is 9995. The previous bucket ENDPOINT_NUMBER is 4, so 9995 - 4 = 9991 rows containing the value 5.

Frequency histograms work fine with a low number of distinct values, but when the number exceeds the maximum number of buckets, you cannot create a bucket for each value. In this case the Optimizer creates Height balanced histograms.

Height Balanced Histograms

You can demonstrate this situation by forcing the optimizer to create fewer buckets than the Number of Distinct Values. i.e. using 8 buckets for 10 Distinct Values:

 

exec DBMS_STATS.GATHER_TABLE_STATS (NULL,'HTAB1', method_opt => 'FOR COLUMNS B SIZE 8'); 

 

So now we have gathered a HEIGHT BALANCED HISTOGRAM for Column B:

 

COL   NUM_DISTINCT        LOW       HIGH DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZED        SAMPLE_SIZE HISTOGRAM
----- ------------ ---------- ---------- ------- --------- ----------- -------------------- ----------- ---------------
A            10000          1      10000       0         0           1 31-jan-2013 09:58:01       10000 NONE
B               10          1      10000       0         0           8 31-jan-2013 09:59:09       10000 HEIGHT BALANCED

TAB        COL        ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ---------- --------------- --------------
     HTAB1          A               0              1
     HTAB1          A               1          10000
     HTAB1          B               0              1
     HTAB1          B               7              5
     HTAB1          B               8          10000

 


Notice that there are 8 Buckets against B now.

Oracle puts the same number of values in each bucket and records the endpoint of each bucket.

With HEIGHT BALANCED Histograms, the ENDPOINT_NUMBER is the actual bucket number and ENDPOINT_VALUE is the endpoint value of the bucket determined by the column value.

From the above, bucket 0 holds the low value for the column.

Because buckets 1-7 have the same endpoint, Oracle does not store all these rows to save space. But we have: bucket 1 with an endpoint of 5, bucket 2 with an endpoint of 5, bucket 3 with an endpoint of 5, bucket 4 with an endpoint of 5, bucket 5 with an endpoint of 5, bucket 6 with an endpoint of 5, bucket 7 with an endpoint of 5 AND bucket 8 with an endpoint of 10000 So bucket 1 contains values between 1 and 5, bucket 8 contains values between 5 and 10000.

All buckets contain the same number of values (which is why they are called height-balanced histograms), except the last bucket may have fewer values then the other buckets.

Storing Character Values in Histograms

For character columns, Oracle only stores the first 32 bytes of any string (there are also limits on numeric columns, but these are less frequently an issue since the majority of numbers are insufficiently large to encounter any problems). See:

Document 212809.1 Limitations of the Oracle Cost Based Optimizer

Any predicates that contain strings greater than 32 characters will not use histogram information and the selectivity will be 1 / Number of DISTINCT Values. Data in histogram endpoints is normalized to double precision floating point arithmetic.

For Example

 

SQL> select * from example;

A
----------
a
b
c
d
e
e
e
e
  

 

The table contains 5 distinct values. There is one occurence of 'a', 'b', 'c' and 'd' There are 4 occurrences of 'e'. If we create a histogram: Looking in user_histograms:

 

TABLE      COL   ENDPOINT_NUMBER ENDPOINT_VALUE
---------- ----- --------------- --------------
   EXAMPLE     A               1     5.0365E+35
   EXAMPLE     A               2     5.0885E+35
   EXAMPLE     A               3     5.1404E+35
   EXAMPLE     A               4     5.1923E+35
   EXAMPLE     A               8     5.2442E+35

 

So:

 

 ENDPOINT_VALUE 	5.0365E+35 represents a
			5.0885E+35 represents b
			5.1404E+35 represents c
			5.1923E+35 represents d
			5.2442E+35 represents e

 

Then, if you look at the cumulative values for ENDPOINT_NUMBER, the corresponding ENDPOINT_VALUE's are correct.

 

 

【6层】一字型框架办公楼(含建筑结构图、计算书) 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。
深度学习是机器学习的一个子领域,它基于人工神经网络的研究,特别是利用多层次的神经网络来进行学习和模式识别。深度学习模型能够学习数据的高层次特征,这些特征对于图像和语音识别、自然语言处理、医学图像分析等应用至关重要。以下是深度学习的一些关键概念和组成部分: 1. **神经网络(Neural Networks)**:深度学习的基础是人工神经网络,它是由多个层组成的网络结构,包括输入层、隐藏层和输出层。每个层由多个神经元组成,神经元之间通过权重连接。 2. **前馈神经网络(Feedforward Neural Networks)**:这是最常见的神经网络类型,信息从输入层流向隐藏层,最终到达输出层。 3. **卷积神经网络(Convolutional Neural Networks, CNNs)**:这种网络特别适合处理具有网格结构的数据,如图像。它们使用卷积层来提取图像的特征。 4. **循环神经网络(Recurrent Neural Networks, RNNs)**:这种网络能够处理序列数据,如时间序列或自然语言,因为它们具有记忆功能,能够捕捉数据中的时间依赖性。 5. **长短期记忆网络(Long Short-Term Memory, LSTM)**:LSTM 是一种特殊的 RNN,它能够学习长期依赖关系,非常适合复杂的序列预测任务。 6. **生成对抗网络(Generative Adversarial Networks, GANs)**:由两个网络组成,一个生成器和一个判别器,它们相互竞争,生成器生成数据,判别器评估数据的真实性。 7. **深度学习框架**:如 TensorFlow、Keras、PyTorch 等,这些框架提供了构建、训练和部署深度学习模型的工具和库。 8. **激活函数(Activation Functions)**:如 ReLU、Sigmoid、Tanh 等,它们在神经网络中用于添加非线性,使得网络能够学习复杂的函数。 9. **损失函数(Loss Functions)**:用于评估模型的预测与真实值之间的差异,常见的损失函数包括均方误差(MSE)、交叉熵(Cross-Entropy)等。 10. **优化算法(Optimization Algorithms)**:如梯度下降(Gradient Descent)、随机梯度下降(SGD)、Adam 等,用于更新网络权重,以最小化损失函数。 11. **正则化(Regularization)**:技术如 Dropout、L1/L2 正则化等,用于防止模型过拟合。 12. **迁移学习(Transfer Learning)**:利用在一个任务上训练好的模型来提高另一个相关任务的性能。 深度学习在许多领域都取得了显著的成就,但它也面临着一些挑战,如对大量数据的依赖、模型的解释性差、计算资源消耗大等。研究人员正在不断探索新的方法来解决这些问题。
1、资源项目源码均已通过严格测试验证,保证能够正常运行;、 2项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值