Inside Oracle10g Dynamic Sampling

Inside Oracle10g Dynamic Sampling
by Donald K. Burleson

Introduction
One of the greatest problems with the Oracle Cost-based Optimizer (CBO) was not a problem with the CBO at all, but with the failure of the Oracle DBA to gather accurate schema statistics. Even with the dbms_stats package, the schema statistics were often stale, and the DBA did not always create histograms for skewed data columns and data columns that are used to estimate the size of SQL intermediate result sets.

This resulted in a “bum rap” for Oracle’s CBO, and beginner DBAs often falsely accused it of failing to generate optimal execution plans when the real cause of the sub-optimal execution plan was the DBA’s failure to collect complete schema statistics.

Hence, Oracle has automated the function of collecting and refreshing schema statistics in Oracle10g. This automates a very important DBA task and ensures that Oracle will always gather good statistics and choose the best execution plan for any query. Using the enhanced dbms_stats package, Oracle will automatically estimate the sample size, detect skewed columns that would benefit from histograms, and refresh the schema statistics when they become stale.

begindbms_stats.gather_schema_stats(ownname       => 'SCOTT',estimate_percent => dbms_stats.auto_sample_size,method_opt     => 'for all columns size skewonly',degree       => 7);end;/
However, there was always a nagging problem with the CBO. Even with good statistics, the CBO would sometimes determine a sub-optimal table-join order, causing unnecessarily large intermediate result sets. For example, consider the complex WHERE clause in the query below. Even with the best schema statistics, it can be impossible to predict a priori the optimal table-join order (the one that has the smallest intermediate baggage). Reducing the size of the intermediate row-sets can greatly improve the speed of the query.

selectstufffromcustomernatural joinordersnatural joinitemnatural joinproductwherecredit_rating * extended_credit > .07and(qty_in_stock * velocity) /.075 < 30or(sku_price / 47) * (qty_in_stock / velocity) > 47;
In this example, the four-way table join only returns 18 rows, but the query carries 9,000 rows in intermediate result sets, slowing-down the SQL execution speed (refer to figure 1).



Figure 1: Sub-optimal intermediate row sets.

If we were somehow able to predict the sizes of the intermediate results, we can re-sequence the table-join order to carry less “intermediate baggage” during the four-way table join, in this example carrying only 3,000 intermediate rows between the table joins (refer to figure 2) .



Figure 2: Optimal intermediate row sets.

Let’s take a closer look at this issue. Assume that we have a three-way table join against tables that all contain over 10,000 rows each. This database has 50,000 student rows, 10,000 course rows and 5,000 professor rows (refer to figure 3).



Figure 3: Number of rows in each table.

If the number of rows in the table determined the best table-join order, we would expect that any three-way table join would start by joining the professor and course tables, and then would join the RESULT set to the student table.

Ah, but whenever there is a WHERE clause, the total number of rows in each table does not matter (if you are using index access). Here is the query:

selectstudent_namefromprofessornatural joincoursenatural joinstudentwhereprofessor = ‘jones’andcourse = ‘anthropology 610’;
Stan Nowakowski
Bob Crane
James Bakke
Patty O’Furniture

4 Rows selected.

Despite the huge numbers of rows in each table, the final result set will only be four rows. If the CBO can guess a priori the size of the final result, he can use sampling techniques to examine the WHERE clause of the query and determine which two table we should join together first.

There are only two table-join choices in our simplified example:

1. Join (student to course) and (RESULT to professor)

2. Join professor to course and (RESULT to student)

So, then, which is better? The best solution will be the one in which RESULT is smallest. Because the query is filtered with a WHERE clause, the number of rows in each table is incidental, and what we are really concerned about is the number of rows “where professor = ‘jones’” and “where course = ‘Anthropology 610’.”

If we know, the best table-join order becomes obvious. Assume that Professor Jones is very popular and teaches 50 courses and that Anthropology 610 as a total of eight students. Knowing this, we can see that the size of the intermediate row baggage is very different:

Join professor to course and (RESULT to student).



Figure 4: A sub-optimal intermediate row size.

If the CBO were to join the student table to the course table first, the intermediate result set would only be eight rows, far less baggage to carry over to the final join:

Join (student to course) and (RESULT to professor).



Figure 5: An optimal intermediate row size.

Now that we have only eight rows returned from the first query, it easy to join the tiny eight-row result set into the professor table to get the final answer.

How Do We Estimate Join Cardinality?

As we can see, in the absence of column histograms, Oracle CBO must be able to “guess” this information, and it sometimes gets it wrong. This is one reason why the ORDERED hint is one of the most popular SQL tuning hints; using the ORDERED hint allows you to specify that the tables be joined together in the same order that they appear in the FROM clause, like this:

select /+ ORDERED */student_namefromstudentnatural joincoursenatural joinprofessorwhereprofessor = ‘jones’andcourse = ‘anthropology 610’;
Remember, if the values for the professor and course table columns are not skewed, then it is unlikely that the 10g automatic statistics would have created histograms buckets in the dba_histograms view for these columns.

As we can see, the Oracle CBO needs to be able to accurately estimate the final number of rows returned by each step of the query and then use schema metadata (from running dbms_stats) to choose the table-join order that results in the least amount of “baggage” (intermediate rows) from each of the table join operations.

But this is a daunting task. When a SQL query has a complex WHERE clause, it can if very difficult to estimate the size of the intermediate result sets, especially when the WHERE clause transforms column values with mathematical functions. This is because Oracle has made a commitment to making the CBO infallible, even when incomplete information exists. However, Oracle9i introduced the new dynamic sampling method for gathering run-time schema statistics, and it is now enabled by default in Oracle10g.

Note that dynamic sampling is not for every database. Let’s take a closer look.


Dynamic Sampling

The main objective of dynamic sampling is to create more accurate selectivity and cardinality estimates, which, in turn, helps the CBO generate faster execution plans. Dynamic sampling is normally used to estimate single-table predicate selectivity when collected statistics cannot be used or are likely to lead to significant errors in estimation. It is also used to estimate table cardinality for tables without statistics or for tables whose statistics are too out of date to trust.

The optimizer_dynamic_sampling initialization parameter controls the number of blocks read by the dynamic sampling query. The parameter can be set to a value from 0 to 10. In 10g, the default for this parameter is set to “2,” automatically enabling dynamic sampling. Beware that the optimizer_features_enable parameter will turns off dynamic sampling if it is set to a version earlier than 9.2.0.

A value of 0 means dynamic sampling will not be performed. Increasing the value of the parameter results in more aggressive dynamic sampling, in terms of both the type of tables sampled (analyzed or un-analyzed) and the amount of I/O spent on sampling.

Dynamic Sampling Is Not for Everyone

When dynamic_sampling was first introduced in Oracle9i, it was used primarily for data warehouse systems with complex queries. Because it is enabled by default in Oracle10g, you may want to turn off dynamic_sampling to remove unnecessary overhead if any of the following are true:

You have an online transaction processing (OLTP) database with small, single-table queries.
Your queries are not frequently re-executed (as determined by the executions column in v$sql and executions_delta in dba_hist_sqlstat).
Your multi-table joins have simple WHERE clause predicates with single-column values and no built-in or mathematical functions.
Dynamic sampling is ideal whenever a query is going to execute multiple times because the sample time is small compared to the overall query execution time.
By sampling data from the table at runtime, Oracle10g can quickly evaluate complex WHERE clause predicates and determine the selectivity of each predicate, using this information to determine the optimal table-join order. Let’s use the Oracle SQL sample clause to see how this works.

Sample Table Scans

A sample table scan retrieves a random sample of data of whatever size you choose. The sample can be from a simple table or a complex SELECT statement such as a statement involving multiple joins and complex views.

To peek inside dynamic sampling, we can run some simple SQL queries. The following SQL statement uses a sample block and sample rows scan on the customer table. (Note that there are 50,000 rows in this table.) The first statement shows a sample block scan and the last one shows a sample row scan:

selectcount(*)fromcustomersample block(20);
COUNT(*)----------12268
selectpol_no,sales_id,sum_assured,premiumfromcustomersample (0.02) ;
  POL_NO   SALES_ID SUM_ASSURED   PREMIUM---------- ---------- ----------- ---------- --2895       10     2525       23176       10     2525       29228       10     2525       211294       11     2535       419846       11     2535       425547       12     2545       629583       12     2545       640042       13     2555       847331       14     2565       1045283       14     2565       10  
10 rows selected.
We can use the new dynamic_sampling hint to sample rows from the table.

select /*+ dynamic_sampling(customer 10) */pol_no,sales_id,sum_assured,premiumfromcustomer; POL_NO   SALES_ID SUM_ASSURED   PREMIUM---------- ---------- ----------- ---------- --2895       10     2525       23176       10     2525       29228       10     2525       211294       11     2535       419846       11     2535       425547       12     2545       629583       12     2545       640042       13     2555       847331       14     2565       1045283       14     2565       10  

Conclusion

Dynamic sampling addresses an innate problem in SQL and this issue is common to all relational databases. Estimating the optimal join order involves guessing the sequence that results in the smallest amount of intermediate row-sets, and it is impossible to collect every possible combination of WHERE clauses with a priori statistics.

Dynamic sampling is a godsend for databases that have large n-way table joins that execute frequently. By sampling a tiny sub-set of the data, the Oracle 10g CBO gleans clues as to the fastest table-join order.

As we have noted, dynamic_sampling does not take a long time to execute, but it can be an unnecessary overhead for all Oracle10g databases. Dynamic sampling is just another example of Oracle’s commitment to making Oracle10g an intelligent, self-optimizing database.

相关链接:
http://www.dba-oracle.com/art_dbazine_oracle10g_dynamic_sampling_hint.htm

 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值