MIT6.830 Lab3 Query Optimization 实验报告

一、实验概览

lab3实现的是基于代价的查询优化器,以下是讲义给出的实验的大纲:

Recall that the main idea of a cost-based optimizer is to:

  • Use statistics about tables to estimate “costs” of different
    query plans. Typically, the cost of a plan is related to the cardinalities(基数) of
    (number of tuples produced by) intermediate joins and selections, as well as the
    selectivity of filter and join predicates.
  • Use these statistics to order joins and selections in an
    optimal way, and to select the best implementation for join
    algorithms from amongst several alternatives.

In this lab, you will implement code to perform both of these
functions.

我们可以使用表的统计数据去估计不同查询计划的代价。通常,查询计划的代价与中间进行连接和选择产生的记录数的基数有关,以及过滤和连接的选择性。

通过这些统计信息,我们可以选择最佳的连接和选择顺序,从多个查询方案中选择一个最佳的计划去执行。

优化器结构概览:

image-20211015141929283

The key at the bottom explains the symbols; you
will implement the components with double-borders. The classes and
methods will be explained in more detail in the text that follows (you may wish to refer back
to this diagram), but
the basic operation is as follows:

  1. Parser.java constructs a set of table statistics (stored in the
    statsMap container) when it is initialized. It then waits for a
    query to be input, and calls the method parseQuery on that query.
  2. parseQuery first constructs a LogicalPlan that
    represents the parsed query. parseQuery then calls the method physicalPlan on the
    LogicalPlan instance it has constructed. The physicalPlan method returns a
    DBIterator object that can be used to actually run the query.

In the exercises to come, you will implement the methods that help
physicalPlan devise an optimal plan.

简单总结一下查询优化器的构成:

1.Parser.Java在初始化时会收集并构造所有表格的统计信息,并存到statsMap中。当有查询请求发送到Parser中时,会调用parseQuery方法去处理‘

2.parseQuery方法会把解析器解析后的结果去构造出一个LogicalPlan实例,然后调用LogicalPlan实例的physicalPlan方法去执行,然后返回的是结果记录的迭代器,也就是我们在lab2中做的东西都会在physicalPlan中会被调用。

为了验证文档的描述,大致看了一下LogicalPlan的代码,可以看到有很多添加运算的方法,其中physicalPlan方法是核心,它主要是将逻辑计划翻译成实实在在的查询计划,然后通过JoinOptimize去找出最优的连接顺序然后生成逻辑计划。然后去创建各种运算器实例,然后根据计划去执行,下面是部分代码:

image-20211015143435652

可以看到,lab2我们保证的是一般的SQL语句能够执行;而在lab3,我们要考虑的事情是怎么让SQL执行得更快,最佳的连接的顺序是什么等待。

个人理解,总体的,lab3的查询优化应该分为两个阶段:

  • 第一阶段:收集表的统计信息,有了统计信息我们才可以进行估计;
  • 第二阶段:根据统计信息进行估计,找出最优的执行方案。

lab3共有4个exercise,前面两个exercise做的是第一阶段事情,后面两个exercise做的是第二阶段是事情(分工明确hhh)。

除了上面信息,实验的文档outline部分还给出了很多十分有用的信息,告诉我们如何去统计数据,如何去计算代价等等,可以说是指导方针了。

二、实验过程

Exercise 1: IntHistogram.java

想要估计查询计划的代价,首先是得有数据。那么数据是怎么从table中获取,以怎样的形式收集呢?这里用到了直方图。

简单来讲,一个直方图用于表示一个字段的统计信息,直方图将字段的值分为多个相同的区间,并统计每个区间的记录数,每个区间可以看做是一个桶,单个区间的范围大小看成桶的宽,记录数看成桶的宽,可以说是非常的形象:

image-20211015145501551

如果看不懂,可以看一下《数据库系统概念》里的图,帆船书里面的图会更容易懂一些。一张人员信息表格,年龄字段的直方图如下:

image-20211015145611089

exercise1要做的就是根据给定的数据去构造出这样的直方图,然后是根据直方图的统计信息去估算某个值的选择性(selectivity)。下面是文档描述信息:

You will need to implement
some way to record table statistics for selectivity estimation. We have
provided a skeleton class, IntHistogram that will do this. Our
intent is that you calculate histograms using the bucket-based method described
above, but you are free to use some other method so long as it provides
reasonable selectivity estimates.

We have provided a class StringHistogram that uses
IntHistogram to compute selecitivites for String
predicates. You may modify StringHistogram if you want to
implement a better estimator, though you should not need to in order to
complete this lab.

After completing this exercise, you should be able to pass the
IntHistogramTest unit test (you are not required to pass this test if you
choose not to implement histogram-based selectivity estimation).

我们在这个实验只需要实现IntHistogram,而StringHistogram会将字符串转换为int类型然后调用IntHistogram。首先,是构造器与属性部分。构造器给出直方图的数据范围(最大值最小值),桶的数量。有了这些信息,就可以构造出一个空的直方图。具体实现也是一些很简单的数学计算:

public class IntHistogram {
   

    private int[] buckets;
    private int min;
    private int max;
    private double width;
    private int ntups;

    public IntHistogram(int buckets, int min, int max) {
   
    	// some code goes here
        this.buckets = new int[buckets];
        this.min = min;
        this.max = max;
        this.ntups = 0;
        this.width = Math.max(1, (1. + max - min) / this.buckets.length);
    }

这里为了保证数据的准确,后期修改了一点计算宽度的方法,就是要保证宽度在1以上,才不会导致一些值的选择性大于1(一般是0-1,0表示这些记录一个都用不到,也就是代价很小了,1表示这些记录都会被选上,代价最大)。

然后就是构造直方图统计数据的过程,也很简单,以下是实现代码:

    private int getIndex(int v) {
   
//        if (v < min || v > max) throw new IllegalArgumentException("illegal value");
        return Math.min((int) ((v - min) / width), buckets.length - 1);
    }

    /**
     * Add a value to the set of values that you are keeping a histogram of.
     * @param v Value to add to the histogram
     */
    public void addValue(int v) {
   
    	// some code goes here
        if (v < min || v > max) return;
        int index = getIndex(v);
        ntups ++;
        buckets[index] ++;
    }

接下来是这个exercise的大块头,根据直方图已有的统计信息,去计算进行某种运算时某个值表格记录的选择性。这部分的资料在outline很详细的给出如何估计:

  • To estimate the selectivity of an equality expression,
    f=const, compute the bucket that contains value const.
    Suppose the width (range of values) of the bucket is w, the height (number of
    tuples) is h,
    and the number of tuples in the table is ntups. Then, assuming
    values are uniformly distributed throughout the bucket, the selectivity of
    the
    expression is roughly (h / w) / ntups, since (h/w)
    represents the expected number of tuples in the bin with value
    const.

  • To estimate the selectivity of a range expression f>const,
    compute the bucket b that const is in, with width w_b and height
    h_b.

    Then, b contains a fraction b_f = h_b / ntups of the
    total tuples. Assuming tuples are uniformly distributed throughout b(元组均匀分布),
    the fraction b_part of b that is > const is
    (b_right - const) / w_b, where b_right is the right endpoint of
    b’s bucket. Thus, bucket b contributes (b_f x
    b_part)
    selectivity to the predicate. In addition, buckets
    b+1…NumB-1 contribute all of their
    selectivity (which can be computed using a formula similar to
    b_f above). Summing the selectivity contributions of all the
    buckets will yield the overall selectivity of the expression.
    Figure 2 illustrates this process

简单总结一下:

  1. 对于等值运算f = const,我们要利用直方图估计一个等值表达式f = const的选择性,首先需要计算出包含该const值的桶,然后进行计算:选择性 =( 桶高/桶宽)/ 记录数,也就是 (h / w) / ntups。其中 *(h/w)*我们可以容易理解到这是符合f = const的记录数,然后去除以总记录数,就可以得到该等值运算对表格记录的选择性了。
  2. 对于非等值运算,我们采用的也是同样的思想:找出这个范围的记录数,然后除以总记录数。以下的直方图给出了如何计算一个f > const的过程

image-20211015151542459

首先是计算第一小部分:(const, b.right]这部分的选择性

b_f = h_b / ntups

b_part = (b_right - const) / w_b

b_selectivity = b_f x b_part

然后计算总的选择性(其中cnt是b.right到结尾的记录总数):

selectivity = b_selectivity + 1.0 * cnt / ntups;

计算其它的非等值运算,我们也可以使用同样的思路去做。其实,明白了选择性是干什么用的,整个exercise做起来就会简单一点。下面是代码实现:

public double estimateSelectivity(Predicate.Op op, int v) {
   

    	// some code goes here
        int index = getIndex(v);
        if (op.equals(Predicate.Op.EQUALS)) {
   
            //(high / width) / ntups
            if (index < 0 || index >= buckets.length) return 0.0;
            double selectivity = (buckets[index] / width) / ntups;
            return selectivity;
        }
        if (op.equals(Predicate.Op.NOT_EQUALS)) {
   
            return 1 - estimateSelectivity(Predicate.Op.EQUALS, v);
        }
        if (op.equals(Predicate.Op.GREATER_THAN)) {
   
            if (v <= min) return 1.0;
            if (v >= max) return 0.0;

            int cnt = 0;
            for(int i = index + 1; i < buckets.length; i++) {
   
                cnt += buckets[i];
            }
            //b_f = h_b / ntups
            //b_part = (b_right - const) / w_b
            //b_selectivity = b_f x b_part
            double b_f = 1.0 * buckets[index] / ntups;
            double b_part = (
  • 8
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 14
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 14
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值