6.830 Lab 3: Query Optimization

最新推荐文章于 2022-05-05 23:08:30 发布

HearmingBear

最新推荐文章于 2022-05-05 23:08:30 发布

阅读量2.4k

点赞数 2

分类专栏： MIT 6.830 Database

本文链接：https://blog.csdn.net/hjw199666/article/details/103639262

版权

MIT 6.830 Database 专栏收录该内容

5 篇文章 37 订阅

订阅专栏

这个lab中，要实现SimpleDB的query优化。文档的剩下部分描述了添加优化的细节，并提供了一个大致的框架。

1. Getting started

需要添加了一些新的文件。

$ cd simple-db-hw
$ git pull upstream master

1.1. Implementation hints

这有一份大致推进实验的规划，更多的细节在文档的第二部分。

实现TableStats类中的方法，使得可以估计过滤器的选择率和遍历的代价，使用直方图或者你发明的其他方式展示结果。
实现JoinOptimizer类中的方法，使得可以估计各种jion方式的成本。
实现JoinOptimizer类中的orderJoins方法，这个需要生成一个最佳的joins选择率，给出前两步计算出的数据。

2. Optimizer outline

回顾cost - based optimizer的主要概念是：

使用统计的方法估计不同的query策略开销。尤其是，策略的开销与表之间的joins和selections有关，也与selectivity of filter过滤选择率和 join predicates连接谓词有关。
使用统计数据对joins和selections排序得到最佳方式，然后选择最好的实现方法，

在这个lab中，需要实现这些功能相关的函数。

optimizer会从simpledb/Parser.java中被激活，可能在这个实验开始前可以回顾 lab 2 parser exercise。

2.1 Overall Optimizer Structure

在开始实现前，需要理解simpleDB optimizer的整体优化结构，SimpleDB解析和优化的控制流程如图1所示：

1.当初始化的时候Parser.java 构造了一组表统计数据（存储在statsMap容器中），之后等待一个输入一个query，调用parseQuery方法解析query。

2.parseQuery首先构造一个LogicalPlan代表被解析的query。parseQuery之后调用physicalPlan方法在创建的LogicalPlan实例上，physicalPlan方法返回一个DBIterator对象，可以被用于实际运行query。

在接下来的exercises中，需要实现方法，帮助physicalPlan得到最佳的plan。

2.2. Statistics Estimation

准确的估计plan的开销是很难的。在这个lab中，我们只会关注joins序列和基础表访问的开销，我们不关心selection的访问方式（因为我们只有一种访问方法，table scans）或者额外operators的开销（比如aggregates）。

这个lab中只需要考虑left-deep plans。

2.2.1 Overall Plan Cost

我们会写的join plans 的格式是p=t1 join t2 join ... tn，这是一个left deep join的表达，其中t1是left-most join (deepest in the tree)。如果给出的plan像p一样，它的开销可以被表达成：

scancost(t1) + scancost(t2) + joincost(t1 join t2) +
scancost(t3) + joincost((t1 join t2) join t3) +
...

这里，scancost(t1) 是遍历表t1的I/O开销，joincost(t1,t2)是连接t1到t2的CPU开销。为了比较I/O开销和CPU开销，一个常数的标准经常会被使用，如下：

cost(predicate application) = 1
cost(pageScan) = SCALING_FACTOR x cost(predicate application)

对于这个lab，可以忽视caching的影响（比如，当作每一次对table的访问都是scan表的开销），因此，scancost(t1)就是t1包含的页数 * SCALING_FACTOR。

2.2.2 Join Cost

当出现nested loops joins，连接两表t1和t2(where t1 is the outer)的开销很简单：

joincost(t1 join t2) = scancost(t1) + ntups(t1) x scancost(t2) //IO cost
                       + ntups(t1) x ntups(t2)  //CPU cost

其中，ntups(t1)是表t1中的tuples数。

2.2.3 Filter Selectivity

ntups 可以通过遍历一遍基础表直接得出，如果加入了一个或多个selection predicates后再估计ntups会变得很困难，这就是filter selectivity estimation problem。这有一种可能可以用的方法，基于表中数据计算出一个直方图。

计算出表中每个属性的最大值和最小值（通过遍历表一次）
为表中每个属性构建一个直方图。最简单的方式是使用一组固定数量的桶NumB，每一个桶代表了一个固定范围内的记录数。比如，如果一个属性 f 的范围是1~100并且有10个桶，桶1可能包含了数量1~10的记录，桶2可能包含了数量11~20的记录，等等。
再次扫描整个表，查看所有的tuples的所有fields，计算出每个直方图的每个桶中的计数。
为了估计selectivity of an equality expression，f = const，计算出包含有数值const的桶。假设桶的宽度(range of values) 是w,高度(number of tuples)是h，并且表中的tuples数量是ntups。之后，假设值在桶中是均匀分布的，selectivity of the expression 大致是 (h / w) / ntups，因为(h/w)代表了value const 下tuples的数量。
为了估计selectivity of a range expression f>const，计算const在的桶 b，宽度为w_b高度为h_b。之后，b包含一个分数b_f ， b_f = h_b / ntups ，假设tuples在b中是均匀分布的，b中大于const那部分的分数是 (b_right - const) / w_b，其中b_right是b桶的右边界点，因此，桶b贡献了(b_f x b_part) selectivity给predicate。另外，桶b+1到NumB-1 贡献了他们所有的selectivity选择率。总和所有桶的选择率会计算出这个expression的综合选择率。图二说明了这个过程。

Exercise 1

Exercise 1: IntHistogram.java

You will need to implement some way to record table statistics for selectivity estimation. We have provided a skeleton class, IntHistogram that will do this. Our intent is that you calculate histograms using the bucket-based method described above, but you are free to use some other method so long as it provides reasonable selectivity estimates.

We have provided a class StringHistogram that uses IntHistogram to compute selecitivites for String predicates. You may modify StringHistogram if you want to implement a better estimator, though you should not need to in order to complete this lab.

After completing this exercise, you should be able to pass the IntHistogramTest unit test (you are not required to pass this test if you choose not to implement histogram-based selectivity estimation).

依据上面提到的计算方法实现IntHistogram.java，代码如下：

package simpledb;

/** A class to represent a fixed-width histogram over a single integer-based field.
 */
public class IntHistogram {

    private final int[] buckets;
    private final int min, max;
    private final double width;
    private int ntups = 0;

    /**
     * Create a new IntHistogram.
     * 
     * This IntHistogram should maintain a histogram of integer values that it receives.
     * It should split the histogram into "buckets" buckets.
     * 
     * The values that are being histogrammed will be provided one-at-a-time through the "addValue()" function.
     * 
     * Your implementation should use space and have execution time that are both
     * constant with respect to the number of values being histogrammed.  For example, you shouldn't 
     * simply store every value that you see in a sorted list.
     * 
     * @param buckets The number of buckets to split the input value into.
     * @param min The minimum integer value that will ever be passed to this class for histogramming
     * @param max The maximum integer value that will ever be passed to this class for histogramming
     */
    public IntHistogram(int buckets, int min, int max) {
    	// some code goes here
        this.buckets = new int[buckets];
        this.min = min;
        this.max = max;
        this.width = (1.+max-min)/this.buckets.length;
    }

    /**
     * Add a value to the set of values that you are keeping a histogram of.
     * @param v Value to add to the histogram
     */

    private int getIndex(int v){
        if(v<min || v>max) throw new IllegalArgumentException("value out of range");
        return (int)((v-min)/width);
    }
    public void addValue(int v) {
    	// some code goes here
        if(v >= min && v < max){
            buckets[getIndex(v)]++;
            ntups++;
        }
    }

    /**
     * Estimate the selectivity of a particular predicate and operand on this table.
     * 
     * For example, if "op" is "GREATER_THAN" and "v" is 5, 
     * return your estimate of the fraction of elements that are greater than 5.
     * 
     * @param op Operator
     * @param v Value
     * @return Predicted selectivity of this particular operator and value
     */
    public double estimateSelectivity(Predicate.Op op, int v) {

    	// some code goes here
        if(op.equals(Predicate.Op.LESS_THAN)){
            if(v <= min) return 0.0;
            if(v >= max) return 1.0;
            final int index = getIndex(v);
            double cnt = 0;
            for(int i=0;i<index;++i){
                cnt += buckets[i];
            }
            cnt += buckets[index]/width*(v-index*width-min);
            return cnt/ntups;
        }
        if (op.equals(Predicate.Op.LESS_THAN_OR_EQ)) {
            return estimateSelectivity(Predicate.Op.LESS_THAN, v+1);
        }
        if (op.equals(Predicate.Op.GREATER_THAN)) {
            return 1-estimateSelectivity(Predicate.Op.LESS_THAN_OR_EQ, v);
        }
        if (op.equals(Predicate.Op.GREATER_THAN_OR_EQ)) {
            return estimateSelectivity(Predicate.Op.GREATER_THAN, v-1);
        }
        if (op.equals(Predicate.Op.EQUALS)) {
            return estimateSelectivity(Predicate.Op.LESS_THAN_OR_EQ, v) -
                    estimateSelectivity(Predicate.Op.LESS_THAN, v);
        }
        if (op.equals(Predicate.Op.NOT_EQUALS)) {
            return 1 - estimateSelectivity(Predicate.Op.EQUALS, v);
        }
        return 0.0;

    }
    
    /**
     * @return
     *     the average selectivity of this histogram.
     *     
     *     This is not an indispensable method to implement the basic
     *     join optimization. It may be needed if you want to
     *     implement a more efficient optimization
     * */
    public double avgSelectivity()
    {
        // some code goes here
        int cnt = 0;
        for(int bucket:buckets) cnt += bucket;
        if(cnt ==0) return 0.0;
        return cnt/ntups;
    }
    
    /**
     * @return A string describing this histogram, for debugging purposes
     */
    public String toString() {
        // some code goes here
        return String.format("IntHistgram(buckets=%d, min=%d, max=%d",
                buckets.length, min, max);
    }
}

ant runtest -Dtest=IntHistogramTest 单元测试BUILD SUCCESSFUL。

Exercise 2

Exercise 2: TableStats.java

The class TableStats contains methods that compute the number of tuples and pages in a table and that estimate the selectivity of predicates over the fields of that table. The query parser we have created creates one instance of TableStats per table, and passes these structures into your query optimizer (which you will need in later exercises).

You should fill in the following methods and classes in TableStats:

Implement the TableStats constructor: Once you have implemented a method for tracking statistics such as histograms, you should implement the TableStats constructor, adding code to scan the table (possibly multiple times) to build the statistics you need.
Implement estimateSelectivity(int field, Predicate.Op op, Field constant): Using your statistics (e.g., an IntHistogram or StringHistogram depending on the type of the field), estimate the selectivity of predicate field op constant on the table.
Implement estimateScanCost(): This method estimates the cost of sequentially scanning the file, given that the cost to read a page is costPerPageIO. You can assume that there are no seeks and that no pages are in the buffer pool. This method may use costs or sizes you computed in the constructor.
Implement estimateTableCardinality(double selectivityFactor): This method returns the number of tuples in the relation, given that a predicate with selectivity selectivityFactor is applied. This method may use costs or sizes you computed in the constructor.

You may wish to modify the constructor of TableStats.java to, for example, compute histograms over the fields as described above for purposes of selectivity estimation.

After completing these tasks you should be able to pass the unit tests in TableStatsTest.

依据提示实现TableStats.java，代码如下：

package simpledb;

import javax.swing.*;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

/**
 * TableStats represents statistics (e.g., histograms) about base tables in a
 * query. 
 * 
 * This class is not needed in implementing lab1 and lab2.
 */
public class TableStats {

    private static final ConcurrentHashMap<String, TableStats> statsMap = new ConcurrentHashMap<String, TableStats>();

    static final int IOCOSTPERPAGE = 1000;

    public static TableStats getTableStats(String tablename) {
        return statsMap.get(tablename);
    }

    public static void setTableStats(String tablename, TableStats stats) {
        statsMap.put(tablename, stats);
    }
    
    public static void setStatsMap(HashMap<String,TableStats> s)
    {
        try {
            java.lang.reflect.Field statsMapF = TableStats.class.getDeclaredField("statsMap");
            statsMapF.setAccessible(true);
            statsMapF.set(null, s);
        } catch (NoSuchFieldException e) {
            e.printStackTrace();
        } catch (SecurityException e) {
            e.printStackTrace();
        } catch (IllegalArgumentException e) {
            e.printStackTrace();
        } catch (IllegalAccessException e) {
            e.printStackTrace();
        }

    }

    public static Map<String, TableStats> getStatsMap() {
        return statsMap;
    }

    public static void computeStatistics() {
        Iterator<Integer> tableIt = Database.getCatalog().tableIdIterator();

        System.out.println("Computing table stats.");
        while (tableIt.hasNext()) {
            int tableid = tableIt.next();
            TableStats s = new TableStats(tableid, IOCOSTPERPAGE);
            setTableStats(Database.getCatalog().getTableName(tableid), s);
        }
        System.out.println("Done.");
    }

    /**
     * Number of bins for the histogram. Feel free to increase this value over
     * 100, though our tests assume that you have at least 100 bins in your
     * histograms.
     */
    static final int NUM_HIST_BINS = 100;

    private int ioCostPerPage;
    private DbFile dbFile;
    private int tableid;
    private int numFields;
    private int numTuples;
    private int numPages;
    private HashMap<Integer,IntHistogram> intHistogramHashMap;
    private HashMap<Integer,StringHistogram> stringHistogramHashMap;

    /**
     * Create a new TableStats object, that keeps track of statistics on each
     * column of a table
     * 
     * @param tableid
     *            The table over which to compute statistics
     * @param ioCostPerPage
     *            The cost per page of IO. This doesn't differentiate between
     *            sequential-scan IO and disk seeks.
     */
    public TableStats(int tableid, int ioCostPerPage) {
        // For this function, you'll have to get the
        // DbFile for the table in question,
        // then scan through its tuples and calculate
        // the values that you need.
        // You should try to do this reasonably efficiently, but you don't
        // necessarily have to (for example) do everything
        // in a single scan of the table.
        // some code goes here
        numTuples = 0;
        this.tableid = tableid;
        this.ioCostPerPage = ioCostPerPage;
        intHistogramHashMap = new HashMap<Integer, IntHistogram>();
        stringHistogramHashMap = new HashMap<Integer, StringHistogram>();

        dbFile = Database.getCatalog().getDatabaseFile(tableid);
        numPages = ((HeapFile)dbFile).numPages();
        TupleDesc td = dbFile.getTupleDesc();

        numFields = td.numFields();
        Type types[] = getTypes(td);

        int[] mins = new int[numFields];
        int[] maxs = new int[numFields];

        TransactionId tid = new TransactionId();
        SeqScan scan = new SeqScan(tid,tableid,"");
        try{
            scan.open();
            for(int i=0;i<numFields;++i){
                if(types[i] == Type.STRING_TYPE)
                    continue;

                int min = Integer.MAX_VALUE;
                int max = Integer.MIN_VALUE;

                while(scan.hasNext()){
                    if(i == 0) numTuples++;
                    Tuple tuple = scan.next();
                    IntField field = (IntField)tuple.getField(i);
                    int val = field.getValue();
                    if(val > max) max = val;
                    if(val < min) min = val;
                }
                scan.rewind();
                mins[i] = min;
                maxs[i] = max;
            }
            scan.close();
        }catch (Exception e){
            e.printStackTrace();
        }

        for(int i=0;i < numFields;++i){
            Type type = types[i];
            if(type == Type.INT_TYPE){
                IntHistogram intHistogram = new IntHistogram(NUM_HIST_BINS,mins[i],maxs[i]);
                intHistogramHashMap.put(i,intHistogram);
            }else{
                StringHistogram stringHistogram = new StringHistogram(NUM_HIST_BINS);
                stringHistogramHashMap.put(i,stringHistogram);
            }
        }

        addValueToHist();
    }

    private Type[] getTypes(TupleDesc td){
        int numFields = td.numFields();
        Type[] types = new Type[numFields];

        for(int i=0;i<numFields;++i){
            Type t = td.getFieldType(i);
            types[i] = t;
        }
        return types;
    }

    private void addValueToHist(){
        TransactionId tid = new TransactionId();
        SeqScan scan = new SeqScan(tid,tableid,"");
        try{
            scan.open();
            while(scan.hasNext()){
                Tuple tuple = scan.next();

                for(int i=0;i<numFields;++i){
                    Field field = tuple.getField(i);

                    if(field.getType() == Type.INT_TYPE){
                        int val = ((IntField)field).getValue();
                        intHistogramHashMap.get(i).addValue(val);
                    }else{
                        String val = ((StringField)field).getValue();
                        stringHistogramHashMap.get(i).addValue(val);
                    }
                }
            }
            scan.close();
        }catch (Exception e){
            e.printStackTrace();
        }
    }

    /**
     * Estimates the cost of sequentially scanning the file, given that the cost
     * to read a page is costPerPageIO. You can assume that there are no seeks
     * and that no pages are in the buffer pool.
     * 
     * Also, assume that your hard drive can only read entire pages at once, so
     * if the last page of the table only has one tuple on it, it's just as
     * expensive to read as a full page. (Most real hard drives can't
     * efficiently address regions smaller than a page at a time.)
     * 
     * @return The estimated cost of scanning the table.
     */
    public double estimateScanCost() {
        // some code goes here
        HeapFile heapFile = (HeapFile)dbFile;
        return heapFile.numPages() * ioCostPerPage;
    }

    /**
     * This method returns the number of tuples in the relation, given that a
     * predicate with selectivity selectivityFactor is applied.
     * 
     * @param selectivityFactor
     *            The selectivity of any predicates over the table
     * @return The estimated cardinality of the scan with the specified
     *         selectivityFactor
     */
    public int estimateTableCardinality(double selectivityFactor) {
        // some code goes here
        double cardinality = numTuples * selectivityFactor;
        return (int) cardinality;
    }

    /**
     * The average selectivity of the field under op.
     * @param field
     *        the index of the field
     * @param op
     *        the operator in the predicate
     * The semantic of the method is that, given the table, and then given a
     * tuple, of which we do not know the value of the field, return the
     * expected selectivity. You may estimate this value from the histograms.
     * */
    public double avgSelectivity(int field, Predicate.Op op) {
        // some code goes here
        return 1.0;
    }

    /**
     * Estimate the selectivity of predicate <tt>field op constant</tt> on the
     * table.
     * 
     * @param field
     *            The field over which the predicate ranges
     * @param op
     *            The logical operation in the predicate
     * @param constant
     *            The value against which the field is compared
     * @return The estimated selectivity (fraction of tuples that satisfy) the
     *         predicate
     */
    public double estimateSelectivity(int field, Predicate.Op op, Field constant) {
        // some code goes here
        double selectivity;
        if(constant.getType() == Type.INT_TYPE){
            IntField intField = (IntField) constant;
            selectivity = intHistogramHashMap.get(field).estimateSelectivity(op,intField.getValue());
        }else{
            StringField stringField = (StringField) constant;
            selectivity = stringHistogramHashMap.get(field).estimateSelectivity(op,stringField.getValue());
        }
        return selectivity;
    }

    /**
     * return the total number of tuples in this table
     * */
    public int totalTuples() {
        // some code goes here
        return numTuples;
    }

}

ant runtest -Dtest=TableStatsTest 单元测试BUILD SUCCESSFUL。

2.2.4 Join Cardinality

最终，观察join plan p的开销也包括了joincost((t1 join t2) join t3)这样的表达式，为了评估这样的表达式开销，需要一些方法估计t1 join t2 的大小(ntups)。这样的 join cardinality estimation（连接选择数预估）问题比起选择率预估问题更难。在这个lab中，不需要实现多精妙的预估，只需要基于直方图的一种方法来进行 join selectivity estimation。

下列是实现时需要注意的事情：

对于equality joins，当一个属性是primary key，由表连接产生的tuples数量不能大于non-primary key属性的选择数。

对于没有primary key的equality joins，很难说连接输出的大小是多少，可以是两表被选择数的乘积（如果两表的所有tuples都有相同的值），或者也可以是0。

对于大规模scans，很难说清楚明确的数量。输出的数量应该与输入的数量是成比例的，可以预估一个固定的分数代表range scans产生的向量叉积（cross-product），比如30%。总的来说，range join的开销应该大于相同大小两表的non-primary key equality join开销。

Exercise 3

Exercise 3: Join Cost Estimation

The class JoinOptimizer.java includes all of the methods for ordering and computing costs of joins. In this exercise, you will write the methods for estimating the selectivity and cost of a join, specifically:

Implement estimateJoinCost(LogicalJoinNode j, int card1, int card2, double cost1, double cost2): This method estimates the cost of join j, given that the left input is of cardinality card1, the right input of cardinality card2, that the cost to scan the left input is cost1, and that the cost to access the right input is card2. You can assume the join is an NL join, and apply the formula mentioned earlier.
Implement estimateJoinCardinality(LogicalJoinNode j, int card1, int card2, boolean t1pkey, boolean t2pkey): This method estimates the number of tuples output by join j, given that the left input is size card1, the right input is size card2, and the flags t1pkey and t2pkey that indicate whether the left and right (respectively) field is unique (a primary key).

After implementing these methods, you should be able to pass the unit tests estimateJoinCostTest and estimateJoinCardinality in JoinOptimizerTest.java.

JoinOptimizer.java实现的部分代码如下：

    /**
     * Estimate the cost of a join.
     * 
     * The cost of the join should be calculated based on the join algorithm (or
     * algorithms) that you implemented for Lab 2. It should be a function of
     * the amount of data that must be read over the course of the query, as
     * well as the number of CPU opertions performed by your join. Assume that
     * the cost of a single predicate application is roughly 1.
     * 
     * 
     * @param j
     *            A LogicalJoinNode representing the join operation being
     *            performed.
     * @param card1
     *            Estimated cardinality of the left-hand side of the query
     * @param card2
     *            Estimated cardinality of the right-hand side of the query
     * @param cost1
     *            Estimated cost of one full scan of the table on the left-hand
     *            side of the query
     * @param cost2
     *            Estimated cost of one full scan of the table on the right-hand
     *            side of the query
     * @return An estimate of the cost of this query, in terms of cost1 and
     *         cost2
     */
    public double estimateJoinCost(LogicalJoinNode j, int card1, int card2,
            double cost1, double cost2) {
        if (j instanceof LogicalSubplanJoinNode) {
            // A LogicalSubplanJoinNode represents a subquery.
            // You do not need to implement proper support for these for Lab 3.
            return card1 + cost1 + cost2;
        } else {
            // Insert your code here.
            // HINT: You may need to use the variable "j" if you implemented
            // a join algorithm that's more complicated than a basic
            // nested-loops join.
            double cost = cost1 + card1 * cost2 + card1 * card2;
            return cost;
        }
    }

    /**
     * Estimate the cardinality of a join. The cardinality of a join is the
     * number of tuples produced by the join.
     * 
     * @param j
     *            A LogicalJoinNode representing the join operation being
     *            performed.
     * @param card1
     *            Cardinality of the left-hand table in the join
     * @param card2
     *            Cardinality of the right-hand table in the join
     * @param t1pkey
     *            Is the left-hand table a primary-key table?
     * @param t2pkey
     *            Is the right-hand table a primary-key table?
     * @param stats
     *            The table stats, referenced by table names, not alias
     * @return The cardinality of the join
     */
    public int estimateJoinCardinality(LogicalJoinNode j, int card1, int card2,
            boolean t1pkey, boolean t2pkey, Map<String, TableStats> stats) {
        if (j instanceof LogicalSubplanJoinNode) {
            // A LogicalSubplanJoinNode represents a subquery.
            // You do not need to implement proper support for these for Lab 3.
            return card1;
        } else {
            return estimateTableJoinCardinality(j.p, j.t1Alias, j.t2Alias,
                    j.f1PureName, j.f2PureName, card1, card2, t1pkey, t2pkey,
                    stats, p.getTableAliasToIdMapping());
        }
    }


    /**
     * Estimate the join cardinality of two tables.
     * */
    public static int estimateTableJoinCardinality(Predicate.Op joinOp,
            String table1Alias, String table2Alias, String field1PureName,
            String field2PureName, int card1, int card2, boolean t1pkey,
            boolean t2pkey, Map<String, TableStats> stats,
            Map<String, Integer> tableAliasToId) {
        int card = 1;
        // some code goes here
        if(joinOp == Predicate.Op.EQUALS){
            if(t1pkey){
                card = card2;
            }else if(t2pkey){
                card = card1;
            }else{
                card = card1>card2 ?card1:card2;
            }
        }else{
            double temp = 0.3 * card1 *card2;
            card = (int)temp;
        }
        return card <= 0 ? 1 : card;
    }

2.3 Join Ordering

现在已经实现了预估开销的方法，接下来需要实现Selinger optimizer。在接下来的方法中，joins是一系列join节点的集合（比如两表间谓词），而不是需要连接的表集合。

伪代码的框架大致是：

1. j = set of join nodes
2. for (i in 1...|j|):
3.     for s in {all length i subsets of j}
4.       bestPlan = {}
5.       for s' in {all length d-1 subsets of s}
6.            subplan = optjoin(s')
7.            plan = best way to join (s-s') to subplan
8.            if (cost(plan) < cost(bestPlan))
9.               bestPlan = plan
10.      optjoin(s) = bestPlan
11. return optjoin(j)

为了帮助实现这个算法，已经提供了几个类和方法作为协助。首先，JoinOptimizer.java中的 enumerateSubsets(Vector v, int size)方法会返回所有 size 的子集 v。

其次，提供了方法：

 private CostCard computeCostAndCardOfSubplan(HashMap<String, TableStats> stats, 
                                                HashMap<String, Double> filterSelectivities, 
                                                LogicalJoinNode joinToRemove,  
                                                Set<LogicalJoinNode> joinSet,
                                                double bestCostSoFar,
                                                PlanCache pc)

发方法返回最好方法的CostCard对象，包括了开销，选择数和最好的连接排序（以向量的方式给出）。computeCostAndCardOfSubplan这个方法有可能返回null，原因是没有找到plan（比如没有left-deep hoin），或者所有plans的开销都不比bestCostSoFar参数更好。这个方法用到了前一个join的缓存pc (optjoin in the pseudocode above)，来迅速找到最快的方式join joinSet - {joinToRemove}。其他的参数(stats and filterSelectivities) 时传入lab4实现的orderJoins方法。上面伪代码中的6~8行内容就是orderJoins的功能。

之后，也提供了方法：

private void printJoins(Vector<LogicalJoinNode> js, 
                           PlanCache pc,
                           HashMap<String, TableStats> stats,
                           HashMap<String, Double> selectivities)

这个方法用于图形化展示一个join plan。

最后，提供了PlanCache类，用于缓存Selinger算法中连接joins子集的最好方式。（这个类的实例需要在computeCostAndCardOfSubplan中用到）。

Exercise 4

Exercise 4: Join Ordering

In JoinOptimizer.java, implement the method:
  Vector<LogicalJoinNode> orderJoins(HashMap<String, TableStats> stats, 
                   HashMap<String, Double> filterSelectivities,  
                   boolean explain)
This method should operate on the joins class member, returning a new Vector that specifies the order in which joins should be done. Item 0 of this vector indicates the left-most, bottom-most join in a left-deep plan. Adjacent joins in the returned vector should share at least one field to ensure the plan is left-deep. Here stats is an object that lets you find the TableStats for a given table name that appears in the FROM list of the query. filterSelectivities allows you to find the selectivity of any predicates over a table; it is guaranteed to have one entry per table name in the FROM list. Finally, explain specifies that you should output a representation of the join order for informational purposes.

You may wish to use the helper methods and classes described above to assist in your implementation. Roughly, your implementation should follow the pseudocode above, looping through subset sizes, subsets, and sub-plans of subsets, calling computeCostAndCardOfSubplan and building a PlanCache object that stores the minimal-cost way to perform each subset join.

After implementing this method, you should be able to pass all the unit tests in JoinOptimizerTest. You should also pass the system test QueryTest.

依据提示实现JoinOptimizer.java，代码如下：

    /**
     * Compute a logical, reasonably efficient join on the specified tables. See
     * PS4 for hints on how this should be implemented.
     *
     * @param stats
     *            Statistics for each table involved in the join, referenced by
     *            base table names, not alias
     * @param filterSelectivities
     *            Selectivities of the filter predicates on each table in the
     *            join, referenced by table alias (if no alias, the base table
     *            name)
     * @param explain
     *            Indicates whether your code should explain its query plan or
     *            simply execute it
     * @return A Vector<LogicalJoinNode> that stores joins in the left-deep
     *         order in which they should be executed.
     * @throws ParsingException
     *             when stats or filter selectivities is missing a table in the
     *             join, or or when another internal error occurs
     */
    public Vector<LogicalJoinNode> orderJoins(
            HashMap<String, TableStats> stats,
            HashMap<String, Double> filterSelectivities, boolean explain)
            throws ParsingException {
        //Not necessary for labs 1--3

        // some code goes here
        //Replace the following
        PlanCache planCache = new PlanCache();
        Set<Set<LogicalJoinNode>> nodeSets  = null;
        for(int i=0;i<joins.size()+1;++i){
            nodeSets = enumerateSubsets(joins,i);
            for(Set<LogicalJoinNode> nodeSet:nodeSets){
                double costSoFar = Double.MAX_VALUE;
                for(LogicalJoinNode node:nodeSet){
                    // computeCostAndCardOfSubplan(HashMap<String, TableStats> stats,
                    //                                                HashMap<String, Double> filterSelectivities,
                    //                                                LogicalJoinNode joinToRemove,
                    //                                                Set<LogicalJoinNode> joinSet,
                    //                                                double bestCostSoFar,
                    //                                                PlanCache pc)
                    CostCard costCard = computeCostAndCardOfSubplan(stats,filterSelectivities,
                            node,nodeSet,costSoFar,planCache);
                    if(costCard == null) continue;
                    if(costCard.cost < costSoFar){
                        costSoFar = costCard.cost;
                        //void addPlan(Set<LogicalJoinNode> s, double cost, int card, Vector<LogicalJoinNode> order)
                        planCache.addPlan(nodeSet,costSoFar,costCard.card,costCard.plan);
                    }
                }
            }
        }
        Vector<LogicalJoinNode> res = null;
        for (Set<LogicalJoinNode> nodeSet : nodeSets)
            res = planCache.getOrder(nodeSet);

        if (explain)
            printJoins(res, planCache, stats, filterSelectivities);

        return res;
    }

ant runtest -Dtest=JoinOptimizerTest 单元测试BUILD SUCCESSFUL。

ant runsystest -Dtest=QueryTest系统测试BUILD SUCCESSFUL。