MIT6.830 lab2 SimpleDB Operators 实验报告

最新推荐文章于 2023-07-25 23:53:54 发布

跳着迪斯科学Java

最新推荐文章于 2023-07-25 23:53:54 发布

阅读量2.2k

点赞数 5

分类专栏： 6.830 文章标签： sql 数据库

本文链接：https://blog.csdn.net/weixin_45834777/article/details/120675909

版权

本文详细介绍了MIT6.830数据库课程的实验，涉及SimpleDB数据库操作符的实现，包括过滤、连接、聚合、插入、删除、页面淘汰策略。实验涵盖Filter、Join、OrderBy、Aggregates、HeapFile Mutability、Insertion和Deletion的实现，以及LRU页面淘汰算法。在实验过程中，作者遇到了如何处理空数据页、脏页写回等问题，并给出了解决方案。

摘要由CSDN通过智能技术生成

一、实验概览

以下是资料对本实验的介绍

Implement the operators Filter and Join and verify that their corresponding tests work. The Javadoc comments for
these operators contain details about how they should work. We have given you implementations of
Project and OrderBy which may help you understand how other operators work.(过滤、连接)
Implement IntegerAggregator and StringAggregator. Here, you will write the logic that actually computes an
aggregate over a particular field across multiple groups in a sequence of input tuples. Use integer division for
computing the average, since SimpleDB only supports integers. StringAggegator only needs to support the COUNT
aggregate, since the other operations do not make sense for strings.(聚合函数)
Implement the Aggregate operator. As with other operators, aggregates implement the OpIterator interface so that
they can be placed in SimpleDB query plans. Note that the output of an Aggregate operator is an aggregate value of
an entire group for each call to next(), and that the aggregate constructor takes the aggregation and grouping
fields.
Implement the methods related to tuple insertion, deletion, and page eviction in BufferPool. You do not need to
worry about transactions at this point.(插入、删除、淘汰策略)
Implement the Insert and Delete operators. Like all operators, Insert and Delete implement
OpIterator, accepting a stream of tuples to insert or delete and outputting a single tuple with an integer field
that indicates the number of tuples inserted or deleted. These operators will need to call the appropriate methods
in BufferPool that actually modify the pages on disk. Check that the tests for inserting and deleting tuples work
properly.

这个实验需要完成的内容有：

实现过滤、连接运算符，这些类都是继承与OpIterator接口了，该实验提供了OrderBy的操作符实现，可以参考实现。最终的SQL语句解析出来都是要依靠这些运算符的；
实现聚合函数，由于该数据库只有int和string两种类型，int类型可实现的聚合函数有max,min,avg,count等，string类型只需要实现count；这些与分组查询一起使用，选择进行聚合操作时，可以选择是否进行分组查询。
对IntegerAggregator和StringAggregator的封装，查询计划是调用Aggregate，再去调用具体的聚合器，最后获得聚合结果。
实现插入、删除记录。包括从HeapPage、HeapFile、BufferPool中删除，这里需要把三个之间的调用逻辑搞清楚，代码会好写很多。
实现BufferPool的数据页淘汰策略。BufferPool的默认数据页容量为50页，进行插入数据页的操作时，如果数据页数量大于BufferPool的容量，需要选择某中页面淘汰策略去淘汰页面，我选择的是LRU算法来淘汰页面。

二、实验过程

Exercise1:Filter and Join

exercise1要求我们完成Filter和Join两种操作符，下面是相关描述：

Filter: This operator only returns tuples that satisfy a Predicate that is specified as part of its constructor.
Hence, it filters out any tuples that do not match the predicate.
Join: This operator joins tuples from its two children according to a JoinPredicate that is passed in as part of
its constructor. We only require a simple nested loops join, but you may explore more interesting join
implementations. Describe your implementation in your lab writeup.

Filter实现思路

Filter是SQL语句中where的基础，如select * from students where id > 2.Filter起到条件过滤的作用。我们进行条件过滤，使用的是迭代器FIlter的next去获取所有过滤后的记录，比如上述SQL语句的结果，相当于List<Tuple> list;即一个含有多条tuple的集合，而忽略其中的实现细节Filter就相当于list.iterator()返回的一个跌打器，我们通过it.next()去获取一条一条符合过滤条件的Tuple。

Filter是继承于Operator的，而Operator继承于抽象类OpIterator，是一个迭代器：

由于Operator帮我们实现了next和hasNext方法，而这两个方法最终都是调用fetchNext去获取下一条记录的，所以我们在Filter要做的就是返回下一条符合过滤条件的记录；Filter的属性如下：

其中，predicate是断言，实现条件过滤的重要属性；而child是数据源，我们从这里获取一条一条的Tuple用predicate去过滤；td是我们返回结果元组的描述信息，在Filter中与传入的数据源是相同的，而在其它运算符中是根据返回结果的情况去创建TupleDesc的；

Predicate的作用

前面我们提到：忽略其中的实现细节Filter就相当于list.iterator()返回的一个迭代器器，我们通过it.next()去获取一条一条符合过滤条件的Tuple。而其中的实现细节就是通过Predicate来实现的：

可以看到，每次调用fetchNext，我们是从Filter的child数据源中不断取出tuple，只要有一条Tuple满足predicate的filter的过滤条件，我们就可以返回一条Tuple，即这条Tuple是经过过滤条件筛选之后的有效Tuple。

Filter是依赖于断言来实现的，即实现Filter之前需要实现Predicate，Predicate的基本属性如下：

其中field表示的是利用传入Tuple的第几个字段来于操作数字段operand进行op运算，其中op支持的运算有：相等、大于、小于、等于、不等于、大于等于、小于等于、模糊查询这几种。

而operand是用于参与比较的，比如上述SQL语句select * from students where id > 2；假如id是第0个字段，这里的field = 0，op = GREATER_THAN（大于），operand = new IntField(1)。这里进行比较过滤的实现在filter方法中，我们在Filter类中获取过滤后的tuple也是通过predicate.filter(tuple)方法来实现的，filter方法的实现思路如下：

可以看到，Predicate的作用就是将传入的Tuple进行判断，而Predicate的field属性表明使用元组的第几个字段去与操作数operand进行op运算操作，比较的结果实际是调用Field类的compare方法，compare方法会根据传入的运算符和操作数进行比较，以IntField为例：

可以看到支持的运算符有相等、大于、小于、不等于、大于等于、小于等于这些运算符，这里LIKE和EQUALS都表示等于的意思。

OrderBy的实现思路

实验提供了OrderBy的实现，其思路与我们实现的Filter也是相似的，区别在于对fetchNext的获取下一条tuple的实现有所不同。OrderBy的属性如下：

关键的属性：

1、child：数据源，传入进行排序的所有记录Tuple；

2、childTups：OrderBy的实现思路是在open时将数据源child的所有记录存入list中，然后进行排序；

3、asc：升序还是降序，true表示升序；

4、orderByField：依据元组的第几个字段进行排序；

5、it：对childTups进行排序后childTups.iterator()返回的迭代器，原数据源child依据field字段进行排序后的所有数据。

这里的实现个人觉得不是特别好，当数据源的tuple特别多的时，可能会出现OOM（有点十亿数据进行排序那味了）。

这里主要看open的实现，因为在open中实现了排序并存入it迭代器中，后续调用fetchNext只需要在it中取就行了：

fetchNext就简单很多了，直接从结果迭代器中取就完事了：

Join与JoinPredicate的实现

理解了上面Filter与Predicate的关系以及OrderBy的实现思路，来做Join和JoinPredicate就会容易一点点了。

Join是连接查询实现的基本操作符，我们在MySQL中会区分内连接和外连接，我们这里只实现内连接。一条连接查询的SQL语句如下：

select a.*,b.* from a inner join b on a.id=b.id

Join的主要属性如下：

其中child1，child2是参与连接查询的两个表的元数据，从里面取出tuple使用joinPredicate进行连接过滤。td是结果元组的描述信息，使用内连接我们是将两个表连起来，所以如果显示连接两个表的所有字段的记录，td可以简单理解成两个child数据源的两种tuple的td的拼接，这里在构造器中完成：

实现连接查询的算法有很多种，这里实现的是最简单的嵌套循环连接(NLP)，就是从数据源child1中取出一条tuple，然后不断的与child2取出的tuple进行比较，如果满足条件则拼成新的结果tuple加入结果集，不满足则继续取child2的下一条tuple，直到child2没有记录了，再从child1中取出下一条，继续从child2的第一条开始比较，如此往复，直到child1没有记录了。这个算法的时间复杂度是O(m * n)其中m是child1的记录条数，n是child2的记录条数。

具体实现代码在fetchNext中，如下：

protected Tuple fetchNext() throws TransactionAbortedException, DbException {
   
        // some code goes here
        //后面如果it1走到了后面，但是it2还有数据，可以用t1取匹配it2的数据
        TupleDesc td1 = child1.getTupleDesc(), td2 = child2.getTupleDesc();
        while (child1.hasNext() || t1 != null) {
   
            if(child1.hasNext() && t1 == null) {
   
                t1 = child1.next();
            }
            Tuple t2;
            while(child2.hasNext()) {
   
                t2 = child2.next();
                if(joinPredicate.filter(t1, t2)) {
   
                    Tuple res = new Tuple(td);
                    int i = 0;
                    for(; i < td1.numFields(); i++) {
   
                        res.setField(i, t1.getField(i));
                    }
                    for(int j = 0; j < td2.numFields(); j++) {
   
                        res.setField(i + j, t2.getField(j));
                    }
                    //如果刚好是最后一个匹配到，需要重置child2指针和设置t1=null
                    if(!child2.hasNext()) {
   
                        child2.rewind();
                        t1 = null;
                    }
                    return res;
                }
            }
            //每次child2迭代器走到终点，需要进行重置child2的指针，否则会导致死循环；t1=null是为了选取child1的下一个tuple
            child2.rewind();
            t1 = null;
        }
        return null;
    }

这里要注意的child2指针重置的时机，一个是child1匹配到的刚好是child2的最后一条记录，这时需要重置（不重置的话取出child1的下一条tuple就不是与child2的第一条tuple进行比较了）；另一个时机是child1的一条tuple与child2所有tuple都不匹配，这时child1需要选取下一条tuple进行比较，child2理所应当要从第一条tuple的位置开始迭代。

上面所提到的进行比较看是否匹配，跟前面Filter一样要我们去实现JoinPredicate来实现过滤，而JoinPredicate的实现也是依托与具体Field的compare方法来实现的：

Exercise2:Aggregates

exercise2的介绍：

An additional SimpleDB operator implements basic SQL aggregates with a
GROUP BY clause. You should implement the five SQL aggregates
(COUNT, SUM, AVG, MIN,
MAX) and support grouping. You only need to support aggregates over a single field, and grouping by a single field.

In order to calculate aggregates, we use an Aggregator(聚合器)
interface which merges a new tuple into the existing calculation of an aggregate. The Aggregator is told during
construction what operation it should use for aggregation. Subsequently, the client code should
call Aggregator.mergeTupleIntoGroup() for every tuple in the child iterator. After all tuples have been merged, the
client can retrieve a OpIterator of aggregation results. Each tuple in the result is a pair of the
form (groupValue, aggregateValue), unless the value of the group by field was Aggregator.NO_GROUPING, in which case
the result is a single tuple of the form (aggregateValue).

Note that this implementation requires space linear in the number of distinct groups. For the purposes of this lab, you
do not need to worry about the situation where the number of groups exceeds available memory.

exerciese2要求我们实现各种聚合运算如count、sum、avg、min、max等，并且聚合器需要拥有分组聚合的功能。如以下SQL语句：

SELECT SUM(fee) AS country_group_total_fee, country FROM member GROUP BY country

这条语句的功能是查询每个国家的费用总和及国家名称(根据国家名称进行分组)，这里用到了聚合函数SUM。其中fee是聚合字段，country是分组字段，这两个字段是我们理解聚合运算的关键点。

You only need to support aggregates over a single field, and grouping by a single field.讲义告诉我们，我们只需实现根据一个字段去分组和聚合，也就是只有一个分组字段和一个聚合字段。

exercise2的实验要求：

Implement the skeleton methods in:

src/java/simpledb/execution/IntegerAggregator.java
src/java/simpledb/execution/StringAggregator.java
src/java/simpledb/execution/Aggregate.java

At this point, your code should pass the unit tests IntegerAggregatorTest, StringAggregatorTest, and AggregateTest.
Furthermore, you should be able to pass the AggregateTest system test.

IntegerAggregator的实现

IntegerAggregator的本质是一个迭代器，用于对指定的字段进行聚合，下面是基本属性：

    private int groupField;
    private Type groupFieldType;
    private int aggregateField;
    private Op aggregateOp;

    private TupleDesc td;

    /**
     * 计算int类型字段的聚合值，可以实现MIN、MAX、COUNT、SUM
     */
    private Map<Field, Integer> groupMap;
    /**
     * AVG比较特殊，需要先加到list中，最后再算平均值，保证准确性
     */
    private Map<Field, List<Integer>> avgMap;

其中，groupField是指依据tuple的第几个字段进行分组，当无需分组时groupField的值为-1，在上面的SQL语句中相当于country这个字段；groupFieldType是分组字段的类型，如果无需分组这个属性值为null；aggregateField是指对tuple的第几个字段进行聚合，在上面的SQL语句中相当于fee字段；aggregateOp是进行聚合运算的操作符，相当于上述SQL语句的SUM。td是结果元组的描述信息，对于有分组的聚合运算，td是一个拥有两个字段的TupleDesc，以(groupField, aggregateField)的形式，保存原tuple进行分组聚合后每个分组对应的聚合结果，对于没有分组的聚合运算，td只有一个字段来保存聚合结果；而groupMap、avgMap用于保存聚合的结果集，后面进行运算会用到。

下面是构造器，主要是根据传入的参数对以上的属性进行初始化，其中NO_GROUPING是常数-1

public IntegerAggregator(int gbfield, Type gbfieldtype, int afield, Op what) {
   
        // some code goes here
        this.groupField = gbfield;
        this.groupFieldType = gbfieldtype;
        this.aggregateField = afield;
        this.aggregateOp = what;
        groupMap = new HashMap<>();
        avgMap = new HashMap<>();
        this.td = gbfield != NO_GROUPING ?
                new TupleDesc(new Type[]{
   gbfieldtype, Type.INT_TYPE}, new String[]{
   "gbVal", "aggVal"})
                : new TupleDesc(new Type[]{
   Type.INT_TYPE}, new String[]{
   "aggVal"});
    }

不管是IntegerAggregator还是StringAggregator，他们的作用都是进行聚合运算（分组可选），所以他们的核心方法在于mergeTupleIntoGroup。IntegerAggregator.mergeTupleIntoGroup(Tuple tup)的实现思路是这样的：

1.根据构造器给定的aggregateField获取在tup中的聚合字段及其值；

2.根据构造器给定的groupField获取tup中的分组字段，如果无需分组，则为null；这里需要检查获取的分组类型是否正确；

3.根据构造器给定的aggregateOp进行分组聚合运算，对于MIN,MAX,COUNT,SUM，我们将结果保存在groupMap中，key是分组字段(如果无需分组则为null)，val是聚合结果；对于AVG，我们不能直接进行运算，因为整数的除法是不精确的，我们需要把所以字段值用个list保存起来，当需要获取聚合结果时，再进行计算返回。

下面是具体代码：

    public void mergeTupleIntoGroup(Tuple tup) {
   
        // some code goes here
        //获取聚合字段
        IntField aField = (IntField) tup.getField(aggregateField);
        //获取聚合字段的值
        int value = aField.getValue();
        //获取分组字段，如果单纯只是聚合，则该字段为null
        Field gbField = groupField == NO_GROUPING ? null : tup.getField(groupField);
        if (gbField != null && gbField.getType() != this.groupFieldType && groupFieldType != null) {
   
            throw new IllegalArgumentException("Tuple has wrong type");
        }
        //根据聚合运算符处理数据
        switch (aggregateOp) {
   
            case MIN:
                groupMap.put