MIT6.830 lab2 SimpleDB Operators

最新推荐文章于 2023-07-25 23:53:54 发布

fighting_yifeng

最新推荐文章于 2023-07-25 23:53:54 发布

阅读量656

点赞数

分类专栏： MIT6.830 数据库文章标签：数据库

本文链接：https://blog.csdn.net/fighting_yifeng/article/details/122814617

版权

MIT6.830 数据库专栏收录该内容

6 篇文章 4 订阅

订阅专栏

MIT6.830 lab2 SimpleDB Operators

Lab2的主要内容是为 SimpleDB 编写一组运算符来实现表修改‎ (e.g., insert and delete records), selections, joins, and aggregates.

Exercise 1 Filter and Join

Filter: This operator only returns tuples that satisfy a Predicate that is specified as part of its constructor. Hence, it filters out any tuples that do not match the predicate.
Join: This operator joins tuples from its two children according to a JoinPredicate that is passed in as part of its constructor. We only require a simple nested loops join, but you may explore more interesting join implementations. Describe your implementation in your lab writeup.

Exericse 1 要求我们完成Predicate.java,JoinPredicate.Java,Filter.java,Join.java四个类里的全部方法。

1.Predicate.java

Predicate将元组与指定的字段值进行比较。 predicate是断言，实现条件过滤的重要属性；而child是数据源，我们从这里获取一条一条的Tuple用predicate去过滤；td是我们返回结果元组的描述信息，在Filter中与传入的数据源是相同的，而在其它运算符中是根据返回结果的情况去创建TupleDesc的；

enum Op：枚举定义了EQUALS, GREATER_THAN, LESS_THAN, LESS_THAN_OR_EQ, GREATER_THAN_OR_EQ, LIKE,NOT_EQUALS这六种Predicate Op。
Predicate构造函数：这里有三个参数，第一个参数是field，是进行谓词比较的filed 数量；第二个参数是Op，是用于比较的谓词Operation；第三个参数是operand，时传入要比较的field value。
filter(Tuple t)：在特定的field、Operation、operand下与tuple t进行比较。

2.JoinPredicate.java

JoinPredicate使用Predicate比较两个元组的字段。 JoinPredicate最有可能被Join操作符使用。

JoinPredicate的构造函数：JoinPredicate利用一个predicate对两个tuples的fields进行比较，JoinPredicate最常被Join operator使用。构造函数创建在两个tuples的两个fields上创建一个新的Predicate。有三个参数，第一个参数和第二个参数是field1和field2，是Predicate中第一个tuple和第二个tuple的下标；第三个参数是op，是应用的operation。
filter(Tuple t1, Tuple t2)：将Predicate用用两个特定tuples。

3.Join

Join操作符实现关系连接操作。

Join：构造函数。接受要连接的两个子结点和要利用predicate将它们连接起来
fetchNext：返回联接生成的下一个元组，如果没有其他元组，则返回null。逻辑上，这是r1叉r2中的下一个满足联结谓词的元组。有很多可能的实现; 最简单的是嵌套循环连接。请注意，这个特殊的Join实现返回的元组只是连接左右关系的元组。因此，如果使用了相等谓词，那么结果中将有两个连接属性的副本。 (如果需要，可以使用额外的投影操作符来删除这些重复的列。) 例如，如果一个元组是{1,2,3}，另一个元组是{1,5,6}，在第一列相等时连接，则返回{1,2,3,1,5,6}。

4.Filter

Filter是实现关系选择的操作符 Filter是SQL语句中where的基础，如select * from students where id > 2.Filter起到条件过滤的作用。我们进行条件过滤，使用的是迭代器FIlter的next去获取所有过滤后的记录，比如上述SQL语句的结果，相当于List list;即一个含有多条tuple的集合，而忽略其中的实现细节Filter就相当于list.iterator()返回的一个跌打器，我们通过it.next()去获取一条一条符合过滤条件的Tuple。

Filter：构造函数接受要应用的predicate和要从中读取要过滤的元组的子操作符。
fetchNext：AbstractDbIterator.readNext()实现。从子操作符遍历元组，将谓词应用于它们，并返回那些传递谓词的元组(例如，predicate .filter()返回true)。

5. 总结

Predicate中文意为“断言”，起到一个判断的作用，它用于比较一个tuple和一个field。Predicate的构造函数中包含一个逻辑运算符(大于小于等)，一个操作数field，还有待传入的tuple中用于比较的field id.在filter()函数的参数里传入一个tuple, 根据注释的提示我们返回调用compare函数的结果即可。需要注意的地方是语义：如果是大于，应该是传入的tuple的字段大于oprand时返回true，不要写反了。

JoinPredicate依然起到判断的作用，用于比较两个tuple之间的指定field(两个tuple可以指定不同的field)

Join相当于对Predicate类的封装，Predicate是构造时接收了一个field和一个tuple进行比较，而Join类中要把Predicate持有的这个field和在filter()函数中传参的一连串的tuple(以迭代器形式传参)进行比较。

Exercise2:Aggregates

exerciese2要求我们实现各种聚合运算如count、sum、avg、min、max等，并且聚合器需要拥有分组聚合的功能。如以下SQL语句：

SELECT SUM(fee) AS country_group_total_fee, country FROM member GROUP BY country
这条语句的功能是查询每个国家的费用总和及国家名称(根据国家名称进行分组)，这里用到了聚合函数SUM。其中fee是聚合字段，country是分组字段，这两个字段是我们理解聚合运算的关键点。

1.Aggregate.java：

Aggregate的构造函数：Aggregation operator用于计算一个Aggregate（e.g. sum,avg,max,min），我们需要对一列数据支持聚合。构造函数有四个参数，第一个参数是OpIterator类型的 child，用于不断提供tuples；第二个参数是 int 类型的 afield，标识着我们需要聚合的列；第三个参数是 int 类型的gfield，标识着结果中我们需要group by 的列；第四个参数是 Aggregator.Op类型的aop，是我们需要使用的Aggregation operator。
groupField()：如果这个Aggregate伴随有 group by，返回group by的field 的索引。
groupFieldName()：如果这个Aggregate伴随有 groupby，返回groupby的field 的Name。
fetchNext()：返回下一个tuple。如果有groupby field，那么第一个field是我们group的field，第二个field是计算的aggregate结果；如果没有groupby field，只需要返回结果。
getTupleDesc()：返回这个aggregate计算结果tuple的TupleDesc。

不管是IntegerAggregator还是StringAggregator，他们的作用都是进行聚合运算（分组可选），所以他们的核心方法在于mergeTupleIntoGroup。IntegerAggregator.mergeTupleIntoGroup(Tuple tup)的实现思路是这样的：

1.根据构造器给定的aggregateField获取在tup中的聚合字段及其值；

2.根据构造器给定的groupField获取tup中的分组字段，如果无需分组，则为null；这里需要检查获取的分组类型是否正确；

3.根据构造器给定的aggregateOp进行分组聚合运算，对于MIN,MAX,COUNT,SUM，我们将结果保存在groupMap中，key是分组字段(如果无需分组则为null)，val是聚合结果；对于AVG，我们不能直接进行运算，因为整数的除法是不精确的，我们需要把所以字段值用个list保存起来，当需要获取聚合结果时，再进行计算返回。

2.总结

一条带有聚合函数的分组查询语句是怎样实现的？

0.客户端发起请求，sql语句(假如我们有客户端和服务端)；

1.sql解析器进行解析，得出需要从member表中获取数据，分组字段是country(gbField = 1)，聚合字段是fee(aggField = 2)，聚合运算符op=SUM；

2.根据member表的id，调用Database.getCatalog().getDatabaseFile(tableid)获取数据表文件HeapFile，调用HeapFile的iterator方法获取所有表记录，即数据源child；

3.根据gbField、aggField、op、child创建Aggregate，Aggregate构造器中会根据gbField、aggField、op创建出聚合器IntegerAggregator、聚合结果元组的描述信息td；

4.调用Aggregate的open方法（这里记住Aggregate本身也是迭代器，open后才能next），在open方法中会不断的从数据源child取出记录，并调用聚合器的mergeTupleIntoGroup进行聚合运算；运算结束后通过聚合器的iterator方法生成结果迭代器it

5.不断从迭代器it取出结果并返回给客户端

Exercise3:HeapFile Mutability

Removing tuples: To remove a tuple, you will need to implement deleteTuple. Tuples contain RecordIDs which allow you to find the page they reside on, so this should be as simple as locating the page a tuple belongs to and modifying the headers of the page appropriately.

Adding tuples: The insertTuple method in HeapFile.java is responsible for adding a tuple to a heap file. To add a new tuple to a HeapFile, you will have to find a page with an empty slot. If no such pages exist in the HeapFile, you need to create a new page and append it to the physical file on disk. You will need to ensure that the RecordID in the tuple is updated correctly.

To implement HeapPage, you will need to modify the header bitmap for methods such as insertTuple() and deleteTuple(). You may find that the getNumEmptySlots() and isSlotUsed() methods we asked you to implement in Lab 1 serve as useful abstractions. Note that there is a markSlotUsed method provided as an abstraction to modify the filled or cleared status of a tuple in the page header.Note that it is important that the HeapFile.insertTuple() and HeapFile.deleteTuple() methods access pages using the BufferPool.getPage() method; otherwise, your implementation of transactions in the next lab will not work properly.

exercise3需要我们实现HeapPage、HeapFile、BufferPool的插入元组和删除元组的方法。

1.在HeapPage中插入和删除元组

我们要在HeapPage中插入元组，要做的第一件事就是找空槽位然后进行插入，再处理相关细节；我们要在HeapPage删除tuple，首先需要找到tuple在哪个slot，再进行删除即可。

2.在HeapFile中插入和删除元组

实际我们插入和删除元组，都是以HeapFile为入口的，以插入元组为例，HeapFile和HeapPage的调用关系应该是这样的：

1.调用HeapFile的insertTuple

2.HeapFile的insertTuple遍历所有数据页（用BufferPool.getPage()获取，getPage会先从BufferPool再从磁盘获取），然后判断数据页是否有空slot，有的话调用对应有空slot的page的insertTuple方法去插入页面；如果遍历完所有数据页，没有找到空的slot，这时应该在磁盘中创建一个空的数据页，再调用HeapPage的insertTuple方法进行插入

3.插入的页面保存到list中并返回，表明这是脏页，后续会用到。在BufferPool中插入和删除元组

3.以插入元组为例，BufferPool与HeapFile的调用关系：

1.BufferPool插入元组，会先调用Database.getCatalog().getDatabaseFile(tableId)获取HeapFile即表文件；

2.执行HeapFile.insertTuple()，插入元组并返回插入成功的页面；

3.使用HeapPage的markDirty方法，将返回的页面标记为脏页，并放入缓存池中

Exercise4:Insertion and deletion

For plans that implement insert and delete queries, the top-most operator is a special Insert or Delete operator that modifies the pages on disk. These operators return the number of affected tuples. This is implemented by returning a single tuple with one integer field, containing the count.

Insert: This operator adds the tuples it reads from its child operator to the tableid specified in its constructor. It should use the BufferPool.insertTuple() method to do this.
Delete: This operator deletes the tuples it reads from its child operator from the tableid specified in its constructor. It should use the BufferPool.deleteTuple() method to do this.

exercise4要求我们实现Insertion and deletion两个操作符，实际上就是两个迭代器，实现方式与exercise1相似，将传入的数据源进行处理，并返回处理结果，而处理并返回结果一般都是写在fetchNext中。这里的处理结果元组，只有一个字段，那就是插入或删除影响的行数，与MySQL相似。具体实现插入和删除，需要调用我们exercise3实现的插入删除元组相关方法。

1. Insert.java：

class Insert的构造函数：把从child operator中读取到的tuples添加到tableId对应的表中。有三个参数，第一个参数是代表transaction的tid，第二个参数是OpIterator类型的迭代器child，第三个参数是tableId。
fetchNext()：利用OpIterator类型的迭代器child找到一组要添加的记录，insert需要经过BufferPool，所以使用Database.getBufferPool().insertTuple(this.tid, this.tableId, t)方法进行添加。

2. Delete.java：

class Insert的构造函数：把从child operator中读取到的tuples从tableId对应的表中删除。有两个参数，第一个参数是代表transaction的tid，第二个参数是OpIterator类型的迭代器child。
fetchNext()：利用OpIterator类型的迭代器child找到一组要删除的记录，insert需要经过BufferPool，所以使用Database.getBufferPool().deleteTuple(this.tid, t)方法进行删除。

3.总结

批量记录是怎样被插入的

客户端发起请求，请求消息的有效内容是上述的sql语句(假如我们有客户端和服务端)；
SQL解析器解析上述语句，并获取要插入的表，记录信息；
根据表获取表的id，并将记录信息封装成数据源child（实质是一个迭代器）；
生成本次批量插入操作的事务id；
把tid、tableId、child传入Insert操作符的构造器中，生成Insert对象；
调用Insert的hasNext方法，判断是否有结果，因为是第一次调用，hasNext会调用我们写的fetchNext方法，去执行插入操作并获取结果；
在fetchNext执行操作的具体步骤是：调用Database.getBufferPool().insertTuple(tid, tuple)方法进行插入，BufferPool的insertTuple会根据tableId从获取数据库文件HeapFile，并调用HeapFile的insertTuple方法；而HeapFile的insertTuple方法会调用BufferPool.getPage()方法从缓冲池取出页面HeapPage(如果缓冲池没有才会从磁盘中取并放入缓冲池)；获取HeapPage后，调用HeapPage.insertTuple()方法，去插入元组；插入完成后，HeapFile会返回从BufferPool中获取并插入了元组的页面，在BufferPool的insertTuple中把它标记为脏页并写回缓冲池；
整个过程下来，插入的元组并不是真正插入到了磁盘，而是在缓冲池中取出页面插入元组标记脏页并写回缓冲池。
上述插入操作全部完成后，我们会得到一个结果元组，将结果处理后返回给客户端即可。

Exercise5: Page eviction

If you did not implement writePage() inHeapFile.java above, you will also need to do that here. Finally, you should also implement discardPage() toremove a page from the buffer pool without flushing it to disk. We will not test discardPage()
in this lab, but it will be necessary for future labs.At this point, your code should pass the EvictionTest system test.

Since we will not be checking for any particular eviction policy, this test works by creating a BufferPool with 16
pages (NOTE: while DEFAULT_PAGES is 50, we are initializing the BufferPool with less!), scanning a file with many more
than 16 pages, and seeing if the memory usage of the JVM increases by more than 5 MB. If you do not implement an
eviction policy correctly, you will not evict enough pages, and will go over the size limitation, thus failing the test.

exercise5要求我们实现一种BufferPool的页面淘汰策略：为什么需要页面淘汰策略？该BufferPool缓冲的最大页面数是50，当我们写入的页面超过50时，需要将暂时不需要的页面从BufferPool中淘汰出去。

页面淘汰算法可以采用LRU可以参考leetcode 146. LRU 缓存

class LRUCache {
public:
    struct Node{
        int key, val;
        Node *left, *right;
        Node(int _key, int _val) : key(_key), val(_val), left(NULL), right(NULL){}
    }*L, *R;
    int n;
    unordered_map<int, Node*> hash;
    LRUCache(int capacity) {
        n = capacity;
        L = new Node(-1, -1), R = new Node(-1, -1);
        L->right = R, R->left = L;
    }
    
    int get(int key) {
        if(hash.count(key) == 0) return -1;
        auto p = hash[key];
        remove(p);
        insert(p);
        return p->val;
    }
    
    void put(int key, int value) {
        if(hash.count(key)){
            auto p = hash[key];
            p->val = value;
            remove(p);
            insert(p);
        }else{
            if(hash.size() == n){
                auto p = R->left;
                remove(p);
                hash.erase(p->key);
            }
            auto p = new Node(key, value);
            insert(p);
            hash[key] = p;
        }
    }
    void remove(Node *p){
        p->left->right = p->right;
        p->right->left = p->left;
    }
    void insert(Node *p){
        p->left = L; p->right = L->right;
        L->right->left = p;L->right = p; 
        
    }
};

/**
 * Your LRUCache object will be instantiated and called as such:
 * LRUCache* obj = new LRUCache(capacity);
 * int param_1 = obj->get(key);
 * obj->put(key,value);
 */

实验总结

lab2ha

fighting_yifeng

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
MIT6.830 lab2 SimpleDB Operators

MIT6.830 lab2 SimpleDB OperatorsLab2的主要内容是为 SimpleDB 编写一组运算符来实现表修改‎ (e.g., insert and delete records), selections, joins, and aggregates.Exercise 1 Filter and JoinFilter: This operator only returns tuples that satisfy a Predicate that is specified a
复制链接

扫一扫