c-tree数据库大量数据bulk loading

大部分数据库都提供了bulk insert的手段,c-tree也有类似的功能,就是它的batch处理模式。这种模式可以减少客户端和服务器的交互,提高处理速度。根据我的测试结果,batch模式和普通插入模式的速度比大概是3.5。下表是插入1百万和1千万数据时所用时间(单位秒):

数据量Batch模式普通模式
1m2277
10m226780

以上测试所用机器Intel core i5-2410M (四核), Mem: 4GB, HD: 7200转。


不过在使用的过程中,有两点需要注意。

1. 待导入数据表屏蔽事务支持

比如我下面贴的例子程序,建表时没有加上CTCREATE_TRNLOG选项,就不会启用事务支持。

如果不知原来建表时有没有启用该选项,则可以利用c-tree提供的几个api临时屏蔽,数据导完之后再恢复。

如下面这段代码用来屏蔽:

      //get original mode
      CTCREATE_MODE mode = table->GetCreateMode();   
       
      if ((mode & CTCREATE_TRNLOG))
      {
         // remove TRN mode to disable transaction support
         mode &= ~CTCREATE_TRNLOG;
         table->UpdateCreateMode(mode);
         mode |= CTCREATE_TRNLOG;  //restore old mode;
      }
下面用来恢复:
      table->UpdateCreateMode(mode);

这一点对于插入速度影响非常大,不管是批量插入还是普通插入模式,当导入大量数据时应该都不要用事务支持。


2. Batch模式的应用程序需要启用事务控制

不要奇怪,这看起来和第1点是矛盾的。因为既然表都不支持事务控制,应用里面还要事务控制干什么? 先看完整例子:

#include <string>
#include <iostream>
#include "ctdbsdk.hpp"

using namespace std;

const unsigned BATCHNUM = 10000;
const unsigned TOTALNUM = 10000000;

int main(int argc, char *argv[]) {

    string tableName = "data_loading";
    ctdbStartDatabaseEngine();
    CTSession *MySession = new CTSession(CTSESSION_SQL);
    CTDatabase *MyDatabase = new CTDatabase(MySession);
    CTTable *MyTable = new CTTable(MyDatabase);
    CTRecord *MyRecord = new CTRecord(MyTable);
    try{
        MySession->Logon("FAIRCOMS", "admin", "ADMIN");
        MyDatabase->Connect("ctreesql");
    }
    catch(CTException e){
        cout<<"Logon Error:"<<e.GetErrorCode()<<":"<<e.GetErrorMsg()<<endl;
        exit(1);
    }

    try{
        MyDatabase->DeleteTable(tableName.c_str(),"");
    }
    catch(CTException e){
        cout<<"Delete Table: table not exist"<<endl;
    }

    try{
        MyTable->AddField("field1", CT_TIMESTAMP, 8);
        MyTable->AddField("field2", CT_VARCHAR, 20);
        MyTable->AddField("field3", CT_TIMESTAMP, 8);
        MyTable->AddField("field4", CT_VARCHAR, 50);
        MyTable->AddField("field5", CT_VARCHAR, 50);
        MyTable->AddField("field6", CT_INTEGER, 4);


        CTIndex index1 = MyTable->AddIndex("idx_date_loading_filed6", CTINDEX_FIXED, NO, NO);
        MyTable->AddSegment(index1, "field6", CTSEG_SCHSEG);

        CTIndex index2 = MyTable->AddIndex("idx_date_loading_filed1", CTINDEX_FIXED, YES, YES);
        MyTable->AddSegment(index2, "field1", CTSEG_SCHSEG);

        MyTable->Create(tableName.c_str(), CTCREATE_HUGEFILE);
    }
    catch(CTException &e){
        cout<<"Create Table Error:"<< e.GetErrorCode() <<", Msg:" << e.GetErrorMsg() <<endl<<endl;
        exit(1);
    }

    unsigned lines = 1;

    try{
    	MyTable->Open(tableName.c_str(), CTOPEN_NORMAL);

//      MyRecord->Begin();
        MyRecord->SetBatch(CTBATCH_INS, 0, 0);

        while(lines<=TOTALNUM){
            MyRecord->Clear();
            MyRecord->SetFieldAsDateTime("field1", CTDateTime::CurrentDateTime());
            MyRecord->SetFieldAsString("field2", "abcdefg");
            MyRecord->SetFieldAsDateTime("field3", CTDateTime::CurrentDateTime());
            MyRecord->SetFieldAsString("field4", "");
            MyRecord->SetFieldAsString("field5", "");
            MyRecord->SetFieldAsSigned("field6", lines);

            MyRecord->InsertBatch();

            if(lines % BATCHNUM == 0){
                MyRecord->EndBatch();
//              MyRecord->Commit();
                cout<<endl<< lines <<" records inserted";
//              MyRecord->Begin();
                MyRecord->SetBatch(CTBATCH_INS, 0, 0);
            }

            ++lines;
        }

        MyRecord->EndBatch();
//      MyRecord->Commit();
    }
    catch(CTException &e){
        cout<<"Insert Error:"<< e.GetErrorCode() <<", Msg:" << e.GetErrorMsg() <<endl<<endl;
    }

    cout<<endl<<"Done! " << lines-1 << " records inserted."<<endl<<endl;

    MyTable->Close();
    MyDatabase->Disconnect();
    MySession->Logout();

    delete MyRecord;
    delete MyTable;
    delete MyDatabase;
    delete MySession;
}

上面这段程序先建了个表,然后往里面插入1千万的数据,每1万为一个batch。这段程序可以跑,也可以快速将记录成功插入。唯一的问题是在导入过程中,c-tree server的进程所占内存逐渐攀升,最后占用了1.3G物理内存和1.5G的虚拟内存!

解决这个问题的方法就是uncomment其中注释掉的那四行,启用事物控制。根据厂家技术支持的说法,在Commit之后c-tree服务会释放内存。哎,人家的设计如此。不过这种设计有些让人迷惑。


先让我们看看原题的三个任务介绍: Task 1: Sorting the LINEITEM table by External Merge Sort Consider two cases: 1) using 5 buffer pages in memory for the external merge sort; 2) using 129 buffer pages in memory for the external merge sort. In the implementation, each buffer page occupies 8K bytes. The ORDERKEY attribute of the LINEITEM table is assumed to be the sort key in the external merge sort. Please report the number of passes and also the running time of the external merge sort in each case. Task 2: Organizing the sorted LINEITEM table into disk pages Please use the page format for storing variable-length records to organize the LINEITEM table sorted in Task 1. In the implementation, each disk page occupies 1K bytes. For each page we maintain a directory of slots, with a pair per slot. Both “record offset” and “record length” are 4 bytes wide. Task 3: Building a B-Tree over LINEITEM disk pages by Bulk Loading. Please use bulk loading to build a B-Tree over the disk pages of the LINEITEM table, which are generated in Task 2. The ORDERKEY attribute of the LINEITEM table is used as the (search) key for building the B-Tree. In the B-Tree, each internal node corresponds to a page of 1K bytes, both key and pointer are 4 bytes wide. Please report the running time of the bulk loading. A query interface is required for checking the B-Tree. For a reasonable ORDERKEY value, please print out all the pages visited along the path to find the corresponding record. Please also report the running time of the search.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值