索引算法(Lucene的实现)

Lucene是一个较好的全文检索引擎,它完成了分词,
索引,查询功能。这些都是搜索引擎的主要技术算法.
它成功编译后是一个lib库,别人可以免费使用它的接口
代码可从http://sourceforge.net/projects/clucene/下面
    下面我们来看看一个检索引擎的基本流程。
下面是Lucene中测试Document的测试程序,从中我们可以看出一些基本
使用Lucene库的思路
(摘自/test/document/testDocument.cpp)
 void TestBinaryDocument(CuTest *tc){
    char factbook[1024];
    strcpy(factbook, clucene_data_location);
 strcat(factbook, "/reuters-21578/feldman-cia-worldfactbook-data.txt");
 CuAssert(tc,_T("Factbook file does not exist"),Misc::dir_Exists(factbook));

    Document doc;
    Field* f;
    const char* _as;
    const char* _as2;
    const TCHAR* _ts;
    jstreams::StreamBase<char>* strm;
    RAMDirectory ram;

    const char* areaderString = "a string reader field";
    const TCHAR* treaderString = _T("a string reader field");
    int readerStringLen = strlen(areaderString);

 SimpleAnalyzer an;
    IndexWriter writer(&ram,&an,true); //no analyzer needed since we are not indexing...
------------------------------------
    //use reader
    doc.add( *_CLNEW Field(_T("readerField"),_CLNEW StringReader (treaderString),
        Field::TERMVECTOR_NO | Field::STORE_YES | Field::INDEX_NO) );
    writer.addDocument(&doc);
    doc.clear();
---------------------------------
    IndexReader* reader = IndexReader::open(&ram);
   
    //and check reader stream
    reader->document(1, &doc);
    f = doc.getField(_T("readerField"));
    _ts = f->stringValue();
    CLUCENE_ASSERT(_tcscmp(treaderString,_ts)==0);
    doc.clear();
--------------------------------
    reader->close();
    _CLDELETE(reader);
  }

     第一条横线上的是定义了一个 文档对象(Document), 分析器(SimpleAnalyzer),
写索引对象(IndexWriter), 字典对象(RAMDirectory).
     第二横线上 文档对象通过add方法增加了一个域,这个域名字是readerField, 值为
treaderString变量中的"a string reader field", ,  之后,写索引对象通过addDocument方法
将文档对象加入自身之中,此时索引及值已存在于字典对象之中。
  第三横线上 通过字典创建了一个读索引对象。读索引对象通过document方法将
读索引中的值提到doc对象之中,然后doc对象通过getField方法,根据索引名"readerField"
提取其值到域对象f之中。域对象f通过stringValue方法返回其值到串_ts之中。

所以全文检索的思路简单也说是这样的:
1. 根据索引与值建立域
2. 将域加入文档,再加入写索引对象,最后进入字典中

3. 根据字典创建读索引对象。
4。读索引对象提取文档,再根据索引条件取域,域再取出其中值。搜索完成。

所以检索分为两步,一个是存值到字典,另一个是从字典中取值。

   这思路非常简单,关键点在索引及索引的值的存放上。即
将文档加入到读索引的算法上。为了进一步研究其算法,我们看一下
读索引对象加文档的函数addDocument();

(/index/IndexWriter.cpp)
  void IndexWriter::addDocument(Document* doc, Analyzer* analyzer) {
  //Func - Adds a document to the index
  //Pre  - doc contains a valid reference to a document
  //       ramDirectory != NULL
  //Post - The document has been added to the index of this IndexWriter
 CND_PRECONDITION(ramDirectory != NULL,"ramDirectory is NULL");

 if ( analyzer == NULL )
  analyzer = this->analyzer;

 ramDirectory->transStart();
 try {
  char* segmentName = newSegmentName();
  CND_CONDITION(segmentName != NULL, "segmentName is NULL");
  try {
   //Create the DocumentWriter using a ramDirectory and analyzer
   // supplied by the IndexWriter (this).
   DocumentWriter* dw = _CLNEW DocumentWriter(
    ramDirectory, analyzer, this );
   CND_CONDITION(dw != NULL, "dw is NULL");
   try {
    //Add the client-supplied document to the new segment.
-------------------------------------------------------------------------------------------------------------------------------------
    dw->addDocument(segmentName, doc);
----------------------------------------------------------------------------------------------------------------------------------------
   } _CLFINALLY(
    _CLDELETE(dw);
   );


   这个函数的思路也很简单
1. 判断分析器是否为空,若为空,是将写索引对象的分析器给此分析器。
2. 字典对象调用transStart函数以 表明从文档中提取索引及索引值开始。
3. 创建一个段名
4。当段名不为空时,再创建一个写文档对象(DocumentWriter), 其中创建时,
  以字典,分析器及 写索引对象为参数.
5.写文档对象调用addDocument函数加入文档对象。

  好了,这里的关键就是第五步,我们再看看其具体的实现
(index/DocumentWriter.cpp)
void DocumentWriter::addDocument(const char* segment, Document* doc) {
    CND_PRECONDITION(fieldInfos==NULL, "fieldInfos!=NULL")

 // write field names
 fieldInfos = _CLNEW FieldInfos();
 fieldInfos->add(doc);
 
 const char* buf = Misc::segmentname(segment, ".fnm");
 fieldInfos->write(directory, buf);
 _CLDELETE_CaARRAY(buf);

 // write field values
 FieldsWriter fieldsWriter(directory, segment, fieldInfos);
 try {
  fieldsWriter.addDocument(doc);
 } _CLFINALLY( fieldsWriter.close() );
     
 // invert doc into postingTable
 clearPostingTable();     // clear postingTable
 
 size_t size = fieldInfos->size();
 fieldLengths = _CL_NEWARRAY(int32_t,size); // init fieldLengths
 fieldPositions = _CL_NEWARRAY(int32_t,size);  // init fieldPositions
 fieldOffsets = _CL_NEWARRAY(int32_t,size);    // init fieldOffsets
 memset(fieldPositions, 0, sizeof(int32_t) * size);
     
 //initialise fieldBoost array with default boost
 int32_t fbl = fieldInfos->size();
 float_t fbd = doc->getBoost();
 fieldBoosts = _CL_NEWARRAY(float_t,fbl);   // init fieldBoosts
 { //msvc6 scope fix
  for ( int32_t i=0;i<fbl;i++ )
   fieldBoosts[i] = fbd;
 }

 { //msvc6 scope fix
  for ( int32_t i=0;i<fieldInfos->size();i++ )
   fieldLengths[i] = 0;
 } //msvc6 scope fix

--------------------------------------------------------------------------------------------------------------------
 invertDocument(doc);
-------------------------------------------------------------------------------------------------------------------

 // sort postingTable into an array
 Posting** postings = NULL;
 int32_t postingsLength = 0;
 sortPostingTable(postings,postingsLength);

 //DEBUG:
 


 // write postings
 writePostings(postings,postingsLength, segment);

 // write norms of indexed fields
 writeNorms(segment);
 _CLDELETE_ARRAY( postings );
}

      此函数的思路是这样的
1. 创建一个域消息对象(FieldInfos)
2. 域消息对象通过调用add函数将文档对象加入自身之中
3. 定义一个指针,指向段名为.fnm的值的缓冲区(这里我也没有弄明白具体
用意,大约就是这个意思吧),
4。域消息对象通过调用write函数将字典与缓冲区加入自身之中。
5。根据字典,段,域消息对象创建写域对象。
6。写域对象通过addDocument方法将文档对象加入其中
7。 下面接着得到了域消息的大小,偏移,还操作了boost,
  下面的几句我就没有弄明白了,^_^
8. 接着做了一个重要的操作将文档对象InvertDocument了一下。

  这里面第八步是关键操作
我们看看其代码
(index/IndexWriter.cpp)
void DocumentWriter::invertDocument(const Document* doc) {
 DocumentFieldEnumeration* fields = doc->fields();
 try {
  while (fields->hasMoreElements()) {
   Field* field = (Field*)fields->nextElement();
   const TCHAR* fieldName = field->name();
      const int32_t fieldNumber = fieldInfos->fieldNumber(fieldName);
     
      int32_t length = fieldLengths[fieldNumber];     // length of field
      int32_t position = fieldPositions[fieldNumber]; // position in field
      if (length>0)
       position+=analyzer->getPositionIncrementGap(fieldName);
     int32_t offset = fieldOffsets[fieldNumber];       // offset field
  
      if (field->isIndexed()) {
       if (!field->isTokenized()) { // un-tokenized field
     //FEATURE: this is bug in java: if using a Reader, then
     //field value will not be added. With CLucene, an untokenized
     //field with a reader will still be added (if it isn't stored,
     //because if it's stored, then the reader has already been read.
     const TCHAR* charBuf = NULL;
     int64_t dataLen = 0;

     if (field->stringValue() == NULL && !field->isStored() ) {
      CL_NS(util)::Reader* r = field->readerValue();
      // this call tries to read the entire stream
      // this may invalidate the string for the further calls
      // it may be better to do this via a FilterReader
      // TODO make a better implementation of this
      dataLen = r->read(charBuf, LUCENE_INT32_MAX_SHOULDBE);
      if (dataLen == -1)
       dataLen = 0;
      //todo: would be better to pass the string length, in case
      //a null char is passed, but then would need to test the output too.
     } else {
      charBuf = field->stringValue();
      dataLen = _tcslen(charBuf);
     }
     
     if(field->isStoreOffsetWithTermVector()){
      TermVectorOffsetInfo tio;
      tio.setStartOffset(offset);
      tio.setEndOffset(offset + dataLen);
      addPosition(fieldName, charBuf, position++, &tio );
     }else
      addPosition(fieldName, charBuf, position++, NULL);
     offset += dataLen;
     length++;
       } else { // field must be tokenized
           CL_NS(util)::Reader* reader; // find or make Reader
           bool delReader = false;
           if (field->readerValue() != NULL) {
             reader = field->readerValue();
           } else if (field->stringValue() != NULL) {
             reader = _CLNEW CL_NS(util)::StringReader(field->stringValue(),_tcslen(field->stringValue()),false);
             delReader = true;
           } else {
             _CLTHROWA(CL_ERR_IO,"field must have either String or Reader value");
           }
   
           try {
             // Tokenize field and add to postingTable.
             CL_NS(analysis)::TokenStream* stream = analyzer->tokenStream(fieldName, reader);
   
             try {
               CL_NS(analysis)::Token t;
               int32_t lastTokenEndOffset = -1;
               while (stream->next(&t)) {
                   position += (t.getPositionIncrement() - 1);
                  
                   if(field->isStoreOffsetWithTermVector()){
                    TermVectorOffsetInfo tio;
                    tio.setStartOffset(offset + t.startOffset());
                    tio.setEndOffset(offset + t.endOffset());
        addPosition(fieldName, t.termText(), position++, &tio);
       }else
        addPosition(fieldName, t.termText(), position++, NULL);
       
       lastTokenEndOffset = t.endOffset();
                   length++;
                   // Apply field truncation policy.
       if (maxFieldLength != IndexWriter::FIELD_TRUNC_POLICY__WARN) {
                     // The client programmer has explicitly authorized us to
                     // truncate the token stream after maxFieldLength tokens.
                     if ( length > maxFieldLength) {
                       break;
                     }
       } else if (length > IndexWriter::DEFAULT_MAX_FIELD_LENGTH) {
                     const TCHAR* errMsgBase =
                       _T("Indexing a huge number of tokens from a single")
                       _T(" field (/"%s/", in this case) can cause CLucene")
                       _T(" to use memory excessively.")
                       _T("  By default, CLucene will accept only %s tokens")
                       _T(" tokens from a single field before forcing the")
                       _T(" client programmer to specify a threshold at")
                       _T(" which to truncate the token stream.")
                       _T("  You should set this threshold via")
              _T(" IndexReader::maxFieldLength (set to

LUCENE_INT32_MAX")
                       _T(" to disable truncation, or a value to specify maximum number of fields).");
                    
                     TCHAR defaultMaxAsChar[34];
                     _i64tot(IndexWriter::DEFAULT_MAX_FIELD_LENGTH,
                         defaultMaxAsChar, 10
                       );
                int32_t errMsgLen = _tcslen(errMsgBase)
                         + _tcslen(fieldName)
                         + _tcslen(defaultMaxAsChar);
                     TCHAR* errMsg = _CL_NEWARRAY(TCHAR,errMsgLen+1);
   
                     _sntprintf(errMsg, errMsgLen,errMsgBase, fieldName, defaultMaxAsChar);
   
         _CLTHROWT_DEL(CL_ERR_Runtime,errMsg);
                   }
               } // while token->next
      
      if(lastTokenEndOffset != -1 )
                  offset += lastTokenEndOffset + 1;
             } _CLFINALLY (
               stream->close();
               _CLDELETE(stream);
             );
           } _CLFINALLY (
             if (delReader) {
               _CLDELETE(reader);
             }
           );
       } // if/else field is to be tokenized
    fieldLengths[fieldNumber] = length; // save field length
    fieldPositions[fieldNumber] = position;   // save field position
    fieldBoosts[fieldNumber] *= field->getBoost();
    fieldOffsets[fieldNumber] = offset;
   } // if field is to beindexed
  } // while more fields available
 } _CLFINALLY (
   _CLDELETE(fields);
 );
} // Document:;invertDocument

   这个函数基本是是一些基本的数据结构操作,没有调用其它的什么了,是关于数值及基本数据结构的操作。是
索引变换的核心。它的思路是
1. 文档对象调用fields函数获得文档域枚举指针(DocumentFieldEnumeration)
2. 调用文档域枚举指针调用hasMoreElement函数判断里面是否到元素尾了
3. 如果没有,那么文档域枚举指针调用nextElement函数获得域指针。
4。 下面就是对域及其 内部存储的细致的运算了。

转自 http://blog.sina.com.cn/s/blog_625cce080100g0ag.html?retcode=0

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值