索引算法(Lucene的实现)

最新推荐文章于 2024-03-08 20:50:06 发布

ddl007

最新推荐文章于 2024-03-08 20:50:06 发布

阅读量921

点赞数

文章标签： lucene 算法文档 null 全文检索 string

本文链接：https://blog.csdn.net/ddl007/article/details/5771610

版权

Lucene是一个较好的全文检索引擎，它完成了分词，
索引，查询功能。这些都是搜索引擎的主要技术算法.
它成功编译后是一个lib库，别人可以免费使用它的接口
代码可从http://sourceforge.net/projects/clucene/下面
　   下面我们来看看一个检索引擎的基本流程。
下面是Lucene中测试Document的测试程序,从中我们可以看出一些基本
使用Lucene库的思路
(摘自/test/document/testDocument.cpp)
void TestBinaryDocument(CuTest *tc){
    char factbook[1024];
    strcpy(factbook, clucene_data_location);
strcat(factbook, "/reuters-21578/feldman-cia-worldfactbook-data.txt");
CuAssert(tc,_T("Factbook file does not exist"),Misc::dir_Exists(factbook));

    Document doc;
    Field* f;
    const char* _as;
    const char* _as2;
    const TCHAR* _ts;
    jstreams::StreamBase<char>* strm;
    RAMDirectory ram;

    const char* areaderString = "a string reader field";
    const TCHAR* treaderString = _T("a string reader field");
    int readerStringLen = strlen(areaderString);

SimpleAnalyzer an;
    IndexWriter writer(&ram,&an,true); //no analyzer needed since we are not indexing...
－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－
    //use reader
    doc.add( *_CLNEW Field(_T("readerField"),_CLNEW StringReader (treaderString),
        Field::TERMVECTOR_NO | Field::STORE_YES | Field::INDEX_NO) );
    writer.addDocument(&doc);
    doc.clear();
－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－
    IndexReader* reader = IndexReader::open(&ram);

    //and check reader stream
    reader->document(1, &doc);
    f = doc.getField(_T("readerField"));
    _ts = f->stringValue();
    CLUCENE_ASSERT(_tcscmp(treaderString,_ts)==0);
    doc.clear();
－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－
    reader->close();
    _CLDELETE(reader);
}

第一条横线上的是定义了一个　文档对象(Document), 分析器(SimpleAnalyzer),
写索引对象(IndexWriter), 字典对象(RAMDirectory).
第二横线上　文档对象通过add方法增加了一个域,这个域名字是readerField, 值为
treaderString变量中的"a string reader field", , 之后，写索引对象通过addDocument方法
将文档对象加入自身之中，此时索引及值已存在于字典对象之中。
　　第三横线上　通过字典创建了一个读索引对象。读索引对象通过document方法将
读索引中的值提到doc对象之中，然后doc对象通过getField方法，根据索引名"readerField"
提取其值到域对象f之中。域对象f通过stringValue方法返回其值到串_ts之中。

所以全文检索的思路简单也说是这样的:
1. 根据索引与值建立域
2. 将域加入文档，再加入写索引对象，最后进入字典中

3. 根据字典创建读索引对象。
4。读索引对象提取文档，再根据索引条件取域，域再取出其中值。搜索完成。

所以检索分为两步，一个是存值到字典，另一个是从字典中取值。

　　　这思路非常简单，关键点在索引及索引的值的存放上。即
将文档加入到读索引的算法上。为了进一步研究其算法，我们看一下
读索引对象加文档的函数addDocument();

(/index/IndexWriter.cpp)
void IndexWriter::addDocument(Document* doc, Analyzer* analyzer) {
//Func - Adds a document to the index
//Pre - doc contains a valid reference to a document
// ramDirectory != NULL
//Post - The document has been added to the index of this IndexWriter
CND_PRECONDITION(ramDirectory != NULL,"ramDirectory is NULL");

if ( analyzer == NULL )
analyzer = this->analyzer;

ramDirectory->transStart();
try {
  char* segmentName = newSegmentName();
  CND_CONDITION(segmentName != NULL, "segmentName is NULL");
  try {
   //Create the DocumentWriter using a ramDirectory and analyzer
   // supplied by the IndexWriter (this).
   DocumentWriter* dw = _CLNEW DocumentWriter(
    ramDirectory, analyzer, this );
   CND_CONDITION(dw != NULL, "dw is NULL");
   try {
    //Add the client-supplied document to the new segment.
-------------------------------------------------------------------------------------------------------------------------------------
    dw->addDocument(segmentName, doc);
----------------------------------------------------------------------------------------------------------------------------------------
   } _CLFINALLY(
    _CLDELETE(dw);
   );

　　　这个函数的思路也很简单
1. 判断分析器是否为空,若为空，是将写索引对象的分析器给此分析器。
2. 字典对象调用transStart函数以　表明从文档中提取索引及索引值开始。
3. 创建一个段名
4。当段名不为空时，再创建一个写文档对象(DocumentWriter), 其中创建时，
　　以字典，分析器及　写索引对象为参数.
5.写文档对象调用addDocument函数加入文档对象。

　　好了，这里的关键就是第五步，我们再看看其具体的实现
(index/DocumentWriter.cpp)
void DocumentWriter::addDocument(const char* segment, Document* doc) {
CND_PRECONDITION(fieldInfos==NULL, "fieldInfos!=NULL")

// write field names
fieldInfos = _CLNEW FieldInfos();
fieldInfos->add(doc);

const char* buf = Misc::segmentname(segment, ".fnm");
fieldInfos->write(directory, buf);
_CLDELETE_CaARRAY(buf);

// write field values
FieldsWriter fieldsWriter(directory, segment, fieldInfos);
try {
  fieldsWriter.addDocument(doc);
} _CLFINALLY( fieldsWriter.close() );

// invert doc into postingTable
clearPostingTable();     // clear postingTable

size_t size = fieldInfos->size();
fieldLengths = _CL_NEWARRAY(int32_t,size); // init fieldLengths
fieldPositions = _CL_NEWARRAY(int32_t,size); // init fieldPositions
fieldOffsets = _CL_NEWARRAY(int32_t,size);    // init fieldOffsets
memset(fieldPositions, 0, sizeof(int32_t) * size);

//initialise fieldBoost array with default boost
int32_t fbl = fieldInfos->size();
float_t fbd = doc->getBoost();
fieldBoosts = _CL_NEWARRAY(float_t,fbl);   // init fieldBoosts
{ //msvc6 scope fix
  for ( int32_t i=0;i<fbl;i++ )
   fieldBoosts[i] = fbd;
}

{ //msvc6 scope fix
for ( int32_t i=0;i<fieldInfos->size();i++ )
fieldLengths[i] = 0;
} //msvc6 scope fix

--------------------------------------------------------------------------------------------------------------------
invertDocument(doc);
-------------------------------------------------------------------------------------------------------------------

// sort postingTable into an array
Posting** postings = NULL;
int32_t postingsLength = 0;
sortPostingTable(postings,postingsLength);

//DEBUG:

// write postings
writePostings(postings,postingsLength, segment);

// write norms of indexed fields
writeNorms(segment);
_CLDELETE_ARRAY( postings );
}

此函数的思路是这样的
1. 创建一个域消息对象(FieldInfos)
2. 域消息对象通过调用add函数将文档对象加入自身之中
3. 定义一个指针，指向段名为.fnm的值的缓冲区(这里我也没有弄明白具体
用意，大约就是这个意思吧），
4。域消息对象通过调用write函数将字典与缓冲区加入自身之中。
5。根据字典，段，域消息对象创建写域对象。
6。写域对象通过addDocument方法将文档对象加入其中
7。　下面接着得到了域消息的大小，偏移，还操作了boost，
　　下面的几句我就没有弄明白了，＾＿＾
8. 接着做了一个重要的操作将文档对象InvertDocument了一下。

这里面第八步是关键操作
我们看看其代码
(index/IndexWriter.cpp)
void DocumentWriter::invertDocument(const Document* doc) {
DocumentFieldEnumeration* fields = doc->fields();
try {
  while (fields->hasMoreElements()) {
   Field* field = (Field*)fields->nextElement();
   const TCHAR* fieldName = field->name();
      const int32_t fieldNumber = fieldInfos->fieldNumber(fieldName);

      int32_t length = fieldLengths[fieldNumber];     // length of field
      int32_t position = fieldPositions[fieldNumber]; // position in field
      if (length>0)
      position+=analyzer->getPositionIncrementGap(fieldName);
     int32_t offset = fieldOffsets[fieldNumber];       // offset field

      if (field->isIndexed()) {
       if (!field->isTokenized()) { // un-tokenized field
     //FEATURE: this is bug in java: if using a Reader, then
     //field value will not be added. With CLucene, an untokenized
     //field with a reader will still be added (if it isn't stored,
     //because if it's stored, then the reader has already been read.
     const TCHAR* charBuf = NULL;
     int64_t dataLen = 0;

     if (field->stringValue() == NULL && !field->isStored() ) {
      CL_NS(util)::Reader* r = field->readerValue();
      // this call tries to read the entire stream
      // this may invalidate the string for the further calls
      // it may be better to do this via a FilterReader
      // TODO make a better implementation of this
      dataLen = r->read(charBuf, LUCENE_INT32_MAX_SHOULDBE);
      if (dataLen == -1)
       dataLen = 0;
      //todo: would be better to pass the string length, in case
      //a null char is passed, but then would need to test the output too.
     } else {
      charBuf = field->stringValue();
      dataLen = _tcslen(charBuf);
     }

     if(field->isStoreOffsetWithTermVector()){
      TermVectorOffsetInfo tio;
      tio.setStartOffset(offset);
      tio.setEndOffset(offset + dataLen);
      addPosition(fieldName, charBuf, position++, &tio );
     }else
      addPosition(fieldName, charBuf, position++, NULL);
     offset += dataLen;
     length++;
       } else { // field must be tokenized
           CL_NS(util)::Reader* reader; // find or make Reader
           bool delReader = false;
           if (field->readerValue() != NULL) {
             reader = field->readerValue();
           } else if (field->stringValue() != NULL) {
             reader = _CLNEW CL_NS(util)::StringReader(field->stringValue(),_tcslen(field->stringValue()),false);
             delReader = true;
           } else {
             _CLTHROWA(CL_ERR_IO,"field must have either String or Reader value");
           }

           try {
             // Tokenize field and add to postingTable.
             CL_NS(analysis)::TokenStream* stream = analyzer->tokenStream(fieldName, reader);

             try {
               CL_NS(analysis)::Token t;
               int32_t lastTokenEndOffset = -1;
               while (stream->next(&t)) {
                   position += (t.getPositionIncrement() - 1);

                   if(field->isStoreOffsetWithTermVector()){
                   TermVectorOffsetInfo tio;
                   tio.setStartOffset(offset + t.startOffset());
                   tio.setEndOffset(offset + t.endOffset());
        addPosition(fieldName, t.termText(), position++, &tio);
       }else
        addPosition(fieldName, t.termText(), position++, NULL);

       lastTokenEndOffset = t.endOffset();
                   length++;
                   // Apply field truncation policy.
       if (maxFieldLength != IndexWriter::FIELD_TRUNC_POLICY__WARN) {
                     // The client programmer has explicitly authorized us to
                     // truncate the token stream after maxFieldLength tokens.
                     if ( length > maxFieldLength) {
                       break;
                     }
       } else if (length > IndexWriter::DEFAULT_MAX_FIELD_LENGTH) {
                     const TCHAR* errMsgBase =
                       _T("Indexing a huge number of tokens from a single")
                       _T(" field (/"%s/", in this case) can cause CLucene")
                       _T(" to use memory excessively.")
                       _T(" By default, CLucene will accept only %s tokens")
                       _T(" tokens from a single field before forcing the")
                       _T(" client programmer to specify a threshold at")
                       _T(" which to truncate the token stream.")
                       _T(" You should set this threshold via")
              _T(" IndexReader::maxFieldLength (set to

LUCENE_INT32_MAX")
                       _T(" to disable truncation, or a value to specify maximum number of fields).");

                     TCHAR defaultMaxAsChar[34];
                     _i64tot(IndexWriter::DEFAULT_MAX_FIELD_LENGTH,
                         defaultMaxAsChar, 10
                       );
                int32_t errMsgLen = _tcslen(errMsgBase)
                         + _tcslen(fieldName)
                         + _tcslen(defaultMaxAsChar);
                     TCHAR* errMsg = _CL_NEWARRAY(TCHAR,errMsgLen+1);

                     _sntprintf(errMsg, errMsgLen,errMsgBase, fieldName, defaultMaxAsChar);

         _CLTHROWT_DEL(CL_ERR_Runtime,errMsg);
                   }
               } // while token->next

      if(lastTokenEndOffset != -1 )
                  offset += lastTokenEndOffset + 1;
             } _CLFINALLY (
               stream->close();
               _CLDELETE(stream);
             );
           } _CLFINALLY (
             if (delReader) {
               _CLDELETE(reader);
             }
           );
       } // if/else field is to be tokenized
    fieldLengths[fieldNumber] = length; // save field length
    fieldPositions[fieldNumber] = position;   // save field position
    fieldBoosts[fieldNumber] *= field->getBoost();
    fieldOffsets[fieldNumber] = offset;
   } // if field is to beindexed
  } // while more fields available
} _CLFINALLY (
   _CLDELETE(fields);
);
} // Document:;invertDocument

这个函数基本是是一些基本的数据结构操作，没有调用其它的什么了，是关于数值及基本数据结构的操作。是
索引变换的核心。它的思路是
1. 文档对象调用fields函数获得文档域枚举指针(DocumentFieldEnumeration)
2. 调用文档域枚举指针调用hasMoreElement函数判断里面是否到元素尾了
3. 如果没有，那么文档域枚举指针调用nextElement函数获得域指针。
4。　下面就是对域及其　内部存储的细致的运算了。

转自 http://blog.sina.com.cn/s/blog_625cce080100g0ag.html?retcode=0