Lucene是一个较好的全文检索引擎,它完成了分词,
索引,查询功能。这些都是搜索引擎的主要技术算法.
它成功编译后是一个lib库,别人可以免费使用它的接口
代码可从http://sourceforge.net/projects/clucene/下面
下面我们来看看一个检索引擎的基本流程。
下面是Lucene中测试Document的测试程序,从中我们可以看出一些基本
使用Lucene库的思路
(摘自/test/document/testDocument.cpp)
void TestBinaryDocument(CuTest *tc){
char factbook[1024];
strcpy(factbook, clucene_data_location);
strcat(factbook, "/reuters-21578/feldman-cia-worldfactbook-data.txt");
CuAssert(tc,_T("Factbook file does not exist"),Misc::dir_Exists(factbook));
Document doc;
Field* f;
const char* _as;
const char* _as2;
const TCHAR* _ts;
jstreams::StreamBase<char>* strm;
RAMDirectory ram;
const char* areaderString = "a string reader field";
const TCHAR* treaderString = _T("a string reader field");
int readerStringLen = strlen(areaderString);
SimpleAnalyzer an;
IndexWriter writer(&ram,&an,true); //no analyzer needed since we are not indexing...
------------------------------------
//use reader
doc.add( *_CLNEW Field(_T("readerField"),_CLNEW StringReader (treaderString),
Field::TERMVECTOR_NO | Field::STORE_YES | Field::INDEX_NO) );
writer.addDocument(&doc);
doc.clear();
---------------------------------
IndexReader* reader = IndexReader::open(&ram);
//and check reader stream
reader->document(1, &doc);
f = doc.getField(_T("readerField"));
_ts = f->stringValue();
CLUCENE_ASSERT(_tcscmp(treaderString,_ts)==0);
doc.clear();
--------------------------------
reader->close();
_CLDELETE(reader);
}
第一条横线上的是定义了一个 文档对象(Document), 分析器(SimpleAnalyzer),
写索引对象(IndexWriter), 字典对象(RAMDirectory).
第二横线上 文档对象通过add方法增加了一个域,这个域名字是readerField, 值为
treaderString变量中的"a string reader field", , 之后,写索引对象通过addDocument方法
将文档对象加入自身之中,此时索引及值已存在于字典对象之中。
第三横线上 通过字典创建了一个读索引对象。读索引对象通过document方法将
读索引中的值提到doc对象之中,然后doc对象通过getField方法,根据索引名"readerField"
提取其值到域对象f之中。域对象f通过stringValue方法返回其值到串_ts之中。
所以全文检索的思路简单也说是这样的:
1. 根据索引与值建立域
2. 将域加入文档,再加入写索引对象,最后进入字典中
3. 根据字典创建读索引对象。
4。读索引对象提取文档,再根据索引条件取域,域再取出其中值。搜索完成。
所以检索分为两步,一个是存值到字典,另一个是从字典中取值。
这思路非常简单,关键点在索引及索引的值的存放上。即
将文档加入到读索引的算法上。为了进一步研究其算法,我们看一下
读索引对象加文档的函数addDocument();
(/index/IndexWriter.cpp)
void IndexWriter::addDocument(Document* doc, Analyzer* analyzer) {
//Func - Adds a document to the index
//Pre - doc contains a valid reference to a document
// ramDirectory != NULL
//Post - The document has been added to the index of this IndexWriter
CND_PRECONDITION(ramDirectory != NULL,"ramDirectory is NULL");
if ( analyzer == NULL )
analyzer = this->analyzer;
ramDirectory->transStart();
try {
char* segmentName = newSegmentName();
CND_CONDITION(segmentName != NULL, "segmentName is NULL");
try {
//Create the DocumentWriter using a ramDirectory and analyzer
// supplied by the IndexWriter (this).
DocumentWriter* dw = _CLNEW DocumentWriter(
ramDirectory, analyzer, this );
CND_CONDITION(dw != NULL, "dw is NULL");
try {
//Add the client-supplied document to the new segment.
-------------------------------------------------------------------------------------------------------------------------------------
dw->addDocument(segmentName, doc);
----------------------------------------------------------------------------------------------------------------------------------------
} _CLFINALLY(
_CLDELETE(dw);
);
这个函数的思路也很简单
1. 判断分析器是否为空,若为空,是将写索引对象的分析器给此分析器。
2. 字典对象调用transStart函数以 表明从文档中提取索引及索引值开始。
3. 创建一个段名
4。当段名不为空时,再创建一个写文档对象(DocumentWriter), 其中创建时,
以字典,分析器及 写索引对象为参数.
5.写文档对象调用addDocument函数加入文档对象。
好了,这里的关键就是第五步,我们再看看其具体的实现
(index/DocumentWriter.cpp)
void DocumentWriter::addDocument(const char* segment, Document* doc) {
CND_PRECONDITION(fieldInfos==NULL, "fieldInfos!=NULL")
// write field names
fieldInfos = _CLNEW FieldInfos();
fieldInfos->add(doc);
const char* buf = Misc::segmentname(segment, ".fnm");
fieldInfos->write(directory, buf);
_CLDELETE_CaARRAY(buf);
// write field values
FieldsWriter fieldsWriter(directory, segment, fieldInfos);
try {
fieldsWriter.addDocument(doc);
} _CLFINALLY( fieldsWriter.close() );
// invert doc into postingTable
clearPostingTable(); // clear postingTable
size_t size = fieldInfos->size();
fieldLengths = _CL_NEWARRAY(int32_t,size); // init fieldLengths
fieldPositions = _CL_NEWARRAY(int32_t,size); // init fieldPositions
fieldOffsets = _CL_NEWARRAY(int32_t,size); // init fieldOffsets
memset(fieldPositions, 0, sizeof(int32_t) * size);
//initialise fieldBoost array with default boost
int32_t fbl = fieldInfos->size();
float_t fbd = doc->getBoost();
fieldBoosts = _CL_NEWARRAY(float_t,fbl); // init fieldBoosts
{ //msvc6 scope fix
for ( int32_t i=0;i<fbl;i++ )
fieldBoosts[i] = fbd;
}
{ //msvc6 scope fix
for ( int32_t i=0;i<fieldInfos->size();i++ )
fieldLengths[i] = 0;
} //msvc6 scope fix
--------------------------------------------------------------------------------------------------------------------
invertDocument(doc);
-------------------------------------------------------------------------------------------------------------------
// sort postingTable into an array
Posting** postings = NULL;
int32_t postingsLength = 0;
sortPostingTable(postings,postingsLength);
//DEBUG:
// write postings
writePostings(postings,postingsLength, segment);
// write norms of indexed fields
writeNorms(segment);
_CLDELETE_ARRAY( postings );
}
此函数的思路是这样的
1. 创建一个域消息对象(FieldInfos)
2. 域消息对象通过调用add函数将文档对象加入自身之中
3. 定义一个指针,指向段名为.fnm的值的缓冲区(这里我也没有弄明白具体
用意,大约就是这个意思吧),
4。域消息对象通过调用write函数将字典与缓冲区加入自身之中。
5。根据字典,段,域消息对象创建写域对象。
6。写域对象通过addDocument方法将文档对象加入其中
7。 下面接着得到了域消息的大小,偏移,还操作了boost,
下面的几句我就没有弄明白了,^_^
8. 接着做了一个重要的操作将文档对象InvertDocument了一下。
这里面第八步是关键操作
我们看看其代码
(index/IndexWriter.cpp)
void DocumentWriter::invertDocument(const Document* doc) {
DocumentFieldEnumeration* fields = doc->fields();
try {
while (fields->hasMoreElements()) {
Field* field = (Field*)fields->nextElement();
const TCHAR* fieldName = field->name();
const int32_t fieldNumber = fieldInfos->fieldNumber(fieldName);
int32_t length = fieldLengths[fieldNumber]; // length of field
int32_t position = fieldPositions[fieldNumber]; // position in field
if (length>0)
position+=analyzer->getPositionIncrementGap(fieldName);
int32_t offset = fieldOffsets[fieldNumber]; // offset field
if (field->isIndexed()) {
if (!field->isTokenized()) { // un-tokenized field
//FEATURE: this is bug in java: if using a Reader, then
//field value will not be added. With CLucene, an untokenized
//field with a reader will still be added (if it isn't stored,
//because if it's stored, then the reader has already been read.
const TCHAR* charBuf = NULL;
int64_t dataLen = 0;
if (field->stringValue() == NULL && !field->isStored() ) {
CL_NS(util)::Reader* r = field->readerValue();
// this call tries to read the entire stream
// this may invalidate the string for the further calls
// it may be better to do this via a FilterReader
// TODO make a better implementation of this
dataLen = r->read(charBuf, LUCENE_INT32_MAX_SHOULDBE);
if (dataLen == -1)
dataLen = 0;
//todo: would be better to pass the string length, in case
//a null char is passed, but then would need to test the output too.
} else {
charBuf = field->stringValue();
dataLen = _tcslen(charBuf);
}
if(field->isStoreOffsetWithTermVector()){
TermVectorOffsetInfo tio;
tio.setStartOffset(offset);
tio.setEndOffset(offset + dataLen);
addPosition(fieldName, charBuf, position++, &tio );
}else
addPosition(fieldName, charBuf, position++, NULL);
offset += dataLen;
length++;
} else { // field must be tokenized
CL_NS(util)::Reader* reader; // find or make Reader
bool delReader = false;
if (field->readerValue() != NULL) {
reader = field->readerValue();
} else if (field->stringValue() != NULL) {
reader = _CLNEW CL_NS(util)::StringReader(field->stringValue(),_tcslen(field->stringValue()),false);
delReader = true;
} else {
_CLTHROWA(CL_ERR_IO,"field must have either String or Reader value");
}
try {
// Tokenize field and add to postingTable.
CL_NS(analysis)::TokenStream* stream = analyzer->tokenStream(fieldName, reader);
try {
CL_NS(analysis)::Token t;
int32_t lastTokenEndOffset = -1;
while (stream->next(&t)) {
position += (t.getPositionIncrement() - 1);
if(field->isStoreOffsetWithTermVector()){
TermVectorOffsetInfo tio;
tio.setStartOffset(offset + t.startOffset());
tio.setEndOffset(offset + t.endOffset());
addPosition(fieldName, t.termText(), position++, &tio);
}else
addPosition(fieldName, t.termText(), position++, NULL);
lastTokenEndOffset = t.endOffset();
length++;
// Apply field truncation policy.
if (maxFieldLength != IndexWriter::FIELD_TRUNC_POLICY__WARN) {
// The client programmer has explicitly authorized us to
// truncate the token stream after maxFieldLength tokens.
if ( length > maxFieldLength) {
break;
}
} else if (length > IndexWriter::DEFAULT_MAX_FIELD_LENGTH) {
const TCHAR* errMsgBase =
_T("Indexing a huge number of tokens from a single")
_T(" field (/"%s/", in this case) can cause CLucene")
_T(" to use memory excessively.")
_T(" By default, CLucene will accept only %s tokens")
_T(" tokens from a single field before forcing the")
_T(" client programmer to specify a threshold at")
_T(" which to truncate the token stream.")
_T(" You should set this threshold via")
_T(" IndexReader::maxFieldLength (set to
LUCENE_INT32_MAX")
_T(" to disable truncation, or a value to specify maximum number of fields).");
TCHAR defaultMaxAsChar[34];
_i64tot(IndexWriter::DEFAULT_MAX_FIELD_LENGTH,
defaultMaxAsChar, 10
);
int32_t errMsgLen = _tcslen(errMsgBase)
+ _tcslen(fieldName)
+ _tcslen(defaultMaxAsChar);
TCHAR* errMsg = _CL_NEWARRAY(TCHAR,errMsgLen+1);
_sntprintf(errMsg, errMsgLen,errMsgBase, fieldName, defaultMaxAsChar);
_CLTHROWT_DEL(CL_ERR_Runtime,errMsg);
}
} // while token->next
if(lastTokenEndOffset != -1 )
offset += lastTokenEndOffset + 1;
} _CLFINALLY (
stream->close();
_CLDELETE(stream);
);
} _CLFINALLY (
if (delReader) {
_CLDELETE(reader);
}
);
} // if/else field is to be tokenized
fieldLengths[fieldNumber] = length; // save field length
fieldPositions[fieldNumber] = position; // save field position
fieldBoosts[fieldNumber] *= field->getBoost();
fieldOffsets[fieldNumber] = offset;
} // if field is to beindexed
} // while more fields available
} _CLFINALLY (
_CLDELETE(fields);
);
} // Document:;invertDocument
这个函数基本是是一些基本的数据结构操作,没有调用其它的什么了,是关于数值及基本数据结构的操作。是
索引变换的核心。它的思路是
1. 文档对象调用fields函数获得文档域枚举指针(DocumentFieldEnumeration)
2. 调用文档域枚举指针调用hasMoreElement函数判断里面是否到元素尾了
3. 如果没有,那么文档域枚举指针调用nextElement函数获得域指针。
4。 下面就是对域及其 内部存储的细致的运算了。
转自 http://blog.sina.com.cn/s/blog_625cce080100g0ag.html?retcode=0