对全文检索引擎xapian的学习(一)---索引

最新推荐文章于 2019-12-13 15:44:06 发布

sirdan

最新推荐文章于 2019-12-13 15:44:06 发布

阅读量2k

点赞数 1

分类专栏： xapian

本文链接：https://blog.csdn.net/sirdan/article/details/23679685

版权

xapian 专栏收录该内容

2 篇文章 1 订阅

订阅专栏

xapian的文档不算丰富,但也够用了.特别是xapian配套的omega项目,是一个使用xapian和学习xapian的宝库.

先说两个重要的概念,term list 和posting list.

term list索引了一个文档,每一个document都对应一个term list.

posting list列出了一个term索引的文档id,每个term都有一个posting list.

在windows下使用xapian,建议从官网下载mvc下的make文件,放在vc下,修改几个错误后就能编译通过.

但官网没有给出omega在windows下的makefile ,我试着在vc下编译,没有成功.在ubuntu下编译成功了,需要提前安装好xapian-core并且把依赖的库也安装好.

移植omega意义不大,我决定学习一下omega的代码,看一下xapian究竟应该怎么用.

omega提供了两个最主要的工具是omindex和query.对应的源码是omindex.cc和query.cc.

omindex支持的格式非常丰富,包括html,pdf,xml,excel,csv等.

omindex的核心索引操作,大体分下面几步:

1.保存文档的data:

// Put the data in the document
Xapian::Document newdocument;
string record = "url=";
record += url;
record += "\nsample=";
record += sample;
if (!title.empty()) {
    record += "\ncaption=";
    record += generate_sample(title, TITLE_SIZE);
}
if (!author.empty()) {
    record += "\nauthor=";
    record += author;
}
record += "\ntype=";
record += mimetype;
if (last_mod != (time_t)-1) {
    record += "\nmodtime=";
    record += str(last_mod);
}
record += "\nsize=";
record += str(d.get_size());
newdocument.set_data(record);

data里面保存了很多信息,类型,大小,url等都放在一个字符串中保存了起来.

要注意的是,data不适合频繁存取,存取一次需要耗费较多的资源,对于需要频繁存取的数据,xapian建议使用value.

2.接下来对标题正文进行索引:

// Index the title, document text, and keywords.
indexer.set_document(newdocument);
if (!title.empty()) {
    indexer.index_text(title, 5);
    indexer.increase_termpos(100);
}
if (!dump.empty()) {
    indexer.index_text(dump);
}
if (!keywords.empty()) {
    indexer.increase_termpos(100);
    indexer.index_text(keywords);
}
// Index the leafname of the file.
{
    indexer.increase_termpos(100);
    string leaf = d.leafname();
    string::size_type dot = leaf.find_last_of('.');
    if (dot != string::npos)
	leaf.resize(dot);
    indexer.index_text(leaf);
}
if (!author.empty()) {
    indexer.increase_termpos(100);
    indexer.index_text(author, 1, "A");
}
// mimeType:
newdocument.add_boolean_term("T" + mimetype);

indexer是一个Xapian::TermGenerator类型,在往document中添加term的时候,可以不使用TermGenerator,但很明显,使用TermGenerator更加方便快捷.建议使用.

TermGenerator只能添加概率term,如果需要添加boolean型term,只能在doc中添加.

indexer.index_text(title, 5);

上面的语句中,title是要索引的文本,后面的5是wdf,也就是这个term的权重(具体来说,wdf是这个term在document中出现的次数).
给term一个更大的权重是有意义的,可以影响检索结果的排序.

需要注意,title必须是utf8编码的,否则不能识别.

title可以包含多个term,需要以空格隔开,否则title将作为一个term存入document中.

还要注意的一点是,index_text会记住添加的term的位置(position),如果不想记住term的position可以使用index_text_without_positions函数,这会减小索引库文件的大小.

indexer.increase_termpos(100);

函数将term的position增加了100,如果标题中有2个term,position分别是1和2,那么接下来的正文索引,term的position将会以103开始,
这能避免短语检索或NEAR检索误把标题和正文的词结合在一起.

indexer.index_text(keywords);

索引了关键词,很多分词算法可以取得关键词,关键词对于文章的聚合,寻找相似内容很有用处.

indexer.index_text(leaf);

索引了文件名(去掉了文件路径).

indexer.index_text(author, 1, "A");

索引作者,这里多了一个参数"A",这是前缀,在xapian中会经常遇到前缀,有重要作用.

newdocument.add_boolean_term("T" + mimetype);

这里增加了一个term使用的是boolean类型,相当于增加了一个wdf为0的term.

// Add last_mod as a value to allow "sort by date".
newdocument.add_value(VALUE_LASTMOD, int_to_binary_string((uint32_t)last_mod));

这里增加了一个value,保存的是doc的最后修改时间.可以使用此value将检索结果按照时间日期排序.

// Add MD5 as a value to allow duplicate documents to be collapsed together.
newdocument.add_value(VALUE_MD5, md5);

这里增加了另外一个value,保存的是doc的md5值,可以用来去重.

// Add the file size as a value to allow "sort by size" and size ranges.
newdocument.add_value(VALUE_SIZE, Xapian::sortable_serialise(d.get_size()));

增加了另外一个value,保存doc的大小,可以用来按大小排序或指定大小范围.

bool inc_tag_added = false;
if (d.is_other_readable()) {
    inc_tag_added = true;
    newdocument.add_boolean_term("I*");
} else if (d.is_group_readable()) {
    const char * group = d.get_group();
    if (group) {
	newdocument.add_boolean_term(string("I#") + group);
    }
}
const char * owner = d.get_owner();
if (owner) {
    newdocument.add_boolean_term(string("O") + owner);
    if (!inc_tag_added && d.is_owner_readable())
	newdocument.add_boolean_term(string("I@") + owner);
}

这里加入了权限控制.如果是文档拥有者只读,加入term"I@",如果是拥有者所在组可读,加入term"I#",如果其它人可读,加入term"I*".

在检索的时候,根据这三个term,可以决定哪些文档是允许当前用户检索的.

string ext_term("E");
for (string::iterator i = ext.begin(); i != ext.end(); ++i) {
    char ch = *i;
    if (ch >= 'A' && ch <= 'Z')
	ch |= 32;
    ext_term += ch;
}
newdocument.add_boolean_term(ext_term);

这里增加扩充term,以"E"开头,term内容为小写字母.

if (!skip_duplicates) {
    // If this document has already been indexed, update the existing
    // entry.
    if (did) {
	// We already found out the document id above.
	db.replace_document(did, newdocument);
    } else if (last_mod <= last_mod_max) {
	// We checked for the UID term and didn't find it.
	did = db.add_document(newdocument);
    } else {
	did = db.replace_document(urlterm, newdocument);
    }
    if (did < updated.size()) {
	if (usual(!updated[did])) {
	    updated[did] = true;
	    --old_docs_not_seen;
	}
    }
    if (verbose) {
	if (did <= old_lastdocid) {
	    cout << "updated" << endl;
	} else {
	    cout << "added" << endl;
	}
    }
} else {
    // If this were a duplicate, we'd have skipped it above.
    db.add_document(newdocument);
    if (verbose)
	cout << "added" << endl;
}

这里是把document入库.对于重复的document,可以跳过,也可以对旧有document进行替换更新.

以上就是index_file函数的主要部分,对于不同格式的文档要进行dump处理,提取出里面的文本内容后再进行索引.