python全文搜索库Whoosh新手入门(二)快速上手

最新推荐文章于 2024-08-06 14:47:30 发布

shiftbank

最新推荐文章于 2024-08-06 14:47:30 发布

阅读量9.4k

点赞数 4

文章标签： Python whoosh 全文搜索

本文链接：https://blog.csdn.net/u012387575/article/details/52188054

版权

关于Whoosh的安装，笔者在上一篇文章有讲到，如果有什么问题，自取。

那么要快速上手Whoosh，大家一起来，小编这里主要还是做一个翻译工作，如有不好，请轻喷。

Index 与 Schema 对象

要开始使用Whoosh ，你需要一个 index 对象。第一次使用一个index时，你需要定义这个index 的Schema。

这个Schema 列出了 index 中的field。一个field 是每个文档在index中的一条信息，比如他的标题或文本内容。

一个field 也可以被索引（意即可以被搜索）或者存储（意即被索引的值由其结果返回，这对于标题类的field 是很有益的）。

Schema 有两个field，标题title & 内容content

from whoosh.fields import Schema, TEXT

schema = Schema(title=TEXT, content=TEXT)

当你创建index时，你只需创建一次Schema，这个Schema将与index存储在一起。

创建Schema：

你需要用关键词参数来映射 filed name 与field type，这些名字与类型将定义你在索引的对象以及可搜索的对象。

Whoosh 有一些很有用的预定义 field types，你也可以很easy的创建你自己的。

whoosh.fields.ID

这个类型简单地将field的值索引为一个独立单元（这意味着，他不被分成单独的单词）。这对于文件路径、URL、时间、类别等field很有益处。

whoosh.fields.STORED

这个类型和文档存储在一起，但没有被索引。这个field type不可搜索。这对于你想在搜索结果中展示给用户的文档信息很有用。

whoosh.fields.KEYWORD

这个类型针对于空格或逗号间隔的关键词设计。可索引可搜索（部分存储）。为减少空间，不支持短语搜索。

whoosh.fields.TEXT

这个类型针对文档主体。存储文本及term的位置以允许短语搜索。

whoosh.fields.NUMERIC

这个类型专为数字设计，你可以存储整数或浮点数。

whoosh.fields.BOOLEAN

这个类型存储bool型

whoosh.fields.DATETIME

这个类型为 datetime object而设计（更多详细信息）

whoosh.fields.NGRAM 和 whoosh.fields.NGRAMWORDS

这些类型将fiel文本和单独的term分成N-grams（更多Indexing & Searching N-grams的信息）

（作为一个捷径，如果你不需要给任何field type传值，你可以只给类名，whoosh会为你初始化该对象）

from whoosh.fields import Schema, STORED, ID, KEYWORD, TEXT

schema = Schema(title=TEXT(stored=True), content=TEXT,
                path=ID(stored=True), tags=KEYWORD, icon=STORED)

关于 schema的设计的更多信息

一旦你有了schema，你可以用create_in函数创建index

import os.path
from whoosh.index import create_in

if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)

在底层，这将创建一个 Storage 对象来包含这个索引。一个Storage 对象代表了索引将被存储在哪个媒介上。一般这将会是用来存储文件夹内一系列文件的索引FileStorage。

当你创建好索引后，你可以用open_dir打开它

from whoosh.index import open_dir

ix = open_dir("index")

IndexWriter对象

好了，既然我们已经知道Index对象了，我们可以开始添加文档了。 Index对象的writer() 方法可以让你把文档加到索引上。 IndexWriter的add_document(**kwargs) 方法接受一个field name被映射到一个值的关键词参数：

writer = ix.writer()
writer.add_document(title=u"My document", content=u"This is my document!",
                    path=u"/a", tags=u"first short", icon=u"/icons/star.png")
writer.add_document(title=u"Second try", content=u"This is the second example.",
                    path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")
writer.add_document(title=u"Third time's the charm", content=u"Examples are many.",
                    path=u"/c", tags=u"short", icon=u"/icons/book.png")
writer.commit()

两点要特别注意的地方：

You don’t have to fill in a value for every field. Whoosh doesn’t care if you leave out a field from a document.
Indexed text fields must be passed a unicode value. Fields that are stored but not indexed (STORED field type) can be passed any pickle-able object.

你不必要为每个field都填上值。Whoosh不会介意你从一个文档中省略一个field

索引的text field必须传递一个unicode 值。被存储而没被索引的fields 可以转成任意 pickle-able 的对象。（至于 pickle-able，小编在各大词典都没查到意思，真是不好意思了）

如果你有一个即索引又存储好了的 text field，如果必要的话，你可以索引一个unicode 值，但存储一个不同的对象

（一般都不是，但有时这非常有用）用这个trick：

writer.add_document(title=u"Title to be indexed", _stored_title=u"Stored title")

在IndexWriter 上调用 commit() 将增加的文档存储到index

writer.commit()

更多关于如何索引文档的信息

一旦你的文档被提交到Index，你就可以开始检索了

Searcher 对象

为了开始搜索Index，我们需要一个 search 对象

searcher = ix.searcher()

你将经常想用一个with 语句打开searcher因为当你使用完毕时它会自动关闭。（searcher对象代表一系列打开的文件，所以你不明确的关掉他们，整个系统像蜗牛一样，你可能会用尽文件句柄）：

with ix.searcher() as searcher:
    ...

当然这与下面的的代码等价：

try:
    searcher = ix.searcher()
    ...
finally:
    searcher.close()

Searcher的 search() 方法需要一个Query对象。你可以直接构造一个查询对象或者用一个查询分析器来分析每一个查询字符串。

例如，这个查询将会在内容field 中匹配同时包含 “apple” 和 “bear” 的文档：

# Construct query objects directly

from whoosh.query import *
myquery = And([Term("content", u"apple"), Term("content", "bear")])

你可以用一个在qparse模块中默认的分析器来分析一个查询字符串。 QueryParser的构造函数的第一个参数是默认要搜索的field。

这通常是 “body text” field，第二个可选参数是用来理解如何分析该field的schema：

# Parse a query string

from whoosh.qparser import QueryParser
parser = QueryParser("content", ix.schema)
myquery = parser.parse(querystring)

一旦你有一个Searcher 和一个查询对象，你可以用 Search的 search方法来跑一个查询，并获取一个Results 对象：

>>> results = searcher.search(myquery)
>>> print(len(results))
1
>>> print(results[0])
{"title": "Second try", "path": "/b", "icon": "/icons/sheep.png"}

默认的 QueryParser 实现了一个与Lucene的查询语言很类似的语言。它让你用 AND 和 OR 将 terms 连接起来，用NOT 消除 terms，用括号将terms组成句子，做一些范围（range）、前缀（prefix）、通配符（wildcard）查询，明确搜索的不同field。默认情况下，经常与AND 合用(所以默认情况下, 你举出的所有terms 必须在文档中，以让文本匹配):

>>> print(parser.parse(u"render shade animate"))
And([Term("content", "render"), Term("content", "shade"), Term("content", "animate")])

>>> print(parser.parse(u"render OR (title:shade keyword:animate)"))
Or([Term("content", "render"), And([Term("title", "shade"), Term("keyword", "animate")])])

>>> print(parser.parse(u"rend*"))
Prefix("content", "rend")

whoosh的搜索还包括一些其余的特征：

按值排序而不是按相关性排序

在摘录中高亮源文本

在前几个搜寻到的文本基础上扩充查询terms

为结果标页数

See How to search for more information.