Whoosh如何搜索--进阶版

最新推荐文章于 2024-10-11 07:54:46 发布

鬼义虎神

最新推荐文章于 2024-10-11 07:54:46 发布

阅读量2k

点赞数 1

分类专栏： Python进阶学习笔记

本文链接：https://blog.csdn.net/u013487601/article/details/104764045

版权

Python进阶学习笔记专栏收录该内容

13 篇文章 1 订阅

订阅专栏

官方文档：https://whoosh.readthedocs.io/en/latest/searching.html

一旦创建了索引并向其中添加了文档，就可以搜索这些文档。

目录：

searcher对象
Resulted对象
得分和排序
筛选结果
我的查询中有哪些匹配项
折叠结果
限制搜索时间
方便的方法
结合结果对象

一、`Searcher` 对象

获得一个 whoosh.searching.Searcher 对象, 调用searcher() 在你的 Index 对象上:

searcher = myindex.searcher()

通常使用with语句来打开searcher，因为当你完成的时候，它可以自动关闭（搜索对象代表一个打开的文件，如果你不显式地关闭，系统会越来越卡，您可以手动关闭):

with ix.searcher() as searcher:
    ...

这当然等价于:

try:
    searcher = ix.searcher()
    ...
finally:
    searcher.close()

Searcher对象有很多有用的方法来获取关于索引的信息，比如lexicon(fieldname)'。

>>> list(searcher.lexicon("content"))
[u"document", u"index", u"whoosh"]

然而Searcher对象上最重要的方法是 search()，查询 whoosh.query.Query 对象并返回s a Results 对象：

from whoosh.qparser import QueryParser

qp = QueryParser("content", schema=myindex.schema)
q = qp.parse(u"hello world")

with myindex.searcher() as s:
    results = s.search(q)

默认情况下，结果最多包含前10个匹配的文档。要获得更多的结果，使用limit关键字:

results = s.search(q, limit=20)

如果你想获得所有结果，使用limit=None。

然而设置限制可以加速搜索速度，因为Whoosh不需要检出和为每个文档打分。

由于一次显示一个页面的结果是一种常见的模式，search_page方法让你方便地检索一个给定的页面上的结果：

results = s.search_page(q, 1)

默认的页面长度是10条数据。你可以使用pagelen关键字参数设置不同的页面长度：

results = s.search_page(q, 5, pagelen=20)

二、Results 对象

Results 对象的作用类似于匹配文档的列表。您可以使用它来访问每个击中文档的存储字段，并将其显示给用户。

>>> # Show the best hit's stored fields （显示击中得分最高的存储字段）
>>> results[0]
{"title": u"Hello World in Python", "path": u"/a/b/c"}
>>> results[0:2]
[{"title": u"Hello World in Python", "path": u"/a/b/c"},
{"title": u"Foo", "path": u"/bar"}]

默认情况下， Searcher.search(myquery) 将命中次数限制为20, 因此Results对象中得分的命中次数可能小于索引中匹配文档的数量。

>>> # 整个索引中有多少文档是匹配的?
>>> len(results)
27
>>> # 在这个结果对象中有多少文档得分和排序?
>>> # 如果命中的次数有限，这个值通常小于len()
>>> # (the default).
>>> results.scored_length()
10

调用len(Results)再次运行一个快速(无得分)版本的查询，以计算匹配文档的总数。这通常是非常快的，但对于大型索引，这可能会造成明显的延迟。如果希望在非常大的索引上避免这种延迟，可以使用has_exact_length()，estimated_length()和estimated_min_length() 函数在不调用len()的情况下估计匹配文档数量的方法:

found = results.scored_length()
if results.has_exact_length():  # 有确切的长度
    print("Scored", found, "of exactly", len(results), "documents")
else:
    low = results.estimated_min_length()  # 估计的最小长度
    high = results.estimated_length()  # 估计长度

    print("Scored", found, "of between", low, "and", high, "documents")

三、得分和排序

1. 得分

通常结果文档的列表是按照score排序的。 whoosh.scoring 模块包含各种评分算法的实现。默认是BM25F。

当你创建搜索使用你可以使用weighting 关键字参数设置评分对象：

from whoosh import scoring

with myindex.searcher(weighting=scoring.TF_IDF()) as s:
    ...

加权模型是一个 WeightingModel 子类，带有一个scorer()方法，产生一个“scorer” 实例。该实例有一个获取当前匹配器并返回浮点分数的方法。

2. 排序

看Sorting and faceting.

高亮显示代码片段及类似内容

看 How to create highlighted search result excerpts 和Query expansion and Key word extraction 以获取有关这些主题的信息。

四、筛选结果

可以使用filter 关键字参数search()来指定允许在结果中显示的一组文档。

参数可以是一个 whoosh.query.Query 对象，一个 whoosh.searching.Results 对象，或者一个包含文档编号的类似集合的对象。

searcher缓存筛选器，例如，如果您多次使用与searcher相同的查询筛选器，那么额外的搜索将会更快，因为searcher将缓存运行筛选器查询的结果。

您还可以指定一个mask键字参数来指定结果中不显示的一组文档。

with myindex.searcher() as s:
    qp = qparser.QueryParser("content", myindex.schema)
    user_q = qp.parse(query_string)

    # 只显示“rendering”章节中的文档 Only show documents in the "rendering" chapter
    allow_q = query.Term("chapter", "rendering")
    # 不要显示任何“tag”字段包含“todo”的文档 Don't show any documents where the "tag" field contains "todo"
    restrict_q = query.Term("tag", "todo")

    results = s.search(user_q, filter=allow_q, mask=restrict_q)

(如果您同时指定了一个filter 和一个mask，并且在两者中都出现了一个匹配的文档，那么mask将“获胜”，该文档是不显示。)

要查明从结果中过滤出了多少结果，请使用results.filtered_count（或者resultspage.results.filtered_count）

with myindex.searcher() as s:
    qp = qparser.QueryParser("content", myindex.schema)
    user_q = qp.parse(query_string)

    # 过滤超过7天的文档 Filter documents older than 7 days
    old_q = query.DateRange("created", None, datetime.now() - timedelta(days=7))
    results = s.search(user_q, mask=old_q)

    print("Filtered out %d older documents" % results.filtered_count)

五、我的查询中有哪些匹配项?

您可以使用terms=True关键字参数来search() ，以便搜索记录查询中的哪些词汇与哪些文档相匹配:

with myindex.searcher() as s:
    results = s.seach(myquery, terms=True)

您可以从 whoosh.searching.Results 和 whoosh.searching.Hit对象中获得匹配哪些项的信息：

# 这个结果对象是用terms=True创建的吗? Was this results object created with terms=True?
if results.has_matched_terms():
    # 结果中哪些项相匹配?  What terms matched in the results?
    print(results.matched_terms())

    # 每次命中匹配哪些项?  What terms matched in each hit?
    for hit in results:
        print(hit.matched_terms())

六、折叠结果

Whoosh允许您从结果中删除除前N个文档之外的所有具有相同facet键的文档。这在一些情况下很有用:

在搜索时消除重复。
限制每个源匹配的数量。例如，在web搜索应用程序中，您可能希望最多显示来自任何网站的三个匹配项。

文档是否应该折叠取决于“collapse facet（折叠面）”的值。如果一个文档有一个空的折叠键，那么它将永远不会被折叠，但是在其他情况下，只有具有相同折叠键的前N个文档才会出现在结果中。

看Sorting and faceting 获取有关方面的信息。

with myindex.searcher() as s:
    # 将facet设置为可折叠，并设置每个文档的最大数量 Set the facet to collapse on and the maximum number of documents per
    # facet值(默认值为1)  facet value (default is 1)
    results = s.collector(collapse="hostname", collapse_limit=3)

    # 字典映射折叠键到的文档数量  Dictionary mapping collapse keys to the number of documents that
    # 通过使用那个键被过滤掉的文档  were filtered out by collapsing on that key
    print(results.collapsed_counts)

折叠工作与评分和排序的结果。你可以使用whoosh.sorting模型中提供的任何可用类型。

默认情况下，Whoosh使用结果顺序(分数或排序键)来确定要折叠的文档。例如，在评分结果中，最好的评分文档将被保留。您可以选择指定一个``collapse_order方面，以控制在崩溃时保留哪些文档。

例如，在一个产品搜索中，您可以显示按价格递减排序的结果，并删除除每个产品类型的最高评级项目以外的所有项目：

from whoosh import sorting

with myindex.searcher() as s:
    price_facet = sorting.FieldFacet("price", reverse=True)
    type_facet = sorting.FieldFacet("type")
    rating_facet = sorting.FieldFacet("rating", reverse=True)

    results = s.collector(sortedby=price_facet,  # Sort by reverse price
                          collapse=type_facet,  # Collapse on product type
                          collapse_order=rating_facet  # Collapse to highest rated
                          )

崩溃发生在搜索过程中，因此它通常比查找所有内容并对结果进行后处理更有效。但是，如果崩溃消除了大量的文档，那么崩溃搜索将花费更长的时间，因为搜索必须考虑更多的文档并删除许多已经收集的文档。

因为这个收集器必须有时返回和删除已收藏的文档，如果你使用它结合 TermsCollector 和/或FacetCollector，这些收集器可能包含文档的信息过滤掉的最终结果崩溃。

七、限制搜索时间

要限制搜索所需的时间：

from whoosh.collectors import TimeLimitCollector, TimeLimit

with myindex.searcher() as s:
    # 获取一个收集器对象
    c = s.collector(limit=None, sortedby="title_exact")
    # 用一个限时的收集器将它包起来，并将时间限制设置为10秒
    tlc = TimeLimitedCollector(c, timelimit=10.0)

    # 尝试搜索
    try:
        s.search_with_collector(myquery, tlc)
    except TimeLimit:
        print("搜索时间太长了，中止了!")

    # 您仍然可以从收集器获得部分结果
    results = tlc.results()

八、方便的方法

Searcher对象上的 document()和 documents()方法允许检索与在关键字参数中传递的术语相匹配的文档存储字段。

这对于日期/时间、标识符、路径等字段尤其有用。

>>> list(searcher.documents(indexeddate=u"20051225"))
[{"title": u"Christmas presents"}, {"title": u"Turkey dinner report"}]
>>> print searcher.document(path=u"/a/b/c")
{"title": "Document C"}

这些方法有一定的局限性:

结果不计分。
多个关键字总是被混合在一起。
每个关键字参数的整个值被认为是一个单独的术语；您不能在同一字段中搜索多个术语。

九、结合结果对象

有时，使用另一个查询的结果来影响whoosh.searching.Results 对象的顺序是很有用的。

例如，您可能有一个“best bet”字段。该字段包含为文档精心挑选的关键字。当用户搜索这些关键字时，您希望将这些文档放在结果列表的顶部。你可以尝试通过极大地增加“best bet”来做到这一点，但这可能会对得分产生不可预测的影响。简单地运行两次查询并合并结果会更容易:

# 解析用户查询
userquery = queryparser.parse(querystring)

# 获取搜索到的术语
termset = set()
userquery.existing_terms(termset)

# 为用户制定一个“bestbet”查询
# 在“content”字段中搜索
bbq = Or([Term("bestbet", text) for fieldname, text
          in termset if fieldname == "content"])

# 查找与搜索项匹配的文档
results = s.search(bbq, limit=5)

# 查找与原始查询匹配的文档
allresults = s.search(userquery, limit=10)

# 将用户查询结果添加到“best bet”结果的末尾。
# 如果文档同时出现在两个结果集中，则将它们推到组合结果的顶部。
results.upgrade_and_extend(allresults)

Results '对象支持以下方法:

Results.extend(results)

将“result”中的文档添加到结果文档列表的末尾。
Results.filter(results)

从结果文档列表中删除“result”中的文档。
Results.upgrade(results)

任何出现在“result”中的结果文档都将移动到结果文档列表的顶部。
Results.upgrade_and_extend(results)

任何出现在“result”中的结果文档都将移动到结果文档列表的顶部。然后将“result”中的任何其他文档添加到结果文档列表中。