python模糊搜索文本文档_python中的模糊文本搜索

最新推荐文章于 2024-08-07 14:36:00 发布

卢新生

最新推荐文章于 2024-08-07 14:36:00 发布

阅读量987

点赞数

文章标签： python模糊搜索文本文档

本文链接：https://blog.csdn.net/weixin_30062621/article/details/114441759

版权

在Python中，你可以使用Whoosh库进行模糊搜索。通过添加FuzzyTermPlugin插件，可以搜索不完全匹配的术语。模糊搜索允许在一定编辑距离内的相似匹配。此外，结合SequencePlugin，可以保持单词顺序，并通过设置slop参数来允许单词间的间隔。文章还提供了创建索引和执行模糊搜索查询的完整示例。

摘要由CSDN通过智能技术生成

{1}

你可以在Whoosh 2.7中做到这一点.它通过添加插件whoosh.qparser.FuzzyTermPlugin进行模糊搜索：

whoosh.qparser.FuzzyTermPlugin lets you search for “fuzzy” terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of “edits” (character insertions, deletions, and/or transpositions – this is called the “Damerau-Levenshtein edit distance”).

要添加模糊插件：

parser = qparser.QueryParser("fieldname", my_index.schema)

parser.add_plugin(qparser.FuzzyTermPlugin())

将模糊插件添加到解析器后,可以通过添加〜后跟可选的最大编辑距离来指定模糊项.如果未指定编辑距离,则默认值为1.

例如,以下“模糊”术语查询：

letter~

letter~2

letter~2/3

{2}要保持单词顺序,请使用查询whoosh.query.Phrase,但您应该使用whoosh.qparser.SequencePlugin替换Phrase插件,该插件允许您在短语中使用模糊术语：

"letter~ stamp~ mail~"

要使用序列插件替换默认短语插件：

parser = qparser.QueryParser("fieldname", my_index.schema)

parser.remove_plugin_class(qparser.PhrasePlugin)

parser.add_plugin(qparser.SequencePlugin())

{3}要允许两者之间的单词,请将短语查询中的slop arg初始化为更大的数字：

whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)

slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.

您还可以在Query中定义slop,如下所示：

"letter~ stamp~ mail~"~10

{4}整体解决方案：

{4.a}索引器就像：

from whoosh.index import create_in

from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), content=TEXT)

ix = create_in("indexdir", schema)

writer = ix.writer()

writer.add_document(title=u"First document", content=u"This is the first document we've added!")

writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")

writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")

writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")

writer.add_document(title=u"Fivth document", content=u"letter first, mail third")

writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")

writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")

writer.commit()

{4.b} Searcher会像：

from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin

with ix.searcher() as searcher:

parser = QueryParser(u"content", ix.schema)

parser.add_plugin(FuzzyTermPlugin())

parser.remove_plugin_class(PhrasePlugin)

parser.add_plugin(SequencePlugin())

query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")

results = searcher.search(query)

print "nb of results =", len(results)

for r in results:

print r

结果如下：

nb of results = 2

{5}如果要将模糊搜索设置为默认值而不在查询的每个单词中使用语法单词~n,则可以像这样初始化QueryParser：

from whoosh.query import FuzzyTerm

parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)

现在您可以使用查询“letter stamp mail”~10但请记住,FuzzyTerm具有默认编辑距离maxdist = 1.如果您想要更大的编辑距离,请对该类进行个性化：

class MyFuzzyTerm(FuzzyTerm):

def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):

super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)

# super().__init__() for Python 3 I think

参考文献：

卢新生

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫