python模糊搜索_python中的模糊文本搜索

{1}

你可以在Whoosh 2.7中做到这一点.它通过添加插件whoosh.qparser.FuzzyTermPlugin进行模糊搜索:

whoosh.qparser.FuzzyTermPlugin lets you search for “fuzzy” terms,that is,terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of “edits” (character insertions,deletions,and/or transpositions – this is called the “Damerau-Levenshtein edit distance”).

要添加模糊插件:

parser = qparser.QueryParser("fieldname",my_index.schema)

parser.add_plugin(qparser.FuzzyTermPlugin())

将模糊插件添加到解析器后,可以通过添加〜后跟可选的最大编辑距离来指定模糊项.如果未指定编辑距离,则默认值为1.

例如,以下“模糊”术语查询:

letter~

letter~2

letter~2/3

{2}要保持单词顺序,请使用查询whoosh.query.Phrase,但您应该使用whoosh.qparser.SequencePlugin替换Phrase插件,该插件允许您在短语中使用模糊术语:

"letter~ stamp~ mail~"

要使用序列插件替换默认短语插件:

parser = qparser.QueryParser("fieldname",my_index.schema)

parser.remove_plugin_class(qparser.PhrasePlugin)

parser.add_plugin(qparser.SequencePlugin())

{3}要允许两者之间的单词,请将短语查询中的slop arg初始化为更大的数字:

whoosh.query.Phrase(fieldname,words,slop=1,boost=1.0,char_ranges=None)

slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.

您还可以在Query中定义slop,如下所示:

"letter~ stamp~ mail~"~10

{4}整体解决方案:

{4.a}索引器就像:

from whoosh.index import create_in

from whoosh.fields import *

schema = Schema(title=TEXT(stored=True),content=TEXT)

ix = create_in("indexdir",schema)

writer = ix.writer()

writer.add_document(title=u"First document",content=u"This is the first document we've added!")

writer.add_document(title=u"Second document",content=u"The second one is even more interesting!")

writer.add_document(title=u"Third document",content=u"letter first,stamp second,mail third")

writer.add_document(title=u"Fourth document",content=u"stamp first,mail third")

writer.add_document(title=u"Fivth document",mail third")

writer.add_document(title=u"Sixth document",content=u"letters first,stamps second,mial third wrong")

writer.add_document(title=u"Seventh document",letters second,mail third")

writer.commit()

{4.b} Searcher会像:

from whoosh.qparser import QueryParser,FuzzyTermPlugin,PhrasePlugin,SequencePlugin

with ix.searcher() as searcher:

parser = QueryParser(u"content",ix.schema)

parser.add_plugin(FuzzyTermPlugin())

parser.remove_plugin_class(PhrasePlugin)

parser.add_plugin(SequencePlugin())

query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")

results = searcher.search(query)

print "nb of results =",len(results)

for r in results:

print r

结果如下:

nb of results = 2

{5}如果要将模糊搜索设置为默认值而不在查询的每个单词中使用语法单词~n,则可以像这样初始化QueryParser:

from whoosh.query import FuzzyTerm

parser = QueryParser(u"content",ix.schema,termclass = FuzzyTerm)

现在您可以使用查询“letter stamp mail”~10但请记住,FuzzyTerm具有默认编辑距离maxdist = 1.如果您想要更大的编辑距离,请对该类进行个性化:

class MyFuzzyTerm(FuzzyTerm):

def __init__(self,fieldname,text,maxdist=2,prefixlength=1,constantscore=True):

super(D,self).__init__(fieldname,boost,maxdist,prefixlength,constantscore)

# super().__init__() for Python 3 I think

参考文献:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值