使用Python和Redis创建一个简单的基于分数的文本索引和搜索系统

文章介绍了如何使用Python编写ScoredIndexSearch类,实现在Redis中构建基于TF-IDF的文本索引系统,支持高效搜索,并配合TestIndex类进行功能测试。
摘要由CSDN通过智能技术生成
import collections
import math
import os
import re
import unittest

import redis

NON_WORDS = re.compile("[^a-z0-9' ]")

# stop words pulled from the below url
# http://www.textfixer.com/resources/common-english-words.txt
STOP_WORDS = set('''a able about across after all almost also am among
an and any are as at be because been but by can cannot could dear did
do does either else ever every for from get got had has have he her
hers him his how however i if in into is it its just least let like
likely may me might most must my neither no nor not of off often on
only or other our own rather said say says she should since so some
than that the their them then there these they this tis to too twas us
wants was we were what when where which while who whom why will with
would yet you your'''.split())

class ScoredIndexSearch(object):
    def __init__(self, prefix, *redis_settings):
        # All of our index keys are going to be prefixed with the provided
        # prefix string.  This will allow multiple independent indexes to
        # coexist in the same Redis db.
        self.prefix = prefix.lower().rstrip(':') + ':'

        # Create a connection to our Redis server.
        self.connection = redis.Redis(*redis_settings)

    @staticmethod
    def get_index_keys(content, add=True):
        # Very simple word-based parser.  We skip stop words and single
        # character words.
        words = NON_WORDS.sub(' ', content.lower()).split()
        words = [word.strip("'") for word in words]
        words = [word for word in words
                    if word not in STOP_WORDS and len(word) > 1]
        # Apply the Porter Stemmer here if you would like that functionality.

        # Apply the Metaphone/Double Metaphone algorithm by itself, or after
        # the Porter Stemmer.

        if not add:
            return words

        # Calculate the TF portion of TF/IDF.
        counts = collections.defaultdict(float)
        for word in words:
            counts[word] += 1
        wordcount = len(words)
        tf = dict((word, count / wordcount)
                    for word, count in counts.iteritems())
        return tf

    def _handle_content(self, id, content, add=True):
        # Get the keys we want to index.
        keys = self.get_index_keys(content)
        prefix = self.prefix

        # Use a non-transactional pipeline here to improve performance.
        pipe = self.connection.pipeline(False)

        # Since adding and removing items are exactly the same, except
        # for the method used on the pipeline, we will reduce our line
        # count.
        if add:
            pipe.sadd(prefix + 'indexed:', id)
            for key, value in keys.iteritems():
                pipe.zadd(prefix + key, id, value)
        else:
            pipe.srem(prefix + 'indexed:', id)
            for key in keys:
                pipe.zrem(prefix + key, id)

        # Execute the insertion/removal.
        pipe.execute()

        # Return the number of keys added/removed.
        return len(keys)

    def add_indexed_item(self, id, content):
        return self._handle_content(id, content, add=True)

    def remove_indexed_item(self, id, content):
        return self._handle_content(id, content, add=False)

    def search(self, query_string, offset=0, count=10):
        # Get our search terms just like we did earlier...
        keys = [self.prefix + key
                    for key in self.get_index_keys(query_string, False)]

        if not keys:
            return [], 0

        def idf(count):
            # Calculate the IDF for this particular count
            if not count:
                return 0
            return max(math.log(total_docs / count, 2), 0)

        total_docs = max(self.connection.scard(self.prefix + 'indexed:'), 1)

        # Get our document frequency values...
        pipe = self.connection.pipeline(False)
        for key in keys:
            pipe.zcard(key)
        sizes = pipe.execute()

        # Calculate the inverse document frequencies...
        idfs = map(idf, sizes)

        # And generate the weight dictionary for passing to zunionstore.
        weights = dict((key, idfv)
                for key, size, idfv in zip(keys, sizes, idfs) if size)

        if not weights:
            return [], 0

        # Generate a temporary result storage key
        temp_key = self.prefix + 'temp:' + os.urandom(8).encode('hex')
        try:
            # Actually perform the union to combine the scores.
            known = self.connection.zunionstore(temp_key, weights)
            # Get the results.
            ids = self.connection.zrevrange(
                temp_key, offset, offset+count-1, withscores=True)
        finally:
            # Clean up after ourselves.
            self.connection.delete(temp_key)
        return ids, known

class TestIndex(unittest.TestCase):
    def test_index_basic(self):
        t = ScoredIndexSearch('unittest', 'dev.ad.ly')
        t.connection.delete(*t.connection.keys('unittest:*'))

        t.add_indexed_item(1, 'hello world')
        t.add_indexed_item(2, 'this world is nice and you are really special')

        self.assertEquals(
            t.search('hello'),
            ([('1', 0.5)], 1))
        self.assertEquals(
            t.search('world'),
            ([('2', 0.0), ('1', 0.0)], 2))
        self.assertEquals(t.search('this'), ([], 0))
        self.assertEquals(
            t.search('hello really special nice world'),
            ([('2', 0.75), ('1', 0.5)], 2))

if __name__ == '__main__':
    unittest.main()

Python类 ScoredIndexSearch,它用于在Redis数据库中创建和管理一个基于分数的索引系统。此外,它还包含了一个测试类 TestIndex,用于测试索引功能。下面我将逐部分解释这段代码的关键组成部分。

导入模块和设置

代码首先导入了需要的Python模块,包括对集合(collections)、数学运算(math)、操作系统接口(os)、正则表达式(re)的支持,以及一个用于单元测试的模块(unittest)。同时,它还导入了 redis 模块,这是用来与Redis数据库交互的。

非单词字符和停用词

使用正则表达式定义了一个模式 NON_WORDS 用来匹配非单词字符,即一切不是小写字母、数字、单引号或空格的字符。STOP_WORDS 是一个集合,包含了被认为对搜索不重要的常见英文单词,这些词通常在文本处理中被忽略。

ScoredIndexSearch 类

ScoredIndexSearch 类的核心功能是在Redis数据库中创建和管理一个基于分数的文本索引,以支持高效的文本搜索操作。下面是该类的关键方法:

  • __init__:类的初始化方法,接收一个前缀字符串和Redis设置参数,用以区分不同的索引并创建Redis连接。

  • get_index_keys:这是一个静态方法,用于处理文本内容,移除停用词和单字符词,可选择性地应用文本处理算法如Porter Stemmer或Metaphone。此方法返回文本中单词的词频(TF)。

  • _handle_content:一个内部方法,用于处理内容的添加或移除操作。它使用Redis的pipeline功能来提高性能,执行添加或删除索引项的操作。

  • add_indexed_itemremove_indexed_item:这两个方法分别用于添加和移除索引项,内部调用 _handle_content 方法。

  • search:这个方法实现了基于查询字符串的搜索功能。它首先计算查询词的逆文档频率(IDF),然后使用Redis的 zunionstore 命令结合权重计算最终的分数,返回匹配的文档ID和它们的分数。

TestIndex 类

TestIndex 类是一个用于测试索引功能的单元测试类,继承自 unittest.TestCase。它定义了一组测试用例来验证 ScoredIndexSearch 类的基本功能。

主程序

代码的最后部分,如果直接运行这个脚本,它将执行单元测试。

总结

本文展示了如何使用Python和Redis创建一个简单的基于分数的文本索引和搜索系统。它包括文本预处理、索引创建、内容添加与移除、以及基于TF/IDF计算的搜索功能。通过单元测试类,也展示了如何验证这些功能的正确性。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值