Python学习:面向对象(下):实现一个搜索引擎

根据学到的面向对象的一些知识,然后模拟敏捷开发过程中的迭代开发流程,巩固面向对象的程序设计思想。从最简单最直接的搜索做起,一步步优化,但不可避免会介绍一些现代搜索引擎中的基础概念,例如语料(corpus)、倒序索引(inverted index)等。

一、“高大上”的搜索引擎

一个搜索引擎由搜索器、索引器、检索器和用户接口四个部分组成

搜索器,通俗来讲就是我们常提到的爬虫(scrawler),它能在互联网上大量爬取各类网站的内容,送给索引器。索引器拿到网页和内容后,会对内容进行处理,形成索引(index),存储于内部的数据库等待检索。

最后的用户接口很好理解,是指网页和 App 前端界面,例如百度和谷歌的搜索页面。用户通过用户接口,向搜索引擎发出询问(query),询问解析后送达检索器;检索器高效检索后,再将结果返回给用户。

我们假设搜索样本存在于本地磁盘上。为了方便,我们只提供五个文件的检索,内容我放在了下面这段代码中:

# 1.txt
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today. 

# 2.txt
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

# 3.txt
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

# 4.txt
This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

# 5.txt
And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

先定义SearchEngineBase基类:

class SearchEngineBase(object):  #基类

    def __init__(self):
        pass

    def add_corpus(self, file_path):  #负责读取文件内容,将文件路径作为 ID
        with open(file_path, 'r') as fin:
            text = fin.read()

        self.process_corpus(file_path, text)

    def process_corpus(self, id, text):  #索引器,对内容进行处理
        raise Exception('process_corpus not implemented.')

    def search(self, query):   #搜索器
        raise Exception('search not implemented.')


def main(search_engine):  #提供搜索器和用户接口
    for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:
        search_engine.add_corpus(file_path)

    while True:

        query = 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值