【信息检索】大作业：jieba/gensim制作一个搜索引擎

最新推荐文章于 2024-06-27 01:11:28 发布

Lucy Ling

最新推荐文章于 2024-06-27 01:11:28 发布

阅读量904

点赞数

分类专栏：学业文章标签： python 搜索引擎

本文链接：https://blog.csdn.net/qq_45769334/article/details/107133096

版权

这篇博客介绍了如何利用jieba进行分词，去除停用词，以及使用gensim的BM25模型构建搜索引擎。文章中提到，还尝试了nltk的预处理，但效果不理想。最终，通过设置置信区对查询进行处理，将搜索结果存入字典并按指定格式输出。

摘要由CSDN通过智能技术生成

设计思路

利用jieba分词
去停用词
~~利用nltk进行词干提取、词形还原等预处理~~ （效果太差了）
利用genism的bm25模型建立索引
对于query设置置信区，并逐条搜索
讲搜索内容写入字典
按照指定格式输出字典内容
（当然用es写也可以）

# -*- coding:utf-8 -*-
import nltk.tokenize
import time
import jieba
"""
———————————————————————————————————————————————————
2020信息检索期末考试
凌珑
————————————————————————————————————————————————————
"""

'''建索引'''
def readfile(filename):
    file= open(filename,'r',encoding='UTF-8')
    print("获取文件成功")
    return file

def makestops():
    stopwords=set()
    with open('stopwords.txt','r')as f:
        while True:
            line = f.readline()
            if not line:
                break
            line = line.strip('\n')
            stopwords.add(line)
    return stopwords


def cutsentence(sen,stops):
    # words = sen.split()
    # words = nltk.tokenize.word_tokenize(sen)
    words = jieba.lcut_for_search(sen.strip(), HMM=True)
    words = [i for i in words if i not in stops]

    return words

def pretreatment(datafile,corpus,dataid):
    begin=time.time()
    stops = makestops()
    for sentence in datafile:
        sen = sentence.split("\t", 1)
        words=cutsentence(sen