【Python】倒排索引

最新推荐文章于 2024-04-28 18:28:12 发布

Birdy_C

最新推荐文章于 2024-04-28 18:28:12 发布

阅读量3.4k

点赞数 1

分类专栏：数据结构与算法文章标签：索引 python

本文链接：https://blog.csdn.net/birdy_/article/details/76642951

版权

预处理word stemming一个单词可能不同的形式，在英语中比如动词的主被动、单复数等。比如live\lives\lived. 虽然英文的处理看起来已经很复杂啦但实际在中文里的处理要更加复杂的多。stop words比如a、the这种词在处理的时候没有实际意义。在这里处理的时候先对词频进行统计，人为界定停词，简单的全部替换为空格。但是这种方式并不适用于所有的情况，对于比如，To be or n

摘要由CSDN通过智能技术生成

代码链接

https://github.com/Birdy-C/Shakespeare-search-engine

预处理

word stemming

一个单词可能不同的形式，在英语中比如动词的主被动、单复数等。比如live\lives\lived.
虽然英文的处理看起来已经很复杂啦但实际在中文里的处理要更加复杂的多。

stop words

比如a、the这种词在处理的时候没有实际意义。在这里处理的时候先对词频进行统计，人为界定停词，简单的全部替换为空格。但是这种方式并不适用于所有的情况，对于比如，To be or not to be，这种就很难处理。

具体实现

Index.txt 记录所出现的文件
这里将建立倒排索引分为三步

thefile.txt 所有出现过的词（词频由高到低）
stop_word.txt 停词
data.pkl 所创建的索引

1 count.py 确定停词
2 index.py 建立倒排索引
3 query.py 用于查询

这里在建立倒排索引的时候只记录了出现的文件名，并没有记录在文件中出现的位置。

图为count.py生成的词频统计

这里写图片描述

count.py

#-*- coding:utf-8 -*-
'''
@author birdy qian
'''
import sys
from nltk import *                                                                                          #import natural-language-toolkit
from operator import itemgetter                                                                 #for sort

def output_count(fdist):                                                                                #output the relative information
    #vocabulary =fdist.items()
    vocabulary =fdist.items()                                                                           #get all the vocabulary 


    vocabulary=sorted(vocabulary, key=itemgetter(1),reverse=True)               #sort the vocabulary in decreasing order
    print vocabulary[:250]                                                                              #print top 250 vocabulary and its count on the screen
    print 'drawing plot.....'                                                                               #show process
    fdist.plot(120 , cumulative=False)                                                              #print the plot

    #output in file
    file_object = open('thefile.txt', 'w')                                                              #prepare the file for writing
    for j in vocabulary:
        file_object.write( j[0] + ' ')                                                                      #put put all the vocabulary in decreasing order 
    file_object.close( )                                                                                        #close the file


def pre_file(filename): 
    print("read file %s.txt....."%filename)                                                             #show process
    content = open( str(filename) + '.txt', "r").read()
    content = content.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~' :                                            #cancel the punction

最低0.47元/天解锁文章

Birdy_C

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
2
评论
【Python】倒排索引

预处理word stemming一个单词可能不同的形式，在英语中比如动词的主被动、单复数等。比如live\lives\lived. 虽然英文的处理看起来已经很复杂啦但实际在中文里的处理要更加复杂的多。stop words比如a、the这种词在处理的时候没有实际意义。在这里处理的时候先对词频进行统计，人为界定停词，简单的全部替换为空格。但是这种方式并不适用于所有的情况，对于比如，To be or n
复制链接

扫一扫