python硬刚倒排索引

最新推荐文章于 2024-03-06 17:17:47 发布

weixin_30945039

最新推荐文章于 2024-03-06 17:17:47 发布

阅读量189

点赞数

文章标签： python json

原文链接：http://www.cnblogs.com/chenyuan404/p/10147531.html

版权

需要导入的库：jieba, json

json是python自带的库，jieba只需要在命令行输入pip install jieba即可

本代码采用直接硬刚倒排索引，可能会引起稍微不适，请选用。

代码分为三部分：分词、创建正排索引、创建倒排索引

需要文件：语料库、停用词库（停用词库请自行搜索即可）

语料库图片如下：

我用的是自己爬取的一部分新闻标题，包含网易，头条，凤凰网以及一小部分微信文章标题。语料库处理：只需要每一句的后面加个换行即可。

分词代码：

stopwords =[]

with open('stopwords', 'r', encoding='utf-8')as f:
    for i in f:
        word = i.strip()
        stopwords.append(word)

filename = 'test.txt'

filename1 = 'test_cws.txt'
# 写入分词
def write_cws():
    num = 0 # 这个是文件id值，如果本身就有，这个可以更改为你自己的，我这里只是简单的计数作为id值
    writing = open(filename1, 'a+', encoding='utf-8')
    with open(filename, 'r', encoding='utf-8')as f:
        for line in f:
            content = line.strip()
            content = content.replace(' ', '')
            seg = jieba.cut(content)
            test =''
            for i in seg:
                if i not in stopwords:
                    test += i+' '
            writing.write(str(num)+"    "+test+'\n')
            num += 1
    writing.close()

正排索引代码：

filename2 = 'zhengxiang.txt'
def zhengxiang():

    all_words = []
    all = {}
    file2 = open(filename2, 'a+', encoding='utf-8')
    with open(filename1, 'r', encoding='utf-8')as f:
        for line in f:
            line = line.strip()
            # print(line)
            content = line.split('    ')[1]

            num = line.split('    ')[0]
            words = content.split(' ')
            for word in words:
                word_num =[num]
                if word not in all_words:
                    all_words.append(word)
                    all[word] = word_num
                else:
                    if num not in all[word]:
                        all[word].append(num)


    for word, nums in all.items():
        file2.write(word+'    ')
        for i in range(len(nums)):
            if i ==0:
                file2.write(nums[i])
            else:
                file2.write(','+nums[i])
        file2.write('\n')

    file2.close()

倒排索引代码：

# 倒排索引
filename3 = 'daopai.txt'
def daopai():
    with open(filename2, 'r', encoding='utf-8')as f:
        for line in f:
            try:#这个异常处理是我数据有点问题，如果你本身数据和我上面截图的语料库数据一样，应该不会报错
                word_dict = {}# 单词的字典，字典格式，方便存取
                word_list =[] # 存放这个单词的情况
                syc = [] # 存放单词以及单词在所有文件出现的次数，在一个文件出现就加1，不管其中出现多少次

                Aword = line.strip()# Aword 是 all_word
                word = Aword.split('    ')[0]
                print(word)
                nums = Aword.split('    ')[1]
                count = len(nums.split(','))
                syc.append(word+' '+str(count))
                word_list.append(syc)
                with open(filename1, 'r', encoding='utf-8') as r:
                    for line1 in r:
                        acount = 0 # 这个单词在这行中出现的个数
                        words = line1.strip().split('    ')[1].split(' ')
                        num = line1.strip().split('    ')[0]
                        if word in words: # 判断这个单词在不在这个句子
                            for aword in words:
                                if word == aword:
                                    acount += 1
                            temp1 = [num, acount]# 用于存放单词出现的地方以及它的次数
                            word_list.append(temp1)
                word_dict[word] = word_list
                with open(filename3, 'a', encoding='utf-8')as f:
                    json.dump(word_dict,f,ensure_ascii=False)
                    f.write(',')
                    f.write('\n')
            except Exception as e:
                print(line)
                print(e)

这个代码是原语料库跑出分词之后，将分词文件去跑正排索引，将正排索引去跑倒排索引，所以运行的时候，请依次运行。

如果有一定的帮助，点个赞哦，谢谢！！！

转载于:https://www.cnblogs.com/chenyuan404/p/10147531.html

weixin_30945039

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
python硬刚倒排索引

需要导入的库：jieba, jsonjson是python自带的库，jieba只需要在命令行输入pip install jieba即可本代码采用直接硬刚倒排索引，可能会引起稍微不适，请选用。代码分为三部分：分词、创建正排索引、创建倒排索引需要文件：语料库、停用词库（停用词库请自行搜索即可）语料库图片如下：我用的是自己爬取的一部分新闻标题，包含网易，头条，凤凰网以及一小部...
复制链接

扫一扫