某农业大学信息搜索与引擎-第2次实验-CSDN博客

本文链接：https://blog.csdn.net/qssssss79/article/details/131293387

一、文本预处理

1. 词汇切分

实验要求：按照上述逻辑实现正向减字最大匹配法，以字符串“今天是中华人民共和国获得奥运会举办权的日子”为测试用例，代码使用的词典可参考使用dict_example.txt。

dic = []


def init():
    with open("dict_example.txt", "r", encoding='utf-8') as f:
        for i in f:
            dic.append(i.strip())


def pipei(juzi):
    # 统计词典种最长的词
    max_length = max(len(i) for i in dic)
    # 统计句子长度
    word_length = len(juzi)
    max_cut_length = len(dic[0])
    # 存储切分好的词语
    cut_word = []
    i = 0
    while word_length > 0:
        # 创建待分序列sub
        sub = juzi[0:]
        max_cut_length = len(sub)
        # 进行一轮分词，在左侧切出一个词
        while max_cut_length > 0:
            print("s[" + str(i) + "]:" + sub)
            i += 1
            if sub in dic:
                cut_word.append(sub)
                break
            elif max_cut_length == 1:
                cut_word.append(sub)
                break
            else:
                max_cut_length = max_cut_length - 1
                sub = juzi[0: max_cut_length]
        # 将切掉的单词删去
        juzi = juzi[max_cut_length:]
        word_length = word_length - max_cut_length
    return cut_word


init()
print(dic)
k = "今天是中华人民共和国获得奥运会举办权的日子"
result = pipei(k)
seg = ""
for s in result:
    if seg == "":
        seg += s
    else:
        seg += "/"+s
print("结果是：")
print(seg)

由于本任务中，我们要假定最长词的长度为7，那么我们在刚开始时，就应该将7设定为固定值进行匹配，每次都匹配7个字词，如果与词典匹配成功即可，如果没有匹配成功，就依次减少，进而能够获得最大匹配。代码如下：

dic = []


def init():
    with open("dict_example.txt", "r", encoding='utf-8') as f:
        for i in f:
            dic.append(i.strip())


def pipei(juzi):
    max_length = 7
    i = 0
    for word in juzi:
        cut_word = []
        word_length = len(juzi)
        while word_length > 0:
            sub = word[0: max_length]
            while sub not in dic:
                print("s[" + str(i) + "]:" + sub)
                i += 1
                if len(sub) == 1:
                    break
                sub = sub[0:len(sub)-1]
            print("s[" + str(i) + "]:" + sub)
            i += 1
            cut_word.append(sub)
            word = word[len(sub):]
            word_length = len(word)
    return cut_word


init()
print(dic)
k = ["今天是中华人民共和国获得奥运会举办权的日子"]
result = pipei(k)
seg = ""
for s in result:
    if seg == "":
        seg += s
    else:
        seg += "/"+s
print("结果是：")
print(seg)

3. Python开源库jieba的使用

主要是利用jieba.cut(sentence, cut_all = False, HMM = True) 这个用法。该函数有三个参数，sentence是需要分词的字符串，其编码可以是unicode、utf-8或gbk；cut_all用来控制是否采用全模式，默认为否；用来控制是否使用 HMM 模型，默认值为 True。该方法的返回结果是一个generator类型的对象，可以通过转换为list或者使用for循环获取切分结果。

import jieba

# 精简模式
ds = jieba.cut("今天是中华人民共和国获得奥运会举办权的日子.")
print(list(ds))

# 全模式
ds = jieba.cut("今天是中华人民共和国获得奥运会举办权的日子.", cut_all=True)
print(list(ds))

加载自定义词典

先创建一个文本文件，其中每行编写一个词汇，并保存文件。我将其保存至"C:\Users\qssss\PycharmProjects\pythonProject\信息检索\实验二\一、文本预处理\udict.txt"当中。

然后通过load_userdict方法加载，即可进行切分。

代码如下：

import jieba
# 首先将 大数据 保存在自定义字典中
jieba.load_userdict(r"C:\Users\qssss\PycharmProjects\pythonProject\信息检索\实验二\一、文本预处理\udict.txt")
ds = jieba.cut("这是一本大数据相关专业的教材.")
print(list(ds))

二、倒排文档的实现

需要实现的功能：实现一个函数get_reverse_index，该函数的返回结果是一个字典，其中key为词语，value为该词语在文档每一行中出现次数组成的列表。下图是一个示例结果：‘网络’在文档中的第一行和第三行都分别出现了一次，‘的’在文档中的第一行和第二行都分别出现了两次。

我们依然利用jieba方法进行分词，第二，我们利用字典和列表共同进行统计，首先我们先将每个列表中的元素进行统计一次，将其放在cut_words字典当中，然后再进行一次统计，此次是加上了所有的行数，将其放在dic字典当中，最终就可以得出结果。

代码如下：

import jieba
stopwords = []


def remove_stopword(word_list):
    if len(stopwords) == 0:
        with open('stopword.txt', "r", encoding="utf-8") as word_input:
            for word in word_input:
                stopwords.append(word.split("\n")[0].strip())
    new_word_list = []
    for word in word_list:
        if word in stopwords:
            continue
        new_word_list.append(word)
    return new_word_list


def get_reverse_index(filepath):
    dic = {}
    with open(filepath, "r", encoding='utf-8') as f:
        for i in f:
            num = 1
            cut_words = {}
            i = i.strip()
            ac = list(jieba.cut(i))
            ac = remove_stopword(ac)
            # print(ac)

            for word in set(ac):
                if word == '\n' or word == '。' or word == '，':
                    continue
                num = 0
                for temp in ac:
                    if temp == word:
                        num += 1
                cut_words[word] = num
            # print(cut_words)

            for word in cut_words.keys():
                if word in dic.keys():
                    t = str(num)+':'+str(cut_words[word])
                    dic[word].append(t)
                else:
                    dic[word] = []
                    t = str(num)+':'''+str(cut_words[word])
                    dic[word].append(t)

            num += 1
    print(dic)
    return dic


if __name__ == '__main__':
    dic = get_reverse_index('data.txt')
    while 1:
        search_word = input('Please input the word you want to search: ')
        if search_word in dic:
            print(dic.get(search_word))
        else:
            print(-1)