pythonNLP-文本相似度计算实验汇总

最新推荐文章于 2024-08-07 17:50:00 发布

Kang_TJU

最新推荐文章于 2024-08-07 17:50:00 发布

阅读量1w

点赞数 6

分类专栏： Machine Learning NLP python学习

本文链接：https://blog.csdn.net/Kang_TJU/article/details/53771335

版权

本文总结我写实验时文本相似度计算的代码。任务是：给定语料库，计算任意两篇语聊的相似度。输入是语料库，输出是整个语料库的相似度矩阵。

基于LDA模型的文本相似度计算

主要的过程如下：

文本预处理过程
训练LDA模型
相似度计算
结果保存

下面分别去说。

文本预处理过程(pre_process.py)


#-*- coding:utf-8

'''

preprocess.py
这个文件的作用是做文档预处理，
讲每篇文档，生成相应的token_list
只需执行最后documents_pre_process函数即可。

'''

import nltk
import traceback
import jieba
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer
from collections import defaultdict

# 分词 - 英文
def tokenize(document):
    try:

        token_list = nltk.word_tokenize(document)

        #print "[INFO]: tokenize is finished!"
        return token_list

    except Exception,e:
        print traceback.print_exc()

# 分词 - 中文
def tokenize_chinese(document):
    try:

        token_list = jieba.cut( document, cut_all=False )

        #print "[INFO]: tokenize_chinese is finished!"
        return token_list

    except Exception,e:
        print traceback.print_exc()

# 去除停用词
def filtered_stopwords(token_list):
    try:


        token_list_without_stopwords = [ word for word in token_list
                                         if word not in stopwords.words("english")]


        #print "[INFO]: filtered_words is finished!"
        return token_list_without_stopwords
    except Exception,e:
        print traceback.print_exc()

# 去除标点
def filtered_punctuations(token_list):
    try:
        punctuations = ['', '\n', '\t', ',', '.', ':', ';', '?', '(', ')', '[', ']',