个人项目作业-论文查重

3121005062郑乾睿

已于 2023-03-08 20:28:24 修改

阅读量164

点赞数

文章标签： python Powered by 金山文档

于 2023-03-08 20:18:48 首次发布

本文链接：https://blog.csdn.net/UKBigBen/article/details/129402148

版权

作业要求	https://bbs.csdn.net/topics/613858565
作业目标	代码实现、性能分析、单元测试、异常处理说明、记录PSP表格
作业gitcode链接	https://gitcode.net/UKBigBen/3121005062/-/tree/master/3121005062

调用接口

jieba.cut

用于对中文句子进行分词，该方法提供多种分词模式供选择，这里只需用到默认最简单的“精确模式”。

re.match

由于对比对象为中文或英文单词，因此应该对读取到的文件数据中存在的换行符\n、标点符号过滤掉，这里选择用正则表达式来匹配符合的数据。

代码：

def filter(str):
    str = jieba.lcut(str)
    result = []
    for tags in str:
        if (re.match(u"[a-zA-Z0-9\u4e00-\u9fa5]", tags)):
            result.append(tags)
        else:
            pass
    return result

gensim.dictionary.doc2bow

Bag-of-words model (BoW model) 最早出现在自然语言处理（Natural Language Processing）和信息检索（Information Retrieval）领域.。该模型忽略掉文本的语法和语序等要素，将其仅仅看作是若干个词汇的集合，文档中每个单词的出现都是独立的。

代码：

def convert_corpus(text1,text2):
    texts=[text1,text2]
    dictionary = gensim.corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    return corpus

gensim.similarities.Similarity

该方法可以用计算余弦相似度，代码：

def calc_similarity(text1,text2):
    corpus=convert_corpus(text1,text2)
    similarity = gensim.similarities.Similarity('-Similarity-index', corpus, num_features=len(dictionary))
    test_corpus_1 = dictionary.doc2bow(text1)
    cosine_sim = similarity[test_corpus_1][1]
    return cosine_sim

代码实现

# coding:gbk
import jieba
import gensim
import re
import os

jieba.setLogLevel(jieba.logging.INFO)

# 获取指定路径的文件内容
def get_file_contents(path):
    str = ''
    f = open(path, 'r', encoding='UTF-8')
    line = f.readline()
    while line:
        str = str + line
        line = f.readline()
    f.close()
    return str

#将读取到的文件内容先进行jieba分词，然后再把标点符号、转义符号等特殊符号过滤掉
def filter(str):
    str = jieba.lcut(str)
    result = []
    for tags in str:
        if (re.match(u"[a-zA-Z0-9\u4e00-\u9fa5]", tags)):
            result.append(tags)
        else:
            pass
    return result

# 忽略掉文本的语法和语序等要素，将其仅仅看作是若干个词汇的集合
def convert_corpus(text1,text2):
    texts=[text1,text2]
    dictionary = gensim.corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    return corpus

#传入过滤之后的数据，通过调用gensim.similarities.Similarity计算余弦相似度
def calc_similarity(text1,text2):
    texts=[text1,text2]
    dictionary = gensim.corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    similarity = gensim.similarities.Similarity('-Similarity-index', corpus, num_features=len(dictionary))
    test_corpus_1 = dictionary.doc2bow(text1)
    cosine_sim = similarity[test_corpus_1][1]
    return cosine_sim


if __name__ == '__main__':
        path1 = "D:\pythonProject\test1.txt"  #论文原文的文件的绝对路径（作业要求）
        path2 = "D:\pythonProject\test2.txt"  #抄袭版论文的文件的绝对路径
        save_path = "D:\pythonProject\save.txt"   #输出结果绝对路径
        str1 = get_file_contents(path1)
        str2 = get_file_contents(path2)
        text1 = filter(str1)
        text2 = filter(str2)
        similarity = calc_similarity(text1, text2)
        print("文章相似度： %.2f"%similarity)
        #将相似度结果写入指定文件
        f = open(save_path, 'w', encoding="utf-8")
        f.write("python main.py "+path1+" "+path2+" "+"文章相似度： %.2f"%similarity)
        f.close()

测试样例，运行结果：

性能分析

改进代码：

def filter(string):
    pattern = re.compile(u"[^a-zA-Z0-9\u4e00-\u9fa5]")
    string = pattern.sub("", string)
    result = jieba.lcut(string)
    return result

改进后耗费时间结果：

单元测试

为了方便进行单元测试，主函数main修改如下：

if __name__ == '__main__':
    path1 = input("输入论文原文的文件的绝对路径：")
    path2 = input("输入抄袭版论文的文件的绝对路径：")
    save_path = "D:\pythonProject\save.txt"   #输出结果绝对路径
    str1 = get_file_contents(path1)
    str2 = get_file_contents(path2)
    text1 = filter(str1)
    text2 = filter(str2)
    similarity = calc_similarity(text1, text2)
    print("文章相似度： %.2f"%similarity)
    #将相似度结果写入指定文件
    f = open(save_path, 'w', encoding="utf-8")
    f.write("python"+" "+"main.py"+" "+path1+" "+path2+" "+"文章相似度： %.2f"%similarity)
    f.close()

异常处理说明

在读取指定文件内容之前先判断文件是否存在，若不存在则做出响应并且结束程序。

if __name__ == '__main__':
    path1 = input("输入论文原文的文件的绝对路径：")
    path2 = input("输入抄袭版论文的文件的绝对路径：")
  if not os.path.exists(path1) :
        print("论文原文文件不存在！")
        exit()
  if not os.path.exists(path2):
        print("抄袭版论文文件不存在！")
        exit()
    save_path = "D:\pythonProject\save.txt"   #输出结果绝对路径
    str1 = get_file_contents(path1)
    str2 = get_file_contents(path2)
    text1 = filter(str1)
    text2 = filter(str2)
    similarity = calc_similarity(text1, text2)
    print("文章相似度： %.2f"%similarity)
    #将相似度结果写入指定文件
    f = open(save_path, 'w', encoding="utf-8")
    f.write("python"+" "+"main.py"+" "+path1+" "+path2+" "+"文章相似度： %.2f"%similarity)
    f.close()

记录PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	120	180
· Estimate	· 估计这个任务需要多少时间	240	300
Development	开发	300	320
· Analysis	· 需求分析 (包括学习新技术)	120	100
· Design Spec	· 生成设计文档	30	30
· Design Review	· 设计复审	30	20
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	20	10
· Design	· 具体设计	120	150
· Coding	· 具体编码	20	20
· Code Review	· 代码复审	20	10
· Test	· 测试（自我测试，修改代码，提交修改）	30	20
Reporting	报告	60	50
· Test Repor	· 测试报告	20	25
· Size Measurement	· 计算工作量	5	5
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	5	10
Total	· 合计	1120	1250