2021.3.19项目阶段报告

最新推荐文章于 2022-01-05 17:54:03 发布

梦平

最新推荐文章于 2022-01-05 17:54:03 发布

阅读量249

点赞数 2

分类专栏：基于文本分析的批量代码语法检测和实验报告查重系统文章标签： python 项目管理机器学习

本文链接：https://blog.csdn.net/qq_43533083/article/details/115012598

版权

基于文本分析的批量代码语法检测和实验报告查重系统专栏收录该内容

22 篇文章 3 订阅

订阅专栏

2021.3.19项目阶段报告

提要

经过讨论，我们确定了各自的分工，我负责的模块主要是查重部分，所以我就这一部分进行工作。

因为python有直接的可用库进行分词等操作，所以目前的部分是使用python写的

本周总结

1、学习了关于pycharm，以及pyqt5的使用并且做出了属于自己的界面，通过pyqt5做界面方面比较简单。以下是第一版测试界面，只需要有个样子就可以了，重要的是测试代码能不能正确运行。

在这里插入图片描述
2、通过python的jpype库实现了java和python的对接

具体过程在另一篇博客中有展示。

3、进行了文本相似度算法的学习，包括：

cosine
稀疏矩阵
simhash
TF-IDF

4、尝试了以上几种算法

首先是cosine，主要思路是先统计文章中的关键词出现的次数，然后通过计算余弦相似度来对两个文章进行相似度比较。存在缺陷，应该使用TF-IDF算法对词进行加权计算，这样既可以排除两篇文章长度不一，词总数所带来的差异，并且更加的科学。不过这只是第一个版本。

#cosine.py
import math
import re
import datetime
import time
import jieba
import gensim
#剔除标点符号
excludes = {'，','。','/','《','》','？','；','‘','：','“','【','】','{','}',
            '、','|','！','@','#','￥','%','……','&','*','（','）','-','=',
            '——','+','·','~','”',
            ',','.','/','<','>','?',';','\'',':','"','[',']','{','}','\\','|',
            '~','`','!','@','#','$','%','^','&','*','(',')','_','-','+','=',
            ' ','\n'}

def getThetxt(str):
    d1 = open(str,encoding='utf-8').read()

    for word in excludes:
        d1 = d1.replace(word,'')
    return d1
def getTheFrequency(list):
    dict = {}
    for word in list:
        if word != '' and word in dict:
            num = dict[word]
            dict[word] = num + 1
        elif word != '':
            dict[word] = 1
        else:
            continue
    return dict

def compute_cosine(txt_a, txt_b):
    first = getThetxt(txt_a)
    second = getThetxt(txt_b)

    first_list = [word for word in jieba.cut(first)]
    second_list = [word for word in jieba.cut(second)]

    #频率获得  从list 到 dict
    first_dict = getTheFrequency(first_list)
    second_dict = getTheFrequency(second_list)

    #排序
    dic_f = sorted(first_dict.items(),key=lambda asd:asd[1],reverse=True)
    dic_s = sorted(second_dict.items(),key=lambda asd:asd[1],reverse=True)

    #得到词向量（两者的所有的）
    keyWords = []

    for i in range(len(dic_f)):
        keyWords.append(dic_f[i][0])    #添加数组（都是字）

    for i in range(len(dic_s)):
        if dic_s[i][0] in keyWords: #有的话啥也不干
            pass
        else:                       #没有就加入
            keyWords.append(dic_s[i][0])
    vect_f = []
    vect_s = []

    for word in keyWords:
        if word in first_dict:
            vect_f.append(first_dict[word])
        else:
            vect_f.append(0)
        if word in second_dict:
            vect_s.append(second_dict[word])
        else:
            vect_s.append(0)
    #开始计算余弦相似度
    sum = 0
    sq1 = 0
    sq2 = 0
    for i in range(len(vect_f)):
        sum += vect_f[i] * vect_s[i]
        sq1 += pow(vect_f[i],2)
        sq2 += pow(vect_s[i], 2)
    try:    #round() 四舍五入,结果保留两位小数
        result = round(float(sum)/(math.sqrt(sq1)*math.sqrt(sq2)),2)
    except ZeroDivisionError:
        result = 0.0
    return result

if __name__ == '__main__':

    print(compute_cosine('C:/Users/60917/Desktop/第一章.txt','C:/Users/60917/Desktop/第三章.txt'))
    print(compute_cosine('C:/Users/60917/Desktop/第二章.txt','C:/Users/60917/Desktop/第三章.txt'))

运行结果展示

D:\pythonProject\venv\Scripts\python.exe D:/getTheWord/cosine.py
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\60917\AppData\Local\Temp\jieba.cache
Loading model cost 1.067 seconds.
Prefix dict has been built successfully.
0.78
0.89

Process finished with exit code 0

第二个是采用稀疏矩阵进行相似度比较，采用了TF-IDF的算法，使得统计的结果更加科学化，但是也存在一定的缺陷，引文直接调用TF-IDF库，针对实验报告的查重存在一定的偏差，应该使用很多的实验报告训练自己的TF-IDF加权值，这样才更具有针对性。

#SparseMatrix

import jieba
import gensim
#标点符合去掉
excludes = {'，','。','/','《','》','？','；','‘','：','“','【','】','{','}',
            '、','|','！','@','#','￥','%','……','&','*','（','）','-','=',
            '——','+','·','~','”',
            ',','.','/','<','>','?',';','\'',':','"','[',']','{','}','\\','|',
            '~','`','!','@','#','$','%','^','&','*','(',')','_','-','+','=',
            ' ','\n'}


def getThetxt(str):
    d1 = open(str,encoding='utf-8').read()

    for word in excludes:
        d1 = d1.replace(word,'')
    #print(d1)
    return d1

#分词
def getWordFromList(str):
    words = []
    for sentence in str:
        sentence_list = [word for word in jieba.cut(sentence)]
        words.append(sentence_list)
    return words

if __name__ == '__main__':

    first = 'C:/Users/60917/Desktop/第一章.txt'
    second = 'C:/Users/60917/Desktop/第二章.txt'
    third = 'C:/Users/60917/Desktop/第三章.txt'
    #获得字符串
    f = getThetxt(first)
    s = getThetxt(second)
    t = getThetxt(third)
    #将前两个模拟数据库
    mysql = [f,s]
    mysql_list = getWordFromList(mysql)
    dictionary = gensim.corpora.Dictionary(mysql_list)

    corpus = [dictionary.doc2bow(doc) for doc in mysql_list]
    #print(corpus)
    t_list = [word for word in jieba.cut(t)]

    test_doc_vec = dictionary.doc2bow(t_list)
    tfidf = gensim.models.TfidfModel(corpus)
    index = gensim.similarities.SparseMatrixSimilarity(tfidf[corpus],num_features=len(dictionary.keys()))

    sim = index[tfidf[test_doc_vec]]
    print(sim)
 #   t1 = jieba.cut(t)

运行结果

D:\pythonProject\venv\Scripts\python.exe D:/getTheWord/test.py
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\60917\AppData\Local\Temp\jieba.cache
Loading model cost 1.228 seconds.
Prefix dict has been built successfully.
[0.04587318 0.5870048 ]

Process finished with exit code 0

使用simhash来进行文本相似度的比较。不足之处，没有使用TF-IDF算法。

#simhash_test.py
from simhash import Simhash
#去掉标点符号
excludes = {'，','。','/','《','》','？','；','‘','：','“','【','】','{','}',
            '、','|','！','@','#','￥','%','……','&','*','（','）','-','=',
            '——','+','·','~','”',
            ',','.','/','<','>','?',';','\'',':','"','[',']','{','}','\\','|',
            '~','`','!','@','#','$','%','^','&','*','(',')','_','-','+','=',
            ' ','\n'}

def getThetxt(str):
    d1 = open(str,encoding='utf-8').read()

    for word in excludes:
        d1 = d1.replace(word,'')
    #print(d1)
    return d1

def simhash_similarity(text1,text2):
    aa_simhash = Simhash(text1)
    bb_simhash = Simhash(text2)
    max_hashbit = max(len(bin(aa_simhash.value)),len(bin(bb_simhash.value)))
    #汉明距离
    distince = aa_simhash.distance(bb_simhash)
    similar = 1 - distince/max_hashbit
    return similar

if __name__ == '__main__':

    first = getThetxt('C:/Users/60917/Desktop/第一章.txt')
    second = getThetxt('C:/Users/60917/Desktop/第二章.txt')
    third = getThetxt('C:/Users/60917/Desktop/第三章.txt')

    print(simhash_similarity(first, third))
    print(simhash_similarity(second, third))

运行结果

D:\pythonProject\venv\Scripts\python.exe D:/getTheWord/simhash_test.py
0.48484848484848486
0.5303030303030303

Process finished with exit code 0

总结：计算出来的结果有些偏差，主要是因为有的代码使用了TF-IDF算法，而有一些没有使用。

下周期望

对于下周项目实现方面的任务，主要分成以下几个部分

实现真正的自己的TF-IDF，因为使用别人库里的总会有一些偏差，不如自己训练出来的更加符合实验报告的需求，计算的结果更加准确。
另一方面就是代码相似度的判断，打算先开个头。
还有界面的设计在完善一些。

梦平

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
6
评论
2021.3.19项目阶段报告

2021.3.19项目阶段报告提要经过讨论，我们确定了各自的分工，我负责的模块主要是查重部分，所以我就这一部分进行工作。因为python有直接的可用库进行分词等操作，所以目前的部分是使用python写的本周总结1、学习了关于pycharm，以及pyqt5的使用并且做出了属于自己的界面，通过pyqt5做界面方面比较简单。以下是第一版测试界面，只需要有个样子就可以了，重要的是测试代码能不能正确运行。2、通过python的jpype库实现了java和python的对接具体过程在另一篇博客中有展示。
复制链接

扫一扫

专栏目录