2021.3.19项目阶段报告
提要
经过讨论,我们确定了各自的分工,我负责的模块主要是查重部分,所以我就这一部分进行工作。
因为python有直接的可用库进行分词等操作,所以目前的部分是使用python写的
本周总结
1、学习了关于pycharm,以及pyqt5的使用并且做出了属于自己的界面,通过pyqt5做界面方面比较简单。以下是第一版测试界面,只需要有个样子就可以了,重要的是测试代码能不能正确运行。
2、通过python的jpype库实现了java和python的对接
具体过程在另一篇博客中有展示。
3、进行了文本相似度算法的学习,包括:
-
cosine
-
稀疏矩阵
-
simhash
-
TF-IDF
4、尝试了以上几种算法
- 首先是cosine,主要思路是先统计文章中的关键词出现的次数,然后通过计算余弦相似度来对两个文章进行相似度比较。存在缺陷,应该使用TF-IDF算法对词进行加权计算,这样既可以排除两篇文章长度不一,词总数所带来的差异,并且更加的科学。不过这只是第一个版本。
#cosine.py
import math
import re
import datetime
import time
import jieba
import gensim
#剔除标点符号
excludes = {',','。','/','《','》','?',';','‘',':','“','【','】','{','}',
'、','|','!','@','#','¥','%','……','&','*','(',')','-','=',
'——','+','·','~','”',
',','.','/','<','>','?',';','\'',':','"','[',']','{','}','\\','|',
'~','`','!','@','#','$','%','^','&','*','(',')','_','-','+','=',
' ','\n'}
def getThetxt(str):
d1 = open(str,encoding='utf-8').read()
for word in excludes:
d1 = d1.replace(word,'')
return d1
def getTheFrequency(list):
dict = {}
for word in list:
if word != '' and word in dict:
num = dict[word]
dict[word] = num + 1
elif word != '':
dict[word] = 1
else:
continue
return dict
def compute_cosine(txt_a, txt_b):
first = getThetxt(txt_a)
second = getThetxt(txt_b)
first_list = [word for word in jieba.cut(first)]
second_list = [word for word in jieba.cut(second)]
#频率获得 从list 到 dict
first_dict = getTheFrequency(first_list)
second_dict = getTheFrequency(second_list)
#排序
dic_f = sorted(first_dict.items(),key=lambda asd:asd[1],reverse=True)
dic_s = sorted(second_dict.items(),key=lambda asd:asd[1],reverse=True)
#得到词向量(两者的所有的)
keyWords = []
for i in range(len(dic_f)):
keyWords.append(dic_f[i][0]) #添加数组(都是字)
for i in range(len(dic_s)):
if dic_s[i][0] in keyWords: #有的话啥也不干
pass
else: #没有就加入
keyWords.append(dic_s[i][0])
vect_f = []
vect_s = []
for word in keyWords:
if word in first_dict:
vect_f.append(first_dict[word])
else:
vect_f.append(0)
if word in second_dict:
vect_s.append(second_dict[word])
else:
vect_s.append(0)
#开始计算余弦相似度
sum = 0
sq1 = 0
sq2 = 0
for i in range(len(vect_f)):
sum += vect_f[i] * vect_s[i]
sq1 += pow(vect_f[i],2)
sq2 += pow(vect_s[i], 2)
try: #round() 四舍五入,结果保留两位小数
result = round(float(sum)/(math.sqrt(sq1)*math.sqrt(sq2)),2)
except ZeroDivisionError:
result = 0.0
return result
if __name__ == '__main__':
print(compute_cosine('C:/Users/60917/Desktop/第一章.txt','C:/Users/60917/Desktop/第三章.txt'))
print(compute_cosine('C:/Users/60917/Desktop/第二章.txt','C:/Users/60917/Desktop/第三章.txt'))
运行结果展示
D:\pythonProject\venv\Scripts\python.exe D:/getTheWord/cosine.py
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\60917\AppData\Local\Temp\jieba.cache
Loading model cost 1.067 seconds.
Prefix dict has been built successfully.
0.78
0.89
Process finished with exit code 0
- 第二个是采用稀疏矩阵进行相似度比较,采用了TF-IDF的算法,使得统计的结果更加科学化,但是也存在一定的缺陷,引文直接调用TF-IDF库,针对实验报告的查重存在一定的偏差,应该使用很多的实验报告训练自己的TF-IDF加权值,这样才更具有针对性。
#SparseMatrix
import jieba
import gensim
#标点符合去掉
excludes = {',','。','/','《','》','?',';','‘',':','“','【','】','{','}',
'、','|','!','@','#','¥','%','……','&','*','(',')','-','=',
'——','+','·','~','”',
',','.','/','<','>','?',';','\'',':','"','[',']','{','}','\\','|',
'~','`','!','@','#','$','%','^','&','*','(',')','_','-','+','=',
' ','\n'}
def getThetxt(str):
d1 = open(str,encoding='utf-8').read()
for word in excludes:
d1 = d1.replace(word,'')
#print(d1)
return d1
#分词
def getWordFromList(str):
words = []
for sentence in str:
sentence_list = [word for word in jieba.cut(sentence)]
words.append(sentence_list)
return words
if __name__ == '__main__':
first = 'C:/Users/60917/Desktop/第一章.txt'
second = 'C:/Users/60917/Desktop/第二章.txt'
third = 'C:/Users/60917/Desktop/第三章.txt'
#获得字符串
f = getThetxt(first)
s = getThetxt(second)
t = getThetxt(third)
#将前两个模拟数据库
mysql = [f,s]
mysql_list = getWordFromList(mysql)
dictionary = gensim.corpora.Dictionary(mysql_list)
corpus = [dictionary.doc2bow(doc) for doc in mysql_list]
#print(corpus)
t_list = [word for word in jieba.cut(t)]
test_doc_vec = dictionary.doc2bow(t_list)
tfidf = gensim.models.TfidfModel(corpus)
index = gensim.similarities.SparseMatrixSimilarity(tfidf[corpus],num_features=len(dictionary.keys()))
sim = index[tfidf[test_doc_vec]]
print(sim)
# t1 = jieba.cut(t)
运行结果
D:\pythonProject\venv\Scripts\python.exe D:/getTheWord/test.py
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\60917\AppData\Local\Temp\jieba.cache
Loading model cost 1.228 seconds.
Prefix dict has been built successfully.
[0.04587318 0.5870048 ]
Process finished with exit code 0
- 使用simhash来进行文本相似度的比较。不足之处,没有使用TF-IDF算法。
#simhash_test.py
from simhash import Simhash
#去掉标点符号
excludes = {',','。','/','《','》','?',';','‘',':','“','【','】','{','}',
'、','|','!','@','#','¥','%','……','&','*','(',')','-','=',
'——','+','·','~','”',
',','.','/','<','>','?',';','\'',':','"','[',']','{','}','\\','|',
'~','`','!','@','#','$','%','^','&','*','(',')','_','-','+','=',
' ','\n'}
def getThetxt(str):
d1 = open(str,encoding='utf-8').read()
for word in excludes:
d1 = d1.replace(word,'')
#print(d1)
return d1
def simhash_similarity(text1,text2):
aa_simhash = Simhash(text1)
bb_simhash = Simhash(text2)
max_hashbit = max(len(bin(aa_simhash.value)),len(bin(bb_simhash.value)))
#汉明距离
distince = aa_simhash.distance(bb_simhash)
similar = 1 - distince/max_hashbit
return similar
if __name__ == '__main__':
first = getThetxt('C:/Users/60917/Desktop/第一章.txt')
second = getThetxt('C:/Users/60917/Desktop/第二章.txt')
third = getThetxt('C:/Users/60917/Desktop/第三章.txt')
print(simhash_similarity(first, third))
print(simhash_similarity(second, third))
运行结果
D:\pythonProject\venv\Scripts\python.exe D:/getTheWord/simhash_test.py
0.48484848484848486
0.5303030303030303
Process finished with exit code 0
总结:计算出来的结果有些偏差,主要是因为有的代码使用了TF-IDF算法,而有一些没有使用。
下周期望
对于下周项目实现方面的任务,主要分成以下几个部分
- 实现真正的自己的TF-IDF,因为使用别人库里的总会有一些偏差,不如自己训练出来的更加符合实验报告的需求,计算的结果更加准确。
- 另一方面就是代码相似度的判断,打算先开个头。
- 还有界面的设计在完善一些。