前言
这里记录一下自己利用深度学习进行文本相似度计算的过程。这是一个系列,后续一直更行。
xmnlp(中文)
代码链接
安装指令
pip install -U xmnlp
模型我这里用的是:xmnlp-onnx-models-v5.zip,可以在github中安装。
代码:
import pandas as pd
from xmnlp.sv import SentenceVector
# 导入两个文本
commit_file = 'commits.csv'
issues_file = 'issues.csv'
commit = pd.read_csv(commit_file)
issue = pd.read_csv(issues_file)
# 得到文本的内容和对应的文件名
commit_content = commit['commit_content'].tolist()
commit_ids = commit['commit_id'].tolist()
issue_content = issue['issue_content'].tolist()
issue_ids = issue['issue_id'].tolist()
# 加载模型
model_dir = 'model\\xmnlp-onnx-models\\sentence_vector' # 模型的地址,下载后直接放在文件夹中即可
sv = SentenceVector(model_dir, genre='通用')
# sv = SentenceVector(genre='金融')
# sv = SentenceVector(genre='国际')
# sv = SentenceVector(genre='通用')
# 调用模型得到结果
query = issue_content[0]
result=[]
result.append(['commit','issue','sim'])
for issue_index in range(len(issue_content)):
for commit_index in range(len(commit_content)):
query = issue_content[issue_index]
doc = commit_content[commit_index]
commit_name = commit_ids[commit_index]
issue_name = issue_ids[issue_index]
sim = sv.similarity(query, doc)
result.append([commit_name,issue_name,sim])
print('similarity:', sv.similarity(query, doc))
print('most similar doc:', sv.most_similar(query, commit_content))
print('query representation shape:', sv.transform(query).shape)
result_all=pd.DataFrame(result)
result_all.to_csv('result.csv')
sentence transformers
源代码
https://github.com/UKPLab/sentence-transformers
安装指令
pip install -U sentence-transformers
数据集格式
-
uc_Full.txt
-
uc_TCName(里面存储的文本对应的名称)
-
CN_MN_VN_CMT.txt (代码文件,CN表示类名,MN 表示方法名,VN为变量名,CMT为注释)。 这里的一行表示一个文本句子。
-
code_TCName.txt (存储的代码对应的名称)
代码
这里是通过bert-base-nli-mean-tokens这个模型进行encode,生成词向量,然后通过余弦相似度计算两个词向量的之间的距离。
# -*- coding: utf-8 -*-
# @Time : 2023/6/30 14:52
# 本项目基于pytorch
import pandas as pd
requirement_file = "../dataset/Pig/uc_Full.txt"
code_file = "../dataset/Pig/CN_MN_VN_CMT.txt"
requirement_name_file = "../dataset/Pig/uc_TCName.txt"
code_name_file = "../dataset/Pig/code_TCName.txt"
with open(requirement_file, 'r') as f:
requirement = f.readlines()
with open(requirement_name_file, 'r') as f:
requirement_name = f.readlines()
with open(code_file, 'r') as f:
code = f.readlines()
with open(code_name_file, 'r') as f:
code_name = f.readlines()
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
from sklearn.metrics.pairwise import cosine_similarity
sentences = code
sentences.extend(requirement)
sentence_embeddings = model.encode(sentences)
# 让我们计算第0句的余弦相似度:
result = []
result.append(['code', 'requirement', 'sim'])
for i in range(len(code_name)):
for j in range(len(requirement_name)):
sim = cosine_similarity(
[sentence_embeddings[i]],
[sentence_embeddings[len(code_name) - 1 + j]]
)
result.append([code_name[i], requirement_name[j], sim[0]])
print(sim[0])
result_all = pd.DataFrame(result)
result_all.to_csv('result_2.csv')