Spring boot项目中,需要计算两个句子的语义相似度,可以基于JAVA直接写W2C计算语义相似度,但我选择了用BERT计算语义相似度(不过速度慢...)
1.BERT.py:可Pycharm修改输入试运行,测试效果
from transformers import BertTokenizer, BertModel
import torch
def calculate_similarity(sentence1, sentence2):
# 加载预训练模型,可以替换合适的模型,如‘bert-base-chinese’,适合中文
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese')
# 对句子进行tokenization和padding,并计算句子的嵌入表示
encoded_inputs = tokenizer([sentence1, sentence2], padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
outputs = model(**encoded_inputs)
sentence1_embeddings = outputs.last_hidden_state[0] # 句子1的嵌入表示
sentence2_embeddings = outputs.last_hidden_state[1] # 句子2的嵌入表示
# 计算余弦相似度
similarity = torch.cosine_similarity(sentence1_embeddings.mean(dim=0), sentence2_embeddings.mean(dim=0), dim=0)
return similarity.item()
sentence1 = input()
sentence2 = input()
print(calculate_similarity(sentence1, sentence2))
BERT模型可以更改,这里选了一个适合中文语句的模型,然后将以上BERT.py复制到自己JAVA项目需要调用BERT计算相似度的地方
2.SemanticSimilarityCalculator:写一个Java Class来调用BERT.py
package com.example.demo.service.Impl;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
public class SemanticSimilarityCalculator {
public double calculateSimilarity(String sentence1, String sentence2) throws IOException, InterruptedException {
// 创建Python进程
ProcessBuilder pb = new ProcessBuilder("python", "./到自己方法文件同级路径/BERT.py");
Process process = pb.start();
// 获取Python进程的输入流和输出流
BufferedReader br = new BufferedReader(new InputStreamReader(process.getInputStream()));
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(process.getOutputStream()));
System.out.println("Sending sentences to Python process:");
System.out.println("Sentence 1: " + sentence1);
System.out.println("Sentence 2: " + sentence2);
try {
bw.write(sentence1 + "\n");
bw.write(sentence2 + "\n");
bw.flush();
} catch (IOException e) {
throw new IOException("Failed to send data to Python process", e);
}
int exitCode = process.waitFor();
if (exitCode != 0) {
throw new IOException("Python process exited with non-zero status: " + exitCode);
}
// 读取Python进程的输出结果(语义相似度)
String output = br.readLine();
System.out.println("Python Output: " + output); // 调试语句
if (output == null) {
throw new IOException("Python process did not produce any output");
}
// 关闭输入流、输出流和进程
br.close();
bw.close();
process.destroy();
// 解析输出结果并返回语义相似度
String similarityString = output.replaceAll("[^0-9.]", ""); // 提取字符串中的数字和小数点
double similarity = Double.parseDouble(similarityString);
similarity *= 100;
return similarity;
}
}
3.在自己文件需要计算语义相似度的地方调用SemanticSimilarityCalculator方法计算语义相似度
Double sentenceSimilarity;
String sentence1 = "今天天气不错,对面只进了三个球";
String sentence2 = "今天有点晒,我没能零封对手";
sentenceSimilarity = similarityCalculator.calculateSimilarity(sentence1, sentence2);
在自己的项目里运行,即可调用BERT计算两个句子的语义相似度:
总结:这BERT不好使,而且...真的慢~
新手小白,望指正改进