JAVA使用W2V计算语义相似度(java实现)

在SpringBoot项目中,将预训练好的W2V模型bin文件放到项目中,引入依赖,使用java的方法分词、计算余弦相似度,计算出两个句子的语义相似度。注意bin模型要自己训练或下载。

1. 引入pom依赖

      常用想要的依赖这里都有:

Maven Repository: Search/Browse/Explore (mvnrepository.com)icon-default.png?t=N7T8https://mvnrepository.com/

       这里引入了深度学习库和分词库:

        <dependency>
			<groupId>org.deeplearning4j</groupId>
			<artifactId>deeplearning4j-nlp</artifactId>
			<version>1.0.0-beta7</version>
		</dependency>
		<dependency>
			<groupId>org.nd4j</groupId>
			<artifactId>nd4j-native-platform</artifactId>
			<version>1.0.0-beta7</version>
		</dependency>
		<dependency>
			<groupId>com.huaban</groupId>
			<artifactId>jieba-analysis</artifactId>
			<version>1.0.2</version>
		</dependency>

2. 写Word2Vector方法类

        分词——停用部分分词以提升效果——计算余弦相似度

package com.example.demo.service.Impl.data;

import org.deeplearning4j.models.embeddings.loader.WordVectorSerializer;
import org.deeplearning4j.models.word2vec.Word2Vec;
import com.huaban.analysis.jieba.JiebaSegmenter;
import java.util.Arrays;
import java.util.List;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Set;

public class SentenceSimilarityCalculator {

    private Word2Vec word2Vec;
    private JiebaSegmenter segmenter;
    private Set<String> stopwords;

    public SentenceSimilarityCalculator(String modelPath) {
        word2Vec = WordVectorSerializer.readWord2VecModel(modelPath);
        segmenter = new JiebaSegmenter();
        initializeStopwords(); // 初始化停用词
    }

    private void initializeStopwords() {
        // 设置分词的停用词,以提升效果,这里示例停用了(),
        stopwords = new HashSet<>();
        stopwords.add(")");
        stopwords.add("(");
        stopwords.add(",");
    }

    // 过滤和加权分词结果
    private List<String> filterAndWeightWords(List<String> words) {
        List<String> filteredWords = new ArrayList<>();
        for (String word : words) {
            // 过滤掉停用词和特定符号
            if (!stopwords.contains(word) && !word.matches("[\\pP\\pS]+")) {
                // 这里可以做加权处理
                filteredWords.add(word);
            }
        }
        return filteredWords;
    }

    public double calculateSimilarity(String sentence1, String sentence2) {
        List<String> words1 = segmenter.sentenceProcess(sentence1);
        List<String> words2 = segmenter.sentenceProcess(sentence2);

        // 过滤和加权处理分词结果
        List<String> filteredWords1 = filterAndWeightWords(words1);
        List<String> filteredWords2 = filterAndWeightWords(words2);

        String[] arrayWords1 = filteredWords1.toArray(new String[0]);
        String[] arrayWords2 = filteredWords2.toArray(new String[0]);

        double[] vector1 = calculateAverageVector(arrayWords1);
        double[] vector2 = calculateAverageVector(arrayWords2);

        double cosineSimilarity = cosineSimilarity(vector1, vector2);
        return cosineSimilarity;
    }

    private double[] calculateAverageVector(String[] words) {
        double[] sumVector = new double[word2Vec.getWordVector(word2Vec.vocab().wordAtIndex(0)).length];

        for (String word : words) {
            double[] wordVector = word2Vec.getWordVector(word);
            if (wordVector != null) {
                for (int i = 0; i < wordVector.length; i++) {
                    sumVector[i] += wordVector[i];
                }
            }
        }

        for (int i = 0; i < sumVector.length; i++) {
            sumVector[i] /= words.length;
        }
        System.out.println("Words: " + Arrays.toString(words));
        System.out.println("SumVector: " + Arrays.toString(sumVector));
        //上面打印出来,方便在控制台观察W2V分词相似等情况
        return sumVector;
    }

    private double cosineSimilarity(double[] vector1, double[] vector2) {
        double dotProduct = 0.0;
        double norm1 = 0.0;
        double norm2 = 0.0;

        for (int i = 0; i < vector1.length; i++) {
            dotProduct += vector1[i] * vector2[i];
            norm1 += Math.pow(vector1[i], 2);
            norm2 += Math.pow(vector2[i], 2);
        }

        return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2));
    }

    //当结果需要拉伸变幻时
    //如果相似度在0.0到0.2之间,将线性映射到-1.0到0.2的范围
    public static double mapSimilarity(double similarity) {//(实际范围)/实际范围差*映射范围差+映射起点
        if (similarity >= 0.2) {
            return similarity;
        } else {
           return (similarity - 0.0) / (0.2 - 0.0) * (0.2 - (-1.0)) + (-1.0);
        }
    }
}

3. 在需要计算的地方引用计算方法

String modelPath = "src/main/java/com/example/demo/service/Impl/data/pretrained_word2vec.bin";//自己的路径
                SentenceSimilarityCalculator calculator = new SentenceSimilarityCalculator(modelPath);
                sentenceSimilarity = calculator.calculateSimilarity(sentence1, sentence2); 
                double mappedSimilarity = calculator.mapSimilarity(sentenceSimilarity);
                mappedSimilarity *= 100;   

        如果需要打印观察结果

String modelPath = "src/main/java/com/example/demo/service/Impl/data/pretrained_word2vec.bin";
                SentenceSimilarityCalculator calculator = new SentenceSimilarityCalculator(modelPath);
                sentenceSimilarity = calculator.calculateSimilarity(sentence1, sentence2); 
                System.out.println("sentenceSimilarity between the two sentences: " + sentenceSimilarity);
                double mappedSimilarity = calculator.mapSimilarity(sentenceSimilarity);
                System.out.println("mappedSimilarity between the two sentences: " + mappedSimilarity);
                mappedSimilarity *= 100;
                System.out.println("FinalMappedSimilarity between the two sentences: " + mappedSimilarity);

4. 输入或从其他地方接入需要计算语义相似度的句子

String sentence1 = "比利时大个子费莱尼,传奇轰炸机";
String sentence2 = "泰山老队长费莱尼,来时英雄去时传奇";

        运行结果不再赘述,方法可行,效果尚佳。

  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
Word2Vec是一种用于将单词嵌入到低维向量空间中的技术,它可以用于自然语言处理任务,例如文本分类、情感分析等。在PyTorch中,可以使用torch.nn.Embedding模块来实现Word2Vec。下面是一个简单的示例代码: ```python import torch import torch.nn as nn # 定义一个简单的Word2Vec模型 class Word2Vec(nn.Module): def __init__(self, vocab_size, embedding_dim): super(Word2Vec, self).__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) self.linear = nn.Linear(embedding_dim, vocab_size) def forward(self, x): x = self.embedding(x) x = self.linear(x) return x # 定义一个简单的语料库 corpus = ["I like playing football", "He likes playing basketball", "She hates playing tennis"] # 构建词汇表 vocab = set(" ".join(corpus).split()) word_to_idx = {word: i for i, word in enumerate(vocab)} idx_to_word = {i: word for i, word in enumerate(vocab)} # 将语料库转换为数字序列 data = [[word_to_idx[word] for word in sentence.split()] for sentence in corpus] data = torch.tensor(data) # 训练Word2Vec模型 vocab_size = len(vocab) embedding_dim = 10 model = Word2Vec(vocab_size, embedding_dim) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.001) for epoch in range(100): optimizer.zero_grad() output = model(data) loss = criterion(output.view(-1, vocab_size), data.view(-1)) loss.backward() optimizer.step() # 输出词向量 embedding = model.embedding.weight.data.numpy() for i in range(vocab_size): print(idx_to_word[i], embedding[i]) ```
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值