利用余弦相似度计算文本相似度

最新推荐文章于 2024-07-13 02:30:10 发布

ZKYEN

最新推荐文章于 2024-07-13 02:30:10 发布

阅读量1.2w

点赞数 7

分类专栏： NLP 文章标签：算法

本文链接：https://blog.csdn.net/weixin_37790871/article/details/78374147

版权

本文介绍了如何使用余弦相似度来衡量文本之间的相似性，重点在于实际的代码实现，通过计算两个文本向量的夹角余弦值来评估它们的相似程度。

摘要由CSDN通过智能技术生成

利用余弦相似度计算文本相似度

1、Introduction

针对文本相似判定，本文提供余弦相似度算法，并根据实际项目遇到的一些问题，给出相应的解决方法。经过实际测试表明：余弦相似度算法适合于短文本，而不适合于长文本。

2、Related Work

2.1 最长公共子序列（基于权值空间、词条空间）
（1）将两个字符串分别以行和列组成矩阵。
（2）计算每个节点行列字符是否相同，如相同则为1。
（3）通过找出值为1的最长对角线即可得到最长公共子串。
（4）为进一步提升该算法，我们可以将字符相同节点的值加上左上角（d[i-1，j-1]）的值，这样即可获得最大公共子串的长度。如此一来只需以行号和最大值为条件即可截取最大子串。
2.2 最小编辑距离算法（基于词条空间）
（1）狭义编辑距离
设A、B为两个字符串，狭义的编辑距离定义为把A转换成B需要的最少删除（删除A中一个字符）、插入（在A中插入一个字符）和替换（把A中的某个字符替换成另一个字符）的次数，用ED（A，B）来表示。直观来说，两个串互相转换需要经过的步骤越多，差异越大。
（2）步骤
a) 对两部分文本进行处理，将所有的非文本字符替换为分段标记“#”
b) 较长文本作为基准文本，遍历分段之后的短文本，发现长文本包含短文本子句后在长本文中移除，未发现匹配的字句累加长度。
c) 比较剩余文本长度与两段文本长度和，其比值为不匹配比率。

3、Cosine Similarity

余弦相似度 (Cosine Similarity) 通过计算两个向量的夹角余弦值来评估他们的相似度。余弦相似度将向量根据坐标值，绘制到向量空间中，如最常见的二维空间。
3.1 Conception：
将向量根据坐标值，绘制到向量空间中。如最常见的二维空间。求得他们的夹角，并得出夹角对应的余弦值，此余弦值就可以用来表征，这两个向量的相似性。夹角越小，余弦值越接近于1，它们的方向更加吻合，则越相似。
因此，我们可以通过夹角的大小，来判断向量的相似程度。夹角越小，就代表越相似。
3.2 Calculate：
以二维空间为例，上图的a和b是两个向量，我们要计算它们的夹角θ。余弦定理告诉我们，可以用下面的公式求得：
这里写图片描述

数学家已经证明，余弦的这种计算方法对n维向量也成立。假定A和B是两个n维向量，A是 [A1, A2, …, An] ，B是 [B1, B2, …, Bn] ，则A与B的夹角θ的余弦等于：
这里写图片描述

算法步骤

（1）向量对齐：
由于在实际应用中，表征文本特征的两个向量的长度是不同的，因此必然需要对上述向量进行处理。
a) 对文本进行预处理：去停用词（分词，介词，代词等）以及非文本符号
b) 归并向量，并根据原向量是否在新向量（归并后的向量）存在，若存在则以该词汇的词频来表征，若不存在则该节点置为0
c) 示例如下：
Text1_1: It is a beautiful butterfly
Text1_2: beautiful butterfly
Text2_1: She is a beautiful girl
Text2_2: beautiful girl
Vector: beautiful butterfly girl
Vector1 = (1, 1, 0)
Vector2 = (1, 0, 1)
（2）样例：
Test1_1、Test2_1为来自不同类型文章中的随机段落节选；Test1_2、Test2_2为去停用词和非文字符号后的文本
Test1_1：In spite of the title, this article will really be on how not to grow old, which, at my time of life, is a much more important subject. My first advice would be to choose your ancestors carefully. Although both my parents died young, I have done well in this respect as regards my other ancestors. My maternal grandfather, it is true, was cut off in the flower of his youth at the age of sixty-seven, but my other three grandparents all lived to be over eighty. Of remoter ancestors I can only discover one who did not live to a great age, and he died of a disease which is now rare, namely, having his head cut off.
Test1_2：spite title article grow old which time life subject advice choose ancestors carefully parents died young respect ancestors maternal grandfather true cut flower youth age sixty-seven grandparents lived eighty remoter ancestors discover live age died disease rare namely head cut off
Test2_1：A good book may be among the best of friends. It is the same today that it always was, and it will never change. It is the most patient and cheerful of companions. It does not turn its back upon us in times of adversity or distress. It always receives us with the same kindness; amusing and instructing us in youth, and comforting and consoling us in age.
Test2_2：book friends was change patient cheerful companions times adversity distress receives kindness amusing instructing youth comforting consoling age

代码块

①Cos_Main
package NLP_Cos;

import java.io.*;

public class CosMain {
   

    public static void main(String[] args) throws Exception {
        //第一步，预处理主要是进行分词和去停用词，分词。
        //第二步，列出所有的词。