scala 短文本相似度计算

Code_LT

已于 2023-09-13 18:26:59 修改

阅读量389

点赞数

分类专栏： Scala 文章标签： scala c# 开发语言

于 2023-07-28 15:18:07 首次发布

本文链接：https://blog.csdn.net/Code_LT/article/details/131975382

版权

Scala 专栏收录该内容

40 篇文章 1 订阅

订阅专栏

该文介绍了一种短文本相似度判断的方法，包括使用编辑距离（Levenshtein距离）和Jaccard系数。通过汉语文本分词库HanLP提取名词，然后计算Jaccard相似度和编辑距离，结合配置参数来判断文本相似性。此外，提到了DataFrame中计算编辑距离的方式以及MD5和语义向量模型作为其他可能的文本比较策略。

摘要由CSDN通过智能技术生成

simHash类的算法更适合长文本的相似度判断，而短文本可考虑一下几种方法：

一、编辑距离+jacard距离


import com.hankcs.hanlp.HanLP
import com.hankcs.hanlp.seg.common.Term
import java.util.Properties
import scala.collection.JavaConverters._

object Test extends Serializable  {

  def main(args: Array[String]): Unit = {
    val props=new Properties()
    props.setProperty("deduplicateMinJaccardDistance","0.2")
    val s1="山东今天大雨"
    val s2="云南今天大雨"
    println(isSimilar(props,s1,s2))
  
  }

  /**
   * 获取有效实体
   *
   * @param text
   * @return
   */
  def getEfficientNorms(text: String): List[String] = {
    val terms: List[Term] = HanLP.newSegment.seg(text).asScala.toList
    terms.filter(term => term.word.length > 1 && term.nature.startsWith
    ("n")).map(term => term.word) //n开头为名词
  }

  /**
   * 获取Jaccard系数
   *
   * @param array1
   * @param array2
   * @return
   */
  def getJaccardCoefficient(array1: Seq[String], array2: Seq[String]) = {
    val s1 = array1.toSet
    val s2 = array2.toSet
    s1.intersect(s2).size.toDouble / s1.union(s2).size.toDouble
  }

  /**
   * 计算编辑距离Levenshtein距离：插入、删除和替换
   *
   * @param word1
   * @param word2
   * @return
   */
  def getLevenshtein(word1: String, word2: String): Int = {
    val m = word1.length
    val n = word2.length
    val dp = Array.ofDim[Int](m + 1, n + 1)
    for (i <- 0 to m) dp(i)(0) = i
    for (j <- 0 to n) dp(0)(j) = j
    for (i <- 1 to m; j <- 1 to n) {
      if (word1(i - 1) == word2(j - 1)) dp(i)(j) = dp(i - 1)(j - 1)
      else dp(i)(j) = (dp(i - 1)(j - 1) + 1).min((dp(i - 1)(j) + 1).min(dp(i)(j - 1) + 1))
    }
    dp(m)(n)
  }
  /**
   * 判断两字符串是否相似
   * @param props
   * @param text1
   * @param text2
   * @return
   */

  def isSimilar(props: Properties, text1: String, text2: String): Boolean = {

    val maxDistance = props.getProperty("deduplicateMaxEditDistance", "10").toInt
    val minDistanceRate = props.getProperty("deduplicateMinDistanceRate", "0.3").toFloat
    val minJaccardDistance = props.getProperty("deduplicateMinJaccardDistance", "0.1").toFloat


    val lDis = getLevenshtein(text1, text2)

    val score = 1 - lDis.toDouble/ Math.max(text1.length, text2.length)

    val jDis = getJaccardCoefficient(getEfficientNorms(text1), getEfficientNorms(text2))

    jDis > minJaccardDistance && score > minDistanceRate && lDis < maxDistance

  }
}

对于dataframe，getLevenshtein可利用原生的levenshtein函数

 df.withColumn("editDistance", levenshtein(col("text1"), col("text2")))
   .withColumn("score", lit(1) - col("editDistance") / greatest(length(col("text1")), length(col("text2"))))

二、md5

三、语义向量模型

其他思路

## 根据经验，ratio() 值超过 0.6 就意味着两个序列是近似匹配的，1表示完全相同
difflin.SequenceMatcher(None,str1,str2).quik_ratio() #原理类似jacard距离

参考

python的difflib使用

Code_LT

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scala 短文本相似度计算

对于dataframe，getLevenshtein可利用原生的levenshtein函数。
复制链接

扫一扫

专栏目录