java字符相似_Java中的相似性字符串比较

问题

我想比较几个字符串,找到最相似的字符串。我想知道是否有任何库,方法或最佳实践将返回我哪些字符串更类似于其他字符串。例如:

"快速狐狸跳" - >"狐狸跳了"

"快狐狸跳" - >"狐狸"

这种比较将返回第一个比第二个更相似。

我想我需要一些方法,例如:

double similarityIndex(String s1, String s2)

某处有这样的事吗?

编辑:我为什么这样做?我正在编写一个脚本,将MS Project文件的输出与处理任务的某些遗留系统的输出进行比较。由于遗留系统具有非常有限的字段宽度,因此在添加值时,将缩写描述。我想要一些半自动的方式来查找MS Project中哪些条目与系统上的条目类似,这样我就可以获得生成的密钥。它有缺点,因为它必须仍然手动检查,但它会节省大量的工作

#1 热门回答(129 赞)

以0%-100%方式计算两个字符串之间的相似性的常用方法(如许多库中所使用的)是测量你需要多少(以%为单位)来更改较长的字符串以将其变为较短的字符串:

/**

* Calculates the similarity (a number within 0 and 1) between two strings.

*/

public static double similarity(String s1, String s2) {

String longer = s1, shorter = s2;

if (s1.length() < s2.length()) { // longer should always have greater length

longer = s2; shorter = s1;

}

int longerLength = longer.length();

if (longerLength == 0) { return 1.0; /* both strings are zero length */ }

return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

}

// you can use StringUtils.getLevenshteinDistance() as the editDistance() function

// full copy-paste working code is below

###计算editDistance():

上面的editDistance()函数有望计算两个字符串之间的编辑距离。这一步有several implementations,每个都可能更适合特定情况。最常见的是Levenshtein distance algorithm,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能表现更好)。

以下是计算编辑距离的两个选项:

你可以使用Apache Commons Text实现的Levenshtein距离:apply(CharSequence left,CharSequence rightt)

自己实施。你将在下面找到一个示例实现。

###工作示例:

在此查看在线演示。

public class StringSimilarity {

/**

* Calculates the similarity (a number within 0 and 1) between two strings.

*/

public static double similarity(String s1, String s2) {

String longer = s1, shorter = s2;

if (s1.length() < s2.length()) { // longer should always have greater length

longer = s2; shorter = s1;

}

int longerLength = longer.length();

if (longerLength == 0) { return 1.0; /* both strings are zero length */ }

/* // If you have Apache Commons Text, you can use it to calculate the edit distance:

LevenshteinDistance levenshteinDistance = new LevenshteinDistance();

return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */

return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

}

// Example implementation of the Levenshtein Edit Distance

// See http://rosettacode.org/wiki/Levenshtein_distance#Java

public static int editDistance(String s1, String s2) {

s1 = s1.toLowerCase();

s2 = s2.toLowerCase();

int[] costs = new int[s2.length() + 1];

for (int i = 0; i <= s1.length(); i++) {

int lastValue = i;

for (int j = 0; j <= s2.length(); j++) {

if (i == 0)

costs[j] = j;

else {

if (j > 0) {

int newValue = costs[j - 1];

if (s1.charAt(i - 1) != s2.charAt(j - 1))

newValue = Math.min(Math.min(newValue, lastValue),

costs[j]) + 1;

costs[j - 1] = lastValue;

lastValue = newValue;

}

}

}

if (i > 0)

costs[s2.length()] = lastValue;

}

return costs[s2.length()];

}

public static void printSimilarity(String s, String t) {

System.out.println(String.format(

"%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));

}

public static void main(String[] args) {

printSimilarity("", "");

printSimilarity("1234567890", "1");

printSimilarity("1234567890", "123");

printSimilarity("1234567890", "1234567");

printSimilarity("1234567890", "1234567890");

printSimilarity("1234567890", "1234567980");

printSimilarity("47/2010", "472010");

printSimilarity("47/2010", "472011");

printSimilarity("47/2010", "AB.CDEF");

printSimilarity("47/2010", "4B.CDEFG");

printSimilarity("47/2010", "AB.CDEFG");

printSimilarity("The quick fox jumped", "The fox jumped");

printSimilarity("The quick fox jumped", "The fox");

printSimilarity("kitten", "sitting");

}

}

输出:

1.000 is the similarity between "" and ""

0.100 is the similarity between "1234567890" and "1"

0.300 is the similarity between "1234567890" and "123"

0.700 is the similarity between "1234567890" and "1234567"

1.000 is the similarity between "1234567890" and "1234567890"

0.800 is the similarity between "1234567890" and "1234567980"

0.857 is the similarity between "47/2010" and "472010"

0.714 is the similarity between "47/2010" and "472011"

0.000 is the similarity between "47/2010" and "AB.CDEF"

0.125 is the similarity between "47/2010" and "4B.CDEFG"

0.000 is the similarity between "47/2010" and "AB.CDEFG"

0.700 is the similarity between "The quick fox jumped" and "The fox jumped"

0.350 is the similarity between "The quick fox jumped" and "The fox"

0.571 is the similarity between "kitten" and "sitting"

#2 热门回答(73 赞)

是的,有许多记录良好的算法,如:

余弦相似度

Jaccard相似度

骰子的系数

匹配相似性

重叠相似性

等等

检查这些项目:

http://www.dcs.shef.ac.uk/~sam/simmetrics.html

http://jtmt.sourceforge.net/

#3 热门回答(12 赞)

我把349198192翻译成了JavaScript:

String.prototype.LevenshteinDistance = function (s2) {

var array = new Array(this.length + 1);

for (var i = 0; i < this.length + 1; i++)

array[i] = new Array(s2.length + 1);

for (var i = 0; i < this.length + 1; i++)

array[i][0] = i;

for (var j = 0; j < s2.length + 1; j++)

array[0][j] = j;

for (var i = 1; i < this.length + 1; i++) {

for (var j = 1; j < s2.length + 1; j++) {

if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];

else {

array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);

array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);

}

}

}

return array[this.length][s2.length];

};

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值