Dice's coefficient

Dice系数是一种与Jaccard指数相关的相似度测量方法。该文详细介绍了Dice系数如何应用于信息检索中的关键词匹配,以及作为字符串相似度测量时的具体计算方法。通过使用字符双字母对(bigrams)来比较两个字符串之间的相似度。
摘要由CSDN通过智能技术生成
 

Dice's coefficient (also known as the Dice coefficient) is a similarity measure related to the Jaccard index.

For sets X and Y of keywords used in information retrieval, the coefficient may be defined as:[1]

s = /frac{2 | X /cap Y |}{| X | + | Y |}

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[2]

s = /frac{2 n_{t}}{n_{x} + n_{y}}

where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

night
nacht

We would find the set of bigrams in each word:

{ ni, ig, gh, ht}
{ na, ac, ch, ht}

Each set has 4 elements, and the intersection of these two sets has only one element: ht.

Plugging this into the formula, we calculate, s = (2 * 1) / (4 + 4) = 0.25

 See also

 Notes

  1. ^ C. J. van Rijsbergen (1979)
  2. ^ Kondrak, G. et al. (2003)

References

  • C. J. van Rijsbergen (1979) Information Retrieval (London: Butterworths)
  • Kondrak, G., Marcu, D. and Knight, K. (2003) "Cognates Can Improve Statistical Translation Models" in Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 46--48
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值