Jaccard index and dice coifficient

he Jaccard index, also known as the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets.

The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

 J(A,B) = {​{|A \cap B|}\over{|A \cup B|}}.

The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:

 J_{\delta}(A,B) = 1 - J(A,B) = { { |A \cup B| - |A \cap B| } \over |A \cup B| }.

This distance is a proper metric[1] .[2]


Binary Properties

Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:

M_{11} represents the total number of attributes where  A and  B both have a value of 1.
M_{01} represents the total number of attributes where the attribute of  A is 0 and the attribute of  B is 1.
M_{10} represents the total number of attributes where the attribute of  A is 1 and the attribute of  B is 0.
M_{00} represents the total number of attributes where  A and  B both have a value of 0.

Each attribute must fall into one of these four categories, meaning that

M_{11} + M_{01} + M_{10} + M_{00} = n.

The Jaccard similarity coefficient, J, is given as

J = {M_{11} \over M_{01} + M_{10} + M_{11}}.



Dice's coefficient, named after Lee Raymond Dice[1] and also known as the Dice coefficient or Dice similarity coefficient (DSC), is a similarity measure over sets:

s = \frac{2 | X \cap Y |}{| X | + | Y |}

It is identical to the Sørensen similarity index, and is occasionally referred to as the Sørensen-Dice coefficient. It is not very different in form from the Jaccard index but has some different properties.

The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function

d = 1 -  \frac{2 | X \cap Y |}{| X | + | Y |}

is not a proper distance metric as it does not possess the property of triangle inequality. The simplest counterexample of this is given by the three sets {a}, {b}, and {a,b}, the distance between the first two being 1, and the difference between the third and each of the others being one-third.



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值