Jaccard similarity

 

The Jaccard index, also known as the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets.

The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the sample sets:

 J(A,B) = |A /cap B|/|A /cup B|.

The Jaccard distance, which measures dissimilarity between sample sets, is obtained by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union, or, simpler, by subtracting the Jaccard coefficient from 1:

 J_{/delta}(A,B) = 1 - J(A,B) = { { |A /cup B| - |A /cap B| } /over |A /cup B| }. 

 Similarity of asymmetric binary attributes

Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:

M11 represents the total number of attributes where A and B both have a value of 1.
M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
M00 represents the total number of attributes where A and B both have a value of 0.

Each attribute must fall into one of these four categories, meaning that

M11 + M01 + M10 + M00 = n.

The Jaccard similarity coefficient, J, is given as

J = {M_{11} /over M_{01} + M_{10} + M_{11}} .

The Jaccard distance, J', is given as

J' = {M_{01} + M_{10} /over M_{01} + M_{10} + M_{11}}.

 

 Tanimoto coefficient (extended Jaccard coefficient)

Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as

 /theta = /arccos {A /cdot B /over /|A/| /|B/|}.

For text matching, the attribute vectors A and B are usually the tf-idf vectors of the documents.

Since the angle, θ, is in the range of [0,π], the resulting similarity will yield the value of π as meaning exactly opposite, π / 2 meaning independent, 0 meaning exactly the same, with in-between values indicating intermediate similarities or dissimilarities.

This cosine similarity metric may be extended such that it yields the Jaccard coefficient in the case of binary attributes. This is the Tanimoto coefficient, T(A,B), represented as

 T(A,B) = {A /cdot B /over /|A/|^2 +/|B/|^2 - A /cdot B}.

 

 See also

 

 References

  • Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining (2005), ISBN 0-321-32136-7
  • Paul Jaccard (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547-579.
  • Tanimoto, T.T. (1957) IBM Internal Report 17th Nov. 1957.

 

External links

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值