Simple matching coefficient

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

The simple matching coefficient (SMC) or Rand similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets.[1]

 A
01
B0{\displaystyle M_{00}}M_{00}{\displaystyle M_{10}}M_{10}
1{\displaystyle M_{01}}M_{01}{\displaystyle M_{11}}M_{11}

Given two objects, A and B, each with n binary attributes, SMC is defined as:

{\displaystyle {\begin{aligned}{\text{SMC}}&={\frac {\text{number of matching attributes}}{\text{number of attributes}}}\\[8pt]&={\frac {M_{00}+M_{11}}{M_{00}+M_{01}+M_{10}+M_{11}}}\end{aligned}}}{\displaystyle {\begin{aligned}{\text{SMC}}&={\frac {\text{number of matching attributes}}{\text{number of attributes}}}\\[8pt]&={\frac {M_{00}+M_{11}}{M_{00}+M_{01}+M_{10}+M_{11}}}\end{aligned}}}

where:

{\displaystyle M_{11}}M_{11} is the total number of attributes where  A and  B both have a value of 1.
{\displaystyle M_{01}}M_{01} is the total number of attributes where the attribute of  A is 0 and the attribute of  B is 1.
{\displaystyle M_{10}}M_{10} is the total number of attributes where the attribute of  A is 1 and the attribute of  B is 0.
{\displaystyle M_{00}}M_{00} is the total number of attributes where  A and  B both have a value of 0.

The simple matching distance (SMD), which measures dissimilarity between sample sets, is given by {\displaystyle 1-{\text{SMC}}}{\displaystyle 1-{\text{SMC}}}.[2]\

SMS is linearly related to Hamann similarity: {\displaystyle SMC=(Hamann+1)/2}{\displaystyle SMC=(Hamann+1)/2}. Also, {\displaystyle SMC=1-D^{2}/n}{\displaystyle SMC=1-D^{2}/n}, where D^2 is the squared Euclidean distance between the two objects (binary vectors) and n is the number of attributes.

Difference with the Jaccard index[edit]

The SMC is very similar to the more popular Jaccard index. The main difference is that the SMC has the term {\displaystyle M_{00}}M_{00} in its numerator and denominator, whereas the Jaccard index does not. Thus, the SMC counts both mutual presences (when an attribute is present in both sets) and mutual absence (when an attribute is absent in both sets) as matches and compares it to the total number of attributes in the universe, whereas the Jaccard index only counts mutual presence as matches and compares it to the number of attributes that have been chosen by at least one of the two sets.

In market basket analysis, for example, the basket of two consumers who we wish to compare might only contain a small fraction of all the available products in the store, so the SMC will usually return very high values of similarities even when the baskets bear very little resemblance, thus making the Jaccard index a more appropriate measure of similarity in that context. For example, consider a supermarket with 1000 products and two customers. The basket of the first customer contains salt and pepper and the basket of the second contains salt and sugar. In this scenario, the similarity between the two baskets as measured by the Jaccard index would be 1/3, but the similarity becomes 0.998 using the SMC.

In other contexts, where 0 and 1 carry equivalent information (symmetry), the SMC is a better measure of similarity. For example, vectors of demographic variables stored in dummy variables, such as binary gender, would be better compared with the SMC than with the Jaccard index since the impact of gender on similarity should be equal, independently of whether male is defined as a 0 and female as a 1 or the other way around. However, when we have symmetric dummy variables, one could replicate the behaviour of the SMC by splitting the dummies into two binary attributes (in this case, male and female), thus transforming them into asymmetric attributes, allowing the use of the Jaccard index without introducing any bias. By using this trick, the Jaccard index can be considered as making the SMC a fully redundant metric. The SMC remains, however, more computationally efficient in the case of symmetric dummy variables since it does not require adding extra dimensions.

The Jaccard index is also more general than the SMC and can be used to compare other data types than just vectors of binary attributes, such as probability measures.

See also[edit]

Notes[edit]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值