这篇是接着之前写的那篇初学hadoop之一:相似度计算(余弦距离)来的,除了计算方法不一样,其他部分都是大同小异,在这里只简单贴一下计算jacard距离的代码:
protected static double compare(String[] words1, String[] words2){
if(words1.length==0 && words2.length==0){
return 1.0;
}
Set<String> intersectionSet = new HashSet<String>();
Set<String> unionSet = new HashSet<>();
for(int i=0;i<words1.length;i++){
for(int j=0;j<words2.length;j++){
if(words1[i].equals(words2[j])){
intersectionSet.add(words1[i]);
unionSet.add(words1[i]);
}
else{
unionSet.add(words1[i]);
unionSet.add(words2[j]);
}
}