相似度算法是机器学习算法的基础。常见的算法包括,
euclideanDistance,欧氏距离
manhattanDistance,曼哈顿距离
chebyshevDistance,切比雪夫距离
hammingDistance,汉明距
minkowskiDistance,明氏距离
jaccardDistance,杰卡德距离
cosineDistance,余弦相似度
相似度算法中,多种距离的计算方法可以应用在不同场景中,比如,欧氏距离和曼哈顿距离,都是明氏距离的子集,他们将向量维度以相同的量纲进行考虑并计算,最常见。
而汉明距考虑的是信息论中,同位的数字间的差距的和。
切比雪夫距离是以多维度中最大差距为距离的算法。
杰卡德距离是考虑并集和交集的关系。
python demo代码
#coding:UTF-8
'''
Created on 2014年5月3日
@author: hao
'''
import scipy
class distanceCalculate():
def euclideanDistance(self, v1, v2):
'''
Euclidean Distance
'''
if not isinstance(v1, list) or not isinstance(v2, list):
raise ValueError('vectors should be list type')
return -1
if len(v1) != len(v2):
raise ValueError('two lists have different dimension')
return -1
return pow(sum(pow(x1-x2,2) for (x1, x2) in zip(v1,v2)),0.5)
def manhattanDistance(self, v1, v2):
'''
Manhattan Distance
'''
if not isinstance(v1, list) or not isinstance(v2, list):
raise ValueError('vectors should be list type')
return -1
if len(v1) != len(v2):
raise ValueError('two lists have different dimension')
return -1
return sum(abs(x1-x2) for (x1,x2) in zip(v1, v2))
def chebyshevDistance(self, v1, v2):
'''
Chebyshev Distance
'''
if not isinstance(v1, list) or not isinstance(v2, list):
raise ValueError('vectors should be list type')
return -1
if len(v1) != len(v2):
raise ValueError('two lists have different dimension')
return -1
return max([abs(x1-x2) for (x1,x2) in zip(v1,v2)])
def hammingDistance(self, s1, s2):
'''
the Hamming distance between two strings of equal length is the number of
positions at which the corresponding symbols are different.
'''
if not isinstance(s1, str) or not isinstance(s2, str):
raise ValueError('Hamming distance only calculate difference two strings with same length')
return -1
if len(s1) != len(s2):
raise ValueError('two strings have different dimension')
return -1
return sum(ch1!=ch2 for (ch1,ch2) in zip(s1,s2))
def minkowskiDistance(self, v1, v2, exponential):
'''
a set of distance collections
'''
if not isinstance(v1, list) or not isinstance(v2, list):
raise ValueError('vectors should be list type')
return -1
if len(v1) != len(v2):
raise ValueError('two lists have different dimension')
return -1
if exponential<1:
raise ValueError('exponential should be larger or equal to 1')
return -1
return pow(sum(pow(x1-x2,exponential) for (x1, x2) in zip(v1,v2)),1/float(exponential))
def jaccardDistance(self, v1, v2):
'''
(A or B)-(A and B)
------------------
(A or B)
'''
if not isinstance(v1, list) or not isinstance(v2, list):
raise ValueError('vectors should be list type')
return -1
# v1 and v2
v1ANDv2 = list(set(v1).intersection(set(v2)))
# v1 or v2
v1ORv2 = list(set(v1).union(set(v2)))
return float(len(v1ORv2)-len(v1ANDv2))/len(v1ORv2)
def cosineDistance(self, v1, v2):
'''
vector(a)*vector(b)
-------------------
|a|*|b|
'''
if not isinstance(v1, list) or not isinstance(v2, list):
raise ValueError('vectors should be list type')
return -1
if len(v1) != len(v2):
raise ValueError('two lists have different dimension')
return -1
return float(scipy.dot(v1,v2))/float(pow(scipy.dot(v1,v1)*scipy.dot(v2,v2),0.5))
if __name__=='__main__':
test = distanceCalculate()
# print test.euclideanDistance([4,5], [2,4])
# print test.manhattanDistance([90,5], [2,4])
# print test.chebyshevDistance([2.2,5], [2,4])
# print test.hammingDistance('s162', 's225')
# print test.minkowskiDistance([4,5], [2,4],3)
# print test.jaccardDistance([90,5], [2,5,4])
print test.cosineDistance([-5,-10], [2,4])
还有马氏距离,度量的是两点之间协方差的
它的优点是避免量纲的误差(和欧氏距离区别),排除变量之间的相关性的干扰。
quote:马氏距离功能强大,甚至连欧氏距离都是马氏距离的特殊情形。在了解马氏距离之前,不妨先观察下图:有两个正态分布的总体,它们的均值分别为a和b,但是方差不一样,则图中的A点离哪个总体近呢?显然,A离左边的更近,A属于左边总体的概率更大,尽管A与a的欧式距离远一些。这就是马氏距离的直观意义了。
from: http://blog.sina.com.cn/s/blog_7420820c0100pl87.html
python demo代码 from:http://blog.csdn.net/henryghx/article/details/16785671
有时间补上自己写的马氏距离