文章目录
范数(norm)
定义在赋范线性空间 【 1 】 ^{【1】} 【1】中满足:①非负性;②齐次性;③三角不等式,常被用来度量某个向量空间(或矩阵)中的每个向量的长度或大小(两点间的欧式距离就可以看成是亮点间向量的长度)
常用范数/距离
L 1 L_1 L1范数/距离
∣ ∣ x ∣ ∣ 1 = ∑ i = 1 n ∣ x i ∣ L 1 ( x i , y i ) = ∑ i = 1 n ∣ x i − y i ∣ \begin{aligned} &||x||_1=\sum_{i=1}^{n}|x_i|\\ &L_1(x_i,y_i)=\sum_{i=1}^{n}|x_i-y_i| \end{aligned} ∣∣x∣∣1=i=1∑n∣xi∣L1(xi,yi)=i=1∑n∣xi−yi∣
L 2 L_2 L2范数/距离
∣ ∣ x ∣ ∣ 2 = ∑ i = 1 n x i 2 , x = { x 1 , x 2 , . . . x n } L 2 ( x i , y i ) = ∑ i = 1 n ( x i − y i ) 2 \begin{aligned} &||x||_2=\sqrt{\sum_{i=1}^{n}x_i^2}, x=\{x_1,x_2,...x_n\}\\ &L_2(x_i,y_i)=\sqrt{\sum_{i=1}^{n}(x_i-y_i)^2} \end{aligned} ∣∣x∣∣2=i=1∑nxi2,x={x1,x2,...xn}L2(xi,yi)=i=1∑n(xi−yi)2
L p L_p Lp范数/距离
∣ ∣ x ∣ ∣ p = ∑ i = 1 n ∣ x i ∣ p p = ( ∑ i = 1 n ∣ x i ∣ p ) 1 p L p ( x i , y i ) = ( ∑ i = 1 n ∣ x i − y i ∣ p ) 1 p \begin{aligned} &||x||_p=\sqrt[p]{\sum_{i=1}^{n}|x_i|^p}=({\sum_{i=1}^{n}|x_i|^p})^{\frac {1}{p}}\\ &L_p(x_i,y_i)=({\sum_{i=1}^{n}|x_i-y_i|^p})^{\frac {1}{p}} \end{aligned} ∣∣x∣∣p=pi=1∑n∣xi∣p=(i=1∑n∣xi∣p)p1Lp(xi,yi)=(i=1∑n∣xi−yi∣p)p1
L ∞ L_{\infty} L∞范数/距离
∣ ∣ x ∣ ∣ ∞ = m a x i = 1 n ( ∣ x i ∣ ) L ∞ ( x i , y i ) = m a x i = 1 n ( ∣ x i − y i ∣ ) \begin{aligned} &||x||_\infty=max_{i=1}^n(|x_i|)\\ &L_\infty(x_i,y_i)=max_{i=1}^n(|x_i-y_i|) \end{aligned} ∣∣x∣∣∞=maxi=1n(∣xi∣)L∞(xi,yi)=maxi=1n(∣xi−yi∣)
两点间的距离
表示距离的方式有很多种,如两向量间的夹角可以用余弦距离描述,甚至可以自己定义一种距离,以我们常用的两点间距离为例:
- 曼哈顿距离: L 1 ( x 1 , y 1 , x 2 , y 2 ) = ∣ x 1 − x 2 ∣ + ∣ y 1 − y 2 ∣ L_1(x_1,y_1,x_2,y_2)=|x_1-x_2|+|y_1-y_2| L1(x1,y1,x2,y2)=∣x1−x2∣+∣y1−y2∣
- 欧氏距离: L 2 ( x 1 , y 1 , x 2 , y 2 ) = ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 L_2(x_1,y_1,x_2,y_2)=\sqrt{ (x_1-x_2)^2+ (y_1-y_2)^2} L2(x1,y1,x2,y2)=(x1−x2)2+(y1−y2)2
- 马氏距离: d ( a ⃗ , b ⃗ ) = ( a ⃗ − b ⃗ ) T ∑ − 1 ( a ⃗ − b ⃗ ) d(\vec {a},\vec {b})=\sqrt{(\vec {a}-\vec {b})^T\sum^{-1}(\vec {a}-\vec {b})} d(a,b)=(a−b)T∑−1(a−b)向量化
- 契比雪夫距离: L ∞ ( x 1 , y 1 , x 2 , y 2 ) = max ( ∣ x 1 − x 2 ∣ + ∣ y 1 − y 2 ∣ ) L_\infty(x_1,y_1,x_2,y_2)=\max({ |x_1-x_2|+ |y_1-y_2|}) L∞(x1,y1,x2,y2)=max(∣x1−x2∣+∣y1−y2∣)
K-means聚类算法
标准k-means算法流程
1、初始化
K
K
K个中心()
2、先定义距离,并计算每个数据点到
K
K
K个中心的距离
d
i
s
t
(
x
i
,
x
j
)
=
∑
d
=
1
D
(
x
i
,
d
−
x
j
,
d
)
2
dist(x_{i},x_{j})=\sqrt{\sum_{d=1}^{D}(x_{i,d}-x_{j,d})^{2}}
dist(xi,xj)=d=1∑D(xi,d−xj,d)2
3、根据距离将数据划分成
K
K
K个簇
4、对每个簇求均值,更新中心
C
e
n
t
e
r
k
=
1
∣
C
k
∣
∑
x
i
∈
C
k
x
i
Center_{k}=\frac{1}{|C_{k}|}\sum_{x_{i}\in C_{k}}x_{i}
Centerk=∣Ck∣1xi∈Ck∑xi
5、返回2继续迭代迭代
6、终止条件,可设置迭代次数,也可根据更新前后的变化设置
J
=
∑
k
=
1
K
∑
x
i
∈
C
k
d
i
s
t
(
x
i
,
C
e
n
t
e
r
k
)
J=\sum_{k=1}^{K}\sum_{x_{i}\in C_{k}}dist(x_{i},Center_{k})
J=k=1∑Kxi∈Ck∑dist(xi,Centerk)
初始化方法:
1、直接随机初始化
K
K
K个中心,
2、多次运行,每次使用不同的随机中心初始化,然后选取具有最小SSE(误差的平方和)的簇集
3、K-means++
空簇问题
如果某个簇没有分配到数据点,就会得到空簇。需要选择一个替补质心,否则平方误差(如果用平方误差衡量的话)将会偏大,下面给出了两种方法
- 选择一个距离当前任何质心最远的点,代替当前对总平方误差影响最大的中心点
- 从具有最大SSE的簇中选择一个替补的质心,这将分裂簇并降低聚类的总SSE
如果有多个空簇,则该过程重复多次
优点:参数少、简单、易实现;当潜在的簇形状是凸面的,簇与簇之间区别较明显,且簇大小相近时,其聚类结果较理想。处理大数据集合非常高效,且伸缩性较好;算法的可解释度比较强;
缺点:需要事先指定簇的个数
K
K
K,且需经过反复尝试才能找到最合适的
K
K
K值; 对初始类簇中心的选取较为敏感; 容易陷入局部最优;适于发现非凸面形状的簇或大小差别很大的簇;对“噪声”和孤立点敏感(可用中位数的方式解决);如果各隐含类别的数据不平衡,比如各隐含类别的数据量严重失衡,或者各隐含类别的方差不同,则聚类效果不佳
k-means++
k-means++就是让k-means有一个好的初始中心:初始的聚类中心之间的相互距离要尽可能的远
方法如下:
1、从数据集中随机选择一个点作为第一个聚类中心
2、计算每个数据点与最近聚类中心(已选择)的距离
- D ( x i ) = a r g m i n ∣ ∣ x i − μ r ∣ ∣ 2 2 r = 1 , 2 , . . . k s e l e c t e d D(x_i) = arg\;min||x_i- \mu_r||_2^2\;\;r=1,2,...k_{selected} D(xi)=argmin∣∣xi−μr∣∣22r=1,2,...kselected
3、选择D(x)较大的点,作为新的聚类中心
4、重复2和3直到选出
K
K
K个聚类中心
5、根据这
K
K
K个初始中心进行标准k-means聚类
其它优化方法
- 距离计算优化elkan K-Means:利用了两边之和大于等于第三边,以及两边之差小于第三边的三角形性质,来减少距离的计算量
- 数据量较大可以抽取部分样本进行聚类
用K-means聚类Anchor Box
只要定义距离,其余可按照上述算法流程: d = 1 − I o U d=1-IoU d=1−IoU
这里的距离表示的就是框间的差异,离得越近IoU越大,距离 d d d越小(相当于把中心点固定,两个框完全重合IoU最大,两个框长宽差异越大IoU越小)
算法实现
# coding: utf-8
# This script is modified from https://github.com/lars76/kmeans-anchor-boxes
from __future__ import division, print_function
import numpy as np
def iou(box, clusters):
"""
Calculates the Intersection over Union (IoU) between a box and k clusters.
param:
box: tuple or array, shifted to the origin (i. e. width and height)
clusters: numpy array of shape (k, 2) where k is the number of clusters
return:
numpy array of shape (k, 0) where k is the number of clusters
"""
x = np.minimum(clusters[:, 0], box[0])
y = np.minimum(clusters[:, 1], box[1])
if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0:
raise ValueError("Box has no area")
intersection = x * y
box_area = box[0] * box[1]
cluster_area = clusters[:, 0] * clusters[:, 1]
iou_ = intersection / (box_area + cluster_area - intersection + 1e-10)
return iou_
def avg_iou(boxes, clusters):
"""
Calculates the average Intersection over Union (IoU) between a numpy array of boxes and k clusters.
param:
boxes: numpy array of shape (r, 2), where r is the number of rows
clusters: numpy array of shape (k, 2) where k is the number of clusters
return:
average IoU as a single float
"""
return np.mean([np.max(iou(boxes[i], clusters)) for i in range(boxes.shape[0])])
def translate_boxes(boxes):
"""
Translates all the boxes to the origin.
param:
boxes: numpy array of shape (r, 4)
return:
numpy array of shape (r, 2)
"""
new_boxes = boxes.copy()
for row in range(new_boxes.shape[0]):
new_boxes[row][2] = np.abs(new_boxes[row][2] - new_boxes[row][0])
new_boxes[row][3] = np.abs(new_boxes[row][3] - new_boxes[row][1])
return np.delete(new_boxes, [0, 1], axis=1)
def kmeans(boxes, k, dist=np.median):
"""
Calculates k-means clustering with the Intersection over Union (IoU) metric.
param:
boxes: numpy array of shape (r, 2), where r is the number of rows
k: number of clusters
dist: distance function
return:
numpy array of shape (k, 2)
"""
rows = boxes.shape[0]
distances = np.empty((rows, k))
last_clusters = np.zeros((rows,))
np.random.seed()
# the Forgy method will fail if the whole array contains the same rows
clusters = boxes[np.random.choice(rows, k, replace=False)]
while True:
for row in range(rows):
distances[row] = 1 - iou(boxes[row], clusters)
nearest_clusters = np.argmin(distances, axis=1)
if (last_clusters == nearest_clusters).all():
break
for cluster in range(k):
clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0)
last_clusters = nearest_clusters
return clusters
def parse_anno(annotation_path):
anno = open(annotation_path, 'r')
result = []
for line in anno:
s = line.strip().split(' ')
s = s[2:]
box_cnt = len(s) // 5
for i in range(box_cnt):
x_min, y_min, x_max, y_max = float(s[i*5+1]), float(s[i*5+2]), float(s[i*5+3]), float(s[i*5+4])
width = x_max - x_min
height = y_max - y_min
assert width > 0
assert height > 0
result.append([width, height])
result = np.asarray(result)
return result
def get_kmeans(anno, cluster_num=9):
anchors = kmeans(anno, cluster_num)
ave_iou = avg_iou(anno, anchors)
anchors = anchors.astype('int').tolist()
anchors = sorted(anchors, key=lambda x: x[0] * x[1])
return anchors, ave_iou
if __name__ == '__main__':
annotation_path = "./data/my_data/train.txt"
anno_result = parse_anno(annotation_path)
anchors, ave_iou = get_kmeans(anno_result, 9)
anchor_string = ''
for anchor in anchors:
anchor_string += '{},{}, '.format(anchor[0], anchor[1])
anchor_string = anchor_string[:-2]
print('anchors are:')
print(anchor_string)
print('the average iou is:')
print(ave_iou)
参考文献
【1】赋范向量空间
【2】欧氏距离与马氏距离
【3】机器学习中SVD总结
【4】K-Means聚类算法原理
【5】马氏距离和欧式距离详解
【6】协方差和马氏距离的理解
【7】马氏距离(Mahalanobis Distance)
【8】欧式距离、曼哈顿距离、切比雪夫距离三种距离的可视化展示
【9】如何理解线性赋范空间、希尔伯特空间, 巴拿赫空间,拓扑空间
【10】Kmeans聚类及图像分割