# 原型聚类之学习向量量化及Python实现

## 学习向量量化(Learning Vector Quantization)

原型向量个数$q$$q$，各原型向量预设的类别标记$\left\{{t}_{1},{t}_{2},...,{t}_{q}\right\}$$\left \{ t_{1},t_{2},...,t_{q} \right \}$
学习率$\eta \in \left(0,1\right)$$\eta \in \left ( 0,1 \right )$

1. 初始化一组原型向量$\left\{{p}_{1},{p}_{2},...,{p}_{q}\right\}$$\left \{ p_{1},p_{2},...,p_{q} \right \}$
2. repeat
3.  从样本集中随机选择样本$\left({x}_{j},{y}_{j}\right)$$(x_{j},y_{j})$ ;
4.  计算样本${x}_{j}$$x_{j}$${p}_{j}\left(1\le i\le p\right)$$p_{j}(1 \leq i \leq p)$的距离：${\mathrm{d}}_{ji}={‖{x}_{j}-{p}_{i}‖}_{2}$$\mathrm{d}_{ji}=\left \| x_{j} -p_{i}\right \|_{2}$;
5.  找出与${x}_{j}$$x_{j}$距离最近的原型向量${p}_{{i}^{\ast }}$$p_{i^{\ast }}$,${i}^{\ast }=argmi{n}_{i\in \left\{1,2,...,q\right\}}{d}_{ji}$$i^{\ast }=arg min_{i\in\left \{ 1,2,...,q\right \}}d_{ji}$
6.  if ${y}_{j}={t}_{{i}^{\ast }}$$y_{j}=t_{i^{\ast}}$ then
7.   ${p}^{\prime }={p}_{{i}^{\ast }}+\eta \cdot \left({x}_{j}-{p}_{{i}^{\ast }}\right)$${p}'=p_{i^{\ast}}+\eta \cdot \left ( x_{j}-p_{i^{\ast}} \right )$
8.  else
9.   ${p}^{\prime }={p}_{{i}^{\ast }}-\eta \cdot \left({x}_{j}-{p}_{{i}^{\ast }}\right)$${p}'=p_{i^{\ast}}-\eta \cdot \left ( x_{j}-p_{i^{\ast}} \right )$
10. end if
11.  将原型向量${p}_{{i}^{\ast }}$$p_{i^{\ast}}$更新为${p}^{\prime }$${p}'$
12. until 满足停止条件

LVQ的关键实在第6-10行，即如何更新原型向量。直观上看，对样本${x}_{j}$$x_{j}$，若最近的原型向量${p}_{{i}^{\ast }}$$p_{i^{\ast}}$${x}_{j}$$x_{j}$的类别标记相同，则令${p}_{{i}^{\ast }}$$p_{i^{\ast}}$${x}_{j}$$x_{j}$的方向靠拢。如第7行所示，此时新原型向量为

$\begin{array}{}\text{(1.1)}& {p}^{\prime }={p}_{{i}^{\ast }}+\eta \cdot \left({x}_{j}-{p}_{{i}^{\ast }}\right)\end{array}$

${p}^{\prime }$${p}'$${x}_{j}$$x_{j}$之间的距离为
$\begin{array}{}\text{(1.2)}& {‖{p}^{\prime }-{x}_{j}‖}_{2}=‖{p}_{{i}^{\ast }}+\eta \cdot \left({x}_{j}-{p}_{{i}^{\ast }}\right)-{x}_{j}‖=\left(1-\eta \right)\cdot {‖{p}_{{i}^{\ast }}-{x}_{j}‖}_{2}\end{array}$

${p}_{{i}^{\ast }}$$p_{i^{\ast}}$${x}_{j}$$x_{j}$的类别标记不同，则更新后的原型向量与${x}_{j}$$x_{j}$之间的距离将增大为$\left(1+\eta \right)\cdot {‖{p}_{{i}^{\ast }}-{x}_{j}‖}_{2}$$\left ( 1+\eta \right )\cdot \left \| p_{i^{\ast}} -x_{j}\right \|_{2}$，从而远离${x}_{j}$$x_{j}$

### python实现学习向量量化算法

1.数据生成，这里使用较小的样本集$data$$data$,共有13个样本，每个样本采集的特征为：密度，含糖率，是否好瓜。其中标签为：Y和N。

data = \
"""1,0.697,0.46,Y,
2,0.774,0.376,Y,
3,0.634,0.264,Y,
4,0.608,0.318,Y,
5,0.556,0.215,Y,
6,0.403,0.237,Y,
7,0.481,0.149,Y,
8,0.437,0.211,Y,
9,0.666,0.091,N,
10,0.639,0.161,N,
11,0.657,0.198,N,
12,0.593,0.042,N,
13,0.719,0.103,N"""


2.数据预处理

#定义一个西瓜类，四个属性，分别是编号，密度，含糖率，是否好瓜
class watermelon:
def __init__(self, properties):
self.number = properties[0]
self.density = float(properties[1])
self.sweet = float(properties[2])
self.good = properties[3]

a = re.split(',', data.strip(" "))
dataset = []     #dataset:数据集
for i in range(int(len(a)/4)):
temp = tuple(a[i * 4: i * 4 + 4])
dataset.append(watermelon(temp))

3.距离计算，这里采用的是欧几里得距离。

def dist(a, b):
return math.sqrt(math.pow(a[0]-b[0], 2)+math.pow(a[1]-b[1], 2))

4.算法模型

def LVQ(dataset, a, q,max_iter):
#随机产生q个原型向量
P = [(i.density, i.sweet,i.good) for i in np.random.choice(dataset, q)]
while max_iter>0:
#从样本集dataset中随机选取一个样本X
X = np.random.choice(dataset, 1)[0]
#找出P中与X距离最近的原型向量P[index]
m = []
for i in range(len(P)):
m.append(dist((X.density, X.sweet),(P[i][0],P[i][1])))
index = np.argmin(m)
#获得原型向量的标签t,并判断t是否与随机样本的标签相等
t = P[index][2]
if t == X.good:
P[index] = ((1 - a) * P[index][0] + a * X.density, (1 - a) * P[index][1] + a * X.sweet,t )
else:
P[index] = ((1 + a) * P[index][0] - a * X.density, (1 + a) * P[index][1] - a * X.sweet,t )
max_iter -= 1
return P


5.画图

def draw(C, P):
colValue = ['r', 'y', 'g', 'b', 'c', 'k', 'm']
for i in range(len(C)):
coo_X = []    #x坐标列表
coo_Y = []    #y坐标列表
for j in range(len(C[i])):
coo_X.append(C[i][j].density)
coo_Y.append(C[i][j].sweet)
pl.scatter(coo_X, coo_Y, marker='x', color=colValue[i%len(colValue)], label=i)
#展示原型向量
P_x = []
P_y = []
for i in range(len(P)):
P_x.append(P[i][0])
P_y.append(P[i][1])
pl.scatter(P[i][0], P[i][1], marker='o', color=colValue[i%len(colValue)], label="vector")
pl.legend(loc='upper right')
pl.show()


6.实验结果

## 参考文献

[1]. 周志华，机器学习，清华大学出版社，2016
[2]. 聚类算法——python实现学习向量量化（LVQ）算法