(四)MDS(多维缩放)

一 简介


原始空间中样本之间的距离在低维空间中得以保持,如上图所示,即得到多维缩放(Multiple Dimensional Scaling, MDS),这是一种经典的降维方法。

二 算法描述

思路:原始样本 A n × p A_{n \times p} An×p任意两个样本在低维空间中的欧氏距离等于原始空间中的距离。

输入:距离矩阵 D n × n D_{n \times n} Dn×n,其中 D i j D_{ij} Dij为样本 x i x_i xi x j x_j xj的距离;低维空间的维数 k k k
输出:样本在低维空间中表示 Z n × k Z_{n \times k} Zn×k
过程:降维之后样本的内积矩阵 B = Z Z T B=Z Z^T B=ZZT b i j = z ( i ) T z ( j ) b_{ij}=z_{(i)}^T z_{(j)} bij=z(i)Tz(j),有
D i j 2 = ∣ ∣ z ( i ) − z ( j ) ∣ ∣ 2 = ∣ ∣ z ( i ) ∣ ∣ 2 + ∣ ∣ z ( j ) ∣ ∣ 2 − 2 z ( i ) T z ( j ) = b i i + b j j − 2 b i j D_{ij}^2 = ||z_{(i)}-z_{(j)}||^2 =||z_{(i)}||^2+||z_{(j)}||^2 -2z_{(i)}^T z_{(j)}=b_{ii}+b_{jj}-2b_{ij} Dij2=z(i)z(j)2=z(i)2+z(j)22z(i)Tz(j)=bii+bjj2bij
假定降维之后的样本 Z Z Z被中心化,即 ∑ i = 1 n z ( i ) = \sum_{i=1}^n z_{(i)}= i=1nz(i)= 0,显然矩阵B的行和列之和均为零,即 ∑ i = 1 n b i j = ∑ j = 1 n b i j = 0. \sum _{i=1} ^nb_{ij}=\sum_{j=1}^nb_{ij}=0. i=1nbij=j=1nbij=0.易知
∑ i = 1 n D i j 2 = t r ( B ) + m b j j ∑ j = 1 n D i j 2 = t r ( B ) + m b i i ∑ i = 1 n ∑ j = 1 n D i j 2 = 2 m t r ( B ) \sum _{i=1}^nD_{ij}^2=tr(B)+mb_{jj} \\ \sum _{j=1}^nD_{ij}^2=tr(B)+mb_{ii} \\ \sum _{i=1}^n \sum _{j=1}^n D_{ij}^2=2mtr(B) i=1nDij2=tr(B)+mbjjj=1nDij2=tr(B)+mbiii=1nj=1nDij2=2mtr(B)
可得:
b i j = 1 2 n ∑ i = 1 n D i j 2 + 1 2 n ∑ j = 1 n D i j 2 − 1 2 n 2 ∑ i = 1 n ∑ j = 1 n D i j 2 − 1 2 n D i j 2 b_{ij}= \frac{1}{2n}\sum_{i=1}^nD_{ij}^2 + \frac{1}{2n}\sum_{j=1}^nD_{ij}^2 - \frac{1}{2n^2}\sum_{i=1}^n\sum_{j=1}^nD_{ij}^2 - \frac{1}{2n}D_{ij}^2 bij=2n1i=1nDij2+2n1j=1nDij22n21i=1nj=1nDij22n1Dij2
通过降维前后距离矩阵保持不 D n × n D_{n \times n} Dn×n变求内积矩阵 B n × n B_{n \times n} Bn×n,对 B B B做特征值分解, B = V Λ V T = V Λ 1 2 ( V Λ 1 2 ) T B=V\Lambda V^T=V\Lambda^{\frac{1}{2}} (V \Lambda^{\frac{1}{2}})^T B=VΛVT=VΛ21(VΛ21)T其中 Λ \Lambda Λ为所有特征值(或者删除为零的特征值)从大到小排列组成的对角矩阵, V V V为对应的特征向量矩阵。近似的,最终取 Z = V ~ Λ ~ 1 2 Z=\tilde{V}\tilde{\Lambda}^{\frac{1}{2}} Z=V~Λ~21 Λ ~ \tilde{\Lambda} Λ~为前 k k k个特征值组成的对角矩阵, V ~ \tilde{V} V~为对应的特征向量矩阵。

思考:实则也是对原矩阵进行了线性变换, Z = A × W Z=A\times W Z=A×W W W W p × k p\times k p×k的变换矩阵,从这个角度考虑跟PCA类似。(还需进一步理解)

三 python实现

(1)把所有对角矩阵中特征值为0的去掉,记得验证得到的两个距离矩阵是否相同。

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

import numpy as np
from scipy.spatial.distance import cdist


class MDS:
    
    def __init__(self):
        from scipy.spatial.distance import cdist
    
    def _cal_dist(self, matrix, row, col):
        if row == '*' and col == '*':
            return np.sum(matrix**2)
        elif row == '*' and col != '*':
            return np.sum(matrix[:, col]**2)
        elif row != '*' and col == '*':
            return np.sum(matrix[row, :]**2)
        else:
            return matrix[row, col]**2
    
    def fit(self, dataset):
        length = dataset.shape[0]
        # 计算原始空间的距离矩阵
        original_matrix = cdist(dataset, dataset, 'euclidean')
        # 计算 dist_i.、dist_.j 以及 dist_..
        dist_matrix = np.matrix(np.zeros(original_matrix.shape))
        rows, cols = dist_matrix.shape
        # 获得矩阵 B
        for row in range(rows):
            for col in range(cols):
                distij = self._cal_dist(original_matrix, row, col)
                dist_i = self._cal_dist(original_matrix, row, '*') / length
                dist_j = self._cal_dist(original_matrix, '*', col) / length
                dist_all = self._cal_dist(original_matrix, '*', '*') / (length**2)
                dist_matrix[row, col] = -(distij - dist_i - dist_j + dist_all) / 2
        # 计算特征值和特征向量
        feature_values, feature_vectors = np.linalg.eig(dist_matrix)
        select_feature_values = []
        for i in range(len(feature_values) - 1, -1, -1):
            if np.round(feature_values[i]) != 0:
                select_feature_values.append(feature_values[i])
            else:
                feature_vectors = np.delete(feature_vectors, i, axis=1)
        eye_matrix = np.eye(len(select_feature_values))
        for i in range(len(select_feature_values)):
            eye_matrix[i, i] = select_feature_values[i]
        return np.dot(feature_vectors, eye_matrix**0.5)


mds = MDS()
a=mds.fit(np.array([
    [1, 1, 1],
    [2, 2, 2],
    [3, 3, 3]
]))
print(a)

(2)有问题,以后再解决

import numpy as np

import math
# run this to get a test matrix
# A = np.random.randint(1,100,(5,20))
# np.save('mat.npy', A)
# exit()

A = np.array([
     [1, 1, 1],
     [2, 2, 2],
    [3, 3, 3]
 ])
#print(A)
#print(A[1])     #A[i]表示的是某一行
n,m = A.shape
Dist = np.zeros((n,n))
B = np.zeros((n,n))
for i in range(n):
    for j in range(n):
        Dist[i][j] = sum((ix-jx)**2 for ix,jx in zip(A[i], A[j]))   #距离矩阵

disti2 = np.array([0]*n)
distj2 = np.array([0]*n)

for x in range(n):
    disti2[x] = np.mean([Dist[x][j] for j in range(n)])
    distj2[x] = np.mean([Dist[i][x] for i in range(n)])

distij2 = np.mean([Dist[i][j] for i in range(n) for j in range(n)])

for i in range(n):
    for j in range(n):
        B[i][j] = -0.5*(Dist[i][j] - disti2[i] - distj2[j] + distij2)  #样本的内积矩阵

w,v = np.linalg.eig(B)     #v即为B的特征向量矩阵,不同特征值的特征向量不相关,但是不一定正交
v=v.transpose()
print(w,v)
U = [{'eVal':w[i], 'eVec':v[i]} for i in range(n)]
print(type(U))
U.sort(key = lambda obj:obj.get('eVal'), reverse = True)  #从大到小排列
k=2        #k不能等于3,大概是不能计算NAN0的次幂??
w=np.array([0]*k)
v=np.zeros((k,n))

for i in range(k):
    w[i] = U[i].get('eVal')**0.5    #U里面是小数怎么就变成整数了?
    print(type(w[i]))        #类型一直为整数,该怎么调,6开根号为2????
    v[i] = U[i].get('eVec')

ans = np.dot(v.transpose(), np.diag(w))     #降维后的到新的数据矩阵
print(ans)
ans_dist = np.zeros((n,n))
for i in range(n):
#    ans_str=""
    for j in range(n):
        ans_dist[i][j] = sum((ix-jx)**2 for ix,jx in zip(ans[i], ans[j]))
#np.set_printoptions(suppress=True)   #不使用科学计数法显示数据
print("Orign dis[][] is :")
print(Dist)
print("MDS dis[][] is :")
print(ans_dist)     #最终结果两个矩阵不相同
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值