保持结构的非监督特征选择(Structure Preserving Unsupervised Feature Selection)--- 机器学习大作业

一、概述

机器学习大作业。机器学习课程最后要求实现一篇论文,这里做一下记录。

二、实验内容

这里只放上最重要的部分。

本次课程论文需要实现的算法如算法1所示:
在这里插入图片描述

2.1 权重矩阵预计算

根据算法1可知,输入数据为原始数据 X ∈ R n × d X\in\R^{n\times d} XRn×d,权重矩阵 S S S,参数 α \alpha α β \beta β,以及需要从原始数据中筛选出的特征数 h h h,而输出为 h h h维被筛选出来的特征,即 X ∈ R n × h X\in \R^{n\times h} XRn×h。此外,权重矩阵 S S S需要根据原始数据 X X X预先计算得到。

根据公式 8 8 8可知,权重矩阵 S S S为所有 n n n个样本数据循环遍历,依次经径向基核 k ( x i , x j ) = e − ∥ x i − x j ∥ 2 σ k(x_i,x_j)=e^{-\frac{\|x_i-x_j\|^2}{\sigma}} k(xi,xj)=eσxixj2计算得到的核矩阵,即:
S = [ k ( x 1 , x 1 ) k ( x 1 , x 2 ) k ( x 1 , x 3 ) ⋯ k ( x 1 , x n ) k ( x 2 , x 1 ) k ( x 2 , x 2 ) k ( x 2 , x 3 ) ⋯ k ( x 2 , x n ) k ( x 3 , x 1 ) k ( x 3 , x 2 ) k ( x 3 , x 3 ) ⋯ k ( x 3 , x n ) ⋮ ⋮ ⋮ ⋱ ⋮ k ( x n , x 1 ) k ( x n , x 2 ) k ( x n , x 3 ) ⋯ k ( x n , x n ) ] S=\left[\begin{matrix} k(x_1,x_1)&k(x_1,x_2)&k(x_1,x_3)&\cdots&k(x_1,x_n)\\ k(x_2,x_1)&k(x_2,x_2)&k(x_2,x_3)&\cdots&k(x_2,x_n)\\ k(x_3,x_1)&k(x_3,x_2)&k(x_3,x_3)&\cdots&k(x_3,x_n)\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ k(x_n,x_1)&k(x_n,x_2)&k(x_n,x_3)&\cdots&k(x_n,x_n) \end{matrix}\right] S=k(x1,x1)k(x2,x1)k(x3,x1)k(xn,x1)k(x1,x2)k(x2,x2)k(x3,x2)k(xn,x2)k(x1,x3)k(x2,x3)k(x3,x3)k(xn,x3)k(x1,xn)k(x2,xn)k(x3,xn)k(xn,xn)

在实现过程中 σ = 1 \sigma=1 σ=1。计算权重矩阵 S S S的代码如下:

def matrixGaussianKernel_S(X):
    num_samples = len(X)
    S = np.zeros((num_samples,num_samples))
    for i in range(num_samples):
        for j in range(num_samples):
            S[i][j] = math.exp(-math.pow(np.linalg.norm(X[i]-X[j]),2)/1.0)
    return S

其中参数 X X X为输入的原始数据 X ∈ R n × d X\in\R^{n\times d} XRn×d,返回值为核矩阵 S S S

2.2 特征提取算法实现

算法1具体实现步骤如下:

步骤1:初始化矩阵 Q = I Q=I Q=I,其中 I I I为单位矩阵, Q ∈ R d × d Q\in\R^{d\times d} QRd×d Q Q Q初始化代码如下:

def initializeIdeMatrix_Q(num_feature):
    I = np.zeros((num_feature,num_feature))
    for i in range(num_feature):
        I[i][i] = 1
    return I

参数 n u m _ f e a t u r e num\_feature num_feature为原始数据 X X X所有特征的维度 d d d

步骤2:计算 L a p l a c e Laplace Laplace矩阵 L S = D − S L_S=D-S LS=DS,其中 D D D为对角矩阵,并且每个对角元素为权重矩阵 S S S的各行元素之和, D D D可如下表示:
D = [ s u m ( S 1 ) 0 0 ⋯ 0 0 s u m ( S 2 ) 0 ⋯ 0 0 0 s u m ( S 3 ) ⋯ 0 ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 ⋯ s u m ( S n ) ] D=\left[\begin{matrix} sum(S_1)&0&0&\cdots&0\\ 0&sum(S_2)&0&\cdots&0\\ 0&0&sum(S_3)&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&sum(S_n) \end{matrix}\right] D=sum(S1)0000sum(S2)0000sum(S3)0000sum(Sn)
函数 s u m ( S j ) sum(S_j) sum(Sj)表示对矩阵 S S S j j j行所有元素求和,即
D i i = s u m ( S i ) = ∑ j = 1 n S i j D_{ii}=sum(S_i)=\sum_{j=1}^{n}S_{ij} Dii=sum(Si)=j=1nSij
由上文已知矩阵 S S S为径向基核矩阵,所以即可求得 L a p l a c e Laplace Laplace矩阵,实现代码如下:

def calculateMatrix_Ls(S):
    D = np.zeros((S.shape[0],S.shape[0]))
    for i in range(S.shape[0]):
        D[i][i] = np.sum(S[i])
    return D-S

步骤3:迭代计算表达矩阵 W W W,不断训练更新矩阵 Q Q Q,直到矩阵 W W W收敛。表达矩阵 W W W的计算公式为:
W = ( β X T L S X + X X T + α Q ) − 1 X T X W=(\beta X^TL_SX+XX^T+\alpha Q)^{-1}X^TX W=(βXTLSX+XXT+αQ)1XTX
代码实现为:

def iterationUntilConvergence(alpha,beta,epsilon,Ls,Q,X):
    #temp_W = np.zeros((X.shape[1],X.shape[1]))
    '''W = np.matmul(np.matmul(
        np.linalg.inv(np.matmul(
            np.matmul(beta * X.T, Ls), X
        ) + np.matmul(X.T, X) + alpha * Q), X.T), X)'''
    #Q = updateQ(epsilon, W)
    
    #itetimes = 0
    #while(np.sum(np.abs(temp_W-W)) > 0.01):
    for i in range(50):
        #print(np.sum(np.abs(temp_W-W)))	#输出每迭代一次W与上一次W之间的差
        #temp_W = W
        #itetimes += 1  #统计W达到收敛精度0.01时的迭代次数
        W = np.matmul(np.matmul(
            np.linalg.inv(np.matmul(
                np.matmul(beta*X.T,Ls),X
            )+np.matmul(X.T,X)+alpha*Q),X.T),X)
        Q = updateQ(epsilon, W)

    #print(itetimes)
    return W

根据论文实验结果,迭代训练50次即可,其中,利用 u p d a t e Q ( ) updateQ() updateQ()方法来训练更新矩阵 Q Q Q,计算 Q Q Q的公式为:
Q i i = 1 2 W i T W i + ε Q_{ii}=\frac{1}{2\sqrt{W^T_iW_i+\varepsilon}} Qii=2WiTWi+ε 1
其中 ε → 0 \varepsilon\rightarrow0 ε0,实现过程中取 ε = 0.0001 \varepsilon=0.0001 ε=0.0001 u p d a t e Q ( ) updateQ() updateQ()定义如下:

def updateQ(epsilon,W):
    Q = np.zeros((W.shape[0],W.shape[0]))
    for i in range(W.shape[0]):
        Q[i][i] = 1/(2*math.sqrt(np.sum(np.square(W[i]))+epsilon))
    return Q

步骤4:令 r d × 1 = [ r 1 , r 2 , ⋯   , r d ] T r^{d\times 1}=[r_1,r_2,\cdots,r_d]^T rd×1=[r1,r2,,rd]T,分别计算表达矩阵 W W W各行元素的2范数 r i = ∥ W i ∥ 2 r_i=\|W_i\|_2 ri=Wi2,并将 r d × 1 r^{d\times 1} rd×1降序排列,取最大的前 h h h r i r_i ri的索引 i n d e x = { i 1 , i 2 , ⋯   , i h } index=\{i_1,i_2,\cdots,i_h\} index={i1,i2,,ih},则原始数据 X X X被提取后的特征数据为 f e a t u r e _ X = { X i 1 T , X i 2 T , ⋯   , X i h T } T feature\_X=\{X^T_{i_1},X^T_{i_2},\cdots,X^T_{i_h}\}^T feature_X={Xi1T,Xi2T,,XihT}T。实现过程如下:

def rankBasedW(h,W,X):
    norm_W = np.linalg.norm(W,axis=1)
    index = heapq.nlargest(h,range(len(norm_W)),norm_W.take) #取前h个最大值的索引
    print(index)
    feature_X = np.zeros((X.shape[0],h))
    for i in range(X.shape[0]):
        feature_X[i] = X[i][index] #取每个数据中被提取的特征数据
    return feature_X

以上为算法1的实现过程,还需要对 X X X经过提取特征之后的数据 f e a t u r e _ X feature\_X feature_X进行实验结果验证。

2.3 标签重排与精度计算

采用 K M e a n s KMeans KMeans聚类算法,对 f e a t u r e _ X feature\_X feature_X进行聚类,聚类方法可以直接利用 s k l e a r n sklearn sklearn机器学习库方法:

def k_means(data, clusters):
    return KMeans(n_clusters=clusters,random_state=0).fit(data).predict(data)

lable_pred = k_means(feature_X,len(np.unique(Y)))

其中 l a b e l _ p r e d label\_pred label_pred为经过聚类后得到的数据标签。由于数据标签的排列标准不一致,需要利用 M u n k r e s Munkres Munkres算法对 l a b e l _ p r e d label\_pred label_pred进行重排,重排过程实现如下:

from munkres import Munkres
import numpy as np

def maplabels(L1, L2):
    L2 = L2+1
    Label1 = np.unique(L1)
    Label2 = np.unique(L2)
    nClass1 = len(Label1)
    nClass2 = len(Label2)
    nClass = np.maximum(nClass1, nClass2)
    G = np.zeros((nClass, nClass))
    for i in range(nClass1):
        ind_cla1 = L1 == Label1[i]
        ind_cla1 = ind_cla1.astype(float)
        for j in range(nClass2):
            ind_cla2 = L2 == Label2[j]
            ind_cla2 = ind_cla2.astype(float)
            G[i, j] = np.sum(ind_cla2*ind_cla1)

    m = Munkres()
    index = m.compute(-G.T)
    index = np.array(index)
    index = index+1
    print(-G.T)
    print(index)
    newL2 = np.zeros(L2.shape, dtype=int)
    for i in range(nClass2):
        for j in range(len(L2)):
            if L2[j] == index[i, 0]:
                newL2[j] = index[i, 1]

    return newL2

传入参数 L 1 , L 2 L_1,L_2 L1,L2分别为原始数据标签和经过聚类得到的预测标签,返回值为经过重排后的标签。分别使用 A C AC AC N M I NMI NMI指标评价聚类精度, A C AC AC实现如下:

import numpy as np

def acc(L1, L2):
    sum = np.sum(L1[:]==L2[:])
    return sum/len(L2)

传入参数 L 1 , L 2 L_1,L_2 L1,L2分别为原始数据标签和经过 M u n k r e s Munkres Munkres算法重排后标签,返回值为 A C AC AC精度。调用 s k l e a r n sklearn sklearn机器学习库方法计算 N M I NMI NMI评价指标:

from sklearn import metrics

def nmi(L1, L2):
    return metrics.normalized_mutual_info_score(L1, L2)

传入参数同上,返回值为 N M I NMI NMI精度。

三、实验代码

README:

运行环境
Python 3.7 64位
pycharm
package:sklearn, numpy, math,heapq等
将code文件夹中所有.py文件复制到pycharm中,运行spufs.py文件即可。
注:运行之前,需要安装相关包,并将代码中的相关路径名进行修改。

Acc.py

import numpy as np

def acc(L1, L2):
    sum = np.sum(L1[:]==L2[:])
    return sum/len(L2)

datadvi.py

from scipy.io import loadmat
import numpy as np

def divdata():
    filename = '.../作业/机器学习/datasets/' + input("input name of data file: ")
    data = loadmat(filename)
#    print(data['X'])


    if filename == '.../作业/机器学习/datasets/COIL20.mat':
        dataX = data['fea']
        dataY = data['gnd'][0]

    else:
        dataX = data['X']
        dataY = data['Y'].T[0]

    print(len(dataX[0]))

    divideornot = input("divide data or not?(Yes/No): ")
    if divideornot == 'Yes':
        dataX_train = []
        dataX_predict = []
        dataY_train = []
        dataY_predict = []
        num_Y = np.unique(dataY).astype(int)
        for i in range(len(num_Y)):
            temp = dataY == num_Y[i]
            temp.astype(float)
            num_Y[i] = np.sum(temp)
            flag = 0
            for j in range(len(dataY)):
                if temp[j] == 1:
                    if flag < int(round(0.9 * num_Y[i])):
                        dataX_train.append(dataX[j])
                        dataY_train.append(dataY[j])
                        flag += 1
                    else:
                        dataX_predict.append(dataX[j])
                        dataY_predict.append(dataY[j])

        dataX_train = np.array(dataX_train)
        dataX_predict = np.array(dataX_predict)
        dataY_train = np.array(dataY_train)
        dataY_predict = np.array(dataY_predict)
        return dataX_train,dataX_predict,dataY_train,dataY_predict
    else:
        return dataX,dataX,dataY,dataY

def decreaseData(dataX,dataY):
    dataX_train = []
    dataY_train = []
    num_Y = np.unique(dataY).astype(int)
    print("this data has {} samples".format(len(dataX)))
    ratio = float(input("input the ratio you want to decrease: "))
    for i in range(len(num_Y)):
        temp = dataY == num_Y[i]
        temp.astype(float)
        num_Y[i] = np.sum(temp)
        flag = 0
        for j in range(len(dataY)):
            if temp[j] == 1:
                if flag < round(ratio * num_Y[i]):
                    dataX_train.append(dataX[j])
                    dataY_train.append(dataY[j])
                    flag += 1

    dataX_train = np.array(dataX_train)
    dataY_train = np.array(dataY_train)
    print(dataX_train)

    return dataX_train,dataY_train

kmeans.py

from sklearn.cluster import KMeans

def k_means(data, clusters):
    return KMeans(n_clusters=clusters,random_state=0).fit(data).predict(data)

maplabels.py

from munkres import Munkres, print_matrix
import numpy as np

def maplabels(L1, L2):
    L2 = L2+1
    Label1 = np.unique(L1)
    Label2 = np.unique(L2)
    nClass1 = len(Label1)
    nClass2 = len(Label2)
    nClass = np.maximum(nClass1, nClass2)
    G = np.zeros((nClass, nClass))
    for i in range(nClass1):
        ind_cla1 = L1 == Label1[i]
        ind_cla1 = ind_cla1.astype(float)
        for j in range(nClass2):
            ind_cla2 = L2 == Label2[j]
            ind_cla2 = ind_cla2.astype(float)
            G[i, j] = np.sum(ind_cla2*ind_cla1)

    m = Munkres()
    index = m.compute(-G.T)
    index = np.array(index)
    index = index+1
    print(-G.T)
    print(index)
    newL2 = np.zeros(L2.shape, dtype=int)
    for i in range(nClass2):
        for j in range(len(L2)):
            if L2[j] == index[i, 0]:
                newL2[j] = index[i, 1]

    return newL2

NMI.py

from sklearn import metrics

def nmi(L1, L2):
    return metrics.normalized_mutual_info_score(L1, L2)

spufs.py

import numpy as np
import math
import heapq
import datadvi
import kmeans
import maplabels
import Acc
import NMI

def matrixGaussianKernel_S(X):
    num_samples = len(X)
    S = np.zeros((num_samples,num_samples))
    for i in range(num_samples):
        for j in range(num_samples):
            S[i][j] = math.exp(-math.pow(np.linalg.norm(X[i]-X[j]),2)/1.0)
    return S

def initializeIdeMatrix_Q(num_feature):
    I = np.zeros((num_feature,num_feature))
    for i in range(num_feature):
        I[i][i] = 1
    return I

def calculateMatrix_Ls(S):
    D = np.zeros((S.shape[0],S.shape[0]))
    for i in range(S.shape[0]):
        D[i][i] = np.sum(S[i])
    return D-S

def updateQ(epsilon,W):
    Q = np.zeros((W.shape[0],W.shape[0]))
    for i in range(W.shape[0]):
        Q[i][i] = 1/(2*math.sqrt(np.sum(np.square(W[i]))+epsilon))
    return Q

def objectiveFunc(alpha, beta, epsilon, W, Ls, X):
    medTermValue = 0
    for i in range(W.shape[0]):
        medTermValue += math.sqrt(np.sum(np.square(W[i]))+epsilon)
    medTermValue *= alpha
    lastTerm = np.matmul(np.matmul(np.matmul(np.matmul(W.T,X.T),Ls),X),W)
    valueOfObjFun = math.pow(np.linalg.norm(X-np.matmul(X,W)),2) + medTermValue + beta*np.trace(lastTerm)
    return valueOfObjFun

def iterationUntilConvergence(alpha,beta,epsilon,Ls,Q,X):
    temp_W = np.zeros((X.shape[1],X.shape[1]))
    W = np.matmul(np.matmul(
        np.linalg.inv(np.matmul(
            np.matmul(beta * X.T, Ls), X
        ) + np.matmul(X.T, X) + alpha * Q), X.T), X)
    Q = updateQ(epsilon, W)
    #for i in range(50):
    itetimes = 0
    #while(np.sum(np.abs(temp_W-W)) > 0.01):
    for i in range(50):
        print(objectiveFunc(alpha,beta,epsilon,W,Ls,X))
        #print(np.sum(np.abs(temp_W-W)))
        temp_W = W

        itetimes += 1

        W = np.matmul(np.matmul(
            np.linalg.inv(np.matmul(
                np.matmul(beta*X.T,Ls),X
            )+np.matmul(X.T,X)+alpha*Q),X.T),X)
        Q = updateQ(epsilon, W)
        #print(np.sum(np.abs(temp_W - W)))
        #print(Q)
    print(itetimes)
    return W

def rankBasedW(h,W,X):
    norm_W = np.linalg.norm(W,axis=1)
    index = heapq.nlargest(h,range(len(norm_W)),norm_W.take)
    print(index)
    feature_X = np.zeros((X.shape[0],h))
    for i in range(X.shape[0]):
        feature_X[i] = X[i][index]
    return feature_X


if __name__ == '__main__':
    alpha = float(input("input parameter alpha: "))
    beta = float(input("input parameter beta: "))
    epsilon = float(input("input parameter epsilon: "))
    h = int(input("input number of features h: "))
    X,X_pred,Y,Y_pred = datadvi.divdata()
    Q = initializeIdeMatrix_Q(X.shape[1])
    S = matrixGaussianKernel_S(X)
    Ls = calculateMatrix_Ls(S)
    W = iterationUntilConvergence(alpha,beta,epsilon,Ls,Q,X)
    feature_X = rankBasedW(h,W,X)

    print(feature_X.shape)
    lable_pred = kmeans.k_means(feature_X,len(np.unique(Y)))
    lable_pred = maplabels.maplabels(Y,lable_pred)
    print(Acc.acc(Y,lable_pred))
    print(NMI.nmi(Y,lable_pred))

'''
a = np.array([1,2,3])
b = np.array([4,5,6])
print(math.pow(np.linalg.norm(a),2))
print(np.sum(np.square(a)))
print(heapq.nlargest(2,range(len(a)),a.take))
print(a[[1,2]])
'''
  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 7
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值