一、概述
机器学习大作业。机器学习课程最后要求实现一篇论文,这里做一下记录。
二、实验内容
这里只放上最重要的部分。
本次课程论文需要实现的算法如算法1所示:
2.1 权重矩阵预计算
根据算法1可知,输入数据为原始数据 X ∈ R n × d X\in\R^{n\times d} X∈Rn×d,权重矩阵 S S S,参数 α \alpha α和 β \beta β,以及需要从原始数据中筛选出的特征数 h h h,而输出为 h h h维被筛选出来的特征,即 X ∈ R n × h X\in \R^{n\times h} X∈Rn×h。此外,权重矩阵 S S S需要根据原始数据 X X X预先计算得到。
根据公式
8
8
8可知,权重矩阵
S
S
S为所有
n
n
n个样本数据循环遍历,依次经径向基核
k
(
x
i
,
x
j
)
=
e
−
∥
x
i
−
x
j
∥
2
σ
k(x_i,x_j)=e^{-\frac{\|x_i-x_j\|^2}{\sigma}}
k(xi,xj)=e−σ∥xi−xj∥2计算得到的核矩阵,即:
S
=
[
k
(
x
1
,
x
1
)
k
(
x
1
,
x
2
)
k
(
x
1
,
x
3
)
⋯
k
(
x
1
,
x
n
)
k
(
x
2
,
x
1
)
k
(
x
2
,
x
2
)
k
(
x
2
,
x
3
)
⋯
k
(
x
2
,
x
n
)
k
(
x
3
,
x
1
)
k
(
x
3
,
x
2
)
k
(
x
3
,
x
3
)
⋯
k
(
x
3
,
x
n
)
⋮
⋮
⋮
⋱
⋮
k
(
x
n
,
x
1
)
k
(
x
n
,
x
2
)
k
(
x
n
,
x
3
)
⋯
k
(
x
n
,
x
n
)
]
S=\left[\begin{matrix} k(x_1,x_1)&k(x_1,x_2)&k(x_1,x_3)&\cdots&k(x_1,x_n)\\ k(x_2,x_1)&k(x_2,x_2)&k(x_2,x_3)&\cdots&k(x_2,x_n)\\ k(x_3,x_1)&k(x_3,x_2)&k(x_3,x_3)&\cdots&k(x_3,x_n)\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ k(x_n,x_1)&k(x_n,x_2)&k(x_n,x_3)&\cdots&k(x_n,x_n) \end{matrix}\right]
S=⎣⎢⎢⎢⎢⎢⎡k(x1,x1)k(x2,x1)k(x3,x1)⋮k(xn,x1)k(x1,x2)k(x2,x2)k(x3,x2)⋮k(xn,x2)k(x1,x3)k(x2,x3)k(x3,x3)⋮k(xn,x3)⋯⋯⋯⋱⋯k(x1,xn)k(x2,xn)k(x3,xn)⋮k(xn,xn)⎦⎥⎥⎥⎥⎥⎤
在实现过程中 σ = 1 \sigma=1 σ=1。计算权重矩阵 S S S的代码如下:
def matrixGaussianKernel_S(X):
num_samples = len(X)
S = np.zeros((num_samples,num_samples))
for i in range(num_samples):
for j in range(num_samples):
S[i][j] = math.exp(-math.pow(np.linalg.norm(X[i]-X[j]),2)/1.0)
return S
其中参数 X X X为输入的原始数据 X ∈ R n × d X\in\R^{n\times d} X∈Rn×d,返回值为核矩阵 S S S。
2.2 特征提取算法实现
算法1具体实现步骤如下:
步骤1:初始化矩阵 Q = I Q=I Q=I,其中 I I I为单位矩阵, Q ∈ R d × d Q\in\R^{d\times d} Q∈Rd×d, Q Q Q初始化代码如下:
def initializeIdeMatrix_Q(num_feature):
I = np.zeros((num_feature,num_feature))
for i in range(num_feature):
I[i][i] = 1
return I
参数 n u m _ f e a t u r e num\_feature num_feature为原始数据 X X X所有特征的维度 d d d。
步骤2:计算
L
a
p
l
a
c
e
Laplace
Laplace矩阵
L
S
=
D
−
S
L_S=D-S
LS=D−S,其中
D
D
D为对角矩阵,并且每个对角元素为权重矩阵
S
S
S的各行元素之和,
D
D
D可如下表示:
D
=
[
s
u
m
(
S
1
)
0
0
⋯
0
0
s
u
m
(
S
2
)
0
⋯
0
0
0
s
u
m
(
S
3
)
⋯
0
⋮
⋮
⋮
⋱
⋮
0
0
0
⋯
s
u
m
(
S
n
)
]
D=\left[\begin{matrix} sum(S_1)&0&0&\cdots&0\\ 0&sum(S_2)&0&\cdots&0\\ 0&0&sum(S_3)&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&sum(S_n) \end{matrix}\right]
D=⎣⎢⎢⎢⎢⎢⎡sum(S1)00⋮00sum(S2)0⋮000sum(S3)⋮0⋯⋯⋯⋱⋯000⋮sum(Sn)⎦⎥⎥⎥⎥⎥⎤
函数
s
u
m
(
S
j
)
sum(S_j)
sum(Sj)表示对矩阵
S
S
S第
j
j
j行所有元素求和,即
D
i
i
=
s
u
m
(
S
i
)
=
∑
j
=
1
n
S
i
j
D_{ii}=sum(S_i)=\sum_{j=1}^{n}S_{ij}
Dii=sum(Si)=j=1∑nSij
由上文已知矩阵
S
S
S为径向基核矩阵,所以即可求得
L
a
p
l
a
c
e
Laplace
Laplace矩阵,实现代码如下:
def calculateMatrix_Ls(S):
D = np.zeros((S.shape[0],S.shape[0]))
for i in range(S.shape[0]):
D[i][i] = np.sum(S[i])
return D-S
步骤3:迭代计算表达矩阵
W
W
W,不断训练更新矩阵
Q
Q
Q,直到矩阵
W
W
W收敛。表达矩阵
W
W
W的计算公式为:
W
=
(
β
X
T
L
S
X
+
X
X
T
+
α
Q
)
−
1
X
T
X
W=(\beta X^TL_SX+XX^T+\alpha Q)^{-1}X^TX
W=(βXTLSX+XXT+αQ)−1XTX
代码实现为:
def iterationUntilConvergence(alpha,beta,epsilon,Ls,Q,X):
#temp_W = np.zeros((X.shape[1],X.shape[1]))
'''W = np.matmul(np.matmul(
np.linalg.inv(np.matmul(
np.matmul(beta * X.T, Ls), X
) + np.matmul(X.T, X) + alpha * Q), X.T), X)'''
#Q = updateQ(epsilon, W)
#itetimes = 0
#while(np.sum(np.abs(temp_W-W)) > 0.01):
for i in range(50):
#print(np.sum(np.abs(temp_W-W))) #输出每迭代一次W与上一次W之间的差
#temp_W = W
#itetimes += 1 #统计W达到收敛精度0.01时的迭代次数
W = np.matmul(np.matmul(
np.linalg.inv(np.matmul(
np.matmul(beta*X.T,Ls),X
)+np.matmul(X.T,X)+alpha*Q),X.T),X)
Q = updateQ(epsilon, W)
#print(itetimes)
return W
根据论文实验结果,迭代训练50次即可,其中,利用
u
p
d
a
t
e
Q
(
)
updateQ()
updateQ()方法来训练更新矩阵
Q
Q
Q,计算
Q
Q
Q的公式为:
Q
i
i
=
1
2
W
i
T
W
i
+
ε
Q_{ii}=\frac{1}{2\sqrt{W^T_iW_i+\varepsilon}}
Qii=2WiTWi+ε1
其中
ε
→
0
\varepsilon\rightarrow0
ε→0,实现过程中取
ε
=
0.0001
\varepsilon=0.0001
ε=0.0001。
u
p
d
a
t
e
Q
(
)
updateQ()
updateQ()定义如下:
def updateQ(epsilon,W):
Q = np.zeros((W.shape[0],W.shape[0]))
for i in range(W.shape[0]):
Q[i][i] = 1/(2*math.sqrt(np.sum(np.square(W[i]))+epsilon))
return Q
步骤4:令 r d × 1 = [ r 1 , r 2 , ⋯ , r d ] T r^{d\times 1}=[r_1,r_2,\cdots,r_d]^T rd×1=[r1,r2,⋯,rd]T,分别计算表达矩阵 W W W各行元素的2范数 r i = ∥ W i ∥ 2 r_i=\|W_i\|_2 ri=∥Wi∥2,并将 r d × 1 r^{d\times 1} rd×1降序排列,取最大的前 h h h个 r i r_i ri的索引 i n d e x = { i 1 , i 2 , ⋯ , i h } index=\{i_1,i_2,\cdots,i_h\} index={i1,i2,⋯,ih},则原始数据 X X X被提取后的特征数据为 f e a t u r e _ X = { X i 1 T , X i 2 T , ⋯ , X i h T } T feature\_X=\{X^T_{i_1},X^T_{i_2},\cdots,X^T_{i_h}\}^T feature_X={Xi1T,Xi2T,⋯,XihT}T。实现过程如下:
def rankBasedW(h,W,X):
norm_W = np.linalg.norm(W,axis=1)
index = heapq.nlargest(h,range(len(norm_W)),norm_W.take) #取前h个最大值的索引
print(index)
feature_X = np.zeros((X.shape[0],h))
for i in range(X.shape[0]):
feature_X[i] = X[i][index] #取每个数据中被提取的特征数据
return feature_X
以上为算法1的实现过程,还需要对 X X X经过提取特征之后的数据 f e a t u r e _ X feature\_X feature_X进行实验结果验证。
2.3 标签重排与精度计算
采用 K M e a n s KMeans KMeans聚类算法,对 f e a t u r e _ X feature\_X feature_X进行聚类,聚类方法可以直接利用 s k l e a r n sklearn sklearn机器学习库方法:
def k_means(data, clusters):
return KMeans(n_clusters=clusters,random_state=0).fit(data).predict(data)
lable_pred = k_means(feature_X,len(np.unique(Y)))
其中 l a b e l _ p r e d label\_pred label_pred为经过聚类后得到的数据标签。由于数据标签的排列标准不一致,需要利用 M u n k r e s Munkres Munkres算法对 l a b e l _ p r e d label\_pred label_pred进行重排,重排过程实现如下:
from munkres import Munkres
import numpy as np
def maplabels(L1, L2):
L2 = L2+1
Label1 = np.unique(L1)
Label2 = np.unique(L2)
nClass1 = len(Label1)
nClass2 = len(Label2)
nClass = np.maximum(nClass1, nClass2)
G = np.zeros((nClass, nClass))
for i in range(nClass1):
ind_cla1 = L1 == Label1[i]
ind_cla1 = ind_cla1.astype(float)
for j in range(nClass2):
ind_cla2 = L2 == Label2[j]
ind_cla2 = ind_cla2.astype(float)
G[i, j] = np.sum(ind_cla2*ind_cla1)
m = Munkres()
index = m.compute(-G.T)
index = np.array(index)
index = index+1
print(-G.T)
print(index)
newL2 = np.zeros(L2.shape, dtype=int)
for i in range(nClass2):
for j in range(len(L2)):
if L2[j] == index[i, 0]:
newL2[j] = index[i, 1]
return newL2
传入参数 L 1 , L 2 L_1,L_2 L1,L2分别为原始数据标签和经过聚类得到的预测标签,返回值为经过重排后的标签。分别使用 A C AC AC和 N M I NMI NMI指标评价聚类精度, A C AC AC实现如下:
import numpy as np
def acc(L1, L2):
sum = np.sum(L1[:]==L2[:])
return sum/len(L2)
传入参数 L 1 , L 2 L_1,L_2 L1,L2分别为原始数据标签和经过 M u n k r e s Munkres Munkres算法重排后标签,返回值为 A C AC AC精度。调用 s k l e a r n sklearn sklearn机器学习库方法计算 N M I NMI NMI评价指标:
from sklearn import metrics
def nmi(L1, L2):
return metrics.normalized_mutual_info_score(L1, L2)
传入参数同上,返回值为 N M I NMI NMI精度。
三、实验代码
README:
运行环境
Python 3.7 64位
pycharm
package:sklearn, numpy, math,heapq等
将code文件夹中所有.py文件复制到pycharm中,运行spufs.py文件即可。
注:运行之前,需要安装相关包,并将代码中的相关路径名进行修改。
Acc.py
import numpy as np
def acc(L1, L2):
sum = np.sum(L1[:]==L2[:])
return sum/len(L2)
datadvi.py
from scipy.io import loadmat
import numpy as np
def divdata():
filename = '.../作业/机器学习/datasets/' + input("input name of data file: ")
data = loadmat(filename)
# print(data['X'])
if filename == '.../作业/机器学习/datasets/COIL20.mat':
dataX = data['fea']
dataY = data['gnd'][0]
else:
dataX = data['X']
dataY = data['Y'].T[0]
print(len(dataX[0]))
divideornot = input("divide data or not?(Yes/No): ")
if divideornot == 'Yes':
dataX_train = []
dataX_predict = []
dataY_train = []
dataY_predict = []
num_Y = np.unique(dataY).astype(int)
for i in range(len(num_Y)):
temp = dataY == num_Y[i]
temp.astype(float)
num_Y[i] = np.sum(temp)
flag = 0
for j in range(len(dataY)):
if temp[j] == 1:
if flag < int(round(0.9 * num_Y[i])):
dataX_train.append(dataX[j])
dataY_train.append(dataY[j])
flag += 1
else:
dataX_predict.append(dataX[j])
dataY_predict.append(dataY[j])
dataX_train = np.array(dataX_train)
dataX_predict = np.array(dataX_predict)
dataY_train = np.array(dataY_train)
dataY_predict = np.array(dataY_predict)
return dataX_train,dataX_predict,dataY_train,dataY_predict
else:
return dataX,dataX,dataY,dataY
def decreaseData(dataX,dataY):
dataX_train = []
dataY_train = []
num_Y = np.unique(dataY).astype(int)
print("this data has {} samples".format(len(dataX)))
ratio = float(input("input the ratio you want to decrease: "))
for i in range(len(num_Y)):
temp = dataY == num_Y[i]
temp.astype(float)
num_Y[i] = np.sum(temp)
flag = 0
for j in range(len(dataY)):
if temp[j] == 1:
if flag < round(ratio * num_Y[i]):
dataX_train.append(dataX[j])
dataY_train.append(dataY[j])
flag += 1
dataX_train = np.array(dataX_train)
dataY_train = np.array(dataY_train)
print(dataX_train)
return dataX_train,dataY_train
kmeans.py
from sklearn.cluster import KMeans
def k_means(data, clusters):
return KMeans(n_clusters=clusters,random_state=0).fit(data).predict(data)
maplabels.py
from munkres import Munkres, print_matrix
import numpy as np
def maplabels(L1, L2):
L2 = L2+1
Label1 = np.unique(L1)
Label2 = np.unique(L2)
nClass1 = len(Label1)
nClass2 = len(Label2)
nClass = np.maximum(nClass1, nClass2)
G = np.zeros((nClass, nClass))
for i in range(nClass1):
ind_cla1 = L1 == Label1[i]
ind_cla1 = ind_cla1.astype(float)
for j in range(nClass2):
ind_cla2 = L2 == Label2[j]
ind_cla2 = ind_cla2.astype(float)
G[i, j] = np.sum(ind_cla2*ind_cla1)
m = Munkres()
index = m.compute(-G.T)
index = np.array(index)
index = index+1
print(-G.T)
print(index)
newL2 = np.zeros(L2.shape, dtype=int)
for i in range(nClass2):
for j in range(len(L2)):
if L2[j] == index[i, 0]:
newL2[j] = index[i, 1]
return newL2
NMI.py
from sklearn import metrics
def nmi(L1, L2):
return metrics.normalized_mutual_info_score(L1, L2)
spufs.py
import numpy as np
import math
import heapq
import datadvi
import kmeans
import maplabels
import Acc
import NMI
def matrixGaussianKernel_S(X):
num_samples = len(X)
S = np.zeros((num_samples,num_samples))
for i in range(num_samples):
for j in range(num_samples):
S[i][j] = math.exp(-math.pow(np.linalg.norm(X[i]-X[j]),2)/1.0)
return S
def initializeIdeMatrix_Q(num_feature):
I = np.zeros((num_feature,num_feature))
for i in range(num_feature):
I[i][i] = 1
return I
def calculateMatrix_Ls(S):
D = np.zeros((S.shape[0],S.shape[0]))
for i in range(S.shape[0]):
D[i][i] = np.sum(S[i])
return D-S
def updateQ(epsilon,W):
Q = np.zeros((W.shape[0],W.shape[0]))
for i in range(W.shape[0]):
Q[i][i] = 1/(2*math.sqrt(np.sum(np.square(W[i]))+epsilon))
return Q
def objectiveFunc(alpha, beta, epsilon, W, Ls, X):
medTermValue = 0
for i in range(W.shape[0]):
medTermValue += math.sqrt(np.sum(np.square(W[i]))+epsilon)
medTermValue *= alpha
lastTerm = np.matmul(np.matmul(np.matmul(np.matmul(W.T,X.T),Ls),X),W)
valueOfObjFun = math.pow(np.linalg.norm(X-np.matmul(X,W)),2) + medTermValue + beta*np.trace(lastTerm)
return valueOfObjFun
def iterationUntilConvergence(alpha,beta,epsilon,Ls,Q,X):
temp_W = np.zeros((X.shape[1],X.shape[1]))
W = np.matmul(np.matmul(
np.linalg.inv(np.matmul(
np.matmul(beta * X.T, Ls), X
) + np.matmul(X.T, X) + alpha * Q), X.T), X)
Q = updateQ(epsilon, W)
#for i in range(50):
itetimes = 0
#while(np.sum(np.abs(temp_W-W)) > 0.01):
for i in range(50):
print(objectiveFunc(alpha,beta,epsilon,W,Ls,X))
#print(np.sum(np.abs(temp_W-W)))
temp_W = W
itetimes += 1
W = np.matmul(np.matmul(
np.linalg.inv(np.matmul(
np.matmul(beta*X.T,Ls),X
)+np.matmul(X.T,X)+alpha*Q),X.T),X)
Q = updateQ(epsilon, W)
#print(np.sum(np.abs(temp_W - W)))
#print(Q)
print(itetimes)
return W
def rankBasedW(h,W,X):
norm_W = np.linalg.norm(W,axis=1)
index = heapq.nlargest(h,range(len(norm_W)),norm_W.take)
print(index)
feature_X = np.zeros((X.shape[0],h))
for i in range(X.shape[0]):
feature_X[i] = X[i][index]
return feature_X
if __name__ == '__main__':
alpha = float(input("input parameter alpha: "))
beta = float(input("input parameter beta: "))
epsilon = float(input("input parameter epsilon: "))
h = int(input("input number of features h: "))
X,X_pred,Y,Y_pred = datadvi.divdata()
Q = initializeIdeMatrix_Q(X.shape[1])
S = matrixGaussianKernel_S(X)
Ls = calculateMatrix_Ls(S)
W = iterationUntilConvergence(alpha,beta,epsilon,Ls,Q,X)
feature_X = rankBasedW(h,W,X)
print(feature_X.shape)
lable_pred = kmeans.k_means(feature_X,len(np.unique(Y)))
lable_pred = maplabels.maplabels(Y,lable_pred)
print(Acc.acc(Y,lable_pred))
print(NMI.nmi(Y,lable_pred))
'''
a = np.array([1,2,3])
b = np.array([4,5,6])
print(math.pow(np.linalg.norm(a),2))
print(np.sum(np.square(a)))
print(heapq.nlargest(2,range(len(a)),a.take))
print(a[[1,2]])
'''