1. 代码实现
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn import metrics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel('ALL_AML_data.xlsx')
labels = pd.read_excel('ALL_AML_labels.xlsx')
y_true = labels.iloc[:,0].values # 取Excel中的值
# 将真实的分类标准和KMeans的分类标准一致
j = 0
for i in y_true:
if i == 2:
y_true[j] = 1
if i == 1:
y_true[j] = 2
j = j+1
print("y_true",y_true)
# 将用于聚类的数据的特征的维度降至2维,并输出降维后的数据
pca = PCA(n_components=2)
new_pca = pd.DataFrame(pca.fit_transform(df))
kmeans = KMeans(n_clusters=3, random_state=10).fit(df)
y_pred = kmeans.labels_ + 1
print("y_pred",y_pred)
# MSE(均方误差)(Mean Square Error)
print("MSE:", metrics.mean_squared_error(y_true, y_pred))
# RMSE (均方根误差)(Root Mean Square Error)
print("RMSE:", np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE (平均绝对误差)(Mean Absolute Error)
print("MAE:", metrics.mean_absolute_error(y_true, y_pred))
# 准确率
print("ACCURACY:", metrics.accuracy_score(y_true, y_pred))
print(metrics.accuracy_score(y_true, y_pred, normalize=False))
# 可视化
d = new_pca[y_pred == 1]
plt.plot(d[0], d[1], 'r.')
d = new_pca[y_pred == 2]
plt.plot(d[0], d[1], 'go')
d = new_pca[y_pred == 3]
plt.plot(d[0], d[1], 'b*')
plt.gcf().savefig('kmeans.png')
plt.show()
2. 运行结果:
y_true [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3]
y_pred [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 3 3 2 2 2 3 2 2 3]
MSE: 0.1891891891891892
RMSE: 0.43495883620084
MAE: 0.1891891891891892
ACCURACY: 0.8108108108108109
30
参考文献
手把手教你利用Python处理数据
如何使用python读取txt文件中的数据
python-时间
Python 日期和时间
python使用pandas和xlsxwriter读写xlsx文件
python pandas (ix & iloc &loc) 的区别
MSE(均方误差)、RMSE (均方根误差)、MAE (平均绝对误差)
Python sklearn.cluster模块,常用函数和类
利用python的KMeans和PCA包实现聚类算法
计算机的潜意识