模糊C均值(C-means)算法步骤
算法概述
模糊C均值算法是一种基于隶属度的聚类方法,允许单个数据点属于多个聚类中心。这种属于程度由隶属度表示。
初始化
- 数据输入:输入数据集,指定聚类数量 (C)、模糊系数 (m)(通常为2)、误差阈值和最大迭代次数。
- 隶属度初始化:随机初始化隶属度矩阵,使得每个数据点对各聚类的隶属度之和为1。
更新聚类中心
- 根据当前的隶属度矩阵计算每个聚类的中心。聚类中心是数据点的加权平均,权重是其隶属度的 (m) 次幂。
更新隶属度
- 对于每个数据点和每个聚类,基于数据点与聚类中心之间的距离来更新隶属度。距离越近的聚类中心具有更高的隶属度。
迭代和收敛
- 重复执行步骤3和4,直到达到最大迭代次数或聚类中心的变化小于预设的误差阈值。
算法输出
- 输出最终的聚类中心和每个数据点的隶属度矩阵。
错误率计算
- 如果数据集包含真实类别标签,可以使用匈牙利算法等方法来计算聚类结果的错误率。
import numpy as np
from sklearn.cluster import SpectralClustering
class FuzzyCMeans:
def __init__(self, data, n_clusters=3, m=2, error=0.005, max_iter=100):
self.data = data
self.n_clusters = n_clusters
self.m = m # Fuzziness parameter
self.error = error # Threshold for stopping criterion
self.max_iter = max_iter
self.centers = None
self.membership = None
def initialize_membership(self):
# Randomly initialize the membership matrix so that the sum of memberships for each data point is 1
membership = np.random.rand(self.data.shape[0], self.n_clusters)
membership_sum = membership.sum(axis=1).reshape(-1, 1)
return membership / membership_sum
def initialize_centers(self,X, C):
"""
Initialize the cluster centers.
X: ndarray, shape (n_samples, n_features)
C: int, number of clusters
"""
n_samples = X.shape[0]
indices = np.random.choice(n_samples, C, replace=False)
return X[indices]
def update_centers(self):
# Update the cluster centers based on the membership values
centers = np.zeros((self.n_clusters, self.data.shape[1]))
for j in range(self.n_clusters):
numerator = np.power(self.membership[:, j], self.m).reshape(-1, 1) * self.data
denominator = np.power(self.membership[:, j], self.m).sum()
centers[j] = numerator.sum(axis=0) / denominator
return centers
def update_membership(self):
# Update the membership matrix based on the current cluster centers
membership = np.zeros((self.data.shape[0], self.n_clusters))
for i in range(self.data.shape[0]):
for j in range(self.n_clusters):
sum_term = np.sum([(np.linalg.norm(self.data[i] - self.centers[j]) /
np.linalg.norm(self.data[i] - self.centers[k])) **
(2 / (self.m - 1)) for k in range(self.n_clusters)])
membership[i, j] = 1 / sum_term
return membership
def fit(self):
# Main loop of the Fuzzy C-Means algorithm
self.membership = self.initialize_membership()
self.centers=self.initialize_centers(self.data,self.n_clusters)
print("初始类中心:")
print(self.centers)
print("迭代次数:200")
for iteration in range(self.max_iter):
prev_centers = np.copy(self.centers)
self.centers = self.update_centers()
self.membership = self.update_membership()
# Check if the centers have stabilized
if prev_centers is not None:
center_change = np.linalg.norm(self.centers - prev_centers)
if center_change < self.error:
break
return self.centers, self.membership
# Example Usage
# Assume 'data' is a numpy array with your data
# 读取文件
file_path = 'iris.dat' # 替换为文件的实际路径
with open(file_path, 'r') as file:
iris_data = file.readlines()
# 将数据转换为NumPy数组
# 每行通过制表符分割,然后将每个元素转换为浮点数
data_array = np.array([line.strip().split('\t') for line in iris_data], dtype=float)
data=data_array[:,:4]
label=data_array[:,4:]
label=label.astype("int").flatten()
fcm = FuzzyCMeans(data, n_clusters=3, m=2,max_iter=200)
centers, membership = fcm.fit()
print("类中心:",centers)
cluster_indices = np.argmax(membership, axis=1)
print("聚类结果",cluster_indices)
from scipy.optimize import linear_sum_assignment
def calculate_error_rate(original_labels, clustered_labels):
# 构建混淆矩阵
size = max(np.max(original_labels), np.max(clustered_labels)) + 1
confusion_matrix = np.zeros((size, size), dtype=int)
for i in range(len(original_labels)):
confusion_matrix[original_labels[i], clustered_labels[i]] += 1
# 使用匈牙利算法(线性求和分配)找到最佳匹配
row_ind, col_ind = linear_sum_assignment(-confusion_matrix)
# 计算匹配的总数
correct_matches = confusion_matrix[row_ind, col_ind].sum()
# 计算错误率
error_rate = 1 - correct_matches / len(original_labels)
return error_rate
error_rate = calculate_error_rate(label, cluster_indices)
print("C_means错误率:",error_rate)
#谱聚类
n_clusters = 3 # 假设我们要将数据聚成3类
clustering = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors',n_neighbors=50, random_state=0)
labels = clustering.fit_predict(data)
error_rate = calculate_error_rate(label, labels)
print("谱聚类错误率:",error_rate)