K-MEANS算法和 DBSCAN算法

潇洒哥611

已于 2023-10-26 20:10:33 修改

阅读量67

点赞数

文章标签： kmeans 算法机器学习

于 2023-10-26 20:01:22 首次发布

本文链接：https://blog.csdn.net/qq_72985002/article/details/134059881

版权

K-MEANS

流程：在一个要分类的图像上随便找几个中心，然后计算每一个数据点到这几个中心的距离，然后找到最近的，归类。之后把每一个蔟的中心坐标代替原中心，如此循环

class KMeans:
    def __init__(self,data,num_clustres):
        self.data = data
        self.num_clustres = num_clustres

def train(self,max_iterations):
  #1.先随机选择K个中心点
    centroids = KMeans.centroids_init(self.data,self.num_clustres)
        #2.开始训练
    num_examples = self.data.shape[0]
    closest_centroids_ids = np.empty((num_examples,1))
    for _ in range(max_iterations):
            #3得到当前每一个样本点到K个中心点的距离，找到最近的
       closest_centroids_ids = KMeans.centroids_find_closest(self.data,centroids)
            #4.进行中心点位置更新
       centroids = KMeans.centroids_compute(self.data,closest_centroids_ids,self.num_clustres)
    return centroids,closest_centroids_ids

半监督学习

让我们将训练集聚类为n个集群，然后对于每个聚类，让我们找到最靠近质心的那个图像。我们将这些图像称为代表性图像，并赋予标签。

from sklearn.linear_model import LogisticRegression
n_labeled = 50

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train[:n_labeled], y_train[:n_labeled])
log_reg.score(X_test, y_test) # 0.8266666666666667

#逻辑回归模型只训练前50个


log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_representative_digits, y_representative_digits)
log_reg.score(X_test, y_test)#0.9244444444444444

#逻辑回归训练标志性的50个数据

y_train_propagated = np.empty(len(X_train), dtype=np.int32)
for i in range(k):
    y_train_propagated[kmeans.labels_==i] = y_representative_digits[i]
    
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train_propagated)
log_reg.score(X_test, y_test)#0.9288888888888889

#将标签传播到同一群集中的所有其他实例

percentile_closest = 20

X_cluster_dist = X_digits_dist[np.arange(len(X_train)), kmeans.labels_]
for i in range(k):
    in_cluster = (kmeans.labels_ == i)
    cluster_dist = X_cluster_dist[in_cluster] #选择属于当前簇的所有样本
    cutoff_distance = np.percentile(cluster_dist, percentile_closest) #返回第20个数
    above_cutoff = (X_cluster_dist > cutoff_distance) 
    X_cluster_dist[in_cluster & above_cutoff] = -1 除了这20个数，都置-1
partially_propagated = (X_cluster_dist != -1)
X_train_partially_propagated = X_train[partially_propagated]
y_train_partially_propagated = y_train_propagated[partially_propagated]
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_partially_propagated, y_train_partially_propagated)
log_reg.score(X_test, y_test)#0.9422222222222222

#标签传播到同一群集中的所有其他实例
#取每个簇的前20个数据来训练

评估方法

Inertia指标：每个样本与其质心的距离

k值越大，得到的结果肯定会越来越小

轮廓系数：每个样本的轮廓系数平均值

si接近1，则说明样本i聚类合理；
si接近-1，则说明样本i更应该分类到另外的簇；
若si 近似为0，则说明样本i在两个簇的边界上

a（i）样本i到同簇其他样本的平均距离ai ，样本i的簇内不相似度。

b（i）计算样本i到其他某簇Cj 的所有样本的平均距离bij，样本i的簇间不相似度（多个蔟的最小值）

将所有点的轮廓系数求平均，就是该聚类结果总的轮廓系数。

DBSCAN

核心对象：若某个点的密度达到算法设定的阈值则其为核心点。即 r 邻域内点的数量不小于 minPts）

直接密度可达：若某点p在点q的 r 邻域内，且q是核心点则p-q直接密度可达。
密度可达：若有一个点的序列q0、q1、…qk，对任意qi-qi-1是直接密度可达的，则称从q0到qk密度可达，这实际上是直接密度可达的“传播”。

密度相连：若从某核心点p出发，点q和点k都是密度可达的,则称点q和点k是密度相连的。

边界点:属于某一个类的非核心点,不能发展下线了

流程：随便找一个点，然后以这个点为基础开始扩散，最后形成一个蔟

然后找下一个中心点，继续如此，直到所有点都被归类或定义

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps = 0.05,min_samples=5)
dbscan.fit(X)

两个参数：参数ϵ：指定半径

MinPts：密度阈值

dbscan.labels_[:10]
dbscan.core_sample_indices_[:10]#核心样本
np.unique(dbscan.labels_)#有几个类

潇洒哥611

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫