聚类分析和K-means&Dbscan原理

最新推荐文章于 2023-07-15 09:23:53 发布

indigo女孩

最新推荐文章于 2023-07-15 09:23:53 发布

阅读量2.6k

点赞数 1

文章标签： python 数据分析聚类 kmeans

本文链接：https://blog.csdn.net/qq_45003520/article/details/123688989

版权

聚类分析和K-means&Dbscan原理

聚类概念

无监督问题：手里没有标签
聚类：相似的东西分为一组
难点：如何评估、如何调参

K-means算法

基本概念

要得到簇的个数，需要指定K值
质心：均值，即向量各维取平均即可（迭代时用）
距离的度量：常用欧几里得距离（欧氏距离）和余弦相似度（先正则化）
优化目标：（即损失函数最小） $min\sum^K_{i=1}\sum_{x\in{C_i}}{dist(c_i,x)^2}$

流程

首先给定k值，确定簇的个数
根据k值，随机选取k个质心
根据质心计算每个点的距离，根据距离的远近，将点分到不同的簇中
得到k个簇后，重新计算质心，再重复以上步骤，直到点不在发生变化

优势和劣势

优势：

简单、快速、适合常规数据集

劣势：

k值难确定
复杂度与样本呈线性关系
很难发现任意形状的簇（例如：环绕、纠缠）
初始值的选择很重要（所以只能多次实验，随机选择多次初始点）

Dbscan算法

基本概念

核心对象：若某个点的密度达到算法设定的阈值则其为核心点（即r邻域内的点的数量不小于minPts，minPts由自己设置）
epsilon-邻域的距离阈值：设定的半径r(r也由自己设置）
直接密度可达：若某点p在点q的r邻域内，且q时核心点则p-q直接密度可达
密度可达：若有一个点序列q0、q1、…、qk，对任意qi-qi-1是直接密度可达的，则称从q0到qk密度可达，这实际上是直接密度可达的“传播”
密度相连：若从某核心点p出发，点q和点k都是密度可达的，则称点q和点k是密度相连的。
边界点：属于某一个类的非核心点，不能发展下线
噪声点：不属于任何一个类簇的点，从任何一个核心点出发都是密度不可达的

流程

输入参数：数据集D、epsilon半径、MinPts密度阈值
标记所有对象为unvisited
随机选择一个unvisited对象p，把p标记为visited
情况一：若p的epsilon邻域至少有MinPts对象，则创建一个新簇C，并把p添加到C中
令N为p的epsilon邻域中的对象集合，对N中每一个点q：
若q是unvisited，则把它标记为visited，若q的epsilon邻域至少有MinPts个对象，则把这些对象添加到N；
若q还不是任何簇的成员，把p添加到C
输出C
情况二：标记p为噪声
直到没有标记为unvisited的对象

参数选择

半径epsilon，可以根据k距离来设定，找突变点（即随机选择点pi，计算pi到其他所有点的距离。将这些距离从小到大排序，以突变点为间隔分类）
k距离：给定数据集P={pi，i=0,1,2,…,n}，计算P(i)到集合D的子集S中所有点之间的距离，按照从小到大的顺序排列，d(k)就被称为k-距离。
MinPts：k-距离中k的值，一般取得小一些，多次尝试

优势和劣势

优势

不需要指定簇的个数
可以发现任意形状的簇
擅长找到离群点（检测任务）
两个参数就够了

劣势

高维数据有些困难（可以做降维）
参数难以选择（参数对结果的影响非常大）
Sklearn中效率很慢（数据削减策略）

k-means算法代码实现

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

class Kmeans:
    def __init__(self, data, num_clustres):  # data为数据集，num_k为k值
        self.data = data
        self.num_clustres = num_clustres

    def train(self, max_iterations):  # 最大迭代次数
        # 利用centroids_init()函数来随机选择num_clustres个中心点
        centroids = Kmeans.centroids_init(self.data,self.num_clustres)
        num_examples = self.data.shape[0]
        cloest_centroids_ids = np.empty((num_examples,1))
        for _ in range(max_iterations):
            # 利用centroids_find_closest()函数，得到当前每一个样本点到k个中心点的距离，并且找到最近的一个
            cloest_centroids_ids =Kmeans.centroids_find_closest(self.data,centroids)
            # 利用centroids_compute()进行中心点位置更新
            centroids = Kmeans.centroids_compute(self.data,cloest_centroids_ids,self.num_clustres)
        return centroids,cloest_centroids_ids

    def centroids_init(data,num_clustres): # 随机选择k个初始点
        num_examples = data.shape[0] # 又多少个样本数
        random_ids = np.random.permutation(num_examples)# numpy.random.permutation(x) 随机排列一个序列，或者数组。
        centroids = data[random_ids[:num_clustres],:] # 行为random选出那几行，列为：即全部特征
        # 这里需要求输入的data为ndarray的形式，因此在外面输入data时：data = data.values
        return centroids

    def centroids_find_closest(data,centroids): # 每个点找到最近的中心点
        num_example = data.shape[0]
        num_centroids = centroids.shape[0]
        cloest_centroids_ids = np.zeros((num_example,1))
        for example_index in range(num_example):
            distance = np.zeros((num_centroids,1)) # 对每个样本构造一个距离矩阵
            for centroids_index in range(num_centroids):
                distance_diff = data[example_index,:] - centroids[centroids_index,:]
                distance[centroids_index] = np.sum(distance_diff ** 2)
            cloest_centroids_ids[example_index] = np.argmin(distance) # num_centroids个距离里取最小值
        return cloest_centroids_ids #返回每个样品的簇id

    def centroids_compute(data,cloest_centroids_ids,num_clustres): #利用均值，更新簇的中心点
        num_features = data.shape[1]
        centroids = np.zeros((num_clustres,num_features))
        for centroids_id in range(num_clustres):
            cloest_ids = cloest_centroids_ids == centroids_id # 点是哪一个类的
            centroids[centroids_id] = np.mean(data[cloest_ids.flatten(),:],axis=0) # flatten返回一个一维数组，只能适用于numpy对象，即array或者mat
        return centroids

鸢尾花数据集

data = pd.read_csv('Desktop\数据分析\data\iris.csv')
data.head()

	Unnamed: 0	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	1	5.1	3.5	1.4	0.2	setosa
1	2	4.9	3.0	1.4	0.2	setosa
2	3	4.7	3.2	1.3	0.2	setosa
3	4	4.6	3.1	1.5	0.2	setosa
4	5	5.0	3.6	1.4	0.2	setosa

#有四个特征，先进行降维
new_data = data.drop(columns=['Unnamed: 0','Species'])
def x_stander(x):
    x_mean = np.mean(x)
    x_std = np.std(x)
    return (x-x_mean)/x_std
data_std = x_stander(new_data)
ew,ev = np.linalg.eig(np.cov(data_std.T))
ew,ev

(array([2.93808505, 0.9201649 , 0.14774182, 0.02085386]),
 array([[ 0.52106591, -0.37741762, -0.71956635,  0.26128628],
        [-0.26934744, -0.92329566,  0.24438178, -0.12350962],
        [ 0.5804131 , -0.02449161,  0.14212637, -0.80144925],
        [ 0.56485654, -0.06694199,  0.63427274,  0.52359713]]))

ew_sort = ew[np.argsort(ew)[::-1]]
ev_sort = ev[np.argsort(ew)[::-1]]
sum_total = np.sum(ew_sort)
sum_i = 0
n = 1
for i in ew_sort:
    sum_i += i
    percent = round(sum_i/sum_total,2)*100
    print(f'第{n}个元素的占比为{percent}%')
    n += 1

第1个元素的占比为73.0%
第2个元素的占比为96.0%
第3个元素的占比为99.0%
第4个元素的占比为100.0%

V = ev_sort[:,:2]
data_1 = new_data.dot(V)
data_1.head()

	0	1
0	2.640270	-5.204041
1	2.670730	-4.666910
2	2.454606	-4.773636
3	2.545517	-4.648463
4	2.561228	-5.258629

data['Species'].value_counts()
iris_types=['setosa','versicolor','virginica']

plt.figure(figsize=(12,5))
plt.scatter(data_1[0][:],data_1[1][:])

<matplotlib.collections.PathCollection at 0x27311641be0>

在这里插入图片描述

num_example = data_1.shape[0]
data_2 = data_1[:].values.reshape(num_example,2)
num_clustres = 3
max_iterations = 50
data_2.dtype

dtype('float64')

kmeans = Kmeans(data_2,num_clustres)
centroids,cloest_centroids_ids = kmeans.train(max_iterations)

plt.figure(figsize=(12,8))
for cloest_centroids_id, centroid in enumerate(centroids):
    current_examples_index = (cloest_centroids_id == cloest_centroids_ids).flatten()
    plt.scatter(data_2[:,0][current_examples_index],data_2[:,1][current_examples_index],label=cloest_centroids_id)
    # data_2 = data_1[:].values.reshape(num_example,2)时将data_2转化成了array，因此采用data[0]时会报错，需用array的形式取第一列
plt.show()

在这里插入图片描述

聚类算法实践

Kmeans算法和Dbscan算法
半监督问题解决方案
聚类算法的评估

import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['axes.labelsize']=14
plt.rcParams['xtick.labelsize']=12
plt.rcParams['ytick.labelsize']=12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

from sklearn.datasets import make_blobs

blob_centers = np.array([[0.2,2.3],
                         [-1.5,2.3],
                         [-2.8,1.8],
                         [-2.8,2.8],
                         [-2.8,1.3]])

blob_std = np.array([0.4,0.3,0.1,0.1,0.1])

X,y = make_blobs(n_samples=2000,centers=blob_centers,n_features=2,
                 cluster_std=blob_std,random_state=7)
# 确定五个中心点blob_centers，画5个簇。n_features为有多少个feature默认为2
# cluster_std为以这5个点为中心向四周发散，他的发散程度
# random_state类似随机种子
X,y #后面为无监督学习，所以不会使用y

(array([[-2.69823941,  1.3454702 ],
        [-2.87459835,  1.8097575 ],
        [ 0.96077126,  1.17046777],
        ...,
        [-2.80303543,  2.72948115],
        [ 0.24057359,  2.40103109],
        [-2.63807768,  1.95621065]]),
 array([4, 2, 0, ..., 3, 0, 2]))

def plot_clusters(X,y=None):
    plt.scatter(X[:,0],X[:,1],c=y,s=1)
    plt.xlabel('$x_1$',fontsize=14)
    plt.ylabel('$y_1$',fontsize=14,rotation=0)
plt.figure(figsize=(8,4))
plot_clusters(X)
plt.show()

在这里插入图片描述

决策边界

from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters = k, random_state = 42)
y_pred = kmeans.fit_predict(X)

fit_predict(X)与kmeans.labels_得到预测结果相同
kmeans.cluster_centers_ 簇中心点
transform 各个点到各个簇的中心点的距离

y_pred,kmeans.labels_

(array([4, 1, 0, ..., 3, 0, 1]), array([4, 1, 0, ..., 3, 0, 1]))

kmeans.cluster_centers_

array([[ 0.20876306,  2.25551336],
       [-2.80389616,  1.80117999],
       [-1.46679593,  2.28585348],
       [-2.79290307,  2.79641063],
       [-2.80037642,  1.30082566]])

X_new = np.array([[0,2],[3,2],[-3,3],[-3,2.5]])
kmeans.predict(X_new)

array([0, 0, 3, 3])

kmeans.transform(X_new)

array([[0.32995317, 2.81093633, 1.49439034, 2.9042344 , 2.88633901],
       [2.80290755, 5.80730058, 4.4759332 , 5.84739223, 5.84236351],
       [3.29399768, 1.21475352, 1.69136631, 0.29040966, 1.71086031],
       [3.21806371, 0.72581411, 1.54808703, 0.36159148, 1.21567622]])

def plot_data(X):
    plt.plot(X[:,0],X[:,1],'k.',markersize=2)#'k.'指定风格
    
def plot_centroids(centroids,weights=None,circle_color='w',cross_color='k'):
    if weights is not None:
        centroids = centroids[weights > weights.max()/10]
    plt.scatter(centroids[:,0],centroids[:,1],
               marker='o',s=50,linewidths=8,color=circle_color,
               zorder=11,alpha=0.9)
    plt.scatter(centroids[:,0],centroids[:,1],
               marker='x',s=10,linewidths=10,color=cross_color,
               zorder=11,alpha=1)
    
def plot_decision_boundaries(clusterer,X,resolution=1000,show_centroids=True,
                            show_xlabels=True,show_ylabels=True):
    mins = X.min(axis=0)-0.1
    maxs = X.max(axis=0)+0.1
    xx,yy = np.meshgrid(np.linspace(mins[0],maxs[0],resolution),
                       np.linspace(mins[1],maxs[1],resolution))
    Z = clusterer.predict(np.c_[xx.ravel(),yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    #绘制等高线,contour为不加颜色
    plt.contourf(Z,extent=(mins[0],maxs[0],mins[1],maxs[1]),
                 cmap='Pastel2')
    plt.contour(Z,extent=(mins[0],maxs[0],mins[1],maxs[1]),
                 linewidths=1,colors='k')
    plot_data(X)
    if show_centroids:
        plot_centroids(clusterer.cluster_centers_)
    if show_xlabels:
        plt.xlabel('$x_1$',fontsize=14)
    else:
        plt.tick_params(labelbottom='off')
    if show_ylabels:
        plt.ylabel('$y_1$',fontsize=14,rotation=0)
    else:
        plt.tick_params(labelleft='off')

plt.figure(figsize=(8,4))
plot_decision_boundaries(kmeans,X)
plt.show

<function matplotlib.pyplot.show(close=None, block=None)>

在这里插入图片描述

算法流程

n_clusters:指定簇的个数
init:初始点的位置，'random’随机，或者可以指定一个ndarray
n_init:默认值为10，会跑10次，选择出最好的一次
max_iter:最大迭代次数

kmeans_iter1 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=1,random_state=1)
kmeans_iter2 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=2,random_state=1)
kmeans_iter3 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=3,random_state=1)

kmeans_iter1.fit(X)
kmeans_iter2.fit(X)
kmeans_iter3.fit(X)

KMeans(init='random', max_iter=3, n_clusters=5, n_init=1, random_state=1)

plt.figure(figsize=(12,8))

plt.subplot(321)
plot_data(X)
plot_centroids(kmeans_iter1.cluster_centers_,circle_color='r',cross_color='k')
plt.title('Update cluster_centers')

plt.subplot(322)
plot_decision_boundaries(kmeans_iter1,X,show_xlabels=False,show_ylabels=False)
plt.title('label')

plt.subplot(323)
# plot_data(X)变成了第一次迭代的决策边界
plot_decision_boundaries(kmeans_iter1,X,show_xlabels=False,show_ylabels=False)
plot_centroids(kmeans_iter2.cluster_centers_)

plt.subplot(324)
plot_decision_boundaries(kmeans_iter2,X,show_xlabels=False,show_ylabels=False)
plt.title('label')

plt.subplot(325)
plot_decision_boundaries(kmeans_iter2,X,show_xlabels=False,show_ylabels=False)
plot_centroids(kmeans_iter3.cluster_centers_)

plt.subplot(326)
plot_decision_boundaries(kmeans_iter3,X,show_xlabels=False,show_ylabels=False)
plt.title('label')

plt.show()

在这里插入图片描述

不稳定的结果

根据初始位置的不同，迭代会发生差异

def plot_clusterer_comparison(c1,c2,X):
    c1.fit(X)
    c2.fit(X)
    
    plt.figure(figsize=(10,4))
    plt.subplot(121)
    plot_decision_boundaries(c1,X)
    plt.subplot(122)
    plot_decision_boundaries(c2,X)

c1 = KMeans(n_clusters=5,init='random',n_init=1,random_state=11)
c2 = KMeans(n_clusters=5,init='random',n_init=1,random_state=19)
plot_clusterer_comparison(c1,c2,X)

在这里插入图片描述

评估指标

Inertia指标：每个样本与其质心的距离

kmeans.inertia_
# 默认进行10次实验，返回inertia值最小的一次

211.59853725816828

X_dist = kmeans.transform(X)
X_dist,kmeans.labels_

(array([[3.04611916, 0.46779778, 1.54944305, 1.45402521, 0.11146795],
        [3.11541584, 0.07122059, 1.48612753, 0.99002955, 0.51431557],
        [1.32016676, 3.81713488, 2.67154781, 4.09069201, 3.76340605],
        ...,
        [3.04886464, 0.92830156, 1.40795651, 0.06769209, 1.42865797],
        [0.14895409, 3.10300136, 1.71125   , 3.05913478, 3.23385668],
        [2.8625311 , 0.22700281, 1.21678483, 0.85434589, 0.67518173]]),
 array([4, 1, 0, ..., 3, 0, 1]))

X_dist[np.arange(len(X_dist)),kmeans.labels_]
# np.arange(len(X_dist))为X_dist所有样本个数 array([0,1,2,...,1997,1998,1999])

array([0.11146795, 0.07122059, 1.32016676, ..., 0.06769209, 0.14895409,
       0.22700281])

np.sum(X_dist[np.arange(len(X_dist)),kmeans.labels_]**2)

211.59853725816862

kmeans.score(X)

-211.59853725816836

c1.inertia_,c2.inertia_

(223.2910857281904, 237.4624916944286)

找到最佳簇数

若k值越大，得到的结果一定会越来越小

拐点法

kmeans_per_k = [KMeans(n_clusters = k).fit(X) for k in range(1,10)]
inertias = [model.inertia_ for model in kmeans_per_k]

plt.figure(figsize=(8,4))
plt.plot(range(1,10),inertias,'bo-')
plt.axis([1.8,5,0,1300])
plt.show()
# 4为拐点，因此可以令k=4，但也并不精确，因为此处n_clusters = 5

在这里插入图片描述

轮廓系数

ai:计算样本i到同簇其他样本的平均距离ai。ai越小，说明样本i越应该被聚类到该簇。将ai称为样本i的簇内不相似度。
bi：计算样本i到其他某簇cj的所有样本的平均距离bi，称为样本i与簇cj的不相似度。定义为样本i的簇间不相似度：bi=min{bi1,bi2,…bik}

$s_i=\frac{b_i-a_i}{max{a_i,b_i}}$

$s_i = \left\{ \begin{matrix} 1-\frac{a_i}{b_i},&a_i<b_i \\0,&a_i=b_a\\ \frac{b_i}{a_i}-1,&a_i>b_i \end{matrix} \right.$

结论：

si接近1，则说明样本i聚类合理
si接近-1，则说明样本i更应该分类到另外的簇
若si近似于0，则说明样本i在两个簇的边界上

注意：2 <= n_labels <= n_sample-1

from sklearn.metrics import silhouette_score 
silhouette_score(X,kmeans.labels_)

0.655517642572828

silhouette_scores = [silhouette_score(X,model.labels_) for model in kmeans_per_k[1:]]
silhouette_scores

[0.5966442557582528,
 0.5723900247411775,
 0.688531617595759,
 0.655517642572828,
 0.6020248775444942,
 0.6070979466596362,
 0.561138795623175,
 0.5667631147874026]

plt.figure(figsize=(8,4))
plt.plot(range(2,10),silhouette_scores,'bo-')
plt.show()

在这里插入图片描述

kmeans存在的问题

X1,y1 = make_blobs(n_samples=1000,centers=((4,-4),(0,0)),random_state=42)
X1 = X1.dot(np.array([[0.374,0.95],[0.732,0.598]]))
X2,y2 = make_blobs(n_samples=250,centers=1,random_state=42)
X2 = X2 +[6,-8]
X = np.r_[X1,X2]
y = np.r_[y1,y2]

plot_data(X)

在这里插入图片描述

good_kmeans = KMeans(n_clusters=3,init=np.array([[-1,1.5],[0,0],[4,1]]),n_init=1,random_state=42)
bad_kmeans = KMeans(n_clusters=3,random_state=42)
good_kmeans.fit(X)
bad_kmeans.fit(X)

KMeans(n_clusters=3, random_state=42)

plt.figure(figsize=(10,4))
plt.subplot(121)
plot_decision_boundaries(good_kmeans,X)
plt.title('good kmeans inertia:{}'.format(good_kmeans.inertia_))
plt.subplot(122)
plot_decision_boundaries(bad_kmeans,X)
plt.title('bad kmeans inertia:{}'.format(bad_kmeans.inertia_))
plt.show()

在这里插入图片描述

此时bad_kmeans的inertia值为2179.48,而good_kmeans的inertia值为2241.90
因此inertia的值只能作为一个参考，并不能作为实际的判断指标

图像分割小例子

from matplotlib.image import imread
image = imread('Desktop/flower.png')
image.shape #(长,宽,rgb)

(2083, 1280, 3)

X = image.reshape(-1,3)
#此处-1意思是（2083*1280*3）/维度3
X.shape

(2666240, 3)

flower_kmeans = KMeans(n_clusters=2,random_state=42).fit(X)

flower_kmeans.cluster_centers_

array([[0.2163994 , 0.1809465 , 0.13141222],
       [0.89492637, 0.6847649 , 0.71480083]], dtype=float32)

segmented_img = flower_kmeans.cluster_centers_[flower_kmeans.labels_].reshape((2083, 1280, 3))

segmented_imgs = []
n_colors = (10,8,6,4,2)
for n_cluster in n_colors:
    img_kmeans = KMeans(n_clusters=n_cluster,random_state=42).fit(X)
    segmented_img = img_kmeans.cluster_centers_[img_kmeans.labels_]
    segmented_imgs.append(segmented_img.reshape(image.shape))

plt.figure(figsize=(10,5))
plt.subplot(231)
plt.imshow(image)
plt.title('original image')

for idx,n_cluster in enumerate(n_colors):#enumerate返回枚举值：0，10；1，8...
    plt.subplot(232+idx)
    plt.imshow(segmented_imgs[idx])
    plt.title('{}colors'.format(n_cluster))

在这里插入图片描述

半监督学习

首先，让我们将训练集聚类为50个集群，然后对于每个聚类，找到最靠近质心的图像。将这些图像称为代表性图像：

from sklearn.datasets import load_digits

X_digits,y_digits = load_digits(return_X_y = True)

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X_digits,y_digits,random_state=42)

X_digits.shape,X_train.shape

((1797, 64), (1347, 64))

from sklearn.linear_model import LogisticRegression
n_labeled = 50

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train[:n_labeled],y_train[:n_labeled])
log_reg.score(X_test,y_test)

0.8266666666666667

k = 50
linear_kmeans = KMeans(n_clusters=k,random_state=42)
X_digits_dist = linear_kmeans.fit_transform(X_train)
X_digits_dist.shape

(1347, 50)

representative_digits_idx = np.argmin(X_digits_dist,axis=0)
X_representative_digits = X_train[representative_digits_idx]

plt.figure(figsize=(8,2))
for index,X_representative_digit in enumerate(X_representative_digits):
    plt.subplot(k // 10,10,index+1)
    plt.imshow(X_representative_digit.reshape(8,8),cmap='binary',interpolation='bilinear')
    plt.axis('off')
plt.show()

在这里插入图片描述

y_representative_digits = np.array([4,8,0,6,8,3,7,7,9,2,
                                   5,5,8,5,2,1,2,9,6,1,
                                   1,6,9,0,8,3,0,7,4,1,
                                   6,5,2,4,1,8,6,3,9,2,
                                   4,2,9,4,7,6,2,3,1,1])

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_representative_digits,y_representative_digits)
log_reg.score(X_test,y_test)

0.92

扩展到其他实例中

y_train_propagated = np.empty(len(X_train),dtype=np.int32)
for i in range(k):
    y_train_propagated[linear_kmeans.labels_==i] = y_representative_digits[i]
    
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train,y_train_propagated)
log_reg.score(X_test,y_test)

0.9288888888888889

只选择前20个

percentile_closest = 20

X_cluster_dist = X_digits_dist[np.arange(len(X_train)),linear_kmeans.labels_]
for i in range(k):
    in_cluster = (linear_kmeans.labels_ == i)
    cluster_dist = X_cluster_dist[in_cluster]#选择属于当前簇的所有样本
    cutoff_distance = np.percentile(cluster_dist,percentile_closest)#排序找到前20个
    above_cutoff = (X_cluster_dist>cutoff_distance) #返回false，true结果
    X_cluster_dist[in_cluster & above_cutoff] = -1

partially_propagated = (X_cluster_dist != -1)
x_train_partially_propagated = X_train[partially_propagated]
y_train_partially_propagated = y_train_propagated[partially_propagated]

log_reg = LogisticRegression(random_state=42)
log_reg.fit(x_train_partially_propagated,y_train_partially_propagated)
log_reg.score(X_test,y_test)

0.9444444444444444

DBSCAN算法

from sklearn.datasets import make_moons
X,y = make_moons(n_samples=1000,noise=0.05,random_state=42)

plt.plot(X[:,0],X[:,1],'b.')

[<matplotlib.lines.Line2D at 0x2731446e220>]

在这里插入图片描述

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.05,min_samples=5)
dbscan.fit(X)

DBSCAN(eps=0.05)

dbscan.labels_[:10] #DBSCAN中-1表示离群点

array([ 0,  2, -1, -1,  1,  0,  0,  0,  2,  5], dtype=int64)

dbscan.core_sample_indices_[:10]#核心对象的索引

array([ 0,  4,  5,  6,  7,  8, 10, 11, 12, 13], dtype=int64)

np.unique(dbscan.labels_) #一共7个簇

array([-1,  0,  1,  2,  3,  4,  5,  6], dtype=int64)

dbscan2 = DBSCAN(eps=0.2,min_samples=5)
dbscan2.fit(X)

DBSCAN(eps=0.2)

def plot_dbscan(dbscan,X,size,show_xlabels=True,show_ylabels=True):
    core_mask = np.zeros_like(dbscan.labels_,dtype=bool)
    core_mask[dbscan.core_sample_indices_] = True
    anomalies_mask = dbscan.labels_ == -1
    non_core_mask = ~(core_mask|anomalies_mask)
    
    cores = dbscan.components_
    anomalies = X[anomalies_mask]
    non_cores = X[non_core_mask]
    
    plt.scatter(cores[:,0],cores[:,1],
               c=dbscan.labels_[core_mask],marker='o',s=size,cmap='Paired')
    plt.scatter(cores[:,0],cores[:,1],marker='*',s=20,c=dbscan.labels_[core_mask])
    plt.scatter(anomalies[:,0],anomalies[:,1],c='r',marker='x',s=100)
    plt.scatter(non_cores[:,0],non_cores[:,1],c=dbscan.labels_[non_core_mask],marker='.')
    if show_xlabels:
        plt.xlabel('$x_1$',fontsize=14)
    else:
        plt.tick_params(labelbottom='off')
    if show_ylabels:
        plt.ylabel('$x_2$',fontsize=14,rotation=0)
    else:
        plt.tick_params(labelleft='off')
    plt.title('eps={:.2f},min_samples={}'.format(dbscan.eps,dbscan.min_samples),fontsize=14)

plt.figure(figsize=(9,3.2))

plt.subplot(121)
plot_dbscan(dbscan,X,size=100)
plt.subplot(122)
plot_dbscan(dbscan2,X,size=600,show_ylabels=False)

plt.show()

在这里插入图片描述

indigo女孩

关注

1
点赞
踩
13

收藏

觉得还不错? 一键收藏
1
评论
聚类分析和K-means&Dbscan原理

聚类分析和K-means&Dbscan原理聚类概念无监督问题：手里没有标签聚类：相似的东西分为一组难点：如何评估、如何调参K-means算法基本概念要得到簇的个数，需要指定K值质心：均值，即向量各维取平均即可（迭代时用）距离的度量：常用欧几里得距离（欧氏距离）和余弦相似度（先正则化）优化目标：（即损失函数最小）min∑i=1K∑x∈Cidist(ci,x)2min\sum^K_{i=1}\sum_{x\in{C_i}}{dist(c_i,x)^2}mini=1∑Kx∈Ci
复制链接

扫一扫