14python数据分析聚类算法 -k-means算法 -DBSCAN

最新推荐文章于 2024-07-15 22:07:50 发布

A记录学习路线

最新推荐文章于 2024-07-15 22:07:50 发布

阅读量699

点赞数

分类专栏： Python数据分析

Python数据分析专栏收录该内容

16 篇文章 1 订阅

订阅专栏

聚类

聚类和分类判别有什么区别?
在这里插入图片描述

样本点的关键度量指标:距离

 距离的定义
 常用距离
– 欧氏距离，euclidean——通常意义下的距离
在这里插入图片描述

– 马氏距离，manhattan——考虑到变量间的相关性，并且与变量的单位无关
在这里插入图片描述

– 余弦距离，cosine——衡量变量相似性
在这里插入图片描述

(凝聚的)层次聚类法

思想
1 开始时，每个样本各自作为一类
2 规定某种度量作为样本之间的距离及类与类之间的距离，并计算之
3 将距离最短的两个类合并为一个新类
4 重复2-3，即不断合并最近的两个类，每次减少一个类，直至所有样本被合并为一类

各种类与类之间距离计算的方法

 离差平方和法——ward
 类平均法——average
 最大距离法——complete

AgglomerativeClustering

AgglomerativeClustering(n_clusters=2, affinity=‘euclidean’, memory=Memory(cachedir=None), connectivity=None, n_components=None, compute_full_tree=‘auto’, linkage=‘ward’, pooling_func=)
属性
– labels
– n-leaves
– n-components
– children
方法 – fit
– fit_predict
– get_params
– set_params

动态聚类:K-means方法

 算法:
1 选择K个点作为初始质心
2 将每个点指派到最近的质心，形成K个簇(聚类)
3 重新计算每个簇的质心
4 重复2-3直至质心不发生变化
在这里插入图片描述

最优解

在这里插入图片描述

KMeans

KMeans(n_clusters=8, init=‘k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=‘auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1)
属性
– cluster_centers
– labels – inertia
方法 – fit
– fit_predict – predict
– get_params – set_params

肘部法则

在这里插入图片描述

K-means算法的优缺点

 有效率，而且不容易受初始值选择的影响
 不能处理非球形的簇
 不能处理不同尺寸，不同密度的簇
 离群值可能有较大干扰(因此要先剔除)

基于密度的方法: DBSCAN

 DBSCAN = Density-Based Spatial Clustering of Applications with Noise
 本算法将具有足够高密度的区域划分为簇，并可以发现任何形状的聚类
在这里插入图片描述

若干概念

r-邻域:给定点半径r内的区域
核心点:如果一个点的r-邻域至少包含最少数目M个点，则称该点为核心点
直接密度可达:如果点p在核心点q的r-邻域内，则称p是从q出发可以直接密度可达
如果存在点链p1,p2, …, pn，p1=q，pn=p，pi+1是从pi关于r和M直接密度可达，则称点p 是从q关于r和M密度可达的
如果样本集D中存在点o，使得点p、q是从o关于 r和M密度可达的，那么点p、q是关于r 和M密度相连的
在这里插入图片描述

算法基本思想
1 指定合适的 r 和 M
2 计算所有的样本点，如果点p的r邻域里有超过M个点，则创建一个以p为核心点的新簇
3 反复寻找这些核心点直接密度可达(之后可能是密度可达)的点，将其加入到相应的簇，对于核心点发生“密度相连”状况的簇，给予合并
4 当没有新的点可以被添加到任何簇时，算法结束

DBSCAN算法描述

输入: 包含n个对象的数据库，半径e，最少数目MinPts; 输出:所有生成的簇，达到密度要求。
(1)Repeat
(2)从数据库中抽出一个未处理的点;
(3)IF抽出的点是核心点 THEN 找出所有从该点密度可达的对象，形成一个簇; (4)ELSE 抽出的点是边缘点(非核心对象)，跳出本次循环，寻找下一个点; (5)UNTIL 所有的点都被处理。
DBSCAN对用户定义的参数很敏感，细微的不同都可能导致差别很大的结果，而参数的选择无规律可循，只能靠经验确定。

DBSCAN(eps=0.5, min_samples=5, metric=‘euclidean’, algorithm=‘auto’, leaf_size=30, p=None, random_state=None)
属性
– core_sample_indices_
– components_ – labels_
方法 – fit
– fit_predict – get_params – set_params

代码

# coding: utf-8

# In[1]:

import numpy as np
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
import numpy as np
from scipy import ndimage
from matplotlib import pyplot as plt
from sklearn import manifold, datasets


# In[2]:

digits = datasets.load_digits(n_class=10)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
print X[:5,:]
print n_samples,n_features


# In[3]:

# Visualize the clustering
def plot_clustering(X_red, X, labels, title=None):
    x_min, x_max = np.min(X_red, axis=0), np.max(X_red, axis=0)
    X_red = (X_red - x_min) / (x_max - x_min)

    plt.figure(figsize=(6, 4))
    for i in range(X_red.shape[0]):
        plt.text(X_red[i, 0], X_red[i, 1], str(y[i]),
                 color=plt.cm.spectral(labels[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 9})

    plt.xticks([])
    plt.yticks([])
    if title is not None:
        plt.title(title, size=17)
    plt.axis('off')
    plt.tight_layout()


# In[ ]:

# 2D embedding of the digits dataset
print("Computing embedding")
X_red = manifold.SpectralEmbedding(n_components=2).fit_transform(X)
print("Done.")

from sklearn.cluster import AgglomerativeClustering

for linkage in ('ward', 'average', 'complete'):
    clustering = AgglomerativeClustering(linkage=linkage, n_clusters=10)
    clustering.fit(X_red)
    plot_clustering(X_red, X, clustering.labels_, "%s linkage" % linkage)


plt.show()


# In[3]:

get_ipython().magic(u'matplotlib inline')
X0 = np.array([7, 5, 7, 3, 4, 1, 0, 2, 8, 6, 5, 3])
X1 = np.array([5, 7, 7, 3, 6, 4, 0, 2, 7, 8, 5, 7])
plt.figure()
plt.axis([-1, 9, -1, 9])
plt.grid(True)
plt.plot(X0, X1, 'k.');


# In[4]:

C1 = [1, 4, 5, 9, 11]
C2 = list(set(range(12)) - set(C1))
X0C1, X1C1 = X0[C1], X1[C1]
X0C2, X1C2 = X0[C2], X1[C2]
plt.figure()
plt.axis([-1, 9, -1, 9])
plt.grid(True)
plt.plot(X0C1, X1C1, 'rx')
plt.plot(X0C2, X1C2, 'g.')
plt.plot(4,6,'rx',ms=12.0)
plt.plot(5,5,'g.',ms=12.0);


# In[5]:

C1 = [1, 2, 4, 8, 9, 11]
C2 = list(set(range(12)) - set(C1))
X0C1, X1C1 = X0[C1], X1[C1]
X0C2, X1C2 = X0[C2], X1[C2]
plt.figure()
plt.axis([-1, 9, -1, 9])
plt.grid(True)
plt.plot(X0C1, X1C1, 'rx')
plt.plot(X0C2, X1C2, 'g.')
plt.plot(3.8,6.4,'rx',ms=12.0)
plt.plot(4.57,4.14,'g.',ms=12.0);


# In[6]:

C1 = [0, 1, 2, 4, 8, 9, 10, 11]
C2 = list(set(range(12)) - set(C1))
X0C1, X1C1 = X0[C1], X1[C1]
X0C2, X1C2 = X0[C2], X1[C2]
plt.figure()
plt.axis([-1, 9, -1, 9])
plt.grid(True)
plt.plot(X0C1, X1C1, 'rx')
plt.plot(X0C2, X1C2, 'g.')
plt.plot(5.5,7.0,'rx',ms=12.0)
plt.plot(2.2,2.8,'g.',ms=12.0);


# In[7]:

cluster1 = np.random.uniform(0.5, 1.5, (2, 10))
cluster2 = np.random.uniform(3.5, 4.5, (2, 10))
X = np.hstack((cluster1, cluster2)).T
plt.figure()
plt.axis([0, 5, 0, 5])
plt.grid(True)
plt.plot(X[:,0],X[:,1],'k.');


# In[8]:

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
K = range(1, 10)
meandistortions = []
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
plt.plot(K, meandistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('The average degree of distortion')
plt.title('Best k')


# In[9]:

import numpy as np
x1 = np.array([1, 2, 3, 1, 5, 6, 5, 5, 6, 7, 8, 9, 7, 9])
x2 = np.array([1, 3, 2, 2, 8, 6, 7, 6, 7, 1, 2, 1, 1, 3])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)
plt.figure()
plt.axis([0, 10, 0, 10])
plt.grid(True)
plt.plot(X[:,0],X[:,1],'k.');


# In[10]:

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
K = range(1, 10)
meandistortions = []
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
plt.plot(K, meandistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('The average degree of distortion')
plt.title('Best K')


# In[11]:


"""
===================================
Demo of DBSCAN clustering algorithm
===================================

Finds core samples of high density and expands clusters from them.

"""
print(__doc__)

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler


##############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)

##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

##############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()


# In[ ]: