sklearn聚类算法之HAC

基本思想
层次凝聚聚类算法(Hierarchical Agglomerative Clustering)是一种效果很好的聚类算法,简称HAC,它的主要思想是先把每个样本点当做一个聚类,然后不断地将其中最近的两个聚类进行合并,直到满足某个迭代终止条件,比如当前聚类数是初始聚类数的20%,80%的聚类数都被合并了。总结来说,HAC的具体实现步骤如下所示。
    (1)将训练样本集中的每个数据点都当做一个聚类;
    (2)计算每两个聚类之间的距离,将距离最近的或最相似的两个聚类进行合并;
    (3)重复上述步骤,直到满足迭代终止条件
在这个算法中,相似度的度量方式有如下四种方式:
    (1)Single-link:两个不同聚类中离得最近的两个点之间的距离,即MIN;
    (2)Complete-link:两个不同聚类中离得最远的两个点之间的距离,即MAX;
    (3)Average-link:两个不同聚类中所有点对距离的平均值,即AVERAGE;
    (4)Ward-link:两个不同聚类聚在一起后离差平方和的增量
API学习

class sklearn.cluster.AgglomerativeClustering(
	n_clusters=2,
	*, 
	affinity='euclidean', 
	memory=None, 
	connectivity=None,
	compute_full_tree='auto', 
	linkage='ward', 
	distance_threshold=None, 	
	compute_distances=False
)
参数类型解释
n_clustersint or None, default=2表示聚类数,和distance_threshold中必须有一个是None
affinitystr or callable, default=‘euclidean’相似度度量函数,可以是’euclidean’/‘manhattan’/'cosine’等
memorystr or object with the joblib缓存计算过程的文件夹路径
connectivityarray-like or callable, default=None可用来定义数据的给定结构,即对每个样本给定邻居样本
compute_full_tree‘auto’ or bool, default=‘auto’如果为True,当聚类数较多时可用来减少计算时间
linkage{‘ward’, ‘complete’, ‘average’, ‘single’}, default=‘ward’表示不同的度量方法,默认为’ward’方法
distance_thresholdfloat, default=None如果不为None,表示簇不会聚合的距离阈值,此时n_clusters必须不为None,compute_full_tree必须为None
compute_distancesbool, default=False如果为True,即使不使用distance_threshold,也计算簇间距离,可用来可视化树状图
属性类型解释
n_clusters_int聚类数
labels_ndarray of shape(n_samples)分类结果
n_leaves_int层次树的树叶数量
n_connected_components_int在图中有联系的部分的数量
n_features_in_int拟合期间的特征个数
feature_names_inndarray of shape(n_features_in_,)拟合期间的特征名称
children_array-like of shape (n_samples-1, 2)每一个非叶子节点的孩子
distances_array-like of shape (n_nodes-1,)children_中各节点之间的距离
方法说明
fit(X[, y])Fit the hierarchical clustering from features, or distance matrix.
fit_predict(X[, y])Fit and return the result of each sample’s clustering assignment.
get_params([deep])Get parameters for this estimator.
set_params(**params)Set the parameters of this estimator.

代码示例

>>> from sklearn.cluster import AgglomerativeClustering
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 4], [4, 0]])
>>> clustering = AgglomerativeClustering().fit(X)
>>> clustering
AgglomerativeClustering()
>>> clustering.labels_
array([1, 1, 1, 0, 0, 0])

优秀作品学习
test1.py

import numpy as np

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering


def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)


iris = load_iris()
X = iris.data

# setting distance_threshold=0 ensures we compute the full tree.
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)

model = model.fit(X)
plt.title("Hierarchical Clustering Dendrogram")
# plot the top three levels of the dendrogram
plot_dendrogram(model, truncate_mode="level", p=3)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

运行结果:
Hierarchical Clustering
test2.py

import time as time
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_swiss_roll

# #############################################################################
# Generate data (swiss roll dataset)
n_samples = 1500
noise = 0.05
X, _ = make_swiss_roll(n_samples, noise=noise)
# Make it thinner
X[:, 1] *= 0.5

# #############################################################################
# Compute clustering
print("Compute unstructured hierarchical clustering...")
st = time.time()
ward = AgglomerativeClustering(n_clusters=6, linkage="ward").fit(X)
elapsed_time = time.time() - st
label = ward.labels_
print("Elapsed time: %.2fs" % elapsed_time)
print("Number of points: %i" % label.size)

# #############################################################################
# Plot result
fig = plt.figure()
ax = p3.Axes3D(fig)
ax.view_init(7, -80)
for l in np.unique(label):
    ax.scatter(
        X[label == l, 0],
        X[label == l, 1],
        X[label == l, 2],
        color=plt.cm.jet(float(l) / np.max(label + 1)),
        s=20,
        edgecolor="k",
    )
plt.title("Without connectivity constraints (time %.2fs)" % elapsed_time)


# #############################################################################
# Define the structure A of the data. Here a 10 nearest neighbors
from sklearn.neighbors import kneighbors_graph

connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)

# #############################################################################
# Compute clustering
print("Compute structured hierarchical clustering...")
st = time.time()
ward = AgglomerativeClustering(
    n_clusters=6, connectivity=connectivity, linkage="ward"
).fit(X)
elapsed_time = time.time() - st
label = ward.labels_
print("Elapsed time: %.2fs" % elapsed_time)
print("Number of points: %i" % label.size)

# #############################################################################
# Plot result
fig = plt.figure()
ax = p3.Axes3D(fig)
ax.view_init(7, -80)
for l in np.unique(label):
    ax.scatter(
        X[label == l, 0],
        X[label == l, 1],
        X[label == l, 2],
        color=plt.cm.jet(float(l) / np.max(label + 1)),
        s=20,
        edgecolor="k",
    )
plt.title("With connectivity constraints (time %.2fs)" % elapsed_time)

plt.show()

运行结果:
roll1
roll2

  • 1
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值