基本思想
层次凝聚聚类算法(Hierarchical Agglomerative Clustering)是一种效果很好的聚类算法,简称HAC,它的主要思想是先把每个样本点当做一个聚类,然后不断地将其中最近的两个聚类进行合并,直到满足某个迭代终止条件,比如当前聚类数是初始聚类数的20%,80%的聚类数都被合并了。总结来说,HAC的具体实现步骤如下所示。
(1)将训练样本集中的每个数据点都当做一个聚类;
(2)计算每两个聚类之间的距离,将距离最近的或最相似的两个聚类进行合并;
(3)重复上述步骤,直到满足迭代终止条件
在这个算法中,相似度的度量方式有如下四种方式:
(1)Single-link:两个不同聚类中离得最近的两个点之间的距离,即MIN;
(2)Complete-link:两个不同聚类中离得最远的两个点之间的距离,即MAX;
(3)Average-link:两个不同聚类中所有点对距离的平均值,即AVERAGE;
(4)Ward-link:两个不同聚类聚在一起后离差平方和的增量
API学习
class sklearn.cluster.AgglomerativeClustering(
n_clusters=2,
*,
affinity='euclidean',
memory=None,
connectivity=None,
compute_full_tree='auto',
linkage='ward',
distance_threshold=None,
compute_distances=False
)
参数 | 类型 | 解释 |
---|---|---|
n_clusters | int or None, default=2 | 表示聚类数,和distance_threshold中必须有一个是None |
affinity | str or callable, default=‘euclidean’ | 相似度度量函数,可以是’euclidean’/‘manhattan’/'cosine’等 |
memory | str or object with the joblib | 缓存计算过程的文件夹路径 |
connectivity | array-like or callable, default=None | 可用来定义数据的给定结构,即对每个样本给定邻居样本 |
compute_full_tree | ‘auto’ or bool, default=‘auto’ | 如果为True,当聚类数较多时可用来减少计算时间 |
linkage | {‘ward’, ‘complete’, ‘average’, ‘single’}, default=‘ward’ | 表示不同的度量方法,默认为’ward’方法 |
distance_threshold | float, default=None | 如果不为None,表示簇不会聚合的距离阈值,此时n_clusters必须不为None,compute_full_tree必须为None |
compute_distances | bool, default=False | 如果为True,即使不使用distance_threshold,也计算簇间距离,可用来可视化树状图 |
属性 | 类型 | 解释 |
---|---|---|
n_clusters_ | int | 聚类数 |
labels_ | ndarray of shape(n_samples) | 分类结果 |
n_leaves_ | int | 层次树的树叶数量 |
n_connected_components_ | int | 在图中有联系的部分的数量 |
n_features_in_ | int | 拟合期间的特征个数 |
feature_names_in | ndarray of shape(n_features_in_,) | 拟合期间的特征名称 |
children_ | array-like of shape (n_samples-1, 2) | 每一个非叶子节点的孩子 |
distances_ | array-like of shape (n_nodes-1,) | children_中各节点之间的距离 |
方法 | 说明 |
---|---|
fit(X[, y]) | Fit the hierarchical clustering from features, or distance matrix. |
fit_predict(X[, y]) | Fit and return the result of each sample’s clustering assignment. |
get_params([deep]) | Get parameters for this estimator. |
set_params(**params) | Set the parameters of this estimator. |
代码示例
>>> from sklearn.cluster import AgglomerativeClustering
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> clustering = AgglomerativeClustering().fit(X)
>>> clustering
AgglomerativeClustering()
>>> clustering.labels_
array([1, 1, 1, 0, 0, 0])
优秀作品学习
test1.py
import numpy as np
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
iris = load_iris()
X = iris.data
# setting distance_threshold=0 ensures we compute the full tree.
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model = model.fit(X)
plt.title("Hierarchical Clustering Dendrogram")
# plot the top three levels of the dendrogram
plot_dendrogram(model, truncate_mode="level", p=3)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
运行结果:
test2.py
import time as time
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_swiss_roll
# #############################################################################
# Generate data (swiss roll dataset)
n_samples = 1500
noise = 0.05
X, _ = make_swiss_roll(n_samples, noise=noise)
# Make it thinner
X[:, 1] *= 0.5
# #############################################################################
# Compute clustering
print("Compute unstructured hierarchical clustering...")
st = time.time()
ward = AgglomerativeClustering(n_clusters=6, linkage="ward").fit(X)
elapsed_time = time.time() - st
label = ward.labels_
print("Elapsed time: %.2fs" % elapsed_time)
print("Number of points: %i" % label.size)
# #############################################################################
# Plot result
fig = plt.figure()
ax = p3.Axes3D(fig)
ax.view_init(7, -80)
for l in np.unique(label):
ax.scatter(
X[label == l, 0],
X[label == l, 1],
X[label == l, 2],
color=plt.cm.jet(float(l) / np.max(label + 1)),
s=20,
edgecolor="k",
)
plt.title("Without connectivity constraints (time %.2fs)" % elapsed_time)
# #############################################################################
# Define the structure A of the data. Here a 10 nearest neighbors
from sklearn.neighbors import kneighbors_graph
connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
# #############################################################################
# Compute clustering
print("Compute structured hierarchical clustering...")
st = time.time()
ward = AgglomerativeClustering(
n_clusters=6, connectivity=connectivity, linkage="ward"
).fit(X)
elapsed_time = time.time() - st
label = ward.labels_
print("Elapsed time: %.2fs" % elapsed_time)
print("Number of points: %i" % label.size)
# #############################################################################
# Plot result
fig = plt.figure()
ax = p3.Axes3D(fig)
ax.view_init(7, -80)
for l in np.unique(label):
ax.scatter(
X[label == l, 0],
X[label == l, 1],
X[label == l, 2],
color=plt.cm.jet(float(l) / np.max(label + 1)),
s=20,
edgecolor="k",
)
plt.title("With connectivity constraints (time %.2fs)" % elapsed_time)
plt.show()
运行结果: