Caliński, Tadeusz, and Jerzy Harabasz. “A dendrite method for cluster analysis.” Communications in Statistics-theory and Methods 3.1 (1974): 1-27.
公式与简介
CH指标的计算公式是: s = t r ( B k ) t r ( W k ) × n E − k k − 1 s = \frac{\mathrm{tr}(B_k)}{\mathrm{tr}(W_k)} \times \frac{n_E - k}{k - 1} s=tr(Wk)tr(Bk)×k−1nE−k
其中 B k B_{k} Bk 为 between-clusters dispersion mean(类间距离), W k W_{k} Wk为 within-cluster dispersion(类内部的距离),详细公式如下:
W k = ∑ q = 1 k ∑ x ∈ C q ( x − c q ) ( x − c q ) T W_k = \sum_{q=1}^k \sum_{x \in C_q} (x - c_q) (x - c_q)^T Wk=∑q=1k∑x∈Cq(x−cq)(x−cq)T
B k = ∑ q = 1 k n q ( c q − c E ) ( c q − c E ) T B_k = \sum_{q=1}^k n_q (c_q - c_E) (c_q - c_E)^T Bk=∑q=1knq(cq−cE)(cq−cE)T
W k W_{k} Wk 中: C q C_q Cq表示当前点所在的类 q q q; c q c_q cq是当前类 q q q的聚类中心点
B k B_{k} Bk 中: C e C_e Ce表示类 e e e的中心; n q n_q nq表示类别 q q q包含的点数。
值越大表示聚类效果越好
代码实现
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.cluster import KMeans
dataframe = pd.DataFrame(data=np.random.randint(0, 50, size=(200, 10)))
# 以kmeans聚类方法为例
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(dataframe)
labels = kmeans_model.labels_
score = metrics.calinski_harabasz_score(dataframe, labels)
print(score)
参考文章
sklearn:https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index