1.初始数据处理部分,请自行对照调整,此处仅作为保持流程完整使用。
from sklearn import preprocessing
predictors = ['Birth_Rate','Death_Rate']
X = preprocessing.scale(Province[predictors])
X = pd.DataFrame(X)
2.迭代不同值得参数
res = []
for eps in np.arange(0.001,1,0.05):
for min_samples in range(2,10):
dbscan = cluster.DBSCAN(eps = eps, min_samples = min_samples)
dbscan.fit(X)
n_clusters = len([i for i in set(dbscan.labels_) if i != -1])
outliners = np.sum(np.where(dbscan.labels_ == -1, 1,0))
stats = str(pd.Series([i for i in dbscan.labels_ if i != -1]).value_counts().values)
res.append({'eps':eps,'min_samples':min_samples,'n_clusters':n_clusters,'outliners':outliners,'stats':stats})
df = pd.DataFrame(res)
df.loc[df.n_clusters == 3, :]

- eps半径处于一个突变中
- min_samples选取最小个数也是处于一个剑锋
- n_clusters聚类数无明显优势
- outliners异常值处于突变
- stats样本聚类数分布较平均