我需要聚类客户数据,包含分类和数字特征。数字特征不在同一范围内(年龄、收入……)。在我用标准刻度缩放后,我尝试了Mclust来获取数值数据,但这给了我交叉的组。在
1-如果标准量表的结果不令人满意,我是否应该标准化?
2-K-Prototype集群的最佳方式是什么?
3-聚类方法是否应该依赖于数据分布?在
我用熊猫
我用的是:#K-mean Cluster#search K
from scipy.spatial import distance as sci_distance
from sklearn import cluster as sk_cluster
cdata = data
K = range(1, 10)
KM = (sk_cluster.KMeans(n_clusters=k).fit(cdata) for k in K)
centroids = (k.cluster_centers_ for k in KM)
D_k = (sci_distance.cdist(cdata, cent, 'euclidean') for cent in centroids)
dist = (np.min(D, axis=1) for D in D_k)
avgWithinSS = [sum(d) / cdata.shape[0] for d in dist]
plt.plot(K, avgWithinSS, 'b*-')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Aver