spark_ML_聚类
Mllib支持的聚类模型较少,主要有K均值聚类,高斯混合模型GMM,以及二分的K均值,隐含狄利克雷分布LDA模型等。
1,K均值聚类
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# 载入数据
dfdata = spark.read.format("libsvm").load("data/sample_kmeans_data.txt")
# 训练Kmeans模型
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dfdata)
# 进行预测
dfpredictions = model.transform(dfdata)
# 评估模型
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(dfpredictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
# 打印中心点
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
Silhouette with squared euclidean distance = 0.9997530305375207
Cluster Centers:
[9.1 9.1 9.1]
[0.1 0.1 0.1]
2,高斯混合模型
from pyspark.ml.clustering import GaussianMixture
dfdata = spark.read.format("libsvm").load("data/sample_kmeans_data.txt")
gmm = GaussianMixture().setK(2).setSeed(538009335)
model = gmm.fit(dfdata)
print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=True)
aussians shown as a DataFrame:
+--------------------+--------------------+
| mean| cov|
+--------------------+--------------------+
|[0.10000000000001...|0.006666666666806...|
|[9.09999999999998...|0.006666666666812...|
+--------------------+--------------------+
3, 二分K均值 Bisecting k-means
Bisecting k-means是一种自上而下的层次聚类算法。所有的样本点开始时属于一个cluster,然后不断通过K均值二分裂得到多个cluster。
from pyspark.ml.clustering import BisectingKMeans
dfdata = spark.read.format("libsvm").load("data/sample_kmeans_data.txt")
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dfdata)
cost = model.computeCost(dfdata)
print("Within Set Sum of Squared Errors = " + str(cost))
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
print(center)
Within Set Sum of Squared Errors = 0.11999999999994547
Cluster Centers:
[0.1 0.1 0.1]
[9.1 9.1 9.1]
资源链接下载