聚类
K-means
k均值是最常用的聚类算法之一,它将数据点聚集成预定数量的聚类。 MLlib实现包括称为kmeans ||的k-means ++方法的并行变体。
Means被实现为一个估计器,并生成一个KMeansModel作为基础模型。
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette =