同步于Buracag的博客
本节主要讲Spark ML中关于聚类算法的实现。示例的算法Demo包含:K-means、LDA、高斯混合模型(GMM)等。
文章目录
1. K-means
KMeans作为Estimator实现,并生成KMeansModel作为基本模型。
1.1 输入
Param name | Type(s) | Default | Description |
---|---|---|---|
featuresCol | Vector | “features” | Feature vector |
1.2 输出
Param name | Type(s) | Default | Description |
---|---|---|---|
predictionCol | Int | “prediction” | Predicted cluster center |
示例代码如下:
# -*- coding: utf-8 -*-
# @Time : 2019/8/8 15:52
# @Author : buracagyang
# @File : kmeans_example.py
# @Software : PyCharm
"""
Describe:
"""
from __future__ import print_function
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("KMeansExample").getOrCreate()
dataset = spark.read.format("libsvm").load("../data/mllib/sample_kmeans_data.txt")
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)
predictions = model.transform(dataset)
# 通过计算Silhouette得分来评估聚类
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
spark.stop()
结果如下:
Silhouette with squared euclidea