KMeans
class pyspark.ml.clustering.KMeans(featuresCol=‘features’, predictionCol=‘prediction’, k=2, initMode=‘k-means||’, initSteps=2, tol=0.0001, maxIter=20, seed=None, distanceMeasure=‘euclidean’)
使用类似 k-means++ 的初始化模式进行 K-means 聚类(Bahmani 等人的 k-means|| 算法)。
参数介绍和BisectingKMeans.md文档一样
01.创建数据集
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\
.config("spark.ui.showConsoleProgress","false").appName("KMeans")\
.master("local[*]").getOrCreate()
from pyspark.ml.linalg import Vectors
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = spark.createDataFrame(data, ["features"])
df.show()
df.printSchema()
输出结果:
+---------+
| features|
+---------+
|[0.0,0.0]|
|[1.0,1.0]|
|[9.0,8.0]|
|[8.0,9.0]|
+---------+
root
|-- features: vector (nullable = true)
02.使用数据训练得到kmean模型,并转换原有数据
rom pyspark.ml.clustering import KMeans
kMeans = KMeans(k=2,seed=1)
model = kMeans.fit(df)
model.transform(df).show()
输出结果:
+---------+----------+
| features|prediction|
+---------+----------+
|[0.0,0.0]| 1|
|[1.0,1.0]| 1|
|[9.0,8.0]| 0|
|[8.0,9.0]| 0|
+---------+----------+
03.查看模型属性和摘要信息
print(model.hasSummary)
print(model.clusterCenters())
print(model.summary.clusterSizes)
输出结果:
True
[array([8.5, 8.5]), array([0.5, 0.5])]
[2, 2]
04.构造测试数据,使用上面模型转换预测查看结果
testdata = spark.createDataFrame([
(Vectors.dense([8.2,8.2]),)
],["test"])
testdata.show()
# model().transform(testdata).show()
testdata = testdata.withColumnRenamed("test","features")
model.transform(testdata).show()
输出结果:
+---------+
| test|
+---------+
|[8.2,8.2]|
+---------+
+---------+----------+
| features|prediction|
+---------+----------+
|[8.2,8.2]| 0|
+---------+----------+