聚类 - spark.mllib
聚类是一种无监督的学习问题,我们的目标是根据一些相似的概念将实体的子集相互分组。聚类通常用于探索性分析和/或作为分层 监督学习管线(其中针对每个群集训练不同的分类器或回归模型)的组成部分。
该spark.mllib软件包支持以下模型:
K-means
Gaussian mixture
Power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
Bisecting k-means
Streaming k-means
K-means
K-means是最常用的聚类算法之一,可将数据点聚类到预定义数量的聚类中。该spark.mllib实现包含称为kmeans的k-means++方法的并行变体。
实现中有以下参数:spark.mllib
k是期望的簇的数量。
maxIterations是要运行的最大迭代次数。
initializationModel指定随机初始化或通过k-means 进行初始化。
runs是运行k-means算法的次数(k-means不能保证找到全局最优解,并且在给定数据集上运行多次时,算法返回最佳聚类结果)。
initializationSteps确定k-means算法中的步数。
epsilon决定了我们认为k-means已经收敛的距离阈值。
initialModel是用于初始化的一组可选集群中心。如果提供此参数,则只执行一次运行。
from __future__ import print_function
import sys
import numpy as np
from pyspark.sql import SparkSession
def parseVector(line):
return np.array([float(x) for x in line.split(',')])
def closestPoint(p, centers):
bestIndex = 0
closest = float("+inf")
for i in range(len(centers)):
tempDist = np.sum((p - centers[i]) ** 2)
if tempDist < closest:
closest = tempDist
bestIndex = i
return bestIndex
if __name__ == "__main__":
# if len(sys.argv) != 4:
# print("Usage: kmeans <file> <k> <convergeDist>", file=sys.stderr)
# exit(-1)
print("""WARN: This is a naive implementation of KMeans Clustering and is given
as an example! Please refer to examples/src/main/python/ml/kmeans_example.py for an
example on how to use ML's KMeans implementation.""", file=sys.stderr)
spark = SparkSession\
.builder\
.appName("PythonKMeans")\
.getOrCreate()
inputFile = "hdfs://node1:8020/mv_training/training_set.txt"
lines = spark.read.text(inputFile).rdd
#lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
#data = lines.map(parseVector).cache()
data = lines.map(lambda row: row.value.split(","))
#temp = data
K = int(sys.argv[2])
convergeDist = float(sys.argv[3])
kPoints = data.takeSample(False, K, 1)
tempDist = 1.0
while tempDist > convergeDist:
closest = data.map(
lambda p: (closestPoint(p, kPoints), (p, 1)))
pointStats = closest.reduceByKey(
lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))
newPoints = pointStats.map(
lambda st: (st[0], st[1][0] / st[1][1])).collect()
tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)
for (iK, p) in newPoints:
kPoints[iK] = p
print("Final centers: " + str(kPoints))
closest = data.map(lambda p: (p,closestPoint(np.array(p).astype(int) , np.array(kPoints).astype(int))))
closest.repartition(1).saveAsTextFile("file:///root/kmeans_result")
最后得到的中心点是kPoints,closest为保存各行数据所属类别后的结果,并保存到文件中。
官网链接:Clustering - spark.mllib