【Spark】聚类分析

同步于Buracag的博客

本节主要讲Spark ML中关于聚类算法的实现。示例的算法Demo包含:K-means、LDA、高斯混合模型(GMM)等。

1. K-means

KMeans作为Estimator实现,并生成KMeansModel作为基本模型。

1.1 输入

Param nameType(s)DefaultDescription
featuresColVector“features”Feature vector

1.2 输出

Param nameType(s)DefaultDescription
predictionColInt“prediction”Predicted cluster center

示例代码如下:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/8 15:52
# @Author   : buracagyang
# @File     : kmeans_example.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("KMeansExample").getOrCreate()

    dataset = spark.read.format("libsvm").load("../data/mllib/sample_kmeans_data.txt")

    kmeans = KMeans().setK(2).setSeed(1)
    model = kmeans.fit(dataset)

    predictions = model.transform(dataset)

    # 通过计算Silhouette得分来评估聚类
    evaluator = ClusteringEvaluator()
    silhouette = evaluator.evaluate(predictions)
    print("Silhouette with squared euclidean distance = " + str(silhouette))

    centers = model.clusterCenters()
    print("Cluster Centers: ")
    for center in centers:
        print(center)

    spark.stop()

结果如下:

Silhouette with squared euclidean distance = 0.999753030538
Cluster Centers: 
[0.1 0.1 0.1]
[9.1 9.1 9.1]

2. 隐狄利克雷分布(Latent Dirichlet Allocation, LDA)

LDA实现支持EMLDAOptimizer和OnlineLDAOptimizer的Estimator,并生成LDAModel作为基本模型。 如果需要,用户可以将EMLDAOptimizer生成的LDAModel转换为DistributedLDAModel。

示例代码如下:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/8 15:57
# @Author   : buracagyang
# @File     : lda_example.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml.clustering import LDA
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("LDAExample").getOrCreate()

    dataset = spark.read.format("libsvm").load("../data/mllib/sample_lda_libsvm_data.txt")

    lda = LDA(k=10, maxIter=10)
    model = lda.fit(dataset)

    ll = model.logLikelihood(dataset)
    lp = model.logPerplexity(dataset)
    print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
    print("The upper bound on perplexity: " + str(lp))

    topics = model.describeTopics(3)
    print("The topics described by their top-weighted terms:")
    topics.show(truncate=False)

    transformed = model.transform(dataset)
    transformed.show(truncate=False)

    spark.stop()

结果如下:

The lower bound on the log likelihood of the entire corpus: -819.009950373
The upper bound on perplexity: 3.15003826823
The topics described by their top-weighted terms:
+-----+-----------+---------------------------------------------------------------+
|topic|termIndices|termWeights                                                    |
+-----+-----------+---------------------------------------------------------------+
|0    |[5, 0, 4]  |[0.15844458913187548, 0.14007374754187465, 0.13600455491609725]|
|1    |[4, 10, 1] |[0.10928579282605518, 0.10301239992895456, 0.10233720065679922]|
|2    |[7, 2, 8]  |[0.09993621815412573, 0.0995604576055824, 0.09839140083978774] |
|3    |[10, 6, 9] |[0.22028286673724704, 0.14180332771724063, 0.10480970083316132]|
|4    |[3, 1, 7]  |[0.1069290730844232, 0.09913915882531774, 0.09829708091766262] |
|5    |[8, 4, 3]  |[0.10062018802985315, 0.10039557022704547, 0.09964881942009583]|
|6    |[7, 6, 8]  |[0.10241014766676104, 0.10114682616315203, 0.09877798196420218]|
|7    |[3, 10, 4] |[0.23627099191080478, 0.11550793060134483, 0.09113132802908004]|
|8    |[2, 4, 10] |[0.11417002337049241, 0.09981723889288864, 0.09638496973844993]|
|9    |[1, 5, 3]  |[0.11538963974318006, 0.10464760125021952, 0.09761099598591011]|
+-----+-----------+---------------------------------------------------------------+

3. 二分K-means(Bisecting K-means)

二分k-means是一种使用分裂(或“自上而下”)方法的层次聚类:首先将所有点作为一个簇, 然后将该簇一分为二,递归地执行拆分。二分K-means通常比常规K-means快得多,但它通常会产生不同的聚类。

BisectingKMeans作为Estimator实现,并生成BisectingKMeansModel作为基本模型。

示例代码如下:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/8 16:12
# @Author   : buracagyang
# @File     : bisecting_k_means_example.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml.clustering import BisectingKMeans
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("BisectingKMeansExample").getOrCreate()

    dataset = spark.read.format("libsvm").load("../data/mllib/sample_kmeans_data.txt")
    bkm = BisectingKMeans().setK(2).setSeed(1)
    model = bkm.fit(dataset)

    cost = model.computeCost(dataset)
    print("Within Set Sum of Squared Errors = " + str(cost))

    print("Cluster Centers: ")
    centers = model.clusterCenters()
    for center in centers:
        print(center)

    spark.stop()

结果如下:

Within Set Sum of Squared Errors = 0.12
Cluster Centers: 
[0.1 0.1 0.1]
[9.1 9.1 9.1]

4. 混合高斯模型(Gaussian Mixture Model, GMM)

高斯混合模型表示复合分布,其中从k个高斯子分布中的一个绘制点,每个子分布具有其自己的概率。 spark.ml实现使用期望最大化算法求解最大似然模型。

4.1 输入

Param nameType(s)DefaultDescription
featuresColVector“features”Feature vector

4.2 输出

Param nameType(s)DefaultDescription
predictionColInt“prediction”Predicted cluster center
probabilityColVector“probability”Probability of each cluster

示例代码如下:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/8 16:21
# @Author   : buracagyang
# @File     : gaussian_mixture_example.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml.clustering import GaussianMixture
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("GaussianMixtureExample").getOrCreate()

    dataset = spark.read.format("libsvm").load("../data/mllib/sample_kmeans_data.txt")

    gmm = GaussianMixture().setK(2).setSeed(1)
    model = gmm.fit(dataset)

    print("Gaussians shown as a DataFrame: ")
    model.gaussiansDF.show(truncate=False)

    spark.stop()

结果如下:

Gaussians shown as a DataFrame: 
+-------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|mean                                                         |cov                                                                                                                                                                                                     |
+-------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[0.10000000000001552,0.10000000000001552,0.10000000000001552]|0.006666666666806454  0.006666666666806454  0.006666666666806454  
0.006666666666806454  0.006666666666806454  0.006666666666806454  
0.006666666666806454  0.006666666666806454  0.006666666666806454  |
|[9.099999999999984,9.099999999999984,9.099999999999984]      |0.006666666666812185  0.006666666666812185  0.006666666666812185  
0.006666666666812185  0.006666666666812185  0.006666666666812185  
0.006666666666812185  0.006666666666812185  0.006666666666812185  |
+-------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


相关系列:

【Spark】Pipelines

【Spark】特征工程1-Extractors

【Spark】特征工程2-Transformers

【Spark】分类和回归算法-分类

【Spark】分类和回归算法-回归

【Spark】聚类分析

【Spark】协同过滤

【Spark】频繁项集挖掘

【Spark】模型选择和调优

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值