PySpark机器学习(4)——KMeans和GMM

本文主要在PySpark环境下实现经典的聚类算法KMeans(K均值)和GMM(高斯混合模型),实现代码如下所示:

1.KMeans实现代码:

%pyspark

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import StringIndexer
from pyspark.ml.linalg import Vectors

#1.读取数据,构造训练数据集
df = spark.sql("""select * from XXX""")

trainingSet = df.rdd.map(list).map(lambda x:Row(label=x[-1],features=Vectors.dense(x[-8:-1]))).toDF()
#print(trainingSet.show())

#2.KMeans模型训练
StringIndexer = StringIndexer(inputCol='label',outputCol='indexed')
si_model = StringIndexer.fit(trainingSet)
td = si_model.transform(trainingSet)

kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(td)

print(model.clusterCenters())
print(model.summary.k,model.summary.clusterSizes)
print(model.computeCost(td))

result = model.transform(td)
print(result.show())

total_amount=result.count()
correct_amount = result.filter(result.indexed==result.prediction).count()
precision_rate = correct_amount/total_amount
print("聚类准确率为:{}".format(precision_rate))

positive_amount = result.filter(result.indexed == 0).count()
negative_amount = result.filter(result.indexed == 1).count()

print("正样本数:{},负样本数:{}".format(positive_amount,negative_amount))

positive_precision_amount = result.filter(result.indexed == 0).filter(result.prediction == 0).count()
negative_precision_amount = result.filter(result.indexed == 1).filter(result.prediction == 1).count()
print("正样本聚类准确数量:{},负样本聚类准确数量:{}".format(positive_precision_amount,negative_precision_amount))

positive_clustering_rate = positive_precision_amount/positive_amount
negative_clustering_rate = negative_precision_amount/negative_amount
print("正样本聚类准确率为:{},负样本聚类准确率为:{}".format(positive_clustering_rate,negative_clustering_rate))


2.GMM实现代码:

%pyspark

from pyspark.ml.clustering import GaussianMixture
from pyspark.ml.feature import StringIndexer
from pyspark.ml.linalg import Vectors

#1.读取数据,构造训练数据集
df = spark.sql("""select * from XXX""")
trainingSet = df.rdd.map(list).map(lambda x:Row(label=x[-1],features=Vectors.dense(x[-8:-1]))).toDF()
#print(trainingSet.show())

#2.GaussianMixture模型训练
StringIndexer = StringIndexer(inputCol='label',outputCol='indexed')
si_model = StringIndexer.fit(trainingSet)
td = si_model.transform(trainingSet)

gm = GaussianMixture(k=2,tol=0.0001,maxIter=10,seed=10)
gModel = gm.fit(td)
print(gModel.weights)
print(gModel.summary.k,gModel.summary.clusterSizes)
#print(model.gaussiansDF.select("mean").head())
#print(model.gaussiansDF.select("cov").head())

result = gModel.transform(td)
print(result.show())

total_amount=result.count()
correct_amount = result.filter(result.indexed==result.prediction).count()
precision_rate = correct_amount/total_amount
print("聚类准确率为:{}".format(precision_rate))

positive_amount = result.filter(result.indexed == 0).count()
negative_amount = result.filter(result.indexed == 1).count()

print("正样本数:{},负样本数:{}".format(positive_amount,negative_amount))

positive_precision_amount = result.filter(result.indexed == 0).filter(result.prediction == 0).count()
negative_precision_amount = result.filter(result.indexed == 1).filter(result.prediction == 1).count()
print("正样本聚类准确数量:{},负样本聚类准确数量:{}".format(positive_precision_amount,negative_precision_amount))

positive_clustering_rate = positive_precision_amount/positive_amount
negative_clustering_rate = negative_precision_amount/negative_amount
print("正样本聚类准确率为:{},负样本聚类准确率为:{}".format(positive_clustering_rate,negative_clustering_rate))






  • 0
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值