简介
spark MLlib官网:http://spark.apache.org/docs/latest/ml-guide.html
mllib是spark core之上的算法库,包含了丰富的机器学习的一系列算法。你可以通过简单的API来构建算法模型,然后利用模型来进行预测分析推荐之类的。
它包含了一些工具,如:
1)算法工具:分类、回归、聚类、协同等
2)特征化工具:特征提取、转换、降维、选择等
3)管道:用于构建、评估和调整机器学习管道的工具
4)持久性:保存和加载算法、模型、管道
5)实用工具:线性代数、统计、数据处理等工具
spark MLlib支持的算法很丰富,以下将以KMeans推荐算法为例,简单使用MLlib
import org.apache.spark.{SparkConf, SparkContext}
// $example on$
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
// $example off$object KMeansExample {
def main(args: Array[String]) {
//val conf = new SparkConf().setAppName("KMeansExample")
//val sc = new SparkContext(conf)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
val sc = new SparkContext(conf)// $example on$
// Load and parse the data
val data = sc.textFile("G:\\spark-2.4.4-bin-hadoop2.7\\data\\mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()// Cluster the data into THREE classes using KMeans
val numClusters = 3
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println(s"Within Set Sum of Squared Errors = $WSSSE")// Save and load model
//clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
//val sameModel = KMeansModel.load(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
// $example off$//输出最终3个类簇的质心
println("Cluster centers:")
for (c<-clusters.clusterCenters){
println(c.toString)
}
//使用模型测试单点数据
println(" ")
val v1 = Vectors.dense("6.5 4.444 3.0".split(" ").map(_.toDouble))println(s"v1=(${v1.apply(0)},${v1.apply(1)},${v1.apply(2)}) is belong to cluster:"+clusters.predict(v1))
// Save and load model
//clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
//val sameModel = KMeansModel.load(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
// $example off$sc.stop()
}
}
kmeans_data.txt 如下:
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.25.6 5.5 4.3
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2