svd底层是怎么实现的就不去细说了,我们先来谈谈到底可以利用svd来做什么。通过调用svd算法,我们可以得到各个属性的特征值,这个特征值越大对我们判断的影响就越大。特征比较小的时候,我们可以直接忽略该特征进行对事物的判断,判断结果也能比较精准,在这里就体现了svd算的降维。
下面通过调用mlib的svd算法和kmeans算法来,证实svd降维的准确性。
1,首先调用svd算法对数据进行特征值分析:
object MySVD {
val conf =new SparkConf().setAppName("Svd").setMaster("local");
def main(args: Array[String]) {
test1()
}
def test1(): Unit ={
val sc = new SparkContext(conf)
// $example on$
val data = Array(
Vectors.dense( 5.0,1.0, 1.0, 3.0, 7.0),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),
Vectors.dense(4.0, 1.0, 0.0, 6.0, 7.0))
val dataRDD = sc.parallelize(data, 2)
val mat: RowMatrix = new RowMatrix(dataRDD)
// Compute the top 5 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(5, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
// $example off$
val collect = U.rows.collect()
println("U factor is:")
collect.foreach { vector => println(vector) }
println(s"Singular values are: $s")
println(s"V factor is:\n$V")
}
}
结果:
Singular values are: [18.07857954647125,2.8132737647378407,2.604276497395555,0.6842486621559452,7.432464828786868E-8]
说明第一个特征和最后一个特征的影响最大,倒数第二个特征几乎没什么影响,所以等下做测试的时候,会拿原数据进行一次kmeans分类,然后去掉倒数第二个特征进行kmeans分类
object TestKmeansSvd {
def main(args: Array[String]) {
//beforeSvd()
Svdkenmans()
}
def beforeSvd(): Unit ={
val conf =new SparkConf().setAppName("k-means").setMaster("local");
val sc = new SparkContext(conf)
val data1 = Array(
Vectors.dense( 5.0,1.0, 1.0, 3.0, 7.0),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),
Vectors.dense(4.0, 1.0, 0.0, 6.0, 7.0))
val data = sc.parallelize(data1, 2)
// val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()
val parsedData=data
val numClusters = 2 //将目标数据分成几类
val numIterations = 20//迭代的次数
//将参数,和训练数据传入,形成模型
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)
//预测结果
val result=clusters.predict(data)
//打印分类结果
result.foreach(println)
}
def Svdkenmans(): Unit ={
val conf =new SparkConf().setAppName("k-means").setMaster("local");
val sc = new SparkContext(conf)
val data1 = Array(
Vectors.dense(5.0,1.0, 1.0,7.0),
Vectors.dense(2.0, 0.0, 3.0,5.0),
Vectors.dense(4.0, 0.0, 0.0,7.0),
Vectors.dense(4.0, 1.0, 0.0,7.0))
val data = sc.parallelize(data1, 2)
// val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()
val parsedData=data
val numClusters = 2 //将目标数据分成几类
val numIterations = 20//迭代的次数
//将参数,和训练数据传入,形成模型
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)
//预测结果
val result=clusters.predict(data)
//打印分类结果
result.foreach(println)
}
}
这两次的分类结果相同,说明svd的降维操作还是比较准确的,也可以多做一次测试,去掉第一个特征数据,再对该数据进行分类,结果就不同了。