1. 生成N个元素的数组,元素类型为 vector ,其中vector的大小为D,随机生成D个double作为vector中的元素;(Array-->Vector-->Double)
4. 计算每一个Vector的最近中心点,返回 (index,(vector, 1(计数)))
7. 重新计算新的K个中心点 (index, sum(vector)/totalNum)
9.更新K个索引点
def generateData = {
def generatePoint(i: Int) = {
DenseVector.fill(D){rand.nextDouble * R}
}
Array.tabulate(N)(generatePoint)
}
2.计算给出K个中心点,p离哪个中心点最近,返回最近中心点index
def closestPoint(p: Vector[Double], centers: HashMap[Int, Vector[Double]]): Int = {
var index = 0
var bestIndex = 0
var closest = Double.PositiveInfinity
for (i <- 1 to centers.size) {
val vCurr = centers.get(i).get
val tempDist = squaredDistance(p, vCurr)
if (tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
bestIndex
}
3. 随机取Array1~N中的k个Vector作为初始中心点
while (points.size < K) {
points.add(data(rand.nextInt(N)))
}
val iter = points.iterator
for (i <- 1 to points.size) {
kPoints.put(i, iter.next())
}
4. 计算每一个Vector的最近中心点,返回 (index,(vector, 1(计数)))
var closest = data.map (p => (closestPoint(p, kPoints), (p, 1)))
5. 按 index分组
var mappings = closest.groupBy[Int] (x => x._1)
6. 计算 属于一类中心点所有的vector 之和以及个数,pair._2是(index,(vector,1)) 返回(index, (sum(vector), totalCounts)),reduceLeft左叠加
var pointStats = mappings.map { pair =>
pair._2.reduceLeft [(Int, (Vector[Double], Int))] {
case ((id1, (x1, y1)), (id2, (x2, y2))) => (id1, (x1 + x2, y1 + y2))
}
}
7. 重新计算新的K个中心点 (index, sum(vector)/totalNum)
var newPoints = pointStats.map {mapping =>
(mapping._1, mapping._2._1 * (1.0 / mapping._2._2))}
8. 计算新的K个中心点与上次的中心点距离 是否达到收敛,否则重复计算
tempDist = 0.0
for (mapping <- newPoints) {
tempDist += squaredDistance(kPoints.get(mapping._1).get, mapping._2)
}
9.更新K个索引点
for (newP <- newPoints) {
kPoints.put(newP._1, newP._2)
}