spark-knn,spark是一个很优秀的分布式计算框架,本文实现的knn是基于欧几里得距离公式实现的,下面开始起简单的实现,可能有多问题希望大家能够给指出来。
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
val conf = new SparkConf( ).setAppName("knn")
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext( conf )
val k:Int = 6
val path = "hdfs://master:9000/knn.txt"
val data = sc.textFile( path ).map( line =>{
val pair = line.split( "\\s+" )
( pair( 0 ).toDouble,pair( 1 ).toDouble ,pair( 2 ) )
} )
val total:Array[ RDD[(Double,Double,String)] ] = data.randomSplit(Array( 0.7,0.3 ) )
val train = total( 0 ).cache()
val test = total( 1 ).cache()
train.count()
test.count()
val bcTrainSet = sc.broadcast( train.collect() )
val bck = sc.broadcast( k )
val resultSet = test.map{ line => {
val x = line._1
val y = line._2
val trainDatas = bcTrainSet.value
val set = scala.collection.mutable.ArrayBuffer.empty[(Double, String)]
trainDatas.foreach( e => {
val tx = e._1.toDouble
val ty = e._2.toDouble
val distance = Math.sqrt( Math.pow( x - tx, 2 ) + Math.pow( y - ty, 2 ) )
set.+= (( distance, e._3 ) )
})
val list = set.sortBy( _._1 )
val categoryCountMap = scala.collection.mutable.Map.empty[String, Int]
val k = bck.value
for ( i <- 0 until k ){
val category = list(i)._2
val count = categoryCountMap.getOrElse( category, 0 ) + 1
categoryCountMap += ( category -> count )
}
val ( rCategory,frequency ) = categoryCountMap.maxBy( _._2 )
( x, y, rCategory )
}}
resultSet.repartition(1).saveAsTextFile( "hdfs://master:9000/knn/result" )
以上实现是最简单的实现方式。可以采用加权方法,例如在统计次数的时候使用距离的倒数乘以次数作为最终的次数