2-算法实操

单例对象里面封装了欧氏距离公式和将欧氏距离应用到model中,还有计算k值model平均质心距离和对k的取值进行评价的方法

package NetWork

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD

/**
 *单例对象里面封装了欧氏距离公式和将欧氏距离应用到model中,还有计算k值model平均质心距离和对k的取值进行评价的方法
 */
object CountClass {

  /**
   * 欧氏距离公式
   * x.toArray.zip(y.toArray)对应 "两个向量相应元素"
   * map(p => p._1 - p._2)对应 "差"
   * map(d => d*d).sum对应 "平方和"
   * math.sqrt()对应 "平方根"
   *
   * @param x
   * @param y
   * @return
   */
  def distance(x: Vector, y: Vector) = {
    math.sqrt(x.toArray.zip(y.toArray).map(p => p._1 - p._2).map(d => d*d).sum)
  }

  /**
   * 欧氏距离公式应用到model中
   * KMeansModel.predict方法中调用了KMeans对象的findCloest方法
   *
   * @param datum
   * @param model
   * @return
   */
  def distToCentroid(datum: Vector, model: KMeansModel) = {
    //找最短距离的点
    val cluster = model.predict(datum)
    //找中心点
    val centroid = model.clusterCenters(cluster)
    distance(centroid, datum)
  }

  /**
   * k值model平均质心距离
   *
   * @param data RDD向量格式
   * @param k  分类数
   * @return
   */
  def clusteringScore(data: RDD[Vector], k: Int) = {
    val kmeans = new KMeans()
    kmeans.setK(k)
    val model = kmeans.run(data)
    data.map(datum => distToCentroid(datum, model)).mean()
  }

  /**
   * 对k的取值进行评价
   * scala通常采用(x to y by z)这种形式建立一个数字集合,该集合的元素为闭合区间的等差数列
   * 这种语法可用于建立一系列k值,然后对每个值分别执行莫项任务
   * @param data
   */
  def check(data: RDD[Vector]) = {
    (5 to 40 by 5).map(k => (k, CountClass.clusteringScore(data, k))).foreach(println)
  }
}

1、对数据集里面的样本进行统计和分类,按降序排序

package NetWork

import org.apache.spark.{SparkConf, SparkContext}

/**
 * 分类统计样本个数,降序排序
 */
object CountExample {
  def main(args: Array[String]) {

    //创建入口对象
    val conf = new SparkConf().setAppName("CheckValue1").setMaster("local")
    val sc= new SparkContext(conf)
    val HDFS_DATA_PATH = "hdfs://hadoop102:9000/Spark_Network_Abnormality/kddcup.data"
    val rawData = sc.textFile(HDFS_DATA_PATH)
    val sort_result = rawData.map(_.split(",").last).countByValue().toSeq.sortBy(_._2).reverse
    sort_result.foreach(println)
  }
}

在这里插入图片描述
2、对数据集里面的样本进行建立KMeansModel,在输出每个簇的质心

package NetWork

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.{SparkConf, SparkContext}

object SetKMeansModel {
    def main(args: Array[String]) {
      //创建入口对象
      val conf = new SparkConf().setAppName("CheckValue1").setMaster("local")
      val sc= new SparkContext(conf)
      val HDFS_DATA_PATH = "hdfs://hadoop102:9000/Spark_Network_Abnormality/kddcup.data"
      val rawData = sc.textFile(HDFS_DATA_PATH)

      val LabelsAndData = rawData.map{   //代码块执行RDD[String] => RDD[Vector]
        line =>
          //toBuffer创建一个可变列表(Buffer)
          val buffer = line.split(",").toBuffer
          buffer.remove(1, 3)
          val label = buffer.remove(buffer.length-1)
          val vector = Vectors.dense(buffer.map(_.toDouble).toArray)
          (label, vector)
      }
      val data = LabelsAndData.values.cache()  //转化值并进行缓存

      //建立kmeansModel
      val kmeans = new KMeans()
      val model = kmeans.run(data)
      model.clusterCenters.foreach(println)
    }
}

在这里插入图片描述

程序运行结果:
             向量1:
              [48.34019491959669,1834.6215497618625,826.2031900016945,5.7161172049003456E-6,
              6.487793027561892E-4,7.961734678254053E-6,0.012437658596734055,
              3.205108575604837E-5,0.14352904910348827,0.00808830584493399,
              6.818511237273984E-5,3.6746467745787934E-5,0.012934960793560386,
              0.0011887482315762398,7.430952366370449E-5,0.0010211435092468404,
              0.0,4.082940860643104E-7,8.351655530445469E-4,334.9735084506668,
              295.26714620807076,0.17797031701994342,0.1780369894027253,
              0.05766489875327374,0.05772990937912739,0.7898841322630883,
              0.021179610609908736,0.02826081009629284,232.98107822302248,
              189.21428335201279,0.7537133898006421,0.030710978823798966,
              0.6050519309248854,0.006464107887636004,0.1780911843182601,
              0.17788589813474293,0.05792761150001131,0.05765922142400886]
              向量2:
              [10999.0,0.0,1.309937401E9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
              0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,
              0.0,0.0,255.0,1.0,0.0,0.65,1.0,0.0,0.0,0.0,1.0,1.0]
  }

3、对数据集里面的样本中的每个簇中每个标号出现的次数进行计数,按照簇-类别计数,并且按照可读的方法输出

package NetWork

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.{SparkConf, SparkContext}

/**
 * 对数据集里面的样本中的每个簇中每个标号出现的次数进行计数,按照簇-类别计数,并且按照可读的方法输出
 */
object CountByCluster {
  def main(args: Array[String]) {
    //创建入口对象
    val conf = new SparkConf().setAppName("CheckValue1").setMaster("local")
    val sc= new SparkContext(conf)
    val HDFS_DATA_PATH = "hdfs://hadoop102:9000/Spark_Network_Abnormality/kddcup.data"
    val rawData = sc.textFile(HDFS_DATA_PATH)

    val LabelsAndData = rawData.map{   //代码块执行RDD[String] => RDD[Vector]
      line =>
        //toBuffer创建一个可变列表(Buffer)
        val buffer = line.split(",").toBuffer
        buffer.remove(1, 3)
        val label = buffer.remove(buffer.length-1)
        val vector = Vectors.dense(buffer.map(_.toDouble).toArray)
        (label, vector)
    }
    val data = LabelsAndData.values.cache()  //转化值并进行缓存

    //建立kmeansModel
    val kmeans = new KMeans()
    val model = kmeans.run(data)

    /**由CheckValue1已知该数据集有23类,CheckValue2分类肯定不准确,所以下面我利用给定的类别标号信息来
     *直观的看到分好的簇中包含哪些类型的样本,对每个簇中的标号进行计数,并以可读的方式输出
     */
    //对标号进行计数
    val clusterLabelCount = LabelsAndData.map {
      case (label, datum) =>
        val cluster = model.predict(datum)
        (cluster, label)
    }.countByValue()
    //将簇-类别进行计数,输出
    println("计数结果如下")
    clusterLabelCount.toSeq.sorted.foreach {
      case ((cluster, label), count) =>
        //使用字符插值器对变量的输出进行格式化
        println(f"$cluster%1s$label%18s$count%8s")
    }
  }
}

在这里插入图片描述

计数结果如下
0             back.    2203
0  buffer_overflow.      30
0        ftp_write.       8
0     guess_passwd.      53
0             imap.      12
0          ipsweep.   12481
0             land.      21
0       loadmodule.       9
0         multihop.       7
0          neptune. 1072017
0             nmap.    2316
0           normal.  972781
0             perl.       3
0              phf.       4
0              pod.     264
0        portsweep.   10412
0          rootkit.      10
0            satan.   15892
0            smurf. 2807886
0              spy.       2
0         teardrop.     979
0      warezclient.    1020
0      warezmaster.      20
1        portsweep.       1

4、对数据集里面的样本进行对k的取值进行评价

package NetWork

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.{SparkConf, SparkContext}
object EvaluateByK {
  def main(args: Array[String]) {

    //创建入口对象
    val conf = new SparkConf().setAppName("CheckValue1").setMaster("local")
    val sc= new SparkContext(conf)
    val HDFS_DATA_PATH = "hdfs://hadoop102:9000/Spark_Network_Abnormality/kddcup.data"
    val rawData = sc.textFile(HDFS_DATA_PATH)

    val LabelsAndData = rawData.map{   //代码块执行RDD[String] => RDD[Vector]
      line =>
        //toBuffer创建一个可变列表(Buffer)
        val buffer = line.split(",").toBuffer
        buffer.remove(1, 3)
        val label = buffer.remove(buffer.length-1)
        val vector = Vectors.dense(buffer.map(_.toDouble).toArray)
        (label, vector)
    }
    val data = LabelsAndData.values.cache()  //转化值并进行缓存


    CountClass.check(data)     //给k的取值进行评价,k=(5,10,15,20,25,30,35,40)
  }
}

在这里插入图片描述

(5,1938.8583418059188)
(10,1722.0822302073261)
(15,1624.0912356095741)
(20,1298.0334124883611)
(25,1449.809133112103)
(30,1286.6501839804016)
(35,1298.1353507223746)
(40,1285.6059077996915)

5、对数据集进行再次k的取值然后进行评价操作,来大概统计出k的最佳值在什么区间,这里用到的是利用k值进行多次聚类操作, 利用setRuns()方法给定k的运行次数

package NetWork

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.{SparkConf, SparkContext}
object EvaluateByK2 {
  def main(args: Array[String]) {

    //创建入口对象
    val conf = new SparkConf().setAppName("CheckValue1").setMaster("local")
    val sc= new SparkContext(conf)
    val HDFS_DATA_PATH = "hdfs://hadoop102:9000/Spark_Network_Abnormality/kddcup.data"
    val rawData = sc.textFile(HDFS_DATA_PATH)

    val LabelsAndData = rawData.map{   //代码块执行RDD[String] => RDD[Vector]
      line =>
        //toBuffer创建一个可变列表(Buffer)
        val buffer = line.split(",").toBuffer
        buffer.remove(1, 3)
        val label = buffer.remove(buffer.length-1)
        val vector = Vectors.dense(buffer.map(_.toDouble).toArray)
        (label, vector)
    }
    val data = LabelsAndData.values.cache()  //转化值并进行缓存

    //建立kmeansModel
    val kmeans = new KMeans()
    val model = kmeans.run(data)
    //设置给定k值的运行次数
    kmeans.setRuns(10)
    kmeans.setEpsilon(1.0e-6)
    (30 to 100 by 10).par.map(k => (k, CountClass.clusteringScore(data, k))).toList.foreach(println)
  }
}

在这里插入图片描述

(30,1435.2909068752092)
(40,1296.9767432250865)
(50,1043.0428251656863)
(60,1049.0796806917588)
(70,1262.2202539255009)
(80,932.3610275820481)
(90,1009.9958281781237)
(100,1062.1627309290627)

总结:随着k的增大,结果得分持续下降,我们要找到k值的临界点,过了这个临界点之后继续增加k值并不会显著降低得分这个点的k值-得分曲线的拐点。这条曲线通常在拐点之后会继续下行但最终趋于水平。在本实例中k>100之后得分下降很明显,故得出k的拐点应该大于100

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值