Spark MLlib(二)SVM

原创 2016年08月30日 13:48:27
package com.qh

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils

/**
  * Created by hadoop on 8/29/16.
  * Spark 2.0.0
  * Scala 2.11.8
  * 分类算法 线性支持向量机SVM(Linear Support Vector Machines)
  */
object MLlib_SVM {
  private val path = "hdfs://master:9000/Spark/MLlib/SVM"

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setAppName("MLlib SVM")
      .setMaster("spark://master:7077")
    val sc = new SparkContext(conf)

    /*
    数据为Saprk自带的数据源
    LIBSVMData.collect()
    scala> LIBSVMData.collect()
    res0: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(692,[127,128,129,
    130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,
    213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,
    269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,
    328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,
    412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,
    494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,
    571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,
    626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,2...



    scala> SVMData.collect()
    res1: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((1.0,[0.0,2.52078447201548,
    0.0,0.0,0.0,2.004684436494304,2.000347299268466,0.0,2.228387042742021,2.228387042742023,0.0,0.0,
    0.0,0.0,0.0,0.0]), (0.0,[2.857738033247042,0.0,0.0,2.619965104088255,0.0,2.004684436494304,
    2.000347299268466,0.0,2.228387042742021,2.228387042742023,0.0,0.0,0.0,0.0,0.0,0.0]), (0.0,
    [2.857738033247042,0.0,2.061393766919624,0.0,0.0,2.004684436494304,0.0,0.0,2.228387042742021,
    2.228387042742023,0.0,0.0,0.0,0.0,0.0,0.0]), (1.0,[0.0,0.0,2.061393766919624,2.619965104088255,0.0,
    2.004684436494304,2.000347299268466,0.0,0.0,0.0,0.0,2.055002875864414,0.0,0.0,0.0,0.0]), (1.0,
    [2.857738033247042,0.0,2.061393766919624,2.619965104088255,0.0,2.004684436494304,0.0,0.0,0.0,0.0,0.0,
    2.055002875864414,0.0,0.0,0.0,0.0]), (...
     */
    val LIBSVMData = MLUtils.loadLibSVMFile(sc, path + "/LIBSVMData.txt")

    val data = sc.textFile(path + "/SVMData.txt")
    val SVMData = data.map { line =>
      val parts = line.split("\\s+")
      LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(_.toDouble)))
    }

    /*
    用数据的60%建立模型
    40%数据进行测试
    def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
    根据weights权重,将一个RDD切分成多个RDD

    Spark最重要的一个功能,就是在不同操作间,持久化(或缓存)一个数据集在内存中
    cache和persist就是用来实现着一功能
    1)RDD的cache()方法其实调用的就是persist方法,缓存策略均为MEMORY_ONLY
    2)可以通过persist方法手工设定StorageLevel来满足工程需要的存储级别
    3)cache或者persist并不是action
     */
    val parsedDataLib = LIBSVMData.randomSplit(Array(0.6, 0.4))
    val TrainDataLib = parsedDataLib(0).cache()
    val TestDataLib = parsedDataLib(1)

    val parsedData = SVMData.randomSplit(Array(0.6, 0.4))
    val TrainData = parsedData(0).cache()
    val TestData = parsedData(1)

    /*
    向量标签的类型:
    LabeledPoint(label: Double, features: Vector)

    def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, regParam: Double,
              miniBatchFraction: Double, initialWeights: Vector): SVMModel
    input:样本数据,分类标签lable只能是1.0和0.0两种,feature为double类型
    numIterations:梯度下降的迭代次数 默认为1.0
    stepSize:用于梯度下降的每一次迭代的步长 默认为100
    regParam:正则化参数 默认值为0.0
    miniBatchFraction:每次迭代中使用的数据的分数 默认为1.0
    initialWeights:要使用的初始权重集。数组的大小应与数据中的功能数相等。
     */
    var model = SVMWithSGD.train(TrainDataLib, 100)
    model.clearThreshold()
    var scoreAndLabels = TestDataLib.map(x => (model.predict(x.features), x.label))
    scoreAndLabels.saveAsTextFile(path + "/LIBSVMData")
    /*
    将模型保存 以后使用可以直接加载模型数据进行测试
     */
    model.save(sc, path + "/LIBSVMDataModel")
    // 加载model
    var sameModel = SVMModel.load(sc, path + "/LIBSVMDataModel")
    // 用来验证加载的model测试的数据是否正确  与LIBSVMData 数据对比
    scoreAndLabels = TestDataLib.map(x => (sameModel.predict(x.features), x.label))
    scoreAndLabels.saveAsTextFile(path + "/LIBSVMDataCom")

    model = SVMWithSGD.train(TrainData, 100)
    model.clearThreshold()
    scoreAndLabels = TestData.map(x => (model.predict(x.features), x.label))
    scoreAndLabels.saveAsTextFile(path + "/SVMData")
    model.save(sc, path + "/SVMDataModel")
    sameModel = SVMModel.load(sc, path + "/SVMDataModel")
    scoreAndLabels = TestDataLib.map(x => (sameModel.predict(x.features), x.label))
    scoreAndLabels.saveAsTextFile(path + "/SVMDataCom")

    sc.stop()
  }
}
版权声明:本文为博主原创文章,未经博主允许不得转载。

Spark中组件Mllib的学习28之支持向量机SVM-方法1

更多代码请见:https://github.com/xubo245/SparkLearning Spark中组件Mllib的学习之分类篇 1解释 支持向量机(Support Vector Mach...
  • bob601450868
  • bob601450868
  • 2016年05月24日 22:33
  • 1390

Spark中组件Mllib的学习29之支持向量机SVM-方法2

更多代码请见:https://github.com/xubo245/SparkLearning Spark中组件Mllib的学习之分类篇 1解释 spark官网第二种方法建立SVMmodel2....
  • bob601450868
  • bob601450868
  • 2016年05月24日 22:35
  • 2108

Spark机器学习系列之13: 支持向量机SVM

SVM 理论 spark scikit
  • qq_34531825
  • qq_34531825
  • 2016年10月21日 13:07
  • 5212

Svm算法理解以及MLlib实现

首先SVM算法它也是一种分类算法,类似于贝叶斯分类算法,但是在底层的实现还是不同,它可以用更少的样本,训练出更高精度的模型。支持向量机(Support Vector Machine)是Cortes和V...
  • young_so_nice
  • young_so_nice
  • 2016年08月08日 14:51
  • 1231

spark MLlib之分类和回归

MLlib支持多种方法用来处理二分分类,多类分类以及回归分析,下表列出了问题及对应的处理方法: 问题类型 支持的方法 二分分类 现行SVM,逻辑回归,决策树,贝叶斯 多类分类 ...
  • www_jun
  • www_jun
  • 2016年09月06日 16:08
  • 1192

离线轻量级大数据平台Spark之MLib机器学习库SVM实例

支持向量机,因其英文名为support vector machine,故一般简称SVM,通俗来讲,它是一种二类分类模型,其基本模型定义为特征空间上的间隔最大的线性分类器,其学习策略便是间隔最大化,最终...
  • fjssharpsword
  • fjssharpsword
  • 2016年11月07日 14:46
  • 2373

Apache Spark MLlib学习笔记(二)Spark RDD简介和基本操作

第二篇介绍一下Spark的基本数据抽象——RDD,RDD来自伯克利的一篇论文,也就是http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.p...
  • qiao1245
  • qiao1245
  • 2015年04月03日 11:57
  • 1417

【Spark Mllib】分类模型——各分类模型使用

数据集: 线性模型
  • u011239443
  • u011239443
  • 2016年06月16日 13:37
  • 2272

MLlib里几个简单的分类模型(python)

MLlib里几个简单的分类模型
  • q1w2e3r4470
  • q1w2e3r4470
  • 2016年01月02日 17:41
  • 2501

Spark中组件Mllib的学习32之朴素贝叶斯分类器(伯努利朴素贝叶斯)*

更多代码请见:https://github.com/xubo245/SparkLearning Spark中组件Mllib的学习之分类篇 1解释 (1) 朴素贝叶斯分类器种类 在把训练集中的每...
  • bob601450868
  • bob601450868
  • 2016年05月25日 11:03
  • 1321
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Spark MLlib(二)SVM
举报原因:
原因补充:

(最多只允许输入30个字)