巧妇无为无米之炊,只有先有数据,然后才有数据分析,这是我最大的败笔,我之前讲的课,没有告诉听众,如何获取数据。这也是我自己遇到的困扰,我学习一门新技术的时候,如果没有数据,光抽象的讲解,我也会感觉不亲切,也会感觉抽象。我现在正在学习的《Spark MLlib机器学习》,这本书就没有给出数据集的下载地址,说实话,我觉得一般般,只想赶快结束。
- 数据处理
- 生成样本
一、数据处理
MLUtils用于辅助加载、保存、处理MLlib相关算法所需要的所有数据,其中最主要的方法是loadLibSVMFile,用于记载LibSVM格式的数据,返回RDD格式的数据。该数据格式可以用于分类、回归算法中。
1、loadLibSVMFile
加载指定LIBSVM格式文件,这个我在以前做嵌入式AI引擎过程中遇到过~
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.util.MLUtils
//向量
import org.apache.spark.mllib.linalg.Vector
//向量集
import org.apache.spark.mllib.linalg.Vectors
//稠密向量
import org.apache.spark.mllib.linalg.DenseVector
//实例
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
//矩阵
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
//索引矩阵
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//RDD
import org.apache.spark.rdd.RDD
object WordCount {
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("HACK-AILX10").setMaster("local")
val sc = new SparkContext(conf)
val data = MLUtils.loadLibSVMFile(sc,"C:studysparkailx10.txt")
data.foreach(println(_))
}
}
2、saveLibSVMFile
将LIBSVM格式的数据保存到指定文件中
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.util.MLUtils
//向量
import org.apache.spark.mllib.linalg.Vector
//向量集
import org.apache.spark.mllib.linalg.Vectors
//稠密向量
import org.apache.spark.mllib.linalg.DenseVector
//实例
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
//矩阵
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
//索引矩阵
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//RDD
import org.apache.spark.rdd.RDD
object WordCount {
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("HACK-AILX10").setMaster("local")
val sc = new SparkContext(conf)
val data = MLUtils.loadLibSVMFile(sc,"C:studysparkailx10.txt")
MLUtils.saveAsLibSVMFile(data.coalesce(1,true),"C:studysparkhack")
}
}
3、appendBias
对向量增加偏置项,用于回归和分类算法中(略,后面再说)
4、fastSquaredDistance
一种快速计算向量距离的方法,主要用于KMeans聚类算法中(略,后面再说)
二、生成样本
1、generateKMeansRDD
用于生成KMeans的训练样本数据,格式为RDD[Array[Double]]
import org.apache.log4j.{Level, Logger}
import org.apache.spark
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.util.{KMeansDataGenerator, MLUtils}
//向量
import org.apache.spark.mllib.linalg.Vector
//向量集
import org.apache.spark.mllib.linalg.Vectors
//稠密向量
import org.apache.spark.mllib.linalg.DenseVector
//实例
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
//矩阵
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
//索引矩阵
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//RDD
import org.apache.spark.rdd.RDD
object WordCount {
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("HACK-AILX10").setMaster("local")
val sc = new SparkContext(conf)
val ailx_kmeans_rdd = KMeansDataGenerator.generateKMeansRDD(sc,
10,2,5,1.0,3)
for(i <- ailx_kmeans_rdd) {
for (j <- i) {
print(j+ " ")
}
println()
}
}
}
2、generateLinearRDD
用于生成线性回归的训练样本数据,格式为RDD[LabeledPoint]
import org.apache.log4j.{Level, Logger}
import org.apache.spark
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.util.{KMeansDataGenerator, LinearDataGenerator, MLUtils}
//向量
import org.apache.spark.mllib.linalg.Vector
//向量集
import org.apache.spark.mllib.linalg.Vectors
//稠密向量
import org.apache.spark.mllib.linalg.DenseVector
//实例
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
//矩阵
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
//索引矩阵
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//RDD
import org.apache.spark.rdd.RDD
object WordCount {
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("HACK-AILX10").setMaster("local")
val sc = new SparkContext(conf)
val ailx_linear_rdd = LinearDataGenerator.generateLinearRDD(sc,
10,3,1.0,1,0)
ailx_linear_rdd.foreach(println)
}
}
3、generateLogisticsRDD
用于生成逻辑回归的训练样本数据,格式为RDD[LabeledPoint]
import org.apache.log4j.{Level, Logger}
import org.apache.spark
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.util.{KMeansDataGenerator, LinearDataGenerator, LogisticRegressionDataGenerator, MLUtils}
//向量
import org.apache.spark.mllib.linalg.Vector
//向量集
import org.apache.spark.mllib.linalg.Vectors
//稠密向量
import org.apache.spark.mllib.linalg.DenseVector
//实例
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
//矩阵
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
//索引矩阵
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//RDD
import org.apache.spark.rdd.RDD
object WordCount {
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("HACK-AILX10").setMaster("local")
val sc = new SparkContext(conf)
val ailx_logistic_rdd = LogisticRegressionDataGenerator.generateLogisticRDD(sc,
10,3,1.0,1,0.5)
ailx_logistic_rdd.foreach(println)
}
}
本篇完~