2018-07-12更新:sqlContext 报错;这个很隐蔽,spark2.0以后 不能直接用了,而是放在了spark下,所以你需要调包:
val sqlContext = spark.Context;已经试了,有效。
libsvm 数据格式读取,并转换成df
val df = spark.read.format("libsvm").load("/home/uaa/dap/sex_test")
Mlib 数据类型:训练的前提,是输入数据的格式正确性;Mlib有以下几种格式输入
import org.apache.spark.mlib.liang.{Vector,Vectors}
val a:vectors = Vectors.dense(1.0,2.0,3.0,0.0) 生成一个向量,
也可以稀疏化表示:val a:vectors=Vectors.spare(4,Array(0,1,2),Array(1,2,3)) 稀疏化过程中,第一个矩阵是非0元素索引,第二个矩阵是元素值
光有向量是不够的,在监督学习中,还需要数据标签,Spark.mllib中通常用LabelPoint来生成数据对
import org.apache.spark.mllib.regression.LabelPoint
val test = LabelPoint(1.0,Vectors.dense(1.0,2.0,3.0));也可以用MLUtils包直接导入数据,比如:import org.apache.spark.util.MLUtils
val data:RDD[LabelPoint] = MLUtils.loadLibSVMFile(sc,"/path/test.txt")
矩阵:import org.apache.spark.mllib.liang.{Matrix,Matrices}
val test_matrix:Matrix = Matrices.sparse(3,2,Array(1,2,3),Array(3,2,4),Array(4,1,5))
val test_matrix:Matrix = Matrices.dese(3,2,Array(1,2,3,4,5,6)) 从以上数据不难看出,Scala 也是一种强类型语言,使用变量之前需要指定类型;
数据运算处理:
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
val observation:RDD[Vector] = ..... val summary:MultivariateStatisticalSummary = statistic.colStats(observation)
之后便可以计算,summary.mean,summary.variance等等;
协同矩阵:import org.apache.spark.mllib.stat.Statistics
val seriesX:RDD[Double] = ...... val seriesY:RDD[Double] = ....
val collelation:Double = Statistics.corr(seriesX,seriesY,"persion")
键值对:Spark 键值对 import org.apache.spark.rdd.PairRDDFunctions,但是Python 不支持
val fraction:Map[K,Double]= .....
val approxSample = data.sampleByKey(withReplacement = false, fractions)
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)