一、机器学习基础
监督学习,有训练样本(人工参与,预先标注分类),分类算法可以做离散变量的预测(决策树、knn、svm、贝叶斯、感知器),回归算法对连续变量的预测(线性回归、非线性回归);
无监督学习,没有训练样本,比如聚类算法、神经网络等;
二、MlLib构成
1.算法组成:
a.spark-mllib:包含原始API,构建在RDD之上
b.spark-ml:基于dataFrame构建高级API
2.数据类型
a.向量:
稀疏的向量\稠密的向量\带标签的向量
import org.apache.spark.mllib.linalg.{Vector,Vectors}
//Create a dense vector(1.0,0.0,3.0)
val dv:Vector = Vector.dense(1.0,0.0,3.0)
//Create a sparse vector(1.0,0.0,3.0) by specifying its indices and values corresponding to nonzero entries.
val sv1:Vector = Vector.sparse(3,Array(0,2),Array(1.0,3.0))
//Create a sparse vector(1.0,0.0,3.0)by specifying its nozero entries
val sv2:Vector = Vector.sparse(3,Seq((0,1.0),(2,3.0)))
创建带标签的向量
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.regression.LabeledPoint
//Create a labeled point with a positive labeled and a dense feature vector
val pos = LabeledPoint(1.0,Vector.dense(1.0,0.0,3.0))
//Create a labeled point with a negative labeled and a sparse feature vector
val neg = LabeledPoint(0.0,Vector.sparse(3,Array(0,2),Array(1.0,3.0))
b.矩阵:
单机矩阵,分为稠密矩阵和稀疏矩阵
分布式矩阵:RowMatrix/IndexedRowMatrix(常用于计算奇异值,矩阵乘法等)/CoordinateMatrix/BlockMatrix.这几种matrix之间可以相互转换
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows:RDD[Vector]=...//an RDD of local vectors
//Create a rowMatrix from an RDD [Vector]
val mat:RowMatrix=new RowMatrix(rows)
//Get its size
val m = mat.numRows()
val n = mat.numCols()
3.数学统计计算库
基本统计值:min/max/mean
相关分析:Statistics.corr()
随机数产生器:val u = normal.RDD(sc,100000L,10)
假设检验等:
4.机器学习算法
a。分类算法
b。回归算法LogisticRegressionWithSGD.train()
c。聚类算法 KMeans.train()
d。协同算法
e。降维算法
K-means聚类算法
推荐算法ALS常用的算法是协同过滤,可以分为基于用户和基于商品推荐