1.1 SVM支持向量机算法
支持向量机理论知识参照以下文档:
支持向量机SVM(一)
http://www.cnblogs.com/jerrylead/archive/2011/03/13/1982639.html
支持向量机SVM(二)
http://www.cnblogs.com/jerrylead/archive/2011/03/13/1982684.html
http://www.cnblogs.com/jerrylead/archive/2011/03/18/1988406.html
http://www.cnblogs.com/jerrylead/archive/2011/03/18/1988415.html
支持向量机(五)SMO算法
http://www.cnblogs.com/jerrylead/archive/2011/03/18/1988419.html
SVM的目标函数及梯度下降更新公式如下:
MLlib 中 SVM的代码结构如下:
1.2 Spark Mllib SVM源码分析
1.2.1 SVMWithSGD
SVM算法的train方法,由SVMWithSGD类的object定义了train函数,在train函数中新建了SVMWithSGD对象。
package org.apache.spark.mllib.classification
// 1 类:SVMWithSGD
class SVMWithSGD private (
privatevar stepSize: Double,
privatevar numIterations: Int,
privatevar regParam: Double,
privatevar miniBatchFraction: Double)
extends GeneralizedLinearAlgorithm[SVMModel] with Serializable {
privateval gradient = new HingeGradient()
privateval updater = new SquaredL2Updater()
overrideval optimizer = new GradientDescent(gradient, updater)
.setStepSize(stepSize)
.setNumIterations(numIterations)
.setRegParam(regParam)
.setMiniBatchFraction(miniBatchFraction)
overrideprotectedval validators = List(DataValidators.binaryLabelValidator)
/**
* Construct a SVM object with default parameters: {stepSize: 1.0, numIterations: 100,
* regParm: 0.01, miniBatchFraction: 1.0}.
*/
defthis() = this(1.0, 100, 0.01, 1.0)
overrideprotecteddef createModel(weights: Vector, intercept: Double) = {
new SVMModel(weights, intercept)
}
}
SVMWithSGD类中参数说明:
stepSize: 迭代步长,默认为1.0
numIterations: 迭代次数,默认为100
regParam: 正则化参数,默认值为0.0
miniBatchFraction: 每次迭代参与计算的样本比例,默认为1.0
gradient:HingeGradient (),梯度下降;
updater:SquaredL2Updater (),正则化,L2范数;
optimizer:GradientDescent (gradient, updater),梯度下降最优化计算。
// 2 train方法
object SVMWithSGD {
/**
* Train a SVM model given an RDD of (label, features) pairs. We run a fixed number
* of iterations of gradient descent using the specified step size. Each iteration uses
* `miniBatchFraction` fraction of the data to calculate the gradient. The weights used in
* gradient descent are initialized using the initial weights provided.
*
* NOTE: Labels used in SVM should be {0, 1}.
*
* @param input RDD of (label, array of features) pairs.
* @param numIterations Number of iterations of gradient descent to run.
* @param stepSize Step size to be used for each iteration of gradient descent.
* @param regParam Regularization parameter.
* @param miniBatchFraction Fraction of data to be used per iteration.
* @param initialWeights Initial set of weights to be used. Array should be equal in size to
* the number of features in the data.
*/