StreaingLinearRegressionWithSGD源码分析
大家好,我是一拳就能打爆A柱的猛男
好久不见,真的好久没写博客了,最近在准备考试,然后写了一篇20年总结。这个礼拜还是跟之前的进度一样去研究如何测试Spark中的流式机器学习算法的性能。今天给大家带来流式线性回归算法的源码讲解,配合我对分布式的了解给大家说说我的看法。
巨长,耐心看完会有收获,没时间建议先收藏!
巨长,耐心看完会有收获,没时间建议先收藏!
巨长,耐心看完会有收获,没时间建议先收藏!
StreaingLinearRegressionWithSGD官方案例
从Spark官网我们可以找到官方的案例:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
val trainingData = ssc.textFileStream(args(0)).map(LabeledPoint.parse).cache()
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)
val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
可以看到这代码分成四部分:加载数据、初始化模型、训练预测、ssc启动。当然加载数据、清洗什么的需要根据数据源以及数据格式的不同做不同的处理,我们主要关注初始化模型、训练预测两部分。
初始化模型
val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
可以看到StreamingLinearRegressionWithSGD只需要初始化一下**初始权重(InitialWeights)**就可以了,进入StreamingLinearRegressionWithSGD看其构造函数:
class StreamingLinearRegressionWithSGD private[mllib] (
private var stepSize: Double,
private var numIterations: Int,
private var regParam: Double,
private var miniBatchFraction: Double)
extends StreamingLinearAlgorithm[LinearRegressionModel, LinearRegressionWithSGD]
with Serializable {
/**
* Construct a StreamingLinearRegression object with default parameters:
* {stepSize: 0.1, numIterations: 50, miniBatchFraction: 1.0}.
* Initial weights must be set before using trainOn or predictOn
* (see `StreamingLinearAlgorithm`)
*/
@Since("1.1.0")
def this() = this(0.1, 50, 0.0, 1.0)
@Since("1.1.0")
val algorithm = new LinearRegressionWithSGD(stepSize, numIterations, regParam, miniBatchFraction)
protected var model: Option[LinearRegressionModel] = None
.....//剩下的都是参数的setter
可以看到StreamingLinearRegressionWithSGD继承自StreamingLinearAlgorithm,而StreamingLinearAlgorithm中对参数类型做了规定,而且在下面有个成员变量叫algorithm,是LinearRegressionWithSGD的实例,model要装的数据类型是LinearRegressionModel。
总之StreamingLinearRegressionWithSGD继承自StreamingLinearAlgorithm,并且有LinearRegressionWithSGD的实例algorithm,LinearRegressionModel的实例model分别作了一些操作。
总结
回到正题来,在流式线性回归的官方案例中,初始化模型设置了初始权重,而流式线性回归类通过其父类的继承关系配合初始化权重,完成了一系列相关的设置。
训练预测
这里分两部分讲,因为是两套不同的代码,但是他们的类间继承关系一定要搞清楚才行,不然会显得很乱。
训练trainOn
StreamingLinearAlgorithm的77-101行,点击进入trainOn的代码:
/**
* Update the model by training on batches of data from a DStream.
* This operation registers a DStream for training the model,
* and updates the model based on every subsequent
* batch of data from the stream.
*
* @param data DStream containing labeled data
*/
def trainOn(data: DStream[LabeledPoint]): Unit = {
if (model.isEmpty) {
throw new IllegalArgumentException("Model must be initialized before starting training.")
}
data.foreachRDD { (rdd, time) =>
if (!rdd.isEmpty) {
model = Some(algorithm.run(rdd, model.get.weights))
logInfo(s"Model updated at time ${time.toString}")
val display = model.get.weights.size match {
case x if x > 100 => model.get.weights.toArray.take(100).mkString("[", ",", "...")
case _ => model.get.weights.toArray.mkString("[", ",", "]")
}
logInfo(s"Current model: weights, ${display}")
}
}
}
这个trainOn方法是属于StreamingLinearAlgorithm抽象类的:
abstract class StreamingLinearAlgorithm[
M <: GeneralizedLinearModel,
A <: GeneralizedLinearAlgorithm[M]] extends Logging {
这里没有继承关系,但是有中括号,里面有M和A,M <: GeneralizedLinearModel表示泛型M的类型必须是GeneralizedLinearModel或者其子类。所以,这个抽象类指定了两个泛型。
回到trainOn,第一个判断(87-89行)是判断model是否创建成功不用管。data.foreachRDD(90-100行)这里是训练的核心,遍历每个rdd,判断rdd不为空的情况下使用algorithm.run()方法进行训练,接下来就是对数据的展示格式做调整,不用管。
总结
其实trainOn的本质只是一个循环(foreachRDD),其内核还在algorithm的run方法和model的数据结构上!
algorithm
先看algorithm是个啥对象:
/** The algorithm to use for updating. */
protected val algorithm: A
点击algorithm找到其定义的位置,是个泛型A对象,注释说这是用来更新model的算法对象,再看A是啥:
abstract class StreamingLinearAlgorithm[
M <: GeneralizedLinearModel,
A <: GeneralizedLinearAlgorithm[M]] extends Logging {
A是个小于等于GeneralizedLinearAlgorithm的对象(GeneralizedLinearAlgorithm或其子类),ok先到这里,待会再挖,因为我感觉待会还会看到它。
model
看model是什么对象,其实也不难猜到model一个就是M,是个小于等于GeneralizedLinearModel的对象,应该是个通用线性模型对象:
/** The model to be updated and used for prediction. */
protected var model: Option[M]
看model的定义也可以知道,确实是这样的,用来被更新,然后做预测的model对象。
总结
那么我们已经了解了两个对象,algorithm和model,algorithm的算法来更新model,model被更新权重和偏置后拿来预测,所以不难猜测algorithm里面的run就是类似最小二乘法操作之类的操作,model里面一定有个predict方法来对样本做预测。
algorithm.run
文件GeneralizedLinearAlgorithm的233-370行,接下来我们深入algorithm的run方法,看他是怎么训练出weights的,点击进入run,太长了我们一行行看,就不一次性贴出来:
(233-239行)先看注释:
/**
* Run the algorithm with the configured parameters on an input RDD
* of LabeledPoint entries starting from the initial weights provided.
*
*/
def run(input: RDD[LabeledPoint], initialWeights: Vector): M = {
从初始化的权重向量开始,配合配置好的参数对输入的RDD[LabeledPoint]做训练,run的参数包括RDD[LabeledPoint]和initialWeights:Vector,返回类型是个M,也就是model的类型GeneralizedLinearModel。
(241-243行)判断了特征的数量,若特征数量定义<0,则直接通过输入的RDD里面取,因为<0是不可信的,直接从数据拿:
if (input.getStorageLevel == StorageLevel.NONE) {
logWarning("The input data is not directly cached, which may hurt performance if its"
+ " parent RDDs are also uncached.")
}
(245-248行)判断保存等级,不用管:
if (input.getStorageLevel == StorageLevel.NONE) {
logWarning("The input data is not directly cached, which may hurt performance if its"
+ " parent RDDs are also uncached.")
}
(250-253行) 判断数据是否可用,不可用则抛异常:
// Check the data properties before running the optimizer
if (validateData && !validators.forall(func => func(input))) {
throw new SparkException("Input validation failed.")
}
下面是重头戏了!
(255-273行)注释:
/**
* Scaling columns to unit variance as a heuristic to reduce the condition number:
* 将列缩放为单位方差作为启发式以减少条件数
* During the optimization process, the convergence (rate) depends on the condition number of the training dataset.
在优化过程中,收敛(速率)取决于训练数据集的条件数
* Scaling the variables often reduces this condition number
缩放变量通常会减少此条件数
* thus improving the convergence rate.
* 从而可以提高收敛速度
* Without reducing the condition number,
不减少条件数的话
* some training datasets mixing the columns with different scales may not be able to converge.
一些训练集的特征存在不同范围的钢量因此可能无法收敛
*
* condition number, and return
* the weights in the original scale.
* See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
GLMNET和LIBSVM这两个库可以用来scaling(缩放),也可以把缩放的数据还原
*
* Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean)
* and train the model in the scaled space.
在这里,若useFeatureScaling=true,我们的做法是在通过除每一列的方差而不是减去均值来做标准化,然后训练模型。
* Then we transform the coefficients from the scaled space to the original scale
* as GLMNET and LIBSVM do.
然后使用GLMNET和LIBSVM把系数恢复到原来的数据空间
*
* Currently, it's only enabled in LogisticRegressionWithLBFGS
目前这个操作只允许在LogisticRegressionWithLBFGS操作
*/
说白了,这个注释就是给我们解释数据缩放的方法,这个run方法需要对数据进行一些处理,数据才好收敛,在这里要给我们说清楚。
(274-278行)数据缩放器,useFeatureScaling=true时要生成缩放器:
val scaler = if (useFeatureScaling) {
new StandardScaler(withStd = true, withMean = false).fit(input.map(_.features))
} else {
null
}
(280-295行)定义出数据的格式,该加偏置加偏置,该缩放做缩放:
// Prepend an extra variable consisting of all 1.0's for the intercept.
// TODO: Apply feature scaling to the weight vector instead of input data.
val data =
if (addIntercept) { // 若有偏置
if (useFeatureScaling) { // 若需要特征缩放
input.map(lp => (lp.label, appendBias(scaler.transform(lp.features)))).cache()
// 对input这个rdd里的每一个labeledPoint遍历,取出label和features,然后对features加偏置
} else {
// 若不需要特征缩放且需要加偏置 则加偏置完事
input.map(lp => (lp.label, appendBias(lp.features))).cache()
}
} else { // 同理
if (useFeatureScaling) {
input.map(lp => (lp.label, scaler.transform(lp.features))).cache()
} else {
input.map(lp => (lp.label, lp.features))
}
}
(297-307行)这里将偏置(截距)设置到weights中,生成新的变量 initialWeightsWithIntercept:
/**
* TODO: For better convergence, in logistic regression, the intercepts should be computed
* from the prior probability distribution of the outcomes; for linear regression,
* the intercept should be set as the average of response.
为了更好地收敛,在逻辑回归中,截距应根据结果的先验概率分布来计算,对于线性回归,截距应设置为结果的均值
*/
val initialWeightsWithIntercept = if (addIntercept && numOfLinearPredictor == 1) {
appendBias(initialWeights)
} else {
/** If `numOfLinearPredictor > 1`, initialWeights already contains intercepts. */
initialWeights
}
(309行)使用Optimizer的optimize方法对data和initialWeightsWithIntercept优化,其中data: RDD[(Double, Vector)],initialWeightsWithIntercept: Vector,weightsWithIntercept: Vector:
val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)
data:RDD[(LabeledPoint.label,加了偏置的features)],最后经过优化得到一个向量对象,我看了优化器对象,是一个trial,也没定义优化具体代码,所以也不知道怎么操作的,这算是一个坑。
(311-315行)从weightsWithIntercept取出偏置:
val intercept = if (addIntercept && numOfLinearPredictor == 1) {
weightsWithIntercept(weightsWithIntercept.size - 1)
} else {
0.0
}
(317-321行)从weightsWithIntercept取出权重向量:
var weights = if (addIntercept && numOfLinearPredictor == 1) {
Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1))
} else {
weightsWithIntercept
}
(323-330行)注释:
/**
* The weights and intercept are trained in the scaled space; we're converting them back to
* the original scale.
*
* Math shows that if we only perform standardization without subtracting means, the intercept
* will not be changed. w_i = w_i' / v_i where w_i' is the coefficient in the scaled space, w_i
* is the coefficient in the original space, and v_i is the variance of the column i.
*/
权重和偏置已经在缩放的空间中被训练完,我们现在要把它们还原回原来的空间
所以接下来是权重和偏置还原的过程。
(331-355行)还原过程:
if (useFeatureScaling) { // 若需要做缩放执行下面全部代码
if (numOfLinearPredictor == 1) { //若是线性回归
weights = scaler.transform(weights) // 直接用scaler转换回来就行
} else { //若是逻辑回归 则做下面操作
/**
* For `numOfLinearPredictor > 1`, we have to transform the weights back to the original
* scale for each set of linear predictor. Note that the intercepts have to be explicitly
* excluded when `addIntercept == true` since the intercepts are part of weights now.
*/
var i = 0
val n = weights.size / numOfLinearPredictor
val weightsArray = weights.toArray
while (i < numOfLinearPredictor) {
val start = i * n
val end = (i + 1) * n - { if (addIntercept) 1 else 0 }
val partialWeightsArray = scaler.transform(
Vectors.dense(weightsArray.slice(start, end))).toArray
System.arraycopy(partialWeightsArray, 0, weightsArray, start, partialWeightsArray.length)
i += 1
}
weights = Vectors.dense(weightsArray)
}
}
逻辑回归部分太繁琐了,但是仔细看还是能看懂,我们本次重点在线性回归,就先不看下面代码。
(357-368行)收尾工作:
// Warn at the end of the run as well, for increased visibility.
if (input.getStorageLevel == StorageLevel.NONE) {
logWarning("The input data was not directly cached, which may hurt performance if its" + " parent RDDs are also uncached.")
}
// Unpersist cached data
if (data.getStorageLevel != StorageLevel.NONE) {
data.unpersist()
}
createModel(weights, intercept)
判断存储等级,打log什么的,最后重新生成一个model在整个run方法返回。
总结
整个trainOn方法是看完了,是一个常规的训练过程,最后通过训练出weights和intercept来重新生成一个model返回。
预测predictOn
val lr: StreamingLinearRegressionWithSGD = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(10))
lr.predictOn()
点击进入predictOn:
/**
* Use the model to make predictions on batches of data from a DStream
*
* @param data DStream containing feature vectors
* @return DStream containing predictions
*
*/
@Since("1.1.0")
def predictOn(data: DStream[Vector]): DStream[Double] = {
if (model.isEmpty) {
throw new IllegalArgumentException("Model must be initialized before starting prediction.")
}
data.map{x => model.get.predict(x)}
}
显然这是用训练好的model来预测,核心代码就最后一句。data本来是个DStream[Vector],现在map一下,然后使用model的predict方法对Vector做预测。
点击进入model.get.predict(x),就跳转到了GeneralizedLinearModel的82-84行:
/**
* Predict values for a single data point using the model trained.
*
* @param testData array representing a single data point
* @return Double prediction from the trained model
*
*/
@Since("1.0.0")
def predict(testData: Vector): Double = {
predictPoint(testData, weights, intercept)
}
这里调用的是predictPoint,传入需要预测的样本向量,权重向量,偏置(截距),再点击predictPoint:
protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double
还是在abstract class GeneralizedLinearModel里,而且方法没有定义,显然这是抽象类定义的一个接口,经过前面的继承关系分析,我们可以知道线性回归模型(LinearRegressionModel)继承自这个通用线性模型(GeneralizedLinearModel),所以这个方法应该在那里实现了:
class LinearRegressionModel @Since("1.1.0") (
@Since("1.0.0") override val weights: Vector,
@Since("0.8.0") override val intercept: Double)
extends GeneralizedLinearModel(weights, intercept) with RegressionModel with Serializable
with Saveable with PMMLExportable {
override protected def predictPoint(
dataMatrix: Vector,
weightMatrix: Vector,
intercept: Double): Double = {
weightMatrix.asBreeze.dot(dataMatrix.asBreeze) + intercept
}
打开LinearRegressionModel,第一个方法就是这个,而且实现的也很简单,通过权重向量.dot(乘)样本向量再加上偏置就可以了!这跟数学公式一样,没问题。
总结
到这里predictOn部分就完全结束了,这一圈也不复杂,就是定义了通用线性模型,线性回归模型继承并实现其接口就可以了。而且我们也知道整个继承树的大概,但是在trainOn部分还有一个疑问就是优化器没有找到源代码,待会讲。
Optimizer
首先,要搞清楚optimizer在哪里。各位还记不记得我们用的流式线性回归是用了SGD的,也就是用了随机梯度下降,我们还是要回到StreaingLinearRegressionWithSGD:
class StreamingLinearRegressionWithSGD private[mllib] (
private var stepSize: Double,
private var numIterations: Int,
private var regParam: Double,
private var miniBatchFraction: Double)
extends StreamingLinearAlgorithm[LinearRegressionModel, LinearRegressionWithSGD]
这里他继承自StreamingLinearAlgorithm,而流式线性算法类又有两个泛型,第二个就是LinearRegressionWithSGD:
abstract class StreamingLinearAlgorithm[
M <: GeneralizedLinearModel,
A <: GeneralizedLinearAlgorithm[M]] extends Logging {
在StreamingLinearAlgorithm类中还显示的不清楚,他的泛型规定只是说小于等于GeneralizedLinearAlgorithm,那我们现在从第一段代码点入LinearRegressionWithSGD:
/**
* Train a linear regression model with no regularization using Stochastic Gradient Descent.
* This solves the least squares regression formulation
* f(weights) = 1/n ||A weights-y||^2^
* (which is the mean squared error).
* Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with
* its corresponding right hand side label y.
* See also the documentation for the precise formulation.
对没有正则化的线性回归模型使用随机梯度下降
随机梯度下降SGD解决了最小二乘回归公式,也就是反映了最小二乘公式
在这个类中,data矩阵有n行,输入的input RDD 是一个样本集,在最右边还有label Y
*/
@Since("0.8.0")
class LinearRegressionWithSGD private[mllib] (
private var stepSize: Double,
private var numIterations: Int,
private var regParam: Double,
private var miniBatchFraction: Double)
extends GeneralizedLinearAlgorithm[LinearRegressionModel] with Serializable {
private val gradient = new LeastSquaresGradient()
private val updater = new SimpleUpdater()
@Since("0.8.0")
override val optimizer = new GradientDescent(gradient, updater)
.setStepSize(stepSize)
.setNumIterations(numIterations)
.setRegParam(regParam)
.setMiniBatchFraction(miniBatchFraction)
可以看到,optimizer是一个梯度下降GradientDescent对象,点击进入:
class GradientDescent private[spark] (private var gradient: Gradient, private var updater: Updater)
extends Optimizer with Logging {
GradientDescent继承了Optimizer特质,在最后一个方法可以找到optimize方法:
/**
* Runs gradient descent on the given training data.
* @param data training data
* @param initialWeights initial weights
* @return solution vector
*/
def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
val (weights, _) = GradientDescent.runMiniBatchSGD(
data, // RDD[(Double, Vector) ,RDD[(label,样本向量)]
gradient, // 梯度
updater, // 更新器
stepSize, // 步长
numIterations, // 迭代数
regParam, // 忘了
miniBatchFraction, // 最小batch片段
initialWeights, // 初始权重 当然这里不一定是0向量 若迭代过几次可能会有数值
convergenceTol) // 收敛容忍度
weights
}
这些参数都可以在上面的setter找到注释,我们可以理解成,optimize方法调用了梯度下降的runMiniBatchSGD,给了一堆参数进入计算,最后返回来tuple(weights,_),最后整个optimize方法返回的是weights向量。
runMiniBatchSGD
点击进入,可以看到一大堆注释跟方法代码,每个参数都有解释我就不说了:
/**
* Run stochastic gradient descent (SGD) in parallel using mini batches.
* In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
* in order to compute a gradient estimate.
* Sampling, and averaging the subgradients over this subset is performed using one standard
* spark map-reduce in each iteration.
*
* @param data Input data for SGD. RDD of the set of data examples, each of
* the form (label, [feature values]).
* @param gradient Gradient object (used to compute the gradient of the loss function of
* one single data example)
* @param updater Updater function to actually perform a gradient step in a given direction.
* @param stepSize initial step size for the first step
* @param numIterations number of iterations that SGD should be run.
* @param regParam regularization parameter
* @param miniBatchFraction fraction of the input data set that should be used for
* one iteration of SGD. Default value 1.0.
* @param convergenceTol Minibatch iteration will end before numIterations if the relative
* difference between the current weight and the previous weight is less
* than this value. In measuring convergence, L2 norm is calculated.
* Default value 0.001. Must be between 0.0 and 1.0 inclusively.
* @return A tuple containing two elements. The first element is a column matrix containing
* weights for every feature, and the second element is an array containing the
* stochastic loss computed for every iteration.
*/
def runMiniBatchSGD(
data: RDD[(Double, Vector)],
gradient: Gradient,
updater: Updater,
stepSize: Double,
numIterations: Int,
regParam: Double,
miniBatchFraction: Double,
initialWeights: Vector,
convergenceTol: Double): (Vector, Array[Double]) = {
// convergenceTol should be set with non minibatch settings
if (miniBatchFraction < 1.0 && convergenceTol > 0.0) {
logWarning("Testing against a convergenceTol when using miniBatchFraction " +
"< 1.0 can be unstable because of the stochasticity in sampling.")
}
if (numIterations * miniBatchFraction < 1.0) {
logWarning("Not all examples will be used if numIterations * miniBatchFraction < 1.0: " +
s"numIterations=$numIterations and miniBatchFraction=$miniBatchFraction")
}
// 历史随机损失 用一个ArrayBuffer来装 size是迭代次数
val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
// Record previous weight and current one to calculate solution vector difference
// 记录当前和上一次的weights,来计算向量的差
var previousWeights: Option[Vector] = None
var currentWeights: Option[Vector] = None
// 样本数
val numExamples = data.count()
// 没数据时返回initialweights(注意不一定是0向量,是进入训练的初始值weights)
// if no data, return initial weights to avoid NaNs
if (numExamples == 0) {
logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found")
return (initialWeights, stochasticLossHistory.toArray)
}
if (numExamples * miniBatchFraction < 1) {
logWarning("The miniBatchFraction is too small")
}
// Initialize weights as a column vector
var weights = Vectors.dense(initialWeights.toArray)
val n = weights.size
/**
* For the first iteration, the regVal will be initialized as sum of weight squares
* if it's L2 updater; for L1 updater, the same logic is followed.
*/
// 第一次迭代时,若是做L2更新regVal会被初始化为权重的平方和
// 第一次迭代时,若是做L1更新也一样
var regVal = updater.compute(
weights, Vectors.zeros(weights.size), 0, 1, regParam)._2
var converged = false // indicates whether converged based on convergenceTol
var i = 1
// 直到收敛到指定的tolerance或者迭代次数才停止计算
while (!converged && i <= numIterations) {
val bcWeights = data.context.broadcast(weights)
// Sample a subset (fraction miniBatchFraction) of the total data
// compute and sum up the subgradients on this subset (this is one map-reduce)
val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
seqOp = (c, v) => {
// c: (grad, loss, count), v: (label, features)
val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
(c._1, c._2 + l, c._3 + 1)
},
combOp = (c1, c2) => {
// c: (grad, loss, count)
(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
})
bcWeights.destroy()
if (miniBatchSize > 0) {
/**
* lossSum is computed using the weights from the previous iteration
* and regVal is the regularization value computed in the previous iteration as well.
*/
stochasticLossHistory += lossSum / miniBatchSize + regVal
val update = updater.compute(
weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
stepSize, i, regParam)
weights = update._1
regVal = update._2
previousWeights = currentWeights
currentWeights = Some(weights)
if (previousWeights != None && currentWeights != None) {
converged = isConverged(previousWeights.get,
currentWeights.get, convergenceTol)
}
} else {
logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
}
i += 1
}
logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format(
stochasticLossHistory.takeRight(10).mkString(", ")))
(weights, stochasticLossHistory.toArray)
}
说实话,这段看起来很麻烦,我还是以后单独开一篇来讲吧。我们可以简单的理解成,这个optimize方法接受一批次的样本,然后通过model的initialweights来使用L1\L2来做回归,最后返回一个合适的weights向量给你(停止条件:迭代次数或阈值达到)
继承树
StreamingLinearRegressionWithSGD背后的结构还是蛮乱的,为了方便大家理解,我想做一个继承树的图给大家看:
怪我UML没学好,所以画成这样,但是大概意思大家理解就好,总的来说核心的就是中间两个LinearRegressionModel和LinearRegressionWithSGD。
分布式图
因为我们已经知道其核心用的还是LinearRegressionModel,所以它是直接把单机的算法改成在分布式环境下用,我试着把他的图花了一下,有错误请指正:
而且,在trainOn里我们看到这个更新策略好像是直接替换model,这。。。感觉怪怪的。
总结
那么今天这个流式线性回归算法源码分析算是结束了,经过一层层翻找大概弄清楚他的结构了,说白了还是把离线线性回归模型用到了流式线性回归模型里,而且model的更新策略感觉不对劲,看看以后能不能搞一搞。我觉得我要去强化一下DStream和RDD的关系了。