StreamingLinearRegressionWithSGD源码分析

StreaingLinearRegressionWithSGD源码分析

大家好,我是一拳就能打爆A柱的猛男

好久不见,真的好久没写博客了,最近在准备考试,然后写了一篇20年总结。这个礼拜还是跟之前的进度一样去研究如何测试Spark中的流式机器学习算法的性能。今天给大家带来流式线性回归算法的源码讲解,配合我对分布式的了解给大家说说我的看法。

巨长,耐心看完会有收获,没时间建议先收藏!

巨长,耐心看完会有收获,没时间建议先收藏!

巨长,耐心看完会有收获,没时间建议先收藏!

StreaingLinearRegressionWithSGD官方案例

从Spark官网我们可以找到官方的案例

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD

val trainingData = ssc.textFileStream(args(0)).map(LabeledPoint.parse).cache()
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)

val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
  .setInitialWeights(Vectors.zeros(numFeatures))

model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()

ssc.start()
ssc.awaitTermination()

可以看到这代码分成四部分:加载数据、初始化模型、训练预测、ssc启动。当然加载数据、清洗什么的需要根据数据源以及数据格式的不同做不同的处理,我们主要关注初始化模型、训练预测两部分。

初始化模型

val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
  .setInitialWeights(Vectors.zeros(numFeatures))

可以看到StreamingLinearRegressionWithSGD只需要初始化一下**初始权重(InitialWeights)**就可以了,进入StreamingLinearRegressionWithSGD看其构造函数:

class StreamingLinearRegressionWithSGD private[mllib] (
    private var stepSize: Double,
    private var numIterations: Int,
    private var regParam: Double,
    private var miniBatchFraction: Double)
  extends StreamingLinearAlgorithm[LinearRegressionModel, LinearRegressionWithSGD]
  with Serializable {

  /**
   * Construct a StreamingLinearRegression object with default parameters:
   * {stepSize: 0.1, numIterations: 50, miniBatchFraction: 1.0}.
   * Initial weights must be set before using trainOn or predictOn
   * (see `StreamingLinearAlgorithm`)
   */
  @Since("1.1.0")
  def this() = this(0.1, 50, 0.0, 1.0)

  @Since("1.1.0")
  val algorithm = new LinearRegressionWithSGD(stepSize, numIterations, regParam, miniBatchFraction)

  protected var model: Option[LinearRegressionModel] = None
  
  .....//剩下的都是参数的setter

可以看到StreamingLinearRegressionWithSGD继承自StreamingLinearAlgorithm,而StreamingLinearAlgorithm中对参数类型做了规定,而且在下面有个成员变量叫algorithm,是LinearRegressionWithSGD的实例,model要装的数据类型是LinearRegressionModel。

总之StreamingLinearRegressionWithSGD继承自StreamingLinearAlgorithm,并且有LinearRegressionWithSGD的实例algorithm,LinearRegressionModel的实例model分别作了一些操作。

总结

回到正题来,在流式线性回归的官方案例中,初始化模型设置了初始权重,而流式线性回归类通过其父类的继承关系配合初始化权重,完成了一系列相关的设置。

训练预测

这里分两部分讲,因为是两套不同的代码,但是他们的类间继承关系一定要搞清楚才行,不然会显得很乱。

训练trainOn

StreamingLinearAlgorithm的77-101行,点击进入trainOn的代码:

/**
 * Update the model by training on batches of data from a DStream.
 * This operation registers a DStream for training the model,
 * and updates the model based on every subsequent
 * batch of data from the stream.
 *
 * @param data DStream containing labeled data
 */
def trainOn(data: DStream[LabeledPoint]): Unit = {
    if (model.isEmpty) {
        throw new IllegalArgumentException("Model must be initialized before starting training.")
    }
    data.foreachRDD { (rdd, time) =>
        if (!rdd.isEmpty) {
            model = Some(algorithm.run(rdd, model.get.weights))
            logInfo(s"Model updated at time ${time.toString}")
            val display = model.get.weights.size match {
                case x if x > 100 => model.get.weights.toArray.take(100).mkString("[", ",", "...")
                case _ => model.get.weights.toArray.mkString("[", ",", "]")
            }
            logInfo(s"Current model: weights, ${display}")
        }
    }
}

这个trainOn方法是属于StreamingLinearAlgorithm抽象类的:

abstract class StreamingLinearAlgorithm[
    M <: GeneralizedLinearModel,
    A <: GeneralizedLinearAlgorithm[M]] extends Logging {

这里没有继承关系,但是有中括号,里面有M和A,M <: GeneralizedLinearModel表示泛型M的类型必须是GeneralizedLinearModel或者其子类。所以,这个抽象类指定了两个泛型。

回到trainOn,第一个判断(87-89行)是判断model是否创建成功不用管。data.foreachRDD(90-100行)这里是训练的核心遍历每个rdd,判断rdd不为空的情况下使用algorithm.run()方法进行训练,接下来就是对数据的展示格式做调整,不用管。

总结

其实trainOn的本质只是一个循环(foreachRDD),其内核还在algorithm的run方法和model的数据结构上!

algorithm

先看algorithm是个啥对象:

/** The algorithm to use for updating. */
protected val algorithm: A

点击algorithm找到其定义的位置,是个泛型A对象,注释说这是用来更新model的算法对象,再看A是啥:

abstract class StreamingLinearAlgorithm[
    M <: GeneralizedLinearModel,
    A <: GeneralizedLinearAlgorithm[M]] extends Logging {

A是个小于等于GeneralizedLinearAlgorithm的对象(GeneralizedLinearAlgorithm或其子类),ok先到这里,待会再挖,因为我感觉待会还会看到它。

model

看model是什么对象,其实也不难猜到model一个就是M,是个小于等于GeneralizedLinearModel的对象,应该是个通用线性模型对象:

/** The model to be updated and used for prediction. */
  protected var model: Option[M]

看model的定义也可以知道,确实是这样的,用来被更新,然后做预测的model对象。

总结

那么我们已经了解了两个对象,algorithm和model,algorithm的算法来更新model,model被更新权重和偏置后拿来预测,所以不难猜测algorithm里面的run就是类似最小二乘法操作之类的操作,model里面一定有个predict方法来对样本做预测。

algorithm.run

文件GeneralizedLinearAlgorithm的233-370行,接下来我们深入algorithm的run方法,看他是怎么训练出weights的,点击进入run,太长了我们一行行看,就不一次性贴出来:

233-239行)先看注释:

/**
 * Run the algorithm with the configured parameters on an input RDD
 * of LabeledPoint entries starting from the initial weights provided.
 *
 */
def run(input: RDD[LabeledPoint], initialWeights: Vector): M = {

从初始化的权重向量开始,配合配置好的参数对输入的RDD[LabeledPoint]做训练,run的参数包括RDD[LabeledPoint]和initialWeights:Vector,返回类型是个M,也就是model的类型GeneralizedLinearModel。

241-243行)判断了特征的数量,若特征数量定义<0,则直接通过输入的RDD里面取,因为<0是不可信的,直接从数据拿:

if (input.getStorageLevel == StorageLevel.NONE) {
    logWarning("The input data is not directly cached, which may hurt performance if its"
               + " parent RDDs are also uncached.")
}

245-248行)判断保存等级,不用管:

if (input.getStorageLevel == StorageLevel.NONE) {
    logWarning("The input data is not directly cached, which may hurt performance if its"
               + " parent RDDs are also uncached.")
}

250-253行) 判断数据是否可用,不可用则抛异常:

// Check the data properties before running the optimizer
if (validateData && !validators.forall(func => func(input))) {
    throw new SparkException("Input validation failed.")
}

下面是重头戏了!

255-273行)注释:

/**
 * Scaling columns to unit variance as a heuristic to reduce the condition number:
 * 将列缩放为单位方差作为启发式以减少条件数
 
 * During the optimization process, the convergence (rate) depends on the condition number of the training dataset.
 在优化过程中,收敛(速率)取决于训练数据集的条件数
 * Scaling the variables often reduces this condition number
 缩放变量通常会减少此条件数
 * thus improving the convergence rate. 
 * 从而可以提高收敛速度
 * Without reducing the condition number,
 不减少条件数的话
 * some training datasets mixing the columns with different scales may not be able to converge.
 一些训练集的特征存在不同范围的钢量因此可能无法收敛
 *
 * condition number, and return
 * the weights in the original scale.
 * See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
 GLMNET和LIBSVM这两个库可以用来scaling(缩放),也可以把缩放的数据还原
 *
 
 * Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean)
 *  and train the model in the scaled space. 
 在这里,若useFeatureScaling=true,我们的做法是在通过除每一列的方差而不是减去均值来做标准化,然后训练模型。
 
 * Then we transform the coefficients from the scaled space to the original scale
 * as GLMNET and LIBSVM do.
 然后使用GLMNET和LIBSVM把系数恢复到原来的数据空间
 *
 * Currently, it's only enabled in LogisticRegressionWithLBFGS
 目前这个操作只允许在LogisticRegressionWithLBFGS操作
 */

说白了,这个注释就是给我们解释数据缩放的方法,这个run方法需要对数据进行一些处理,数据才好收敛,在这里要给我们说清楚。

274-278行)数据缩放器,useFeatureScaling=true时要生成缩放器:

val scaler = if (useFeatureScaling) {
    new StandardScaler(withStd = true, withMean = false).fit(input.map(_.features))
} else {
    null
}

280-295行)定义出数据的格式,该加偏置加偏置,该缩放做缩放:

// Prepend an extra variable consisting of all 1.0's for the intercept.
// TODO: Apply feature scaling to the weight vector instead of input data.
val data =
if (addIntercept) { // 若有偏置
    if (useFeatureScaling) { // 若需要特征缩放
        input.map(lp => (lp.label, appendBias(scaler.transform(lp.features)))).cache()
        // 对input这个rdd里的每一个labeledPoint遍历,取出label和features,然后对features加偏置
    } else {
        // 若不需要特征缩放且需要加偏置 则加偏置完事
        input.map(lp => (lp.label, appendBias(lp.features))).cache()
    }
} else { // 同理
    if (useFeatureScaling) {
        input.map(lp => (lp.label, scaler.transform(lp.features))).cache()
    } else {
        input.map(lp => (lp.label, lp.features))
    }
}

297-307行)这里将偏置(截距)设置到weights中,生成新的变量 initialWeightsWithIntercept:

/**
 * TODO: For better convergence, in logistic regression, the intercepts should be computed
 * from the prior probability distribution of the outcomes; for linear regression,
 * the intercept should be set as the average of response.
 为了更好地收敛,在逻辑回归中,截距应根据结果的先验概率分布来计算,对于线性回归,截距应设置为结果的均值
 */
val initialWeightsWithIntercept = if (addIntercept && numOfLinearPredictor == 1) {
    appendBias(initialWeights)
} else {
    /** If `numOfLinearPredictor > 1`, initialWeights already contains intercepts. */
    initialWeights
}

309行)使用Optimizer的optimize方法对data和initialWeightsWithIntercept优化,其中data: RDD[(Double, Vector)],initialWeightsWithIntercept: Vector,weightsWithIntercept: Vector:

val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)

data:RDD[(LabeledPoint.label,加了偏置的features)],最后经过优化得到一个向量对象,我看了优化器对象,是一个trial,也没定义优化具体代码,所以也不知道怎么操作的,这算是一个坑。

311-315行)从weightsWithIntercept取出偏置:

val intercept = if (addIntercept && numOfLinearPredictor == 1) {
    weightsWithIntercept(weightsWithIntercept.size - 1)
} else {
    0.0
}

317-321行)从weightsWithIntercept取出权重向量:

var weights = if (addIntercept && numOfLinearPredictor == 1) {
    Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1))
} else {
    weightsWithIntercept
}

323-330行)注释:

/**
     * The weights and intercept are trained in the scaled space; we're converting them back to
     * the original scale.
     *
     * Math shows that if we only perform standardization without subtracting means, the intercept
     * will not be changed. w_i = w_i' / v_i where w_i' is the coefficient in the scaled space, w_i
     * is the coefficient in the original space, and v_i is the variance of the column i.
     */
权重和偏置已经在缩放的空间中被训练完,我们现在要把它们还原回原来的空间

所以接下来是权重和偏置还原的过程。

331-355行)还原过程:

if (useFeatureScaling) { // 若需要做缩放执行下面全部代码
    if (numOfLinearPredictor == 1) { //若是线性回归
        weights = scaler.transform(weights) // 直接用scaler转换回来就行
    } else { //若是逻辑回归 则做下面操作
        /**
         * For `numOfLinearPredictor > 1`, we have to transform the weights back to the original
         * scale for each set of linear predictor. Note that the intercepts have to be explicitly
         * excluded when `addIntercept == true` since the intercepts are part of weights now.
         */
        var i = 0
        val n = weights.size / numOfLinearPredictor
        val weightsArray = weights.toArray
        while (i < numOfLinearPredictor) {
            val start = i * n
            val end = (i + 1) * n - { if (addIntercept) 1 else 0 }

            val partialWeightsArray = scaler.transform(
                Vectors.dense(weightsArray.slice(start, end))).toArray

            System.arraycopy(partialWeightsArray, 0, weightsArray, start, partialWeightsArray.length)
            i += 1
        }
        weights = Vectors.dense(weightsArray)
    }
}

逻辑回归部分太繁琐了,但是仔细看还是能看懂,我们本次重点在线性回归,就先不看下面代码。

357-368行)收尾工作:

// Warn at the end of the run as well, for increased visibility.
if (input.getStorageLevel == StorageLevel.NONE) {
    logWarning("The input data was not directly cached, which may hurt performance if its" + " parent RDDs are also uncached.")
}

// Unpersist cached data
if (data.getStorageLevel != StorageLevel.NONE) {
    data.unpersist()
}

createModel(weights, intercept)

判断存储等级,打log什么的,最后重新生成一个model在整个run方法返回。

总结

整个trainOn方法是看完了,是一个常规的训练过程,最后通过训练出weights和intercept来重新生成一个model返回。

预测predictOn
val lr: StreamingLinearRegressionWithSGD = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(10))
lr.predictOn()

点击进入predictOn:

/**
   * Use the model to make predictions on batches of data from a DStream
   *
   * @param data DStream containing feature vectors
   * @return DStream containing predictions
   *
   */
@Since("1.1.0")
def predictOn(data: DStream[Vector]): DStream[Double] = {
    if (model.isEmpty) {
        throw new IllegalArgumentException("Model must be initialized before starting prediction.")
    }
    data.map{x => model.get.predict(x)}
}

显然这是用训练好的model来预测,核心代码就最后一句。data本来是个DStream[Vector],现在map一下,然后使用model的predict方法对Vector做预测。

点击进入model.get.predict(x),就跳转到了GeneralizedLinearModel的82-84行:

/**
   * Predict values for a single data point using the model trained.
   *
   * @param testData array representing a single data point
   * @return Double prediction from the trained model
   *
   */
@Since("1.0.0")
def predict(testData: Vector): Double = {
    predictPoint(testData, weights, intercept)
}

这里调用的是predictPoint,传入需要预测的样本向量,权重向量,偏置(截距),再点击predictPoint:

protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double

还是在abstract class GeneralizedLinearModel里,而且方法没有定义,显然这是抽象类定义的一个接口,经过前面的继承关系分析,我们可以知道线性回归模型(LinearRegressionModel)继承自这个通用线性模型(GeneralizedLinearModel),所以这个方法应该在那里实现了:

class LinearRegressionModel @Since("1.1.0") (
    @Since("1.0.0") override val weights: Vector,
    @Since("0.8.0") override val intercept: Double)
extends GeneralizedLinearModel(weights, intercept) with RegressionModel with Serializable
with Saveable with PMMLExportable {

    override protected def predictPoint(
        dataMatrix: Vector,
        weightMatrix: Vector,
        intercept: Double): Double = {
        weightMatrix.asBreeze.dot(dataMatrix.asBreeze) + intercept
    }

打开LinearRegressionModel,第一个方法就是这个,而且实现的也很简单,通过权重向量.dot(乘)样本向量再加上偏置就可以了!这跟数学公式一样,没问题。

总结

到这里predictOn部分就完全结束了,这一圈也不复杂,就是定义了通用线性模型,线性回归模型继承并实现其接口就可以了。而且我们也知道整个继承树的大概,但是在trainOn部分还有一个疑问就是优化器没有找到源代码,待会讲。

Optimizer

首先,要搞清楚optimizer在哪里。各位还记不记得我们用的流式线性回归是用了SGD的,也就是用了随机梯度下降,我们还是要回到StreaingLinearRegressionWithSGD:

class StreamingLinearRegressionWithSGD private[mllib] (
    private var stepSize: Double,
    private var numIterations: Int,
    private var regParam: Double,
    private var miniBatchFraction: Double)
extends StreamingLinearAlgorithm[LinearRegressionModel, LinearRegressionWithSGD]

这里他继承自StreamingLinearAlgorithm,而流式线性算法类又有两个泛型,第二个就是LinearRegressionWithSGD:

abstract class StreamingLinearAlgorithm[
    M <: GeneralizedLinearModel,
    A <: GeneralizedLinearAlgorithm[M]] extends Logging {

在StreamingLinearAlgorithm类中还显示的不清楚,他的泛型规定只是说小于等于GeneralizedLinearAlgorithm,那我们现在从第一段代码点入LinearRegressionWithSGD

/**
 * Train a linear regression model with no regularization using Stochastic Gradient Descent.
 * This solves the least squares regression formulation
 *              f(weights) = 1/n ||A weights-y||^2^
 * (which is the mean squared error).
 * Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with
 * its corresponding right hand side label y.
 * See also the documentation for the precise formulation.
 对没有正则化的线性回归模型使用随机梯度下降
 随机梯度下降SGD解决了最小二乘回归公式,也就是反映了最小二乘公式
 在这个类中,data矩阵有n行,输入的input RDD 是一个样本集,在最右边还有label Y
 */
@Since("0.8.0")
class LinearRegressionWithSGD private[mllib] (
    private var stepSize: Double,
    private var numIterations: Int,
    private var regParam: Double,
    private var miniBatchFraction: Double)
  extends GeneralizedLinearAlgorithm[LinearRegressionModel] with Serializable {
      private val gradient = new LeastSquaresGradient()
      private val updater = new SimpleUpdater()
      @Since("0.8.0")
      override val optimizer = new GradientDescent(gradient, updater)
      .setStepSize(stepSize)
      .setNumIterations(numIterations)
      .setRegParam(regParam)
      .setMiniBatchFraction(miniBatchFraction)

可以看到,optimizer是一个梯度下降GradientDescent对象,点击进入:

class GradientDescent private[spark] (private var gradient: Gradient, private var updater: Updater)
extends Optimizer with Logging {

GradientDescent继承了Optimizer特质,在最后一个方法可以找到optimize方法:

/**
   * Runs gradient descent on the given training data.
   * @param data training data
   * @param initialWeights initial weights
   * @return solution vector
   */
def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    val (weights, _) = GradientDescent.runMiniBatchSGD(
        data, // RDD[(Double, Vector) ,RDD[(label,样本向量)]
        gradient,  // 梯度
        updater, // 更新器
        stepSize, // 步长
        numIterations, // 迭代数
        regParam, // 忘了
        miniBatchFraction, // 最小batch片段
        initialWeights, // 初始权重 当然这里不一定是0向量 若迭代过几次可能会有数值
        convergenceTol) // 收敛容忍度
    weights
}

这些参数都可以在上面的setter找到注释,我们可以理解成,optimize方法调用了梯度下降的runMiniBatchSGD,给了一堆参数进入计算,最后返回来tuple(weights,_),最后整个optimize方法返回的是weights向量。

runMiniBatchSGD

点击进入,可以看到一大堆注释跟方法代码,每个参数都有解释我就不说了:

/**
   * Run stochastic gradient descent (SGD) in parallel using mini batches.
   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
   * in order to compute a gradient estimate.
   * Sampling, and averaging the subgradients over this subset is performed using one standard
   * spark map-reduce in each iteration.
   *
   * @param data Input data for SGD. RDD of the set of data examples, each of
   *             the form (label, [feature values]).
   * @param gradient Gradient object (used to compute the gradient of the loss function of
   *                 one single data example)
   * @param updater Updater function to actually perform a gradient step in a given direction.
   * @param stepSize initial step size for the first step
   * @param numIterations number of iterations that SGD should be run.
   * @param regParam regularization parameter
   * @param miniBatchFraction fraction of the input data set that should be used for
   *                          one iteration of SGD. Default value 1.0.
   * @param convergenceTol Minibatch iteration will end before numIterations if the relative
   *                       difference between the current weight and the previous weight is less
   *                       than this value. In measuring convergence, L2 norm is calculated.
   *                       Default value 0.001. Must be between 0.0 and 1.0 inclusively.
   * @return A tuple containing two elements. The first element is a column matrix containing
   *         weights for every feature, and the second element is an array containing the
   *         stochastic loss computed for every iteration.
   */
def runMiniBatchSGD(
    data: RDD[(Double, Vector)],
    gradient: Gradient,
    updater: Updater,
    stepSize: Double,
    numIterations: Int,
    regParam: Double,
    miniBatchFraction: Double,
    initialWeights: Vector,
    convergenceTol: Double): (Vector, Array[Double]) = {

    // convergenceTol should be set with non minibatch settings
    if (miniBatchFraction < 1.0 && convergenceTol > 0.0) {
        logWarning("Testing against a convergenceTol when using miniBatchFraction " +
                   "< 1.0 can be unstable because of the stochasticity in sampling.")
    }

    if (numIterations * miniBatchFraction < 1.0) {
        logWarning("Not all examples will be used if numIterations * miniBatchFraction < 1.0: " +
                   s"numIterations=$numIterations and miniBatchFraction=$miniBatchFraction")
    }

    // 历史随机损失 用一个ArrayBuffer来装 size是迭代次数
    val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
    // Record previous weight and current one to calculate solution vector difference
	// 记录当前和上一次的weights,来计算向量的差
    var previousWeights: Option[Vector] = None
    var currentWeights: Option[Vector] = None

    // 样本数
    val numExamples = data.count()

    // 没数据时返回initialweights(注意不一定是0向量,是进入训练的初始值weights)
    // if no data, return initial weights to avoid NaNs
    if (numExamples == 0) {
        logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found")
        return (initialWeights, stochasticLossHistory.toArray)
    }

    if (numExamples * miniBatchFraction < 1) {
        logWarning("The miniBatchFraction is too small")
    }

    // Initialize weights as a column vector
    var weights = Vectors.dense(initialWeights.toArray)
    val n = weights.size

    /**
     * For the first iteration, the regVal will be initialized as sum of weight squares
     * if it's L2 updater; for L1 updater, the same logic is followed.
     */
    // 第一次迭代时,若是做L2更新regVal会被初始化为权重的平方和
    // 第一次迭代时,若是做L1更新也一样
    var regVal = updater.compute(
        weights, Vectors.zeros(weights.size), 0, 1, regParam)._2

    var converged = false // indicates whether converged based on convergenceTol
    var i = 1
    // 直到收敛到指定的tolerance或者迭代次数才停止计算
    while (!converged && i <= numIterations) {
        val bcWeights = data.context.broadcast(weights)
        // Sample a subset (fraction miniBatchFraction) of the total data
        // compute and sum up the subgradients on this subset (this is one map-reduce)
        val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
            seqOp = (c, v) => {
                // c: (grad, loss, count), v: (label, features)
                val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
                (c._1, c._2 + l, c._3 + 1)
            },
            combOp = (c1, c2) => {
                // c: (grad, loss, count)
                (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
            })
        bcWeights.destroy()

        if (miniBatchSize > 0) {
            /**
         * lossSum is computed using the weights from the previous iteration
         * and regVal is the regularization value computed in the previous iteration as well.
         */
            stochasticLossHistory += lossSum / miniBatchSize + regVal
            val update = updater.compute(
                weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
                stepSize, i, regParam)
            weights = update._1
            regVal = update._2

            previousWeights = currentWeights
            currentWeights = Some(weights)
            if (previousWeights != None && currentWeights != None) {
                converged = isConverged(previousWeights.get,
                                        currentWeights.get, convergenceTol)
            }
        } else {
            logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
        }
        i += 1
    }

    logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format(
        stochasticLossHistory.takeRight(10).mkString(", ")))

    (weights, stochasticLossHistory.toArray)

}

说实话,这段看起来很麻烦,我还是以后单独开一篇来讲吧。我们可以简单的理解成,这个optimize方法接受一批次的样本,然后通过model的initialweights来使用L1\L2来做回归,最后返回一个合适的weights向量给你(停止条件:迭代次数或阈值达到)

继承树

StreamingLinearRegressionWithSGD背后的结构还是蛮乱的,为了方便大家理解,我想做一个继承树的图给大家看:

在这里插入图片描述

怪我UML没学好,所以画成这样,但是大概意思大家理解就好,总的来说核心的就是中间两个LinearRegressionModel和LinearRegressionWithSGD。

分布式图

因为我们已经知道其核心用的还是LinearRegressionModel,所以它是直接把单机的算法改成在分布式环境下用,我试着把他的图花了一下,有错误请指正:

在这里插入图片描述

而且,在trainOn里我们看到这个更新策略好像是直接替换model,这。。。感觉怪怪的。

总结

那么今天这个流式线性回归算法源码分析算是结束了,经过一层层翻找大概弄清楚他的结构了,说白了还是把离线线性回归模型用到了流式线性回归模型里,而且model的更新策略感觉不对劲,看看以后能不能搞一搞。我觉得我要去强化一下DStream和RDD的关系了。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值