StreamingLinearRegressionWithSGD源码分析

最新推荐文章于 2021-02-07 11:49:58 发布

我一拳打弯你A柱

最新推荐文章于 2021-02-07 11:49:58 发布

阅读量431

点赞数 1

分类专栏： Spark 机器学习文章标签：机器学习 spark mllib streaming

本文链接：https://blog.csdn.net/alian_w/article/details/112533024

版权

Spark 同时被 2 个专栏收录

22 篇文章 2 订阅

订阅专栏

机器学习

21 篇文章 7 订阅

订阅专栏

StreaingLinearRegressionWithSGD源码分析

大家好，我是一拳就能打爆A柱的猛男

好久不见，真的好久没写博客了，最近在准备考试，然后写了一篇20年总结。这个礼拜还是跟之前的进度一样去研究如何测试Spark中的流式机器学习算法的性能。今天给大家带来流式线性回归算法的源码讲解，配合我对分布式的了解给大家说说我的看法。

巨长，耐心看完会有收获，没时间建议先收藏！

StreaingLinearRegressionWithSGD官方案例

从Spark官网我们可以找到官方的案例：

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD

val trainingData = ssc.textFileStream(args(0)).map(LabeledPoint.parse).cache()
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)

val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
  .setInitialWeights(Vectors.zeros(numFeatures))

model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()

ssc.start()
ssc.awaitTermination()

可以看到这代码分成四部分：加载数据、初始化模型、训练预测、ssc启动。当然加载数据、清洗什么的需要根据数据源以及数据格式的不同做不同的处理，我们主要关注初始化模型、训练预测两部分。

初始化模型

val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
  .setInitialWeights(Vectors.zeros(numFeatures))

可以看到StreamingLinearRegressionWithSGD只需要初始化一下**初始权重（InitialWeights）**就可以了，进入StreamingLinearRegressionWithSGD看其构造函数：

class StreamingLinearRegressionWithSGD private[mllib] (
    private var stepSize: Double,
    private var numIterations: Int,
    private var regParam: Double,
    private var miniBatchFraction: Double)
  extends StreamingLinearAlgorithm[LinearRegressionModel, LinearRegressionWithSGD]
  with Serializable {

  /**
   * Construct a StreamingLinearRegression object with default parameters:
   * {stepSize: 0.1, numIterations: 50, miniBatchFraction: 1.0}.
   * Initial weights must be set before using trainOn or predictOn
   * (see `StreamingLinearAlgorithm`)
   */
  @Since("1.1.0")
  def this() = this(0.1, 50, 0.0, 1.0)

  @Since("1.1.0")
  val algorithm = new LinearRegressionWithSGD(stepSize, numIterations, regParam, miniBatchFraction)

  protected var model: Option[LinearRegressionModel] = None
  
  .....//剩下的都是参数的setter

可以看到StreamingLinearRegressionWithSGD继承自StreamingLinearAlgorithm，而StreamingLinearAlgorithm中对参数类型做了规定，而且在下面有个成员变量叫algorithm，是LinearRegressionWithSGD的实例，model要装的数据类型是LinearRegressionModel。

总之StreamingLinearRegressionWithSGD继承自StreamingLinearAlgorithm，并且有LinearRegressionWithSGD的实例algorithm，LinearRegressionModel的实例model分别作了一些操作。

总结

回到正题来，在流式线性回归的官方案例中，初始化模型设置了初始权重，而流式线性回归类通过其父类的继承关系配合初始化权重，完成了一系列相关的设置。

训练预测

这里分两部分讲，因为是两套不同的代码，但是他们的类间继承关系一定要搞清楚才行，不然会显得很乱。

训练trainOn

StreamingLinearAlgorithm的77-101行，点击进入trainOn的代码：

/**
 * Update the model by training on batches of data from a DStream.
 * This operation registers a DStream for training the model,
 * and updates the model based on every subsequent
 * batch of data from the stream.
 *
 * @param data DStream containing labeled data
 */
def trainOn(data: DStream[LabeledPoint]): Unit = {
    if (model.isEmpty) {
        throw new IllegalArgumentException("Model must be initialized before starting training.")
    }
    data.foreachRDD { (rdd, time) =>
        if (!rdd.isEmpty) {
            model = Some(algorithm.run(rdd, model.get.weights))
            logInfo(s"Model updated at time ${time.toString}")
            val display = model.get.weights.size match {
                case x if x > 100 => model.get.weights.toArray.take(100).mkString("[", ",", "...")
                case _ => model.get.weights.toArray.mkString("[", ",", "]")
            }
            logInfo(s"Current model: weights, ${display}")
        }
    }
}

这个trainOn方法是属于StreamingLinearAlgorithm抽象类的：

abstract class StreamingLinearAlgorithm[
    M <: GeneralizedLinearModel,
    A <: GeneralizedLinearAlgorithm[M]] extends Logging {

这里没有继承关系，但是有中括号，里面有M和A，M <: GeneralizedLinearModel表示泛型M的类型必须是GeneralizedLinearModel或者其子类。所以，这个抽象类指定了两个泛型。

回到trainOn，第一个判断（87-89行）是判断model是否创建成功不用管。data.foreachRDD（90-100行）这里是训练的核心，遍历每个rdd，判断rdd不为空的情况下使用algorithm.run()方法进行训练，接下来就是对数据的展示格式做调整，不用管。

总结

其实trainOn的本质只是一个循环（foreachRDD），其内核还在algorithm的run方法和model的数据结构上！

algorithm

先看algorithm是个啥对象：

/** The algorithm to use for updating. */
protected val algorithm: A

点击algorithm找到其定义的位置，是个泛型A对象，注释说这是用来更新model的算法对象，再看A是啥：

abstract class StreamingLinearAlgorithm[
    M <: GeneralizedLinearModel,
    A <: GeneralizedLinearAlgorithm[M]] extends Logging {

A是个小于等于GeneralizedLinearAlgorithm的对象（GeneralizedLinearAlgorithm或其子类），ok先到这里，待会再挖，因为我感觉待会还会看到它。

model

看model是什么对象，其实也不难猜到model一个就是M，是个小于等于GeneralizedLinearModel的对象，应该是个通用线性模型对象：

/** The model to be updated and used for prediction. */
  protected var model: Option[M]

看model的定义也可以知道，确实是这样的，用来被更新，然后做预测的model对象。

总结

那么我们已经了解了两个对象，algorithm和model，algorithm的算法来更新model，model被更新权重和偏置后拿来预测，所以不难猜测algorithm里面的run就是类似最小二乘法操作之类的操作，model里面一定有个predict方法来对样本做预测。

algorithm.run

文件GeneralizedLinearAlgorithm的233-370行，接下来我们深入algorithm的run方法，看他是怎么训练出weights的，点击进入run，太长了我们一行行看，就不一次性贴出来：

（233-239行）先看注释：

/**
 * Run the algorithm with the configured parameters on an input RDD
 * of LabeledPoint entries starting from the initial weights provided.
 *
 */
def run(input: RDD[LabeledPoint], initialWeights: Vector): M = {

从初始化的权重向量开始，配合配置好的参数对输入的RDD[LabeledPoint]做训练，run的参数包括RDD[LabeledPoint]和initialWeights:Vector，返回类型是个M，也就是model的类型GeneralizedLinearModel。

（241-243行）判断了特征的数量，若特征数量定义<0，则直接通过输入的RDD里面取，因为<0是不可信的，直接从数据拿：

if (input.getStorageLevel == StorageLevel.NONE) {
    logWarning("The input data is not directly cached, which may hurt performance if its"
               + " parent RDDs are also uncached.")
}

（245-248行）判断保存等级，不用管：

if (input.getStorageLevel == StorageLevel.NONE) {
    logWarning("The input data is not directly cached, which may hurt performance if its"
               + " parent RDDs are also uncached.")
}

（250-253行）判断数据是否可用，不可用则抛异常：

// Check the data properties before running the optimizer
if (validateData && !validators.forall(func => func(input))) {
    throw new SparkException("Input validation failed.")
}

下面是重头戏了！

（255-273行）注释：

/**
 * Scaling columns to unit variance as a heuristic to reduce the condition number:
 * 将列缩放为单位方差作为启发式以减少条件数
 
 * During the optimization process, the convergence (rate) depends on the condition number of the training dataset.
 在优化过程中，收敛（速率）取决于训练数据集的条件数
 * Scaling the variables often reduces this condition number
 缩放变量通常会减少此条件数
 * thus improving the convergence rate. 
 * 从而可以提高收敛速度
 * Without reducing the condition number,
 不减少条件数的话
 * some training datasets mixing the columns with different scales may not be able to converge.
 一些训练集的特征存在不同范围的钢量因此可能无法收敛
 *
 * condition number, and return
 * the weights in the original scale.
 * See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
 GLMNET和LIBSVM这两个库可以用来scaling（缩放），也可以把缩放的数据还原
 *
 
 * Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean)
 *  and train the model in the scaled space. 
 在这里，若useFeatureScaling=true，我们的做法是在通过除每一列的方差而不是减去均值来做标准化，然后训练模型。
 
 * Then we transform the coefficients from the scaled space to the original scale
 * as GLMNET and LIBSVM do.
 然后使用GLMNET和LIBSVM把系数恢复到原来的数据空间
 *
 * Currently, it's only enabled in LogisticRegressionWithLBFGS
 目前这个操作只允许在LogisticRegressionWithLBFGS操作
 */

说白了，这个注释就是给我们解释数据缩放的方法，这个run方法需要对数据进行一些处理，数据才好收敛，在这里要给我们说清楚。

（274-278行）数据缩放器，useFeatureScaling=true时要生成缩放器：

val scaler = if (useFeatureScaling) {
    new StandardScaler(withStd = true, withMean = false).fit(input.map(_.features))
} else {
    null
}

（280-295行）定义出数据的格式，该加偏置加偏置，该缩放做缩放：

// Prepend an extra variable consisting of all 1.0's for the intercept.
// TODO: Apply feature scaling to the weight vector instead of input data.
val data =
if (addIntercept) { // 若有偏置
    if (useFeatureScaling) { // 若需要特征缩放
        input.map(lp => (lp.label, appendBias(scaler.transform(lp.features)))).cache()
        // 对input这个rdd里的每一个labeledPoint遍历，取出label和features，然后对features加偏置
    } else {
        // 若不需要特征缩放且需要加偏置 则加偏置完事
        input.map(lp => (lp.label, appendBias(lp.features))).cache()
    }
} else { // 同理
    if (useFeatureScaling) {
        input.map(lp => (lp.label, scaler.transform(lp.features))).cache()
    } else {
        input.map(lp => (lp.label, lp.features))
    }
}

（297-307行）这里将偏置（截距）设置到weights中，生成新的变量 initialWeightsWithIntercept：

/**
 * TODO: For better convergence, in logistic regression, the intercepts should be computed
 * from the prior probability distribution of the outcomes; for linear regression,
 * the intercept should be set as the average of response.
 为了更好地收敛，在逻辑回归中，截距应根据结果的先验概率分布来计算，对于线性回归，截距应设置为结果的均值
 */
val initialWeightsWithIntercept = if (addIntercept && numOfLinearPredictor == 1) {
    appendBias(initialWeights)
} else {
    /** If `numOfLinearPredictor > 1`, initialWeights already contains intercepts. */
    initialWeights
}

（309行）使用Optimizer的optimize方法对data和initialWeightsWithIntercept优化，其中data: RDD[(Double, Vector)]，initialWeightsWithIntercept: Vector，weightsWithIntercept: Vector：

val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)

data:RDD[(LabeledPoint.label,加了偏置的features)]，最后经过优化得到一个向量对象，我看了优化器对象，是一个trial，也没定义优化具体代码，所以也不知道怎么操作的，这算是一个坑。

（311-315行）从weightsWithIntercept取出偏置：

val intercept = if (addIntercept && numOfLinearPredictor == 1) {
    weightsWithIntercept(weightsWithIntercept.size - 1)
} else {
    0.0
}

（317-321行）从weightsWithIntercept取出权重向量：

var weights = if (addIntercept && numOfLinearPredictor == 1) {
    Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1))
} else {
    weightsWithIntercept
}

（323-330行）注释：

/**
     * The weights and intercept are trained in the scaled space; we're converting them back to
     * the original scale.
     *
     * Math shows that if we only perform standardization without subtracting means, the intercept
     * will not be changed. w_i = w_i' / v_i where w_i' is the coefficient in the scaled space, w_i
     * is the coefficient in the original space, and v_i is the variance of the column i.
     */
权重和偏置已经在缩放的空间中被训练完，我们现在要把它们还原回原来的空间

所以接下来是权重和偏置还原的过程。

（331-355行）还原过程：

if (useFeatureScaling) { // 若需要做缩放执行下面全部代码
    if (numOfLinearPredictor == 1) { //若是线性回归
        weights = scaler.transform(weights) // 直接用scaler转换回来就行
    } else { //若是逻辑回归 则做下面操作
        /**
         * For `numOfLinearPredictor > 1`, we have to transform the weights back to the original
         * scale for each set of linear predictor. Note that the intercepts have to be explicitly
         * excluded when `addIntercept == true` since the intercepts are part of weights now.
         */
        var i = 0
        val n = weights.size / numOfLinearPredictor
        val weightsArray = weights.toArray
        while (i < numOfLinearPredictor) {
            val start = i * n
            val end = (i + 1) * n - { if (addIntercept) 1 else 0 }

            val partialWeightsArray = scaler.transform(
                Vectors.dense(weightsArray.slice(start, end))).toArray

            System.arraycopy(partialWeightsArray, 0, weightsArray, start, partialWeightsArray.length)
            i += 1
        }
        weights = Vectors.dense(weightsArray)
    }
}

逻辑回归部分太繁琐了，但是仔细看还是能看懂，我们本次重点在线性回归，就先不看下面代码。

（357-368行）收尾工作：

// Warn at the end of the run as well, for increased visibility.
if (input.getStorageLevel == StorageLevel.NONE) {
    logWarning("The input data was not directly cached, which may hurt performance if its" + " parent RDDs are also uncached.")
}

// Unpersist cached data
if (data.getStorageLevel != StorageLevel.NONE) {
    data.unpersist()
}

createModel(weights, intercept)

判断存储等级，打log什么的，最后重新生成一个model在整个run方法返回。

总结

整个trainOn方法是看完了，是一个常规的训练过程，最后通过训练出weights和intercept来重新生成一个model返回。

预测predictOn

val lr: StreamingLinearRegressionWithSGD = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(10))
lr.predictOn()

点击进入predictOn：

/**
   * Use the model to make predictions on batches of data from a DStream
   *
   * @param data DStream containing feature vectors
   * @return DStream containing predictions
   *
   */
@Since("1.1.0")
def predictOn(data: DStream[Vector]): DStream[Double] = {
    if (model.isEmpty) {
        throw new IllegalArgumentException("Model must be initialized before starting prediction.")
    }
    data.map{x => model.get.predict(x)}
}

显然这是用训练好的model来预测，核心代码就最后一句。data本来是个DStream[Vector]，现在map一下，然后使用model的predict方法对Vector做预测。

点击进入model.get.predict(x)，就跳转到了GeneralizedLinearModel的82-84行:

/**
   * Predict values for a single data point using the model trained.
   *
   * @param testData array representing a single data point
   * @return Double prediction from the trained model
   *
   */
@Since("1.0.0")
def predict(testData: Vector): Double = {
    predictPoint(testData, weights, intercept)
}

这里调用的是predictPoint，传入需要预测的样本向量，权重向量，偏置（截距），再点击predictPoint：

protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double

还是在abstract class GeneralizedLinearModel里，而且方法没有定义，显然这是抽象类定义的一个接口，经过前面的继承关系分析，我们可以知道线性回归模型（LinearRegressionModel）继承自这个通用线性模型（GeneralizedLinearModel），所以这个方法应该在那里实现了：

class LinearRegressionModel @Since("1.1.0") (
    @Since("1.0.0") override val weights: Vector,
    @Since("0.8.0") override val intercept: Double)
extends GeneralizedLinearModel(weights, intercept) with RegressionModel with Serializable
with Saveable with PMMLExportable {

    override protected def predictPoint(
        dataMatrix: Vector,
        weightMatrix: Vector,
        intercept: Double): Double = {
        weightMatrix.asBreeze.dot(dataMatrix.asBreeze) + intercept
    }

打开LinearRegressionModel，第一个方法就是这个，而且实现的也很简单，通过权重向量.dot（乘）样本向量再加上偏置就可以了！这跟数学公式一样，没问题。

总结

到这里predictOn部分就完全结束了，这一圈也不复杂，就是定义了通用线性模型，线性回归模型继承并实现其接口就可以了。而且我们也知道整个继承树的大概，但是在trainOn部分还有一个疑问就是优化器没有找到源代码，待会讲。

Optimizer

首先，要搞清楚optimizer在哪里。各位还记不记得我们用的流式线性回归是用了SGD的，也就是用了随机梯度下降，我们还是要回到StreaingLinearRegressionWithSGD：

class StreamingLinearRegressionWithSGD private[mllib] (
    private var stepSize: Double,
    private var numIterations: Int,
    private var regParam: Double,
    private var miniBatchFraction: Double)
extends StreamingLinearAlgorithm[LinearRegressionModel, LinearRegressionWithSGD]

这里他继承自StreamingLinearAlgorithm，而流式线性算法类又有两个泛型，第二个就是LinearRegressionWithSGD：

abstract class StreamingLinearAlgorithm[
    M <: GeneralizedLinearModel,
    A <: GeneralizedLinearAlgorithm[M]] extends Logging {

在StreamingLinearAlgorithm类中还显示的不清楚，他的泛型规定只是说小于等于GeneralizedLinearAlgorithm，那我们现在从第一段代码点入LinearRegressionWithSGD：

/**
 * Train a linear regression model with no regularization using Stochastic Gradient Descent.
 * This solves the least squares regression formulation
 *              f(weights) = 1/n ||A weights-y||^2^
 * (which is the mean squared error).
 * Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with
 * its corresponding right hand side label y.
 * See also the documentation for the precise formulation.
 对没有正则化的线性回归模型使用随机梯度下降
 随机梯度下降SGD解决了最小二乘回归公式，也就是反映了最小二乘公式
 在这个类中，data矩阵有n行，输入的input RDD 是一个样本集，在最右边还有label Y
 */
@Since("0.8.0")
class LinearRegressionWithSGD private[mllib] (
    private var stepSize: Double,
    private var numIterations: Int,
    private var regParam: Double,
    private var miniBatchFraction: Double)
  extends GeneralizedLinearAlgorithm[LinearRegressionModel] with Serializable {
      private val gradient = new LeastSquaresGradient()
      private val updater = new SimpleUpdater()
      @Since("0.8.0")
      override val optimizer = new GradientDescent(gradient, updater)
      .setStepSize(stepSize)
      .setNumIterations(numIterations)
      .setRegParam(regParam)
      .setMiniBatchFraction(miniBatchFraction)

可以看到，optimizer是一个梯度下降GradientDescent对象，点击进入：

class GradientDescent private[spark] (private var gradient: Gradient, private var updater: Updater)
extends Optimizer with Logging {

GradientDescent继承了Optimizer特质，在最后一个方法可以找到optimize方法：

/**
   * Runs gradient descent on the given training data.
   * @param data training data
   * @param initialWeights initial weights
   * @return solution vector
   */
def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    val (weights, _) = GradientDescent.runMiniBatchSGD(
        data, // RDD[(Double, Vector) ,RDD[(label,样本向量)]
        gradient,  // 梯度
        updater, // 更新器
        stepSize, // 步长
        numIterations, // 迭代数
        regParam, // 忘了
        miniBatchFraction, // 最小batch片段
        initialWeights, // 初始权重 当然这里不一定是0向量 若迭代过几次可能会有数值
        convergenceTol) // 收敛容忍度
    weights
}

这些参数都可以在上面的setter找到注释，我们可以理解成，optimize方法调用了梯度下降的runMiniBatchSGD，给了一堆参数进入计算，最后返回来tuple(weights,_)，最后整个optimize方法返回的是weights向量。

runMiniBatchSGD

点击进入，可以看到一大堆注释跟方法代码，每个参数都有解释我就不说了：

/**
   * Run stochastic gradient descent (SGD) in parallel using mini batches.
   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
   * in order to compute a gradient estimate.
   * Sampling, and averaging the subgradients over this subset is performed using one standard
   * spark map-reduce in each iteration.
   *
   * @param data Input data for SGD. RDD of the set of data examples, each of
   *             the form (label, [feature values]).
   * @param gradient Gradient object (used to compute the gradient of the loss function of
   *                 one single data example)
   * @param updater Updater function to actually perform a gradient step in a given direction.
   * @param stepSize initial step size for the first step
   * @param numIterations number of iterations that SGD should be run.
   * @param regParam regularization parameter
   * @param miniBatchFraction fraction of the input data set that should be used for
   *                          one iteration of SGD. Default value 1.0.
   * @param convergenceTol Minibatch iteration will end before numIterations if the relative
   *                       difference between the current weight and the previous weight is less
   *                       than this value. In measuring convergence, L2 norm is calculated.
   *                       Default value 0.001. Must be between 0.0 and 1.0 inclusively.
   * @return A tuple containing two elements. The first element is a column matrix containing
   *         weights for every feature, and the second element is an array containing the
   *         stochastic loss computed for every iteration.
   */
def runMiniBatchSGD(
    data: RDD[(Double, Vector)],
    gradient: Gradient,
    updater: Updater,
    stepSize: Double,
    numIterations: Int,
    regParam: Double,
    miniBatchFraction: Double,
    initialWeights: Vector,
    convergenceTol: Double): (Vector, Array[Double]) = {

    // convergenceTol should be set with non minibatch settings
    if (miniBatchFraction < 1.0 && convergenceTol > 0.0) {
        logWarning("Testing against a convergenceTol when using miniBatchFraction " +
                   "< 1.0 can be unstable because of the stochasticity in sampling.")
    }

    if (numIterations * miniBatchFraction < 1.0) {
        logWarning("Not all examples will be used if numIterations * miniBatchFraction < 1.0: " +
                   s"numIterations=$numIterations and miniBatchFraction=$miniBatchFraction")
    }

    // 历史随机损失 用一个ArrayBuffer来装 size是迭代次数
    val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
    // Record previous weight and current one to calculate solution vector difference
	// 记录当前和上一次的weights，来计算向量的差
    var previousWeights: Option[Vector] = None
    var currentWeights: Option[Vector] = None

    // 样本数
    val numExamples = data.count()

    // 没数据时返回initialweights（注意不一定是0向量，是进入训练的初始值weights）
    // if no data, return initial weights to avoid NaNs
    if (numExamples == 0) {
        logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found")
        return (initialWeights, stochasticLossHistory.toArray)
    }

    if (numExamples * miniBatchFraction < 1) {
        logWarning("The miniBatchFraction is too small")
    }

    // Initialize weights as a column vector
    var weights = Vectors.dense(initialWeights.toArray)
    val n = weights.size

    /**
     * For the first iteration, the regVal will be initialized as sum of weight squares
     * if it's L2 updater; for L1 updater, the same logic is followed.
     */
    // 第一次迭代时，若是做L2更新regVal会被初始化为权重的平方和
    // 第一次迭代时，若是做L1更新也一样
    var regVal = updater.compute(
        weights, Vectors.zeros(weights.size), 0, 1, regParam)._2

    var converged = false // indicates whether converged based on convergenceTol
    var i = 1
    // 直到收敛到指定的tolerance或者迭代次数才停止计算
    while (!converged && i <= numIterations) {
        val bcWeights = data.context.broadcast(weights)
        // Sample a subset (fraction miniBatchFraction) of the total data
        // compute and sum up the subgradients on this subset (this is one map-reduce)
        val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
            seqOp = (c, v) => {
                // c: (grad, loss, count), v: (label, features)
                val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
                (c._1, c._2 + l, c._3 + 1)
            },
            combOp = (c1, c2) => {
                // c: (grad, loss, count)
                (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
            })
        bcWeights.destroy()

        if (miniBatchSize > 0) {
            /**
         * lossSum is computed using the weights from the previous iteration
         * and regVal is the regularization value computed in the previous iteration as well.
         */
            stochasticLossHistory += lossSum / miniBatchSize + regVal
            val update = updater.compute(
                weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
                stepSize, i, regParam)
            weights = update._1
            regVal = update._2

            previousWeights = currentWeights
            currentWeights = Some(weights)
            if (previousWeights != None && currentWeights != None) {
                converged = isConverged(previousWeights.get,
                                        currentWeights.get, convergenceTol)
            }
        } else {
            logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
        }
        i += 1
    }

    logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format(
        stochasticLossHistory.takeRight(10).mkString(", ")))

    (weights, stochasticLossHistory.toArray)

}