线性回归
- 一元线性回归
- hθ(x)=θ0+θ1x h θ ( x ) = θ 0 + θ 1 x ——————–1
- 多元线性回归
- hθ(x)=∑mi=1θixi=θTX h θ ( x ) = ∑ i = 1 m θ i x i = θ T X —————–2
损失函数
- J(θ)=1/2∑mi=1(hθ(xi)−yi)2 J ( θ ) = 1 / 2 ∑ i = 1 m ( h θ ( x i ) − y i ) 2 —————3
- 1/2 是为了求导时系数为1,平方里是真实值减去估计值
- 我们的目的就是求其最小值
最小二乘法要求较为苛刻,求矩阵的逆比较慢,要求 X X 是满秩
梯度下降法
- 梯度下降法
- 我们的目的是为了求 的极小值
- 梯度方向有 J(θ) J ( θ ) 对 θ θ 的偏导数确定,由于求的是极小值,因此梯度方向是偏导数的反方向,如果在区间内是凸函数,就是说梯度在更新 时,在梯度是负数时增加 θ θ
- θj:=θj−α∂∂θjJ(θ) θ j := θ j − α ∂ ∂ θ j J ( θ ) ——————-4
- 其中 α α 为学习速率,过大容易越过最小值,过小容易造成迭代次数过多,收敛变慢
- 如果只有一条样本
- ∂∂θjJ(θ)=(hθ(x)−y)∂∂θj(∑mi=0θixi−y)=(hθ(x)−y)xj ∂ ∂ θ j J ( θ ) = ( h θ ( x ) − y ) ∂ ∂ θ j ( ∑ i = 0 m θ i x i − y ) = ( h θ ( x ) − y ) x j —————–5
- 当数量不唯一时,将(5)带入(3)求偏导那么每个参数 θj θ j 沿梯度方向变化如下:
- θj:=θj+α∑mi=0(yi−hθ(xi))xij θ j := θ j + α ∑ i = 0 m ( y i − h θ ( x i ) ) x j i ——————————6
- 这里 m m 代表的时所有样本
- 随机梯度下降
- 梯度下降法会训练所有样本然后更新梯度,每次迭代复杂度O(mn),样本很大我们采用随机梯度下降
- 每读取一条样本,对 进行更新
- 虽然速度快但是会在最小值(极小?)附近震荡,造成永远不能收敛
- 为减少计算复杂度,对于 θT θ T 的变化程度设置一个阈值
源码分析
MLlib源码分析
建立线性回归
org/apache/spark/mllib/regression/LinearRegression.scala
object LinearRegressionWithSGD //是LogisticRegressionWithSGD的伴生对象(可以理解为单例) //主要定义train方法,传递训练参数和RDD给run方法,是LogisticRegressionWithSGD类的入口 //基于随机梯度下降法的线性回归模型 //损失函数f(weights) = 1/n ||A weights-y||^2^ class LinearRegressionWithSGD private[mllib] (//在mllib包下的 private var stepSize: Double,//迭代步长 private var numIterations: Int,//迭代次数 private var regParam: Double,//正则化参数 private var miniBatchFraction: Double)//迭代参与样本的比例 extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable //采用最小平方损失函数的梯度下降法 private val gradient = new LeastSquaresGradient() //采用简单梯度更新,无正则化 private val updater = new SimpleUpdater() @Since("0.8.0") //新建梯度优化计算方法 override val optimizer = new GradientDescent(gradient, updater) .setStepSize(stepSize) .setNumIterations(numIterations) .setRegParam(regParam) .setMiniBatchFraction(miniBatchFraction)
训练的run方法
org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
/** * Run the algorithm with the configured parameters on an input RDD * of LabeledPoint entries starting from the initial weights provided. * */ @Since("1.0.0") def run(input: RDD[LabeledPoint], initialWeights: Vector): M = { //特征维度 //求RDD中features的数量 if (numFeatures < 0) { numFeatures = input.map(_.features.size).first() } //输入样本检测 if (input.getStorageLevel == StorageLevel.NONE) { logWarning("The input data is not directly cached, which may hurt performance if its" + " parent RDDs are also uncached.") } // Check the data properties before running the optimizer if (validateData && !validators.forall(func => func(input))) { throw new SparkException("Input validation failed.") } /** 数据降维,优化过程中,收敛取决于训练数据的维度,降维提高收敛速度 目前只适用于逻辑回归 */ val scaler = if (useFeatureScaling) { new StandardScaler(withStd = true, withMean = false).fit(input.map(_.features)) } else { null } // Prepend an extra variable consisting of all 1.0's for the intercept. //增加偏置项bias:θ0常数项 // TODO: Apply feature scaling to the weight vector instead of input data. val data = if (addIntercept) { if (useFeatureScaling) { input.map(lp => (lp.label, appendBias(scaler.transform(lp.features)))).cache() } else { //在feature后添加bias,持久化到内存 input.map(lp => (lp.label, appendBias(lp.features))).cache() } } else { if (useFeatureScaling) { input.map(lp => (lp.label, scaler.transform(lp.features))).cache() } else { input.map(lp => (lp.label, lp.features)) } } /** * TODO: For better convergence, in logistic regression, the intercepts should be computed * from the prior probability distribution of the outcomes; for linear regression, * the intercept should be set as the average of response. * 初始权重,包括增加偏置项 */ val initialWeightsWithIntercept = if (addIntercept && numOfLinearPredictor == 1) { appendBias(initialWeights) } else { /** If `numOfLinearPredictor > 1`, initialWeights already contains intercepts. */ initialWeights } //权重训练优化,进行梯度下降学习,返回最优权重 val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept) //获取偏置项 val intercept = if (addIntercept && numOfLinearPredictor == 1) { weightsWithIntercept(weightsWithIntercept.size - 1) } else { 0.0 } //获取权重 var weights = if (addIntercept && numOfLinearPredictor == 1) { Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1)) } else { weightsWithIntercept } /** * The weights and intercept are trained in the scaled space; we're converting them back to * the original scale. * 对于降维,权重要进行还原 * Math shows that if we only perform standardization without subtracting means, the intercept * will not be changed. w_i = w_i' / v_i where w_i' is the coefficient in the scaled space, w_i * is the coefficient in the original space, and v_i is the variance of the column i. */ if (useFeatureScaling) { if (numOfLinearPredictor == 1) { weights = scaler.transform(weights) } else { /** * For `numOfLinearPredictor > 1`, we have to transform the weights back to the original * scale for each set of linear predictor. Note that the intercepts have to be explicitly * excluded when `addIntercept == true` since the intercepts are part of weights now. */ var i = 0 val n = weights.size / numOfLinearPredictor val weightsArray = weights.toArray while (i < numOfLinearPredictor) { val start = i * n val end = (i + 1) * n - { if (addIntercept) 1 else 0 } val partialWeightsArray = scaler.transform( Vectors.dense(weightsArray.slice(start, end))).toArray System.arraycopy(partialWeightsArray, 0, weightsArray, start, partialWeightsArray.length) i += 1 } weights = Vectors.dense(weightsArray) } } // Warn at the end of the run as well, for increased visibility. if (input.getStorageLevel == StorageLevel.NONE) { logWarning("The input data was not directly cached, which may hurt performance if its" + " parent RDDs are also uncached.") } // Unpersist cached data if (data.getStorageLevel != StorageLevel.NONE) { data.unpersist(false) } //返回训练模型 createModel(weights, intercept) } }
梯度下降求解权重
org/apache/spark/mllib/optimization/GradientDescent.scala
optimizer.optimize->GradientDescent.optimize->GradientDescent.runMiniBatchSGD
def runMiniBatchSGD( data: RDD[(Double, Vector)], gradient: Gradient, updater: Updater, stepSize: Double, numIterations: Int, regParam: Double, miniBatchFraction: Double, initialWeights: Vector, convergenceTol: Double): (Vector, Array[Double]) = { //历史迭代误差数组 val stochasticLossHistory = new ArrayBuffer[Double](numIterations) // Record previous weight and current one to calculate solution vector difference var previousWeights: Option[Vector] = None var currentWeights: Option[Vector] = None //训练样本数m val numExamples = data.count() // if no data, return initial weights to avoid NaNs if (numExamples == 0) { logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found") return (initialWeights, stochasticLossHistory.toArray) } if (numExamples * miniBatchFraction < 1) { logWarning("The miniBatchFraction is too small") } // 初始化权重Initialize weights as a column vector var weights = Vectors.dense(initialWeights.toArray) val n = weights.size /** * For the first iteration, the regVal will be initialized as sum of weight squares * if it's L2 updater; for L1 updater, the same logic is followed. */ //这里的compute进行了第一次迭代,正则化值初始化为权重的加权平方和 var regVal = updater.compute( weights, Vectors.zeros(weights.size), 0, 1, regParam)._2 var converged = false // indicates whether converged based on convergenceTol var i = 1 //权重迭代计算 while (!converged && i <= numIterations) { //广播权重变量,每个Executer都接受到当前权重 val bcWeights = data.context.broadcast(weights) // Sample a subset (fraction miniBatchFraction) of the total data // 随机抽样样本,对抽样的样本子集,采用treeAggregate的RDD方法,进行聚合计算 //计算每个样本的权重向量,误差值,然后对所有样本权重向量和误差值进行累加,这是一次map-reduce // compute and sum up the subgradients on this subset (this is one map-reduce) val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i) .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(//初始值 //将v合并到c seqOp = (c, v) => { // c: (grad, loss, count), v: (label, features) //返回的是loss val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1)) (c._1, c._2 + l, c._3 + 1) }, //合并两个c combOp = (c1, c2) => { // c: (grad, loss, count) // (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3) }) bcWeights.destroy(blocking = false) if (miniBatchSize > 0) { /** * lossSum is computed using the weights from the previous iteration * and regVal is the regularization value computed in the previous iteration as well. */ //保存误差,更新权重 stochasticLossHistory += lossSum / miniBatchSize + regVal val update = updater.compute( weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), stepSize, i, regParam) weights = update._1 regVal = update._2 previousWeights = currentWeights currentWeights = Some(weights) if (previousWeights != None && currentWeights != None) { converged = isConverged(previousWeights.get, currentWeights.get, convergenceTol) } } else { logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero") } i += 1 } logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format( stochasticLossHistory.takeRight(10).mkString(", "))) //返回模型和误差数组 (weights, stochasticLossHistory.toArray) } //判断是否收敛 private def isConverged( previousWeights: Vector, currentWeights: Vector, convergenceTol: Double): Boolean = { // To compare with convergence tolerance. val previousBDV = previousWeights.asBreeze.toDenseVector val currentBDV = currentWeights.asBreeze.toDenseVector // This represents the difference of updated weights in the iteration. val solutionVecDiff: Double = norm(previousBDV - currentBDV) solutionVecDiff < convergenceTol * Math.max(norm(currentBDV), 1.0) } }
梯度计算
gradient.compute计算每个样本的梯度和误差,线性回归中使用LeastSquaresGradient。最小二乘
org/apache/spark/mllib/optimization/Gradient.scala
/** * :: DeveloperApi :: * Compute gradient and loss for a Least-squared loss function, as used in linear regression. * This is correct for the averaged least squares loss function (mean squared error) * L = 1/2n ||A weights-y||^2 * See also the documentation for the precise formulation. */ @DeveloperApi class LeastSquaresGradient extends Gradient { override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = { //y-h0(x) val diff = dot(data, weights) - label val loss = diff * diff / 2.0 val gradient = data.copy scal(diff, gradient) //梯度值:x*(y-h0(x)) (gradient, loss) } override def compute( data: Vector, label: Double, weights: Vector, cumGradient: Vector): Double = { val diff = dot(data, weights) - label //y-h0(x) axpy(diff, data, cumGradient) // y= x* (y-h0(x))+cumGradient diff * diff / 2.0 } }
权重更新
/** * :: DeveloperApi :: * A simple updater for gradient descent *without* any regularization. * Uses a step-size decreasing with the square root of the number of iterations. */ @DeveloperApi class SimpleUpdater extends Updater { override def compute( weightsOld: Vector, gradient: Vector, stepSize: Double, iter: Int, regParam: Double): (Vector, Double) = { //当前迭代次数的平方根的倒数作为趋近的因子α val thisIterStepSize = stepSize / math.sqrt(iter) val brzWeights: BV[Double] = weightsOld.asBreeze.toDenseVector brzAxpy(-thisIterStepSize, gradient.asBreeze, brzWeights) (Vectors.fromBreeze(brzWeights), 0) } }