mahout 计算方差标准差

最新推荐文章于 2021-02-12 09:49:05 发布

早上的阳光

最新推荐文章于 2021-02-12 09:49:05 发布

阅读量1.4k

点赞数

分类专栏： Mahout 文章标签： mahout

本文链接：https://blog.csdn.net/u010011737/article/details/51909012

版权

Mahout 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

标准差（ Standard Deviation），在概率统计中最常使用作为统计分布程度（statistical dispersion）上的测量。标准差定义是总体各单位标准值与其平均数离差平方的算术平均数的平方根。它反映组内个体间的离散程度。测量到分布程度的结果，原则上具有两种性质：

为非负数值，与测量资料具有相同单位。一个总量的标准差或一个随机变量的标准差，及一个子集合样品数的标准差之间，有所差别。

标准计算公式：

假设有一组数值X₁,X₂,X₃,......Xn（皆为实数），其平均值（算术平均值）为μ，公式如图1。

标准差也被称为标准偏差，或者实验标准差，公式为

。

简单来说，标准差是一组数据平均值分散程度的一种度量。一个较大的标准差，代表大部分数值和其平均值之间差异较大；一个较小的标准差，代表这些数值较接近平均值。

例如，两组数的集合 {0,5,9,14} 和 {5,6,8,9} 其平均值都是 7 ，但第二个集合具有较小的标准差。

标准差可以当作不确定性的一种测量。例如在物理科学中，做重复性测量时，测量数值集合的标准差代表这些测量的精确度。当要决定测量值是否符合预测值，测量值的标准差占有决定性重要角色：如果测量平均值与预测值相差太远（同时与标准差数值做比较），则认为测量值与预测值互相矛盾。这很容易理解，因为如果测量值都落在一定数值范围之外，可以合理推论预测值是否正确。

标准差应用于投资上，可作为量度回报稳定性的指标。标准差数值越大，代表回报远离过去平均数值，回报较不稳定故风险越高。相反，标准差数值越小，代表回报较为稳定，风险亦较小。

例如，A、B两组各有6位学生参加同一次语文测验，A组的分数为95、85、75、65、55、45，B组的分数为73、72、71、69、68、67。这两组的平均数都是70，但A组的标准差约为17.08分，B组的标准差约为2.16分，说明A组学生之间的差距要比B组学生之间的差距大得多。

如是总体（即估算总体方差），根号内除以n（对应excel函数：STDEVP）；

如是抽样（即估算样本方差），根号内除以（n-1）（对应excel函数：STDEV）；

因为我们大量接触的是样本，所以普遍使用根号内除以（n-1）。

方差（ variance ）

当数据分布比较分散（即数据在平均数附近波动较大）时，各个数据与平均数的差的平方和较大，方差就较大；当数据分布比较集中时，各个数据与平均数的差的平方和较小。因此方差越大，数据的波动越大；方差越小，数据的波动就越小。 [5]

样本中各数据与样本平均数的差的平方和的平均数叫做样本方差；样本方差的算术平方根叫做样本标准差。样本方差和样本标准差都是衡量一个样本波动大小的量，样本方差或样本标准差越大，样本数据的波动就越大。

方差和标准差是测算离散趋势最重要、最常用的指标。方差是各变量值与其均值离差平方的平均数，它是测算数值型数据离散程度的最重要的方法。标准差为方差的算术平方根，用S表示。方差相应的计算公式为

标准差与方差不同的是，标准差和变量的计算单位相同，比方差清楚，因此很多时候我们分析的时候更多的使用的是标准差。

接下来看看mahout 代码：

mahout-mr-0.11.0.jar 包中 org.apache.mahout.math.hadoop.stats 是关于统计的包

BasicStats 基本的统计计算方法（均值、方差、标准差、等）

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.mahout.math.hadoop.stats;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;

import java.io.IOException;

/**
 * Methods for calculating basic stats (mean, variance, stdDev, etc.) in map/reduce
 */
public final class BasicStats {

  private BasicStats() {
  }

  /**
   * Calculate the variance of values stored as
   *
   * @param input    The input file containing the key and the count
   * @param output   The output to store the intermediate values
   * @param baseConf
   * @return The variance (based on sample estimation)
   */
  public static double variance(Path input, Path output,
                                Configuration baseConf)
    throws IOException, InterruptedException, ClassNotFoundException {
    VarianceTotals varianceTotals = computeVarianceTotals(input, output, baseConf);
    return varianceTotals.computeVariance();
  }

  /**
   * Calculate the variance by a predefined mean of values stored as
   *
   * @param input    The input file containing the key and the count
   * @param output   The output to store the intermediate values
   * @param mean The mean based on which to compute the variance
   * @param baseConf
   * @return The variance (based on sample estimation)
   */
  public static double varianceForGivenMean(Path input, Path output, double mean,
                                Configuration baseConf)
    throws IOException, InterruptedException, ClassNotFoundException {
    VarianceTotals varianceTotals = computeVarianceTotals(input, output, baseConf);
    return varianceTotals.computeVarianceForGivenMean(mean);
  }

  private static VarianceTotals computeVarianceTotals(Path input, Path output,
                                Configuration baseConf) throws IOException, InterruptedException,
          ClassNotFoundException {
    Configuration conf = new Configuration(baseConf);
    conf.set("io.serializations",
                    "org.apache.hadoop.io.serializer.JavaSerialization,"
                            + "org.apache.hadoop.io.serializer.WritableSerialization");
    Job job = HadoopUtil.prepareJob(input, output, SequenceFileInputFormat.class,
        StandardDeviationCalculatorMapper.class, IntWritable.class, DoubleWritable.class,
        StandardDeviationCalculatorReducer.class, IntWritable.class, DoubleWritable.class,
        SequenceFileOutputFormat.class, conf);
    HadoopUtil.delete(conf, output);
    job.setCombinerClass(StandardDeviationCalculatorReducer.class);
    boolean succeeded = job.waitForCompletion(true);
    if (!succeeded) {
      throw new IllegalStateException("Job failed!");
    }

    // Now extract the computed sum
    Path filesPattern = new Path(output, "part-*");
    double sumOfSquares = 0;
    double sum = 0;
    double totalCount = 0;
    for (Pair<Writable, Writable> record : new SequenceFileDirIterable<>(
            filesPattern, PathType.GLOB, null, null, true, conf)) {

      int key = ((IntWritable) record.getFirst()).get();
      if (key == StandardDeviationCalculatorMapper.SUM_OF_SQUARES.get()) {
        sumOfSquares += ((DoubleWritable) record.getSecond()).get();
      } else if (key == StandardDeviationCalculatorMapper.TOTAL_COUNT
              .get()) {
        totalCount += ((DoubleWritable) record.getSecond()).get();
      } else if (key == StandardDeviationCalculatorMapper.SUM
              .get()) {
        sum += ((DoubleWritable) record.getSecond()).get();
      }
    }

    VarianceTotals varianceTotals = new VarianceTotals();
    varianceTotals.setSum(sum);
    varianceTotals.setSumOfSquares(sumOfSquares);
    varianceTotals.setTotalCount(totalCount);

    return varianceTotals;
  }

  /**
   * Calculate the standard deviation
   *
   * @param input    The input file containing the key and the count
   * @param output   The output file to write the counting results to
   * @param baseConf The base configuration
   * @return The standard deviation
   */
  public static double stdDev(Path input, Path output,
                              Configuration baseConf) throws IOException, InterruptedException,
          ClassNotFoundException {
    return Math.sqrt(variance(input, output, baseConf));
  }

  /**
   * Calculate the standard deviation given a predefined mean
   *
   * @param input    The input file containing the key and the count
   * @param output   The output file to write the counting results to
   * @param mean The mean based on which to compute the standard deviation
   * @param baseConf The base configuration
   * @return The standard deviation
   */
  public static double stdDevForGivenMean(Path input, Path output, double mean,
                              Configuration baseConf) throws IOException, InterruptedException,
          ClassNotFoundException {
    return Math.sqrt(varianceForGivenMean(input, output, mean, baseConf));
  }
}

方法：BasicStats.variance （）方差 BasicStats.stdDev（）标准差