标准差(
Standard Deviation),在
概率统计中最常使用作为
统计分布程度(statistical dispersion)上的测量。标准差定义是总体各单位标准值与其平均数离差平方的算术平均数的
平方根。它反映组内个体间的离散程度。测量到分布程度的结果,原则上具有两种
性质:
标准计算公式:
标准差也被称为
标准偏差,或者实验标准差,公式为
。
简单来说,标准差是一组数据
平均值分散程度的一种度量。一个较大的标准差,代表大部分数值和其平均值之间差异较大;一个较小的标准差,代表这些数值较接近平均值。
例如,两组数的集合 {0,5,9,14} 和 {5,6,8,9} 其平均值都是 7 ,但第二个集合具有较小的标准差。
标准差可以当作不确定性的一种测量。例如在物理科学中,做重复性测量时,测量数值集合的标准差代表这些测量的精确度。当要决定测量值是否符合预测值,测量值的标准差占有决定性重要角色:如果测量平均值与预测值相差太远(同时与标准差数值做比较),则认为测量值与预测值互相矛盾。这很容易理解,因为如果测量值都落在一定数值范围之外,可以合理推论预测值是否正确。
标准差应用于投资上,可作为量度回报稳定性的指标。标准差数值越大,代表回报远离过去
平均数值,回报较不稳定故风险越高。相反,标准差数值越小,代表回报较为稳定,风险亦较小。
例如,A、B两组各有6位学生参加同一次语文测验,A组的分数为95、85、75、65、55、45,B组的分数为73、72、71、69、68、67。这两组的平均数都是70,但A组的标准差约为17.08分,B组的标准差约为2.16分,说明A组学生之间的差距要比B组学生之间的差距大得多。
如是总体(即估算总体方差),根号内除以n(对应excel函数:STDEVP);
如是抽样(即估算样本方差),根号内除以(n-1)(对应excel函数:STDEV);
因为我们大量接触的是样本,所以普遍使用根号内除以(n-1)。
方差
(
variance
)
当数据分布比较分散(即数据在平均数附近波动较大)时,各个数据与平均数的差的平方和较大,方差就较大;当数据分布比较集中时,各个数据与平均数的差的平方和较小。因此方差越大,数据的波动越大;方差越小,数据的波动就越小。
[5]
mahout-mr-0.11.0.jar 包中 org.apache.mahout.math.hadoop.stats 是关于统计的包
BasicStats 基本的统计计算方法(均值、方差、标准差、等)
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.mahout.math.hadoop.stats;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import java.io.IOException;
/**
* Methods for calculating basic stats (mean, variance, stdDev, etc.) in map/reduce
*/
public final class BasicStats {
private BasicStats() {
}
/**
* Calculate the variance of values stored as
*
* @param input The input file containing the key and the count
* @param output The output to store the intermediate values
* @param baseConf
* @return The variance (based on sample estimation)
*/
public static double variance(Path input, Path output,
Configuration baseConf)
throws IOException, InterruptedException, ClassNotFoundException {
VarianceTotals varianceTotals = computeVarianceTotals(input, output, baseConf);
return varianceTotals.computeVariance();
}
/**
* Calculate the variance by a predefined mean of values stored as
*
* @param input The input file containing the key and the count
* @param output The output to store the intermediate values
* @param mean The mean based on which to compute the variance
* @param baseConf
* @return The variance (based on sample estimation)
*/
public static double varianceForGivenMean(Path input, Path output, double mean,
Configuration baseConf)
throws IOException, InterruptedException, ClassNotFoundException {
VarianceTotals varianceTotals = computeVarianceTotals(input, output, baseConf);
return varianceTotals.computeVarianceForGivenMean(mean);
}
private static VarianceTotals computeVarianceTotals(Path input, Path output,
Configuration baseConf) throws IOException, InterruptedException,
ClassNotFoundException {
Configuration conf = new Configuration(baseConf);
conf.set("io.serializations",
"org.apache.hadoop.io.serializer.JavaSerialization,"
+ "org.apache.hadoop.io.serializer.WritableSerialization");
Job job = HadoopUtil.prepareJob(input, output, SequenceFileInputFormat.class,
StandardDeviationCalculatorMapper.class, IntWritable.class, DoubleWritable.class,
StandardDeviationCalculatorReducer.class, IntWritable.class, DoubleWritable.class,
SequenceFileOutputFormat.class, conf);
HadoopUtil.delete(conf, output);
job.setCombinerClass(StandardDeviationCalculatorReducer.class);
boolean succeeded = job.waitForCompletion(true);
if (!succeeded) {
throw new IllegalStateException("Job failed!");
}
// Now extract the computed sum
Path filesPattern = new Path(output, "part-*");
double sumOfSquares = 0;
double sum = 0;
double totalCount = 0;
for (Pair<Writable, Writable> record : new SequenceFileDirIterable<>(
filesPattern, PathType.GLOB, null, null, true, conf)) {
int key = ((IntWritable) record.getFirst()).get();
if (key == StandardDeviationCalculatorMapper.SUM_OF_SQUARES.get()) {
sumOfSquares += ((DoubleWritable) record.getSecond()).get();
} else if (key == StandardDeviationCalculatorMapper.TOTAL_COUNT
.get()) {
totalCount += ((DoubleWritable) record.getSecond()).get();
} else if (key == StandardDeviationCalculatorMapper.SUM
.get()) {
sum += ((DoubleWritable) record.getSecond()).get();
}
}
VarianceTotals varianceTotals = new VarianceTotals();
varianceTotals.setSum(sum);
varianceTotals.setSumOfSquares(sumOfSquares);
varianceTotals.setTotalCount(totalCount);
return varianceTotals;
}
/**
* Calculate the standard deviation
*
* @param input The input file containing the key and the count
* @param output The output file to write the counting results to
* @param baseConf The base configuration
* @return The standard deviation
*/
public static double stdDev(Path input, Path output,
Configuration baseConf) throws IOException, InterruptedException,
ClassNotFoundException {
return Math.sqrt(variance(input, output, baseConf));
}
/**
* Calculate the standard deviation given a predefined mean
*
* @param input The input file containing the key and the count
* @param output The output file to write the counting results to
* @param mean The mean based on which to compute the standard deviation
* @param baseConf The base configuration
* @return The standard deviation
*/
public static double stdDevForGivenMean(Path input, Path output, double mean,
Configuration baseConf) throws IOException, InterruptedException,
ClassNotFoundException {
return Math.sqrt(varianceForGivenMean(input, output, mean, baseConf));
}
}