Measures of Scale

最新推荐文章于 2024-10-20 19:52:38 发布

code_tailor

最新推荐文章于 2024-10-20 19:52:38 发布

阅读量668

点赞数

分类专栏： statistics 文章标签：机器学习人工智能深度学习

本文链接：https://blog.csdn.net/u013226430/article/details/123563241

版权

本文探讨了衡量数据集变化性的几种方法，包括方差、标准差、范围、平均绝对偏差和中位数绝对偏差。这些度量关注数据中心附近和尾部的分散程度，并在正常、双指数、卡方和图基lambda分布等不同分布下进行了比较。标准差受极端值影响较大，而中位数绝对偏差和四分位距在存在极端值时提供更稳定的分散度量。

摘要由CSDN通过智能技术生成

https://www.itl.nist.gov/div898/handbook/eda/section3/eda356.htm

Scale, Variability, or Spread A fundamental task in many statistical analyses is to characterize the spread, or variability, of a data set. Measures of scale are simply attempts to estimate this variability.
When assessing the variability of a data set, there are two key components:

How spread out are the data values near the center?
How spread out are the tails?
Different numerical summaries will give different weight to these two elements. The choice of scale estimator is often driven by which of these components you want to emphasize.
The histogram is an effective graphical technique for showing both of these components of the spread.

Definitions of Variability For univariate data, there are several common numerical measures of the spread:
variance - the variance is defined as
s2=∑Ni=1(Yi−Y¯)2/(N−1)
where Y¯ is the mean of the data.

The variance is roughly the arithmetic average of the squared distance from the mean. Squaring the distance from the mean has the effect of giving greater weight to values that are further from the mean. For example, a point 2 units from the mean adds 4 to the above sum while a point 10 units from the mean adds 100 to the sum. Although the variance is intended to be an overall measure of spread, it can be greatly affected by the tail behavior.

standard deviation - the standard deviation is the square root of the variance. That is,
s=∑Ni=1(Yi−Y¯)2/(N−1)−−−−−−−−−−−−−−−−−−−√
The standard deviation restores the units of the spread to the original data units (the variance squares the units).

range - the range is the largest value minus the smallest value in a data set. Note that this measure is based only on the lowest and highest extreme values in the sample. The spread near the center of the data is not captured at all