概率论与数理统计 1 Overview and Descriptive Statistics(概述和描述性统计) （下篇）

最新推荐文章于 2022-07-13 17:54:14 发布

Lum0s!

最新推荐文章于 2022-07-13 17:54:14 发布

阅读量775

点赞数 3

分类专栏：概率论与数理统计文章标签：概率论

本文链接：https://blog.csdn.net/m0_59751822/article/details/123190061

版权

概率论与数理统计专栏收录该内容

7 篇文章 2 订阅

订阅专栏

概率论_1.3_1.4

1.3 Measure of Location
1.4 Measures of Variability(可变性的量度)

1.3 Measure of Location

Visual summaries of data(数据的可视化总结) are excellent tools for obtaining preliminary impressions and insights(获得初步印象和见解).More formal data analysis often requires the calculation and interpretation of numerical summary measures(计算和解释数值汇总措施).That is, from the data we try to extract several summarizing quantities that might serve to characterize the data set and convey some of its prominent features(描述数据集特征并传达其一些显著特征的概括量).

The Mean(平均值)

The mean, or arithmetic average of the set（算术平均数）

We will often refer to the arithmetic average as the sample mean（样本均值） and denote it by $\overline{x}$ .

The sample mean x of observations x₁, x₂,…, x_n is given by

$\overline{x}=\frac{x_1+x_2+...+x_n}{n}=\sum_{i=1}^n x_i$

The numerator(分子) of $\overline{x}$ can be written more informally as $\begin{matrix} \sum_{}^{} x_i \end{matrix}$ , where the summation is over all sample observations.

A physical interpretation of the sample mean demonstrates(说明） how it assesses the center of a sample（评估样本的中心）.

The sample mean can be regarded as the balance point of the distribution of observations（观测值分布的平衡点）.

Just as $\overline{x}$ represents the average value of the observations in a sample, the average of all values in the population can be calculated. This average is called the population mean(总体均值） and is denoted by the Greek letter $\mu$ .

One of our first tasks in statistical inference will be to present methods based on the sample mean for drawing conclusions about a population mean.(在统计推理中，我们的首要任务之一是提出基于样本均值的方法，以得出关于总体均值的结论)

The mean suffers from one deficiency(不足) that makes it an inappropriate measure of center under some circumstances: Its value can be greatly affected by the presence of even a single outlier （孤立值）(unusually large or small observation，异常大或异常小的观测值).

The Median

The sample median is indeed the middle value once the observations are ordered from smallest to largest(当观察值按照从小到大的顺序排列时，样本中位数实际上就是中间值). When the observations are denoted by $x_1,…, x_n$ , we will use the symbol $\tilde{x}$ to represent the sample median(样本中值).

在这里插入图片描述

The population median（总体中位数）, denoted by $\tilde{\mu}$
We can think of using the sample median $\tilde{x}$ to make an inference about $\tilde{\mu}$ ,.

在这里插入图片描述

Other Measures of Location: Quartiles(四分位数), Percentiles(百分位数), and trimmed Means（修整后的平均值）

The median (population or sample) divides the data set into two parts of equal size. To obtain finer measures of location, we could divide the data into more than two such parts. Roughly speaking, quartiles divide the data set into four equal parts（被三个间隔点分成四段）

Similarly, a data set (sample or population) can be even more finely divided using percentiles（被99个间隔点分成100段）

The mean is quite sensitive to a single outlier, whereas the median is impervious to many outliers（平均值对单个异常值相当敏感，而中值对许多异常值不敏感）.

Extreme behavior of either type might be undesirable.

A trimmed mean is a compromise between $\bar{x}$ and $\tilde{x}$ . A 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of the sample and then averaging what remains.

Categorical data and Sample Proportions(分类数据和样本比例)

A 1 is recorded for an observation in the category and a 0 for an observation not in the category(记录一个1，表示该类别中的观察结果，而记录一个0，表示不在该类别中的观察结果).Then the sample proportion of observations in the category is the sample mean of the sequence of 1 s and 0 s.

1.4 Measures of Variability(可变性的量度)

Measures of Variability for Sample data

Our primary measures of variability involve the deviations(偏差) from the mean, $x_ 1- \bar{x}, x_2- \bar{ x},…, x_n -\bar{ x}$ .

If all the deviations are small in magnitude, then all $x_i 's$ are close to the mean and there is little variability(可变性). Alternatively, if some of the deviations are large in magnitude, then some $x_i's$ lie far from x, suggesting a greater amount of variability.

The sample variance(样本方差), denoted by $s^2$ , is given by（分母是n-1！！！if we used a
divisor n in the sample variance, then the resulting quantity would tend to underestimate $\sigma^2$
(produce estimated values(估计值) that are too small on the average), whereas dividing by the slightly smaller n-1 corrects this underestimating.）

$s^2=\frac{\sum_{}^{} (x_i-\bar{x})^2}{n-1}=\frac{S_{xx}}{n-1}$

The sample standard deviation(样本标准差), denoted by s, is the (positive) square root of the variance:

$s=\sqrt{s^2}$

Note that $s^2$ and $s$ are both nonnegative. One appealing property of the standard deviation is that the unit(单位) for $s$ is the same as the unit for each of the $x_i's$ .

Motivation for $s^2$

We will use $\sigma^2$ (the square of the lowercase Greek letter sigma) to denote the population variance(总体方差) and $\sigma$ to denote the population standard deviation（样本标准差） (the square root of $\sigma^2$ )

When the population is finite and consists of N values,(分母是N！！！)

$\sigma^2=\frac{\sum_{i=1}^{N} (x_i-\mu)^2}{N}$

It is customary to refer to $s^2$ as being based on n-1 degrees of freedom(自由度) (df).This terminology（术语） reflects the fact that although $s^2$ is based on the n quantities $x_1 - \bar{x}, x_2 - \bar{x},…, x_n - \bar{x}$ , these sum to 0, so specifying the values of any n-1 of the quantities determines the remaining value.For example, if n = 4 and $x_1 - \bar{x} = 8,x_2 - \bar{x} = -6, and x_4 - x = -4$ , then automatically $x_3 - \bar{x} = 2$ , so only three of the four values of $x_i - \bar{x}$ are freely determined (3 df).

A Computing Formula for $s^2$

An alternative expression for the numerator of $s^2$ is

$S_{xx}=\sum_{}^{} (x_i-\bar{x})^2=\sum_{}^{} x_i^2-\frac{(\sum_{}^{} x_i)^2}{n}$

在这里插入图片描述

proposition:Let $x_1, x_2,…, x_n$ be a sample and $c$ be any nonzero constant

If $y_1 = x_1 + c$ , $y_2 = x2 + c$ ,…, $y_n = x_n + c$ , then$ s_y^2=s_x ^2$, and

If $y_1 = cx_1,…, y_n = cx_n$ , then $s_y^2 = c^2s_x^2, s_y = |c|s_x$

where $s_x^2$ is the sample variance of the x’s and $s_y^2$ is the sample variance of the y’s.

Boxplots(箱线图)

In recent years, a pictorial summary called a boxplot has been used successfully to describe several of a data set s most prominent(显著的) features. These features include (1) center（中心）, (2) spread（分布）, (3) the extent and nature of any departure from symmetry（任何偏离对称性的程度和性质）, and (4) identification of outliers（异常值的识别）, observations that lie unusually far from the main body of the data（异常远离数据主体的观察值）.

Order the n observations from smallest to largest and separate the smallest half from the largest half; the median $\tilde{x}$ is included in both halves if n is odd. Then the lower fourth is the median of the smallest half and the upper fourth is the median of the largest half. A measure of spread that is resistant to outliers is the fourth spread $f_s$ , given by

$f_s = upper fourth - lower fourth$

Roughly speaking, the fourth spread is unaffected by the positions of those observations in the smallest 25% or the largest 25% of the data. Hence it is resistant to outliers. The fourths are very similar to quartiles, and the fourth spread is similar to the interquartile range, the difference between the upper and lower quartiles.

在这里插入图片描述

e.g.

在这里插入图片描述

Q1 and Q3 are the lower and upper quartiles, respectively, and IQR (interquartile range) is the difference between these quartiles. SE Mean is $s/\sqrt{n}$ , the “standard error of the mean(标准平均误差)”

在这里插入图片描述

Boxplots that Show outliers

Any observation farther than 1.5 $f_s$ from the closest fourth is an outlier. An outlier is extreme(极端异常值) if it is more than 3 $f_s$ from the nearest fourth, and it is mild otherwise（否则为轻微异常值）.

Drawing a whisker out from each end of the box to the smallest and largest observations that are not outliers. Now represent each mild outlier by a closed circle and each extreme outlier by an open circle.（现在用一个闭圆表示每个从箱线图的两端分别画出一根胡须，将最小的和最大的观测值(不是异常值)表示出来。用一个闭圆表示每个温和的异常值，用一个开圆表示每个极端的异常值）

例如：

在这里插入图片描述

Comparative Boxplots（比较/并列箱线图）

A comparative or side-by-side（并列） boxplot is a very effective way of revealing similarities and differences between two or more data sets consisting of observations on the same variable.

e.g. ,

在这里插入图片描述

Lum0s!

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
概率论与数理统计 1 Overview and Descriptive Statistics(概述和描述性统计) （下篇）

概率论_1.3_1.41.3 Measure of LocationThe Mean(平均值)The MedianOther Measures of Location: Quartiles(四分位数), Percentiles(百分位数), and trimmed Means（修整后的平均值）Categorical data and Sample Proportions(分类数据和样本比例)1.4 Measures of Variability(可变性的量度)Measures of Variability
复制链接

扫一扫