概率论与数理统计 1 Overview and Descriptive Statistics(概述和描述性统计) (下篇)

1.3 Measure of Location

Visual summaries of data(数据的可视化总结) are excellent tools for obtaining preliminary impressions and insights(获得初步印象和见解).More formal data analysis often requires the calculation and interpretation of numerical summary measures(计算和解释数值汇总措施).That is, from the data we try to extract several summarizing quantities that might serve to characterize the data set and convey some of its prominent features(描述数据集特征并传达其一些显著特征的概括量).

The Mean(平均值)

The mean, or arithmetic average of the set(算术平均数)

We will often refer to the arithmetic average as the sample mean(样本均值) and denote it by x ‾ \overline{x} x.

The sample mean x of observations x1, x2,…, xn is given by

x ‾ = x 1 + x 2 + . . . + x n n = ∑ i = 1 n x i \overline{x}=\frac{x_1+x_2+...+x_n}{n}=\sum_{i=1}^n x_i x=nx1+x2+...+xn=i=1nxi

The numerator(分子) of x ‾ \overline{x} x can be written more informally as ∑ x i \begin{matrix} \sum_{}^{} x_i \end{matrix} xi , where the summation is over all sample observations.

A physical interpretation of the sample mean demonstrates(说明) how it assesses the center of a sample(评估样本的中心).

The sample mean can be regarded as the balance point of the distribution of observations(观测值分布的平衡点).

Just as x ‾ \overline{x} x represents the average value of the observations in a sample, the average of all values in the population can be calculated. This average is called the population mean(总体均值) and is denoted by the Greek letter μ \mu μ .

One of our first tasks in statistical inference will be to present methods based on the sample mean for drawing conclusions about a population mean.(在统计推理中,我们的首要任务之一是提出基于样本均值的方法,以得出关于总体均值的结论)

The mean suffers from one deficiency(不足) that makes it an inappropriate measure of center under some circumstances: Its value can be greatly affected by the presence of even a single outlier (孤立值)(unusually large or small observation,异常大或异常小的观测值).

The Median

The sample median is indeed the middle value once the observations are ordered from smallest to largest(当观察值按照从小到大的顺序排列时,样本中位数实际上就是中间值). When the observations are denoted by x 1 , … , x n x_1,…, x_n x1,,xn, we will use the symbol x ~ \tilde{x} x~ to represent the sample median(样本中值).

在这里插入图片描述

The population median(总体中位数), denoted by μ ~ \tilde{\mu} μ~
We can think of using the sample median x ~ \tilde{x} x~ to make an inference about μ ~ \tilde{\mu} μ~,.

在这里插入图片描述

Other Measures of Location: Quartiles(四分位数), Percentiles(百分位数), and trimmed Means(修整后的平均值)

The median (population or sample) divides the data set into two parts of equal size. To obtain finer measures of location, we could divide the data into more than two such parts. Roughly speaking, quartiles divide the data set into four equal parts(被三个间隔点分成四段)

Similarly, a data set (sample or population) can be even more finely divided using percentiles(被99个间隔点分成100段

The mean is quite sensitive to a single outlier, whereas the median is impervious to many outliers(平均值对单个异常值相当敏感,而中值对许多异常值不敏感).

Extreme behavior of either type might be undesirable.

A trimmed mean is a compromise between x ˉ \bar{x} xˉ and x ~ \tilde{x} x~. A 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of the sample and then averaging what remains.

Categorical data and Sample Proportions(分类数据和样本比例)

A 1 is recorded for an observation in the category and a 0 for an observation not in the category(记录一个1,表示该类别中的观察结果,而记录一个0,表示不在该类别中的观察结果).Then the sample proportion of observations in the category is the sample mean of the sequence of 1 s and 0 s.

1.4 Measures of Variability(可变性的量度)

Measures of Variability for Sample data

Our primary measures of variability involve the deviations(偏差) from the mean, x 1 − x ˉ , x 2 − x ˉ , … , x n − x ˉ x_ 1- \bar{x}, x_2- \bar{ x},…, x_n -\bar{ x} x1xˉ,x2xˉ,,xnxˉ.

If all the deviations are small in magnitude, then all x i ′ s x_i 's xis are close to the mean and there is little variability(可变性). Alternatively, if some of the deviations are large in magnitude, then some x i ′ s x_i's xis lie far from x, suggesting a greater amount of variability.

The sample variance(样本方差), denoted by s 2 s^2 s2 , is given by(分母是n-1!!!if we used a
divisor n in the sample variance, then the resulting quantity would tend to underestimate σ 2 \sigma^2 σ2
(produce estimated values(估计值) that are too small on the average), whereas dividing by the slightly smaller n-1 corrects this underestimating.)

s 2 = ∑ ( x i − x ˉ ) 2 n − 1 = S x x n − 1 s^2=\frac{\sum_{}^{} (x_i-\bar{x})^2}{n-1}=\frac{S_{xx}}{n-1} s2=n1(xixˉ)2=n1Sxx

The sample standard deviation(样本标准差), denoted by s, is the (positive) square root of the variance:

s = s 2 s=\sqrt{s^2} s=s2

Note that s 2 s^2 s2 and s s s are both nonnegative. One appealing property of the standard deviation is that the unit(单位) for s s s is the same as the unit for each of the x i ′ s x_i's xis.

Motivation for s 2 s^2 s2

We will use σ 2 \sigma^2 σ2 (the square of the lowercase Greek letter sigma) to denote the population variance(总体方差) and σ \sigma σ to denote the population standard deviation(样本标准差) (the square root of σ 2 \sigma^2 σ2 )

When the population is finite and consists of N values,(分母是N!!!)

σ 2 = ∑ i = 1 N ( x i − μ ) 2 N \sigma^2=\frac{\sum_{i=1}^{N} (x_i-\mu)^2}{N} σ2=Ni=1N(xiμ)2

It is customary to refer to s 2 s^2 s2 as being based on n-1 degrees of freedom(自由度) (df).This terminology(术语) reflects the fact that although s 2 s^2 s2 is based on the n quantities x 1 − x ˉ , x 2 − x ˉ , … , x n − x ˉ x_1 - \bar{x}, x_2 - \bar{x},…, x_n - \bar{x} x1xˉ,x2xˉ,,xnxˉ, these sum to 0, so specifying the values of any n-1 of the quantities determines the remaining value.For example, if n = 4 and x 1 − x ˉ = 8 , x 2 − x ˉ = − 6 , a n d x 4 − x = − 4 x_1 - \bar{x} = 8,x_2 - \bar{x} = -6, and x_4 - x = -4 x1xˉ=8,x2xˉ=6,andx4x=4, then automatically x 3 − x ˉ = 2 x_3 - \bar{x} = 2 x3xˉ=2, so only three of the four values of x i − x ˉ x_i - \bar{x} xixˉ are freely determined (3 df).

A Computing Formula for s 2 s^2 s2

An alternative expression for the numerator of s 2 s^2 s2 is

S x x = ∑ ( x i − x ˉ ) 2 = ∑ x i 2 − ( ∑ x i ) 2 n S_{xx}=\sum_{}^{} (x_i-\bar{x})^2=\sum_{}^{} x_i^2-\frac{(\sum_{}^{} x_i)^2}{n} Sxx=(xixˉ)2=xi2n(xi)2

在这里插入图片描述

proposition:Let x 1 , x 2 , … , x n x_1, x_2,…, x_n x1,x2,,xn be a sample and c c c be any nonzero constant

  1. If y 1 = x 1 + c y_1 = x_1 + c y1=x1+c, y 2 = x 2 + c y_2 = x2 + c y2=x2+c,…, y n = x n + c y_n = x_n + c yn=xn+c, then$ s_y^2=s_x ^2$, and

  2. If y 1 = c x 1 , … , y n = c x n y_1 = cx_1,…, y_n = cx_n y1=cx1,,yn=cxn, then s y 2 = c 2 s x 2 , s y = ∣ c ∣ s x s_y^2 = c^2s_x^2, s_y = |c|s_x sy2=c2sx2,sy=csx

    where s x 2 s_x^2 sx2 is the sample variance of the x’s and s y 2 s_y^2 sy2 is the sample variance of the y’s.

Boxplots(箱线图)

In recent years, a pictorial summary called a boxplot has been used successfully to describe several of a data set s most prominent(显著的) features. These features include (1) center(中心), (2) spread(分布), (3) the extent and nature of any departure from symmetry(任何偏离对称性的程度和性质), and (4) identification of outliers(异常值的识别), observations that lie unusually far from the main body of the data(异常远离数据主体的观察值).

Order the n observations from smallest to largest and separate the smallest half from the largest half; the median x ~ \tilde{x} x~ is included in both halves if n is odd. Then the lower fourth is the median of the smallest half and the upper fourth is the median of the largest half. A measure of spread that is resistant to outliers is the fourth spread f s f_s fs, given by

f s = u p p e r f o u r t h − l o w e r f o u r t h f_s = upper fourth - lower fourth fs=upperfourthlowerfourth

Roughly speaking, the fourth spread is unaffected by the positions of those observations in the smallest 25% or the largest 25% of the data. Hence it is resistant to outliers. The fourths are very similar to quartiles, and the fourth spread is similar to the interquartile range, the difference between the upper and lower quartiles.

在这里插入图片描述

e.g.

在这里插入图片描述

在这里插入图片描述

Q1 and Q3 are the lower and upper quartiles, respectively, and IQR (interquartile range) is the difference between these quartiles. SE Mean is s / n s/\sqrt{n} s/n , the “standard error of the mean(标准平均误差)

在这里插入图片描述

Boxplots that Show outliers

Any observation farther than 1.5 f s f_s fs from the closest fourth is an outlier. An outlier is extreme(极端异常值) if it is more than 3 f s f_s fs from the nearest fourth, and it is mild otherwise(否则为轻微异常值).

Drawing a whisker out from each end of the box to the smallest and largest observations that are not outliers. Now represent each mild outlier by a closed circle and each extreme outlier by an open circle.(现在用一个闭圆表示每个从箱线图的两端分别画出一根胡须,将最小的和最大的观测值(不是异常值)表示出来。用一个闭圆表示每个温和的异常值,用一个开圆表示每个极端的异常值)

例如:

在这里插入图片描述

Comparative Boxplots(比较/并列箱线图)

A comparative or side-by-side(并列) boxplot is a very effective way of revealing similarities and differences between two or more data sets consisting of observations on the same variable.

e.g. ,

在这里插入图片描述

  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Lum0s!

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值