

In part2 we covered frequency distribution and histogramsPart4 - Sample space, probability, permutation and combinations


In this part we will be looking into1. Measures of central tendency2. Measures of Variation and Position3. Exploratory Data analysis

在这一部分中,我们将研究1.集中趋势的度量2。 差异和位置的度量3。 探索性数据分析

While describing data, central tendencies and variance play an important role. Look at the following statements

在描述数据时,中心趋势和方差起着重要作用。 看下面的语句

A. Average income of the house is $50,000.B. More than 40% of the people gathered in procession were teenagers.C. 50% of all products brought in cities today are through online.Though these statements are plain, they provide a lot of information. Able to measure central tendency is very important and a primitive skill of a statistics.

房屋的平均收入为50,000美元。 游行队伍中40%以上的人是青少年。 如今,进入城市的所有产品中有50%是通过在线渠道发布的。 能够测量集中趋势非常重要,并且是统计的原始技能。

Statistic — A measure obtained by using data values from a sample.Eg: Average of sales by sales agent on random 10 daysParameter — Measure obtained by using all data of population.Eg: Average sales per day of sales agent throughout his job.


集中趋势的度量 (Measures of central tendency)

  1. Mean : The mean, also known as the arithmetic average, is found by adding the values of the data and dividing by the total number of values.


Mean height of a family whose 5 members heights are as below:5', 4.5', 3', 6', 3.5' is (5+4.5+3+6+3.5)/5 = 4.2

5个成员的身高如下所示的家庭的平均身高:5',4.5',3',6',3.5'为(5 + 4.5 + 3 + 6 + 3.5)/ 5 = 4.2

The procedure for finding the mean for grouped data with frequency uses the midpoint of class limit and multiply it with frequency as shown below.


Image for post
Class and Frequency(f) are given, midpoint(Xm and f*Xm gives total for each class. This should be aggregated and then divided by sum of frequency to get the mean.
给出了类别和频率(f),中点(Xm和f * Xm给出了每个类别的总数),应将其汇总,然后除以频率总和以获得平均值。

2. Median: is the midpoint of the data array or also known as 50th percentile of data. Data should be arranged in order to calculate median.

2.中位数:是数据数组的中点,也称为数据的第50个百分位数。 应该安排数据以便计算中位数。

Eg: Find the median for the daily vehicle pass charge for five U.S. National Parks. The costs are $25, $15, $15, $20, and $15.Order the data : 15, 15, 15, 20, 25 — The mid point of the data array is 15. If there is not a single mid point then average the 2 middle numbers.

例如:找到五个美国国家公园每日车辆通行费的中位数。 成本分别为$ 25,$ 15,$ 15,$ 20和$ 15。对数据进行排序:15、15、15、20、25 —数据数组的中点为15。如果没有一个中点,则取平均值2中间数字。

The midrange is a rough estimate of the middle. It is found by adding the lowest and highest values in the data set and dividing by 2.2, 3, 6, 8, 4, 1Midrange = (1+8)/2 = 4.5

中端 是对中间值的粗略估计。 通过将数据集中的最低和最高值相加并除以2.2、3、6、8、4、1中间范围=(1 + 8)/ 2 = 4.5得出

3. Weighted mean is variance in mean calculation where instead of directly using mean we multiply it with weight. Eg: If you purchase milk from 3 shops where price/gal is 2$, 3$, 4$ respectively, average amount of money spent on milk is not 3$. Assume you bought 1gal , 2gal , 3gal from each shop , average money spent is (1x2+2x3+3x4)/(1+2+3) = 3.33.

3.加权平均值是平均值计算中的方差,而不是直接使用平均值,而是将其乘以权重。 例如:如果您从3家商店购买牛奶,价格/加仑分别为2美元,3美元和4美元,则平均花费在牛奶上的钱就不是3美元。 假设您从每个商店购买了1gal,2gal,3gal,则平均花费为(1x2 + 2x3 + 3x4)/(1 + 2 + 3)= 3.33。

4. Mode is the most repeated/value with highest frequency. A dataset can have more than one mode or mode might not exist at all.Eg: Mode of 2,3,3,4 is 3

4.模式是重复次数最多/频率最高的值。 一个数据集可以有多个模式,或者根本不存在模式。例如:模式2,3,3,4为3

变化量度 (Measures of Variation)

A coach wants to select 5 tall kids in squad for athletics, he choses below groups (heights in ft)Group A: 3.8 , 4.2 , 5 , 5.5 , 7.5 (crazy tall guy )Group B: 5.1 , 5 , 5.2 , 5.3 , 5.4Though both Group A and B has same average heights, Group B is much evenly distributed heights. This is a small data set where we can visibly see high variance. To identify variations in data we use range, variance and standard deviation.

一位教练希望选择5个高个子孩子参加体育比赛,他选择了以下各组(以ft为单位) A组: 3.8,4.2,5,5.5,7.5(疯狂的高个子) B组: 5.1,5,5.2,5.3, 5.4尽管A组和B组的平均高度都相同,但B组的高度却很均匀。 这是一个很小的数据集,在这里我们可以明显看到高方差。 为了确定数据的变化,我们使用范围,方差和标准差。

  1. Range is the highest value minus the lowest value.


    Range for above examples :


    Group A : 7.5 - 3.8 = 3.7

    A组:7.5-3.8 = 3.7

    Group B : 5.4 - 5 = 0.4

    B组:5.4-5 = 0.4

  2. Population Variance is the average of the squares of the distance each value is from the mean. Variance is denoted by greek letter sigma square.

    总体方差是每个值与平均值的距离的平方的平均值。 方差用希腊字母sigma square表示。

Image for post

3. Population Standard Deviation : is the square root of the variance. The symbol for the population standard deviation is sigma.

3.总体标准差:是方差的平方根。 总体标准偏差的符号为sigma。

Image for post



Mean, Variance and SD calculation


Image for post

4. Sample variance and Sample Standard deviation: Population variance formula when applied on a smaller sample does not give the best estimate of the population variance. Variance computed by this formula usually underestimates the population variance. Therefore, instead of dividing by n, find the variance of the sample by dividing by n -1, same applies for sample standard deviation too.

4.样本方差和样本标准差:将人口方差公式应用于较小的样本时,不能给出总体方差的最佳估计。 通过此公式计算出的方差通常会低估总体方差。 因此,不是用n除以,而是用n -1除以找到样本的方差,同样适用于样本标准差

Image for post



Image for post

Variance and Standard Deviation for Grouped Data


Image for post

Coefficient of variation: A Statistic that allows you to compare standard deviations when the units are different.


Image for post

The range can be used to approximate the standard deviation. The approximation is called the Range rule of thumb.Eg: 5,8,9,11,18 . Range = 18–5 = 13S.D = Range/4 ~ 3.25

该范围可用于近似标准偏差。 近似值称为经验范围法则。 例如:5,8,9,11,18。 范围= 18–5 = 13标准差=范围/ 4〜3.25

切比雪夫定理 (Chebyshev’s Theorem)

The proportion of values from a data set that will fall within k standard deviations of the mean will be at least 1 - 1/k² , where k is a number greater than 1 (its k square).

来自数据集的值的比例将落入平均值的k个标准偏差之内,至少应为1-1 /k²,其中k是大于1的数字(其k平方)。

This theorem states that at least 75%, of the data values will fall within 2 standard deviations of the mean of the data set. This result is found by substituting k = 2 in the expression.1–1/k² = 1–1/2² = 0.75 or 75%

该定理指出,至少75%的数据值将落入该数据集平均值的2个标准差之内。 通过在表达式中替换k = 2可以找到该结果。1-1/k²= 1–1 /2²= 0.75或75%

Image for post

经验(正常)规则 (The Empirical (Normal) Rule)

Chebyshev’s theorem applies to any distribution regardless of its shape. However, when a distribution is bell-shaped (or what is called normal), the following statements, which make up the empirical rule, are true.

切比雪夫定理适用于任何分布,而不管其形状如何。 但是,当分布为钟形(或正态分布)时,组成经验规则的以下陈述是正确的。

  1. Approximately 68% of the data values will fall within 1 standard deviation of the mean.

  2. Approximately 95% of the data values will fall within 2 standard deviations of the mean.

  3. Approximately 99.7% of the data values will fall within 3 standard deviations of the mean.


位置测量 (Measures of Position)

“You can’t compare apples and oranges.” But with the use of statistics, it can be done to some extent. Suppose that a student scored 90 on a music test and 45 on an English exam. Direct comparison of raw scores is impossible, since the exams might not be equivalent in terms of number of questions, value of each question, and so on. However, a comparison of a relative standard similar to both can be made. This comparison uses the mean and standard deviation and is called a standard score or z score. (We also use z scores in later chapters.)

“你不能比较苹果和橘子。” 但是使用统计数据可以在某种程度上做到这一点。 假设某个学生在音乐测试中获得90分,在英语考试中获得45分。 无法直接比较原始分数,因为考试在题目数量,每个题目的价值等方面可能不相等。 但是,可以对类似于两者的相对标准进行比较。 该比较使用平均值和标准偏差,称为标准分数或z分数。 (我们在后面的章节中也使用z分数。)

A standard score or z score tells how many standard deviations a data value is above or below the mean for a specific distribution of values. If a standard score is zero, then the data value is the same as the mean.

标准分数或z分数表明数据值在特定值分布的平均值之上或之下有多少标准偏差。 如果标准分数为零,则数据值与平均值相同。

A z score or standard score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The symbol for a standard score is z. The formula is

通过从值中减去平均值,然后将结果除以标准偏差,可以得出该值的z得分标准得分。 标准分数的符号是z。 公式是

z = (value - mean) /standard deviation

z =(值-平均值)/标准偏差

For population and samples, the formula isz= (X- mean )/s

对于总体和样本,公式为z =(X- mean)/ s

The z score represents the number of standard deviations that a data value falls above or below the mean.


Image for post

Percentile calculation — Calculating percentile is very simple yet an important skill in statistics.It’s calculated with this formula c = ( n x p)/100 Where n = total values in arrayp = percentile required*. If c is whole number then count = (c+(c+1))/2*. If c is not whole number, round up to next whole number

P ercentile计算-百分计算是非常简单但在statistics.It与此式C计算值=(NXP)一个重要的技能/ 100其中n =在arrayp =百分需要*总值 如果c是整数,则计数=(c +(c + 1))/ 2 *。 如果c不是整数,则四舍五入到下一个整数

Eg: If you need to calculate 30th percentile of below array of 20 numbers[3,4,1,7,6,8,12,11,14,18,2,5,13,9,19,16,10,17,15,20]Step1 : Arrange the numbers in ascending order(or descending order)[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]Step 2: Identify total numbers and the percentile required and substitute in the formulan = 20p = 30%ilecount = (n * p)/100 = 20x30/100 = 6 Identify 6th number , which is 6 , since count is whole number average 6th and 7th number. In this case (6+7)/2 = 6.5 So 6.5 is the 30th percentileSimilarly 5.5 is the 25th percentile

例如:如果您需要计算以下20个数字组成的数组的第30个百分位数[3,4,1,7,6,8,12,11,14,18,2,5,13,​​9,19,16,10, 17,15,20]步骤1 :按升序(或降序)排列数字[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15, [16,17,18,19,20]步骤2 :确定总数和所需的百分位数,并代入公式= 20p = 30%ile count =(n * p)/ 100 = 20x30 / 100 = 6确定第六个数字,这是6,因为count是整数的平均值6和7。 在这种情况下(6 + 7)/ 2 = 6.5所以6.5是第30个百分位数类似地5.5是第25个百分位数

Quartiles, Interquartile Range and Outliers


Quartiles divide the distribution into four groups, separated by Q1, Q2, Q3. Where Q1 is 25th percentile, Q2 is 50th and Q3 is 75th percentile.

四分位数将分布分为四组,由Q 1, Q 2, Q 3分隔。其中Q1为25%,Q2为50%,Q3为75%。

Image for post

Q1 is calculated as median between Q2 and lowest value

Interquartile range (IQR) is defined as the difference between Q1 and Q3 and is the range of the middle 50% of the data.

四分位间距(IQR)定义为Q 1和Q 3之差,是数据中间50%的范围。

Outliers — Extremely high or extremely low values of a dataset are called outliers. Anything values in dataset not in between-1.5*Q1 to +1.5*Q3 can be treated as outlier.

离群值-数据集的极高或极低值称为离群值。 数据集中不在-1.5 * Q1到+ 1.5 * Q3之间的任何值都可以视为异常值。

Image for post

探索性数据分析(EDA) (Exploratory data analysis (EDA))

Exploratory data analysis laid out by John Tukey pictures the stem and leaf of data. It uses median for central tendency, IQR for variance and Boxplot(also called as whiskers plot) to graphically highlight the spread of data.The popular 5 number summary in boxplot has 1. Lowest value in the dataset2. Q13. Median4. Q35. Highest value of dataset

约翰·图基(John Tukey)进行的探索性数据分析描绘了数据的茎叶。 它使用中位数表示中心趋势,使用IQR表示方差,并使用Boxplot(也称为晶须图)以图形方式突出显示数据的分布。boxplot中受欢迎的5数字汇总有1.数据集中的最低值2。 Q13。 中位数4。 Q35。 数据集的最高价值

Image for post

Summary of concepts covered in this part• Measures of central tendency Mean, median , mode


•Measures of varianceRange, variance , standard deviation


•Difference between population and sample standard deviation.


•Chebyshev’s theorem to range of data between k standard deviations


•Measures of position — Z index


•Percentile, IQR, Outliers and 5 point summary


Reference: Elementary Statistics — Bluman

参考:基本统计信息— Bluman

翻译自: https://medium.com/analytics-vidhya/statistics-deep-dive-part3-8f16ddb56242


  • 0
  • 0
    觉得还不错? 一键收藏
  • 0




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


