深潜mobi_统计资料深潜第3部分

最新推荐文章于 2024-07-30 17:23:09 发布

weixin_26721705

最新推荐文章于 2024-07-30 17:23:09 发布

阅读量778

点赞数

文章标签： python

原文链接：https://medium.com/analytics-vidhya/statistics-deep-dive-part3-8f16ddb56242

版权

深潜mobi

In part2 we covered frequency distribution and histogramsPart4 - Sample space, probability, permutation and combinations

在第2部分，我们覆盖的频率分布直方图和第四部分-样本空间，概率，排列组合

In this part we will be looking into1. Measures of central tendency2. Measures of Variation and Position3. Exploratory Data analysis

在这一部分中，我们将研究1.集中趋势的度量2。差异和位置的度量3。探索性数据分析

While describing data, central tendencies and variance play an important role. Look at the following statements

在描述数据时，中心趋势和方差起着重要作用。看下面的语句

A. Average income of the house is $50,000.B. More than 40% of the people gathered in procession were teenagers.C. 50% of all products brought in cities today are through online.Though these statements are plain, they provide a lot of information. Able to measure central tendency is very important and a primitive skill of a statistics.

房屋的平均收入为50,000美元。游行队伍中40％以上的人是青少年。如今，进入城市的所有产品中有50％是通过在线渠道发布的。能够测量集中趋势非常重要，并且是统计的原始技能。

Statistic — A measure obtained by using data values from a sample.Eg: Average of sales by sales agent on random 10 daysParameter — Measure obtained by using all data of population.Eg: Average sales per day of sales agent throughout his job.

统计-使用样本数据值获得的度量。例如：销售代理商在随机10天的平均销售额。参数-使用所有人口数据获得的度量。例如：销售代理商在其整个工作期间的平均每天销售额。

集中趋势的度量 (Measures of central tendency)

Mean : The mean, also known as the arithmetic average, is found by adding the values of the data and dividing by the total number of values.
平均值：平均值，也称为算术平均值，是通过将数据的值相加并除以值的总数而得出的。

Mean height of a family whose 5 members heights are as below:5', 4.5', 3', 6', 3.5' is (5+4.5+3+6+3.5)/5 = 4.2

5个成员的身高如下所示的家庭的平均身高：5'，4.5'，3'，6'，3.5'为(5 + 4.5 + 3 + 6 + 3.5)/ 5 = 4.2

The procedure for finding the mean for grouped data with frequency uses the midpoint of class limit and multiply it with frequency as shown below.

使用频率查找分组数据的平均值的过程使用类别限制的中点并将其乘以频率，如下所示。

Image for post — Class and Frequency(f) are given, midpoint(Xm and f*Xm gives total for each class. This should be aggregated and then divided by sum of frequency to get the mean.

2. Median: is the midpoint of the data array or also known as 50th percentile of data. Data should be arranged in order to calculate median.

2.中位数：是数据数组的中点，也称为数据的第50个百分位数。应该安排数据以便计算中位数。

Eg: Find the median for the daily vehicle pass charge for five U.S. National Parks. The costs are $25, $15, $15, $20, and $15.Order the data : 15, 15, 15, 20, 25 — The mid point of the data array is 15. If there is not a single mid point then average the 2 middle numbers.

例如：找到五个美国国家公园每日车辆通行费的中位数。成本分别为$ 25，$ 15，$ 15，$ 20和$ 15。对数据进行排序：15、15、15、20、25 —数据数组的中点为15。如果没有一个中点，则取平均值2中间数字。

The midrange is a rough estimate of the middle. It is found by adding the lowest and highest values in the data set and dividing by 2.2, 3, 6, 8, 4, 1Midrange = (1+8)/2 = 4.5

中端是对中间值的粗略估计。通过将数据集中的最低和最高值相加并除以2.2、3、6、8、4、1中间范围=(1 + 8)/ 2 = 4.5得出

3. Weighted mean is variance in mean calculation where instead of directly using mean we multiply it with weight. Eg: If you purchase milk from 3 shops where price/gal is 2$, 3$, 4$ respectively, average amount of money spent on milk is not 3$. Assume you bought 1gal , 2gal , 3gal from each shop , average money spent is (1x2+2x3+3x4)/(1+2+3) = 3.33.

3.加权平均值是平均值计算中的方差，而不是直接使用平均值，而是将其乘以权重。例如：如果您从3家商店购买牛奶，价格/加仑分别为2美元，3美元和4美元，则平均花费在牛奶上的钱就不是3美元。假设您从每个商店购买了1gal，2gal，3gal，则平均花费为(1x2 + 2x3 + 3x4)/(1 + 2 + 3)= 3.33。

4. Mode is the most repeated/value with highest frequency. A dataset can have more than one mode or mode might not exist at all.Eg: Mode of 2,3,3,4 is 3

4.模式是重复次数最多/频率最高的值。一个数据集可以有多个模式，或者根本不存在模式。例如：模式2,3,3,4为3

变化量度 (Measures of Variation)

A coach wants to select 5 tall kids in squad for athletics, he choses below groups (heights in ft)Group A: 3.8 , 4.2 , 5 , 5.5 , 7.5 (crazy tall guy )Group B: 5.1 , 5 , 5.2 , 5.3 , 5.4Though both Group A and B has same average heights, Group B is much evenly distributed heights. This is a small data set where we can visibly see high variance. To identify variations in data we use range, variance and standard deviation.

一位教练希望选择5个高个子孩子参加体育比赛，他选择了以下各组(以ft为单位) A组： 3.8，4.2，5，5.5，7.5(疯狂的高个子) B组： 5.1，5，5.2，5.3， 5.4尽管A组和B组的平均高度都相同，但B组的高度却很均匀。这是一个很小的数据集，在这里我们可以明显看到高方差。为了确定数据的变化，我们使用范围，方差和标准差。

Range is the highest value minus the lowest value.
范围是最大值减去最小值。

Range for above examples :
以上示例的范围：

Group A : 7.5 - 3.8 = 3.7
A组：7.5-3.8 = 3.7

Group B : 5.4 - 5 = 0.4
B组：5.4-5 = 0.4
Population Variance is the average of the squares of the distance each value is from the mean. Variance is denoted by greek letter sigma square.
总体方差是每个值与平均值的距离的平方的平均值。方差用希腊字母sigma square表示。

3. Population Standard Deviation : is the square root of the variance. The symbol for the population standard deviation is sigma.

3.总体标准差：是方差的平方根。总体标准偏差的符号为sigma。

Problem:

问题：

Mean, Variance and SD calculation

均值，方差和SD计算

4. Sample variance and Sample Standard deviation: Population variance formula when applied on a smaller sample does not give the best estimate of the population variance. Variance computed by this formula usually underestimates the population variance. Therefore, instead of dividing by n, find the variance of the sample by dividing by n -1, same applies for sample standard deviation too.

4.样本方差和样本标准差：将人口方差公式应用于较小的样本时，不能给出总体方差的最佳估计。通过此公式计算出的方差通常会低估总体方差。因此，不是用n除以，而是用n -1除以找到样本的方差，同样适用于样本标准差。

Problem

问题

Variance and Standard Deviation for Grouped Data

分组数据的方差和标准差

Coefficient of variation: A Statistic that allows you to compare standard deviations when the units are different.

变异系数：一种统计量，允许您在单位不同时比较标准偏差。

The range can be used to approximate the standard deviation. The approximation is called the Range rule of thumb.Eg: 5,8,9,11,18 . Range = 18–5 = 13S.D = Range/4 ~ 3.25

该范围可用于近似标准偏差。近似值称为经验范围法则。 例如：5,8,9,11,18。范围= 18–5 = 13标准差=范围/ 4〜3.25

切比雪夫定理 (Chebyshev’s Theorem)

The proportion of values from a data set that will fall within k standard deviations of the mean will be at least 1 - 1/k² , where k is a number greater than 1 (its k square).

来自数据集的值的比例将落入平均值的k个标准偏差之内，至少应为1-1 /k²，其中k是大于1的数字(其k平方)。

This theorem states that at least 75%, of the data values will fall within 2 standard deviations of the mean of the data set. This result is found by substituting k = 2 in the expression.1–1/k² = 1–1/2² = 0.75 or 75%

该定理指出，至少75％的数据值将落入该数据集平均值的2个标准差之内。通过在表达式中替换k = 2可以找到该结果。1-1/k²= 1–1 /2²= 0.75或75％

经验(正常)规则 (The Empirical (Normal) Rule)

Chebyshev’s theorem applies to any distribution regardless of its shape. However, when a distribution is bell-shaped (or what is called normal), the following statements, which make up the empirical rule, are true.

切比雪夫定理适用于任何分布，而不管其形状如何。但是，当分布为钟形(或正态分布)时，组成经验规则的以下陈述是正确的。

Approximately 68% of the data values will fall within 1 standard deviation of the mean.
大约68％的数据值将落在平均值的1个标准偏差之内。
Approximately 95% of the data values will fall within 2 standard deviations of the mean.
大约95％的数据值将落在平均值的2个标准偏差之内。
Approximately 99.7% of the data values will fall within 3 standard deviations of the mean.
大约99.7％的数据值将落在平均值的3个标准差之内。

位置测量 (Measures of Position)

“You can’t compare apples and oranges.” But with the use of statistics, it can be done to some extent. Suppose that a student scored 90 on a music test and 45 on an English exam. Direct comparison of raw scores is impossible, since the exams might not be equivalent in terms of number of questions, value of each question, and so on. However, a comparison of a relative standard similar to both can be made. This comparison uses the mean and standard deviation and is called a standard score or z score. (We also use z scores in later chapters.)

“你不能比较苹果和橘子。” 但是使用统计数据可以在某种程度上做到这一点。假设某个学生在音乐测试中获得90分，在英语考试中获得45分。无法直接比较原始分数，因为考试在题目数量，每个题目的价值等方面可能不相等。但是，可以对类似于两者的相对标准进行比较。该比较使用平均值和标准偏差，称为标准分数或z分数。 (我们在后面的章节中也使用z分数。)

A standard score or z score tells how many standard deviations a data value is above or below the mean for a specific distribution of values. If a standard score is zero, then the data value is the same as the mean.

标准分数或z分数表明数据值在特定值分布的平均值之上或之下有多少标准偏差。如果标准分数为零，则数据值与平均值相同。

A z score or standard score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The symbol for a standard score is z. The formula is

通过从值中减去平均值，然后将结果除以标准偏差，可以得出该值的z得分或标准得分。标准分数的符号是z。公式是

z = (value - mean) /standard deviation

z =(值-平均值)/标准偏差

For population and samples, the formula isz= (X- mean )/s

对于总体和样本，公式为z =(X- mean)/ s

The z score represents the number of standard deviations that a data value falls above or below the mean.

z得分表示数据值高于或低于平均值的标准偏差的数量。

Percentile calculation — Calculating percentile is very simple yet an important skill in statistics.It’s calculated with this formula c = ( n x p)/100 Where n = total values in arrayp = percentile required*. If c is whole number then count = (c+(c+1))/2*. If c is not whole number, round up to next whole number

P ercentile计算-百分计算是非常简单但在statistics.It与此式C计算值=(NXP)一个重要的技能/ 100其中n =在arrayp =百分需要*总值。 如果c是整数，则计数=(c +(c + 1))/ 2 *。 如果c不是整数，则四舍五入到下一个整数

Eg: If you need to calculate 30th percentile of below array of 20 numbers[3,4,1,7,6,8,12,11,14,18,2,5,13,9,19,16,10,17,15,20]Step1 : Arrange the numbers in ascending order(or descending order)[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]Step 2: Identify total numbers and the percentile required and substitute in the formulan = 20p = 30%ilecount = (n * p)/100 = 20x30/100 = 6 Identify 6th number , which is 6 , since count is whole number average 6th and 7th number. In this case (6+7)/2 = 6.5 So 6.5 is the 30th percentileSimilarly 5.5 is the 25th percentile

例如：如果您需要计算以下20个数字组成的数组的第30个百分位数[3,4,1,7,6,8,12,11,14,18,2,5,13,9,19,16,10， 17,15,20]步骤1 ：按升序(或降序)排列数字[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15， [16,17,18,19,20]步骤2 ：确定总数和所需的百分位数，并代入公式= 20p = 30％ile count =(n * p)/ 100 = 20x30 / 100 = 6确定第六个数字，这是6，因为count是整数的平均值6和7。在这种情况下(6 + 7)/ 2 = 6.5所以6.5是第30个百分位数类似地5.5是第25个百分位数

Quartiles, Interquartile Range and Outliers

四分位数，四分位数范围和离群值

Quartiles divide the distribution into four groups, separated by Q1, Q2, Q3. Where Q1 is 25th percentile, Q2 is 50th and Q3 is 75th percentile.

四分位数将分布分为四组，由Q 1， Q 2， Q 3分隔。其中Q1为25％，Q2为50％，Q3为75％。

Interquartile range (IQR) is defined as the difference between Q1 and Q3 and is the range of the middle 50% of the data.

四分位间距(IQR)定义为Q 1和Q 3之差，是数据中间50％的范围。

Outliers — Extremely high or extremely low values of a dataset are called outliers. Anything values in dataset not in between-1.5*Q1 to +1.5*Q3 can be treated as outlier.

离群值-数据集的极高或极低值称为离群值。数据集中不在-1.5 * Q1到+ 1.5 * Q3之间的任何值都可以视为异常值。

探索性数据分析(EDA) (Exploratory data analysis (EDA))

Exploratory data analysis laid out by John Tukey pictures the stem and leaf of data. It uses median for central tendency, IQR for variance and Boxplot(also called as whiskers plot) to graphically highlight the spread of data.The popular 5 number summary in boxplot has 1. Lowest value in the dataset2. Q13. Median4. Q35. Highest value of dataset

约翰·图基(John Tukey)进行的探索性数据分析描绘了数据的茎叶。它使用中位数表示中心趋势，使用IQR表示方差，并使用Boxplot(也称为晶须图)以图形方式突出显示数据的分布。boxplot中受欢迎的5数字汇总有1.数据集中的最低值2。 Q13。中位数4。 Q35。数据集的最高价值

Summary of concepts covered in this part• Measures of central tendency Mean, median , mode

本部分介绍的概念摘要•集中趋势的度量平均值，中位数，众数

•Measures of varianceRange, variance , standard deviation

•方差的测量范围，方差，标准差

•Difference between population and sample standard deviation.

•总体与样本标准差之间的差异。

•Chebyshev’s theorem to range of data between k standard deviations

•契比雪夫定理对k个标准差之间的数据范围

•Measures of position — Z index

•位置测量-Z索引

•Percentile, IQR, Outliers and 5 point summary

•百分位数，IQR，离群值和5点汇总

Reference: Elementary Statistics — Bluman

参考：基本统计信息— Bluman

翻译自: https://medium.com/analytics-vidhya/statistics-deep-dive-part3-8f16ddb56242

深潜mobi

weixin_26721705

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
深潜mobi_统计资料深潜第3部分

深潜mobiIn part2 we covered frequency distribution and histogramsPart4 - Sample space, probability, permutation and combinations在第2部分，我们覆盖的频率分布直方图和第四部分-样本空间，概率，排列组合In this part we will be looking into1....
复制链接

扫一扫