Notes - Berkerly Statistics 2.1X - Week2

最新推荐文章于 2021-02-23 21:48:03 发布

dawen5335

最新推荐文章于 2021-02-23 21:48:03 发布

阅读量129

点赞数

原文链接：http://www.cnblogs.com/hphp/p/3584133.html

版权

Notes - Berkerly Statistics 2.1X - Week2

-Week2, 2014/03/06, hphp

欢迎交流、转载，转载请注明出处~

Week2. Location , Represents of data

Summarizing data can help us understand them, especially when the number of data is large. This chapter presents several ways to summarize quantitative data by a typical value(a measure of location, such as the mean, median, or mode) and a measure of how well the typical value represents the list (a measure of spread, such as the range, inter-quartile range, or standard deviation). Markov's and Chebychev's inequalities show that these summary measures can contain a surprisingly large amount of information about the data.

Lecture 3.1 The median and the mode

Measures of location

Measures of location do just that: They try to capture with a single number what is typical of the data.

Mean , Median , Mode.

Median:The median is the number that divides the (ordered) data in half—thesmallestnumber that is at least as big as half the data. At least half the data are equal to or smaller than the median, and at least half the data are equal to or greater than the median.

EG.list:1, 2, 3, 4

median: -- > 2

1/4 th: -- > 1

3/4 th: -- > 3

However, the mean, the median, and the mode are "as close as possible" to all the data: Foreach of these three measures of location, the sum of the distances between each datum and the measure of location is as small as it can be. The differences among the three measures of location are in how "distance" is defined.[1]
The mean, median, and mode can berelated (approximately) to the histogram: loosely speaking, the mode is the highest bump, the median is where half the area is to the right and half is to the left, and the mean is where the histogram would balance, were it a solid object cut out of a uniform block of metal. (All these heuristics are approximate, and depend on the class intervals.)
[datum : 数据]
[Symmetric Distribution - average , balanced.]

The center

Median : the "half point" of the data" --- > 31.4 mm

The Mode: The "most common" value

the value has the highest frequency

4 | 8

5 | 9

6 |3337

7 |000235

8 | 012345788

9 | 015556

10| 0

6|333

7|000

9|555

A unimodal distribution

Unimodal : one peak

Lecture 3.2 The average

average - mean

The average - not center , not even a member , not variable members.

not so many difference with what i have already understood.

Lecture 3.3 Comparing and combining averages

What's the relation between these groups

[Natinal Health and Nutrition Examination 1999-2000] [noting the data and the source.]

the data are not longitudinal, but are cross sectional.

Comparing the numbers

the average of diff groups :"how are the groups related to each other"

E.G.

ave
section1 60
section2 70

cant tell the average , because the lack of information.

ave section size
section1 60 20
section2 70 30

average = total/50

weighted average of averages

ave section size
section proportion
section1 60 20 2/5
section2 70 30 3/5

average = 60*2/5 + 70*3/5

average = SUM(average[i]*weigth[i]) [weights are the section proportions.]

Lecture 3.4 The average and the histogram; The average and the median.

the median is unaffected by outliers.

[ Statistics that are not affected too much by small subsets of the data are resistant. The median is resistant; the mean is not. ]

A right-skewed distribution : average is greater than the median.

incomes

[affluent-rich,enrich]

[gizmos and gadget-创意和配件]

[disingenuously - 狡猾]

[pledge to - 承诺]

[Articles report median incomes. instead of average income.]

What does an average test score tell u.

if a lot of people did not get good scores , the histogram will get : Left-hand tail.

The average and the histogram

list : 2, 3, 3, 4

average = [ (1*2) + (2*3) + (1*4) ]/4 = 1/4*2 + 2/4*3 + 1/4*4

1/4,2/4,1/4 --> the percent/ proportions..

list : 2, 3, 3, 7

average = [ (1*2) + (2*3) + (1*4) ]/4 = 1/4*2 + 2/4*3 + 1/4*4

1/4,2/4,1/4 --> the percent/ proportions..

the average is the center of gravity of the histogram

1/4,2/4,1/4:weights

Lecture 3.5 Markov's inequality

How far can u be above average , How big can the tail be

Andrey Markov(1856-1922)
The average of a group people is 20years, What proportion are more than 80 years old.

Markov's inequality:

If a list has only non-negative entries , then the proportion of entries are at least at large as k times the average is at most 1/k.

[could use the Sum( weight*value ) as a prove.]

taking care of the edge

Question: more than 80 years old: > 80
Markov: more than or equal 80 years old : >= 80

But , if k = 0.5 , the biggest proportion will be 200% , makes no sense though.

Lecture 4.1 How the average/other represents data

Measures of location summarize what is typical of elements of a list, but not every element is typical. Are all the elements close to each other? Are most of the elements close to each other? What is the biggest difference between elements? On the average, how far are the elements from each other? Measures of spread or variability tell us.

The three most common measures of spreador variability are the range, theinterquartile range (IQR), and thestandard deviation (SD).

The range of a list is the largest value minus the smallest value.

It is the width of the smallest interval that contains all the data, so it measures spread. It is not resistant, because changing just one datum can make it arbitrarily large.

Range and interquartile range.

How far are these data from the center.
Spread
IQR : Inter quartile range

The middle 50% data are spread over 8 years.

Lecture 4.2 Standard Deviation

Deviation from average: roughly how far are the numbers from their average?

list : 2, 3, 3, 4, 4, 5, 6, 7 average = 4.25
deviations: 2.25, 1.25, 1.25, 0.25, 0.25, -0.75, -1.75, -2.75 ---> the average of deviations is 0.
BUT absolute values does not have good math properties.

Standard Deviation

Root mean square of deviation from the average --- Rms????

The rms (root mean square) of a list measures the average size of its entries. It is defined as follows:

rms = square-root( (sum of the squares of the entries)/(number of entries) )

=[ (sum of squares of the entries)/(number of entries) ]^½.

How does the sd are measured or representitive for a list of data ?

$List: 2, 3, 3, 4, 4, 5, 6, 7 average = 4.25

variance = mean square of deviation from the average

SD = root 2.44 = 1.56 $

The average and sd use the same units.

---> SD is the measure spread of the data.

the measure spread of the data

The interval average +- SD is roughly [2.75, 5.75]
It picks up a good chunk of the list, but not all.

Lecture 4.3 Properties of the SD:Chebychev's inequality

In a nutshell

Rough statement : No matter what the list , tha vast majority of entries will be in the range average +- a_few_SDs.

Chebycheff(19 centry)
Chebycheff's inequality:

Precise statement:

No mater what the list , a proportion of at least 1-1/k^2 of the entries will be in the range average +/- k*SD

Prove

Instinctly , if the proportion of data that > average + k*SD are bigger than 1/k^2, than , the SD will get larger.

FootPrints

[1]. meaning of distances for "Mean, Median, Mode":

For the mean, the distance between two numbers is defined to be the square of their difference.

That is, the sum of the squares of the differences between the data and the mean is smaller than the sum of squares of the differences between the data and any other number. (Equivalently, the rms or root mean square of the differences from the mean is smaller than the rms of the list of differences from any other number—the rms is defined and discussed below.)

For the median, the distance between two numbers is defined to be the absolute value of their difference. That is, the sum of the absolute values of the differences between a median and the data is no larger than the sum of the absolute values of the differences between any other number and the data.

For the mode, the distance between two numbers is defined to be zero if the numbers are equal, and one if they are not equal. That is, the number of data that differ from a mode is no larger than the number of data that differ from any other value. Equivalently, a mode is a number from which the fewest possible data differ: a "most common" value.

来自为知笔记(Wiz)

转载于:https://www.cnblogs.com/hphp/p/3584133.html

dawen5335

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Notes - Berkerly Statistics 2.1X - Week2

Notes -Berkerly Statistics 2.1X - Week2-Week2, 2014/03/06, hphp欢迎交流、转载，转载请注明出处~Week2. Location , Represents of dataSummarizing data can help us understand them, especially when the number of dat...
复制链接

扫一扫