Notes - Berkerly Statistics 2.1X - Week2

Notes - Berkerly Statistics 2.1X - Week2

-Week2, 2014/03/06, hphp

欢迎交流、转载,转载请注明出处~

Week2. Location , Represents of data

Summarizing data can help us understand them, especially when the number of data is large. This chapter presents several ways to summarize quantitative data by a typical value(a measure of location, such as the mean, median, or mode) and a measure of how well the typical value represents the list (a measure of spread, such as the range, inter-quartile range, or standard deviation). Markov's and Chebychev's inequalities show that these summary measures can contain a surprisingly large amount of information about the data. 


Lecture 3.1 The median and the mode

Measures of location

  • Measures of location do just that: They try to capture with a single number what is typical of the data.

Mean , Median , Mode.

  • Median:The median is the number that divides the (ordered) data in half—thesmallestnumber that is at least as big as half the data. At least half the data are equal to or smaller than the median, and at least half the data are equal to or greater than the median. 

        EG.list:1, 2, 3, 4

        median: -- > 2

        1/4 th: -- > 1

        3/4 th: -- > 3

  • However, the mean, the median, and the mode are "as close as possible" to all the data: Foreach of these three measures of location, the sum of the distances between each datum and the measure of location is as small as it can be. The differences among the three measures of location are in how "distance" is defined.[1]

  • The mean, median, and mode can berelated (approximately) to the histogram: loosely speaking, the mode is the highest bump, the median is where half the area is to the right and half is to the left, and the mean is where the histogram would balance, were it a solid object cut out of a uniform block of metal. (All these heuristics are approximate, and depend on the class intervals.)

  • [datum : 数据]

  • [Symmetric Distribution - average , balanced.]

The center 

  • Median : the "half point" of the data" --- > 31.4 mm

The Mode: The "most common" value

  • the value has the highest frequency

            4 | 8

            5 | 9

            6 |3337

            7 |000235

            8 | 012345788

            9 | 015556

            10| 0

            

            6|333

            7|000

            9|555

A unimodal distribution 

  • Unimodal : one peak


Lecture 3.2 The average

average - mean

The average - not center , not even a member , not variable members.

not so many difference with what i have already understood.


Lecture 3.3 Comparing and combining averages

What's the relation between these groups

081503161283109.png

[Natinal Health and Nutrition Examination 1999-2000]   [noting the data and the source.]

  • the data are not longitudinal, but are cross sectional.

Comparing the numbers

  • the average of diff groups :"how are the groups related to each other"

E.G.


ave
section160
section270

cant tell the average , because the lack of information.


  • avesection size
    section16020
    section27030

average = total/50 

  • weighted average of averages


avesection size
section proportion
section160202/5
section270303/5

average = 60*2/5 + 70*3/5

average = SUM(average[i]*weigth[i]) [weights are the section proportions.]


Lecture 3.4 The average and the histogram; The average and the median.

the median is unaffected by outliers.

[ Statistics that are not affected too much by small subsets of the data are resistant. The median is resistant; the mean is not. ]

A right-skewed distribution : average is greater than the median.

081503176903364.png

incomes

[affluent-rich,enrich]

[gizmos and gadget-创意和配件]

[disingenuously - 狡猾]

[pledge to - 承诺]

[Articles report median incomes. instead of average income.]

What does an average test score tell u.

  • if a lot of people did not get good scores , the histogram will get : Left-hand tail.

The average and the histogram

  • list : 2, 3, 3, 4

average = [ (1*2) + (2*3) + (1*4) ]/4  = 1/4*2 + 2/4*3 + 1/4*4

1/4,2/4,1/4 --> the percent/ proportions..

  • list : 2, 3, 3, 7

average = [ (1*2) + (2*3) + (1*4) ]/4  = 1/4*2 + 2/4*3 + 1/4*4

1/4,2/4,1/4 --> the percent/ proportions..

  • the average is the center of gravity of the histogram

1/4,2/4,1/4:weights


Lecture 3.5 Markov's inequality

How far can u be above average , How big can the tail be

  • Andrey Markov(1856-1922)
  • The average of a group people is 20years, What proportion are more than 80 years old.

081503189566878.png

  • Markov's inequality: 

If a list has only non-negative entries , then the proportion of entries are at least at large as k times the average is at most 1/k.

[could use the Sum( weight*value ) as a prove.]

  • taking care of the edge
Question: more than 80 years old: > 80
Markov: more than or equal 80 years old : >= 80

  • But , if k = 0.5 , the biggest proportion will be 200% , makes no sense though.


Lecture 4.1 How the average/other represents data

     Measures of location summarize what is typical of elements of a list,    but not every element is typical.    Are all the elements close to each other?    Are most of the elements close to each other?    What is the biggest difference between elements?    On the average, how far are the elements from each other?    Measures of spread or variability tell us.

The three most common measures of spreador variability are the    range,    theinterquartile range (IQR),    and thestandard    deviation (SD)

The range of a list is the largest value minus the smallest value.

It is the width of the smallest interval that    contains all the data, so it measures spread.    It is not resistant,    because changing just one datum can make it arbitrarily large. 

Range and interquartile range.

  • How far are these data from the center.
  • Spread 
  • IQR : Inter quartile range

081503207683590.png

The middle 50% data are spread over 8 years.


Lecture 4.2 Standard Deviation

Deviation from average: roughly how far are the numbers from their average?

  • list : 2, 3, 3, 4, 4, 5, 6, 7 average = 4.25
  • deviations: 2.25, 1.25, 1.25, 0.25, 0.25, -0.75, -1.75, -2.75    --->    the average of deviations is 0.
  • BUT absolute values does not have good math properties.

Standard Deviation

  • Root mean square of deviation from the average --- Rms????

The rms (root mean square) of a list measures the average size of its entries. It is defined as follows:

rms = square-root( (sum of the squares of the entries)/(number of entries) )

=[ (sum of squares of the entries)/(number of entries) ]½.

  • How does the sd are measured or representitive for a list of data ?

$List: 2, 3, 3, 4, 4, 5, 6, 7    average = 4.25

variance = mean square  of deviation from the average 

SD = root 2.44 = 1.56 $

The average and sd use the same units.

---> SD is the measure spread of the data.

the measure spread of the data

  • The interval average +- SD is roughly [2.75, 5.75]
  • It picks up a good chunk of the list, but not all.

081503219094375.png


Lecture 4.3 Properties of the SD:Chebychev's inequality

In a nutshell

Rough statement : No matter what the list , tha vast majority of entries will be in the range average +- a_few_SDs.

  • Chebycheff(19 centry)
  • Chebycheff's inequality:

081503239879773.png

  • Precise statement:

No mater what the list , a proportion of at least 1-1/k^2 of the entries will be in the range average +/- k*SD

Prove

Instinctly , if the proportion of data that > average + k*SD are bigger than 1/k^2, than , the SD will get larger.


FootPrints

[1]. meaning of distances for "Mean, Median, Mode":

    For the mean, the distance between two numbers is defined to be the square of their difference. 

    That is, the sum of the squares of the differences between the data and the mean is smaller than the sum of squares of the differences between the data and any other number. (Equivalently, the rms or root mean square of the differences from the mean is smaller than the rms of the list of differences from any other number—the rms is defined and discussed below.)

    For the median, the distance between two numbers is defined to be the absolute value of their difference. That is, the sum of the absolute values of the differences between a median and the data is no larger than the sum of the absolute values of the differences between any other number and the data. 

    For the mode, the distance between two numbers is defined to be zero if the numbers are equal, and one if they are not equal. That is, the number of data that differ from a mode is no larger than the number of data that differ from any other value. Equivalently, a mode is a number from which the fewest possible data differ: a "most common" value. 





转载于:https://www.cnblogs.com/hphp/p/3584133.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
【1】项目代码完整且功能都验证ok,确保稳定可靠运行后才上传。欢迎下载使用!在使用过程中,如有问题或建议,请及时私信沟通,帮助解答。 【2】项目主要针对各个计算机相关专业,包括计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网等领域的在校学生、专业教师或企业员工使用。 【3】项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 【4】如果基础还行,或热爱钻研,可基于此项目进行二次开发,DIY其他不同功能,欢迎交流学习。 【注意】 项目下载解压后,项目名字和项目路径不要用中文,否则可能会出现解析不了的错误,建议解压重命名为英文名字后再运行!有问题私信沟通,祝顺利! 基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值