More Distributions and the Central Limit Theorem
The normal distribution
What is the normal distribution? 正态分布
Symmetrical 对称的
Area = 1
Curve never hits 0
Described by mean and standard deviation
-
Mean: 20
-
Standard deviation: 3
-
Standard normal distribution
-
Mean: 0
-
Standard deviation: 1
Areas under the normal distribution
68% falls within 1 standard deviation
95% falls within 2 standard deviations
99.7% falls within 3 standard deviations
Lots of histograms look normal
Normal distribution
Women's heights from NHANES
Mean: 161cm Standard deviation: 7cm
Approximating data with the normal distribution
What percent of women are shorter than 154 cm?
pnorm(154, mean = 161, sd = 7) #0.159
16% of women in the survey are shorter than 154 cm
What percent of women are taller than 154 cm?
pnorm(154, mean = 161, sd = 7, lower.tail = FALSE) # 0.8413447
What percent of women are 154-157cm?
pnorm(157, mean = 161, sd = 7) - pnorm(154, mean = 161, sd = 7) # 0.1252
What height are 90% of women shorter than?
qnorm(0.9, mean = 161, sd = 7) # 169.9709
What height are 90% of women taller than?
qnorm(0.9, mean = 161, sd = 7, lower.tail = FALSE) # 152.03
Generating random numbers
# Generate 10 random heights
rnorm(10, mean = 161, sd = 7)
The central limit theorem 中心极限定理
Rolling the dice 5 times
die <- c(1, 2, 3, 4, 5, 6)
# Roll 5 times
smple_of_5 <- sample(die, 5, replace=TRUE)
sample_of_5
# 1 3 4 1 1
mean(sample_of_5)
# 2.0
# Roll 5 times and take mean
sample(die, 5, replace=TRUE) %>% mean()
# 4.4
sample(die, 5, replace=TRUE) %>% mean()
# 3.8
Rolling the dice 5 times 10 times
Repeat 10 times:
-
Roll 5 times
-
Take the mean
sample_means <- replicate(10, sample(die, replace=TRUE) %>% mean())
sample_means
# 3.8, 4.0, 3.8, 3.6, 3.2, 4.8, 2.6, 3.0, 2.6, 2.0
Sampling distributions
Sampling distribution of the sample mean
100 sample means
replicate(100, sample(die, 5, replace=TRUE) %>% mean())
# 2.8 3.2 1.8 4.6 4.0 2.8 4.4 2.4 3.4 2.8 4.2 3.4...
1000 sample means
sample_means <- replicate(1000, sample(die, 5, replace=TRUE) %>% mean())
Central limit theorem
The sampling distribution of a statistic becomes closer to the normal distribution as the number of trials increases.
-
Samples should be random and independent
Standard deviation and the CLT
replicate(1000, sample(die, 5, replace=TRUE)%>%sd())
Proportions and the CLT
sales_team <- c("Amir", "Brian", "Claire", "Damian")
sample(sales_team, 10, replace=TRUE)
Sampling distribution of proportion
Mean of sampling distribution
# Estimate expected value of die
mean(sample_mean)
# 3.48
# Estimate proportion of "Claire"s
mean(sample_props)
# 0.26
-
Estimate characteristics of unknown underlying distribution
-
More easily estimate characteristics of large populations
The Poisson distribution 泊松分布
Poisson processes
-
Events appear to happen at a certain rate, but completely at random
-
Examples
-
Number of animals adopted from an animal shelter per week
-
Number of people arriving at a restaurant per hour
-
Number of earthquakes in California per year
-
Poisson distribution
-
Probability of some # of events occuring over a fixed period of time
-
Examples
-
Probability of >= 5 animals adopted from an animal shelter per week
-
Probability of 12 people arriving at a restaurant per hour
-
Probability of < 20 earthquakes in California per year
-
Lambda(λ)
-
λ = avarage number of events per time interval
-
Average number of adoptions per week = 8
-
Lambda is the distribution's peak
Probability of a single value
If the average number of adoptions per week is 8, what is P(# adoptions in a week = 5)?
dpois(5, lambda = 8)
# 0.09160366
Probability of less than or equal to
If the average number of adoptions per week is 8, what is P(# adoptions in a week <= 5)?
ppois(5, lambda = 8)
# 0.1912361
Probability of greater than
ppois(5, lambda = 8, lower.tail = FALSE)
# 0.8087639
If the average number of adoptions per week is 10, what is P(# adoptions in a week > 5)?
ppois(5, lambda = 10, lower.tail = FALSE)
# 0.932914
Sampling from a Poisson distribution
rpois(10, lambda = 8)
# 13 6 11 7 10 8 7 3 7 6
The CLT still applies!
More probability distributions
Exponential distribution
-
Probability of time between Poisson events
-
Examples
-
Probability of > 1 day between adoptions
-
Probability of < 10 minutes between restaurant arrivals
-
Probability of 6-8 months between earthquakes
-
-
Also uses lambda(rate)
-
Continuous(time)
Customer service requests
-
On average, one customer service ticket is created every 2 minutes
-
λ = 0.5 customer service tickets created each minute
-
Lambda in exponential distribution
How long until a new request is created?
P(wait < 1 min) =
pexp(1, rate = 0.5)
# 0.3934693
P(wait > 4 min) =
pexp(4, rate = 0.5, lower.tail = FALSE)
# 0.1353353
P(1 min < wait < 4 min) =
pexp(4, rate = 0.5) - pexp(1, rate = 0.5)
# 0.4711954
Expected value of exponential distribution
In terms of rate(Poisson):
-
λ = 0.5 requests per minute
Interms of time(exponential):
-
1/λ = 1 request per 2 minutes
(Student's)t-distribution
-
Similar shape as the normal distribution
Degrees of freedom
-
Has parameter degrees of freedom(df) which affects the thickness of the tails
-
Lower df = thicker tails, higher standard deviation
-
Higher df = closer to normal distribution
-
Log-normal distribution
-
Variable whose logarithm is normally distributed
-
Examples:
-
Length of chess games
-
Adult blood pressure
-
Number of hospitalizations in the 2003 SARS outbreak
-