文章内容来自DataCamp(纯英文版)
Summary Statistics
What is statistics
What is statistics?
-
The field of statistics - the practice and study of collecting and analyzing data
-
A summary statistic - a fact about or summary of some data
What can statistics do?
-
How likely is someone to purchase a product? Are people more likely to purchase it if they can use a different payment system?
-
How many occupants will your hotel have? How can you optimize occupancy?
-
How many sizes of jeans need to be manufactured so they can fit 95% of the population? Should the same number of each size be produced?
-
A/B tests: Which ad is more effective in getting people to purchase a product?
-
Why is Game of Thrones so popular?
Instead...
-
Are series with more violent scenes viewed by more people?
But...
-
Even so, this can't tell us if more violent scenes lead to more views.
Types of statistics
Descriptive statistics 描述统计学
-
Describe and summarize data
-
50% of friends drive to work
-
25% take the bus
-
25% bike
Inferential statistics 推论统计
-
Use a sample of data to make inferences about a larger population
What percentage of people drive to work?
Types of data
Numeric(Quantitative)
-
Continuous(Measured)
-
Airplane speed
-
Time spent waiting in line
-
-
Discrete(Counted)
-
Number of pets
-
Number of packages shipped
-
Categorical(Qualitative)
-
Nominal(Unordered)
-
Married/unmarried
-
Country of residence
-
-
Orinal(Ordered)
Categorical data can be represented as numbers
Nominal(Unordered)
-
Married/unmarried(1/0)
-
Country of resience(1,2,...)
Orinal(Ordered)
-
Strongly disagree(1)
-
Somewhat disagree(2)
-
Neither agree nor disagree(3)
-
Somewhat agree(4)
-
Strongly agree
Why does data type matter?
Measures of center
Mammal sleep data
msleep
Histograms
How long do mammals in this dataset typically sleep?
What's a typical value?
Where is the center of the data?
-
Mean
-
Median
-
Mode
Measures of center: mean
mean(msleep$sleep_total)
Measures of center: median
sort(msleep$sleep_total)
sort(msleep$sleep_total)[42]
median(msleep$sleep_total)
Measures of center: mode
Most frequent value
msleep %>% count(sleep_total, sort=TRUE)
msleep %>% count(vore, sort=TRUE)
Adding an outlier
msleep %>%
filter(vore == "insecti") %>%
summarize(mean_sleep = mean(sleep_data),
median_sleep = median(sleep_total))
Mean: 16.5 -> 13.2
Median: 18.9 -> 18.1
Measures of spread
What is spread?
Variance
Average distance from each data point to the data's mean
dists <- msleep$sleep_total - mean(msleep$sleep_total)
dists
# 1.66626506 6.56626506 ... -4.13373494 2.06626506 -0.63373494
squared_dists <- (dists)^2
sum_sq_dists <- sum(squared_dists)
sum_sq_dists
# 1624.066
sum_sq_dists/82 #19.80568
var(msleep$sleep_total) #19.80568
Standard deviation
sqrt(var(msleep$sleep_total))
#4.450357
sd(msleep$sleep_total)
#4.450357
Mean absolute deviation
dists <- msleep$sleep_total - mean(msleep$sleep_total)
mean(abs(dists))
#3.566701
Standard deviation vs. mean absolute deviation
-
SD squares distances, penalizing longer distances more than shorter ones
-
MAD penalizes each distance equality
-
One isn't better than the other, but SD is more common than MAD
Quartiles
quantile(msleep$sleep_total)
# 0% 25% 50% 75% 100%
#1.90 7.85 10.10 13.75 19.90
Second quartile/50th percentile = median
Boxplots use quartiles
ggplot(msleep, aes(y=sleep_total)) + geom_boxplot()
Quantiles
Interquartile range(IQR)
Height of the box in a boxplot
quantile(msleep$sleep_total, 0.75) - quantile(msleep$sleep_total, 0.25)
# 75%
# 5.9
Outliers
Outlier: data point that is substantially different from the others
How do we know what a substantial difference is? A data point is an outlier if:
-
data < Q1 - 1.5 x IQR or
-
data > Q3 + 1.5 x IQR
Finding outliers
iqr <- quantile(msleep$bodywt, 0.75) - quantile(msleep$bodywt, 0.25)
lower_threshold <- quantile(msleep$bodywt, 0.25) - 1.5 * iqr
upper_threshold <- quantile(msleep$bodywt, 0.75) + 1.5 * iqr
msleep %>% filter(bodywt < lower_threshold | bodywt > upper_threshold) %>%
select(name, vore, sleep_total, bodywt)