Introduction to Statistics in R：01-Summary Statistics

521R

已于 2024-03-15 21:47:42 修改

阅读量1.9k

点赞数 53

分类专栏： R语言统计学基础文章标签： r语言

于 2024-03-13 21:36:42 首次发布

本文链接：https://blog.csdn.net/m0_67662731/article/details/136692193

版权

R语言统计学基础专栏收录该内容

4 篇文章 2 订阅

订阅专栏

文章内容来自DataCamp（纯英文版）

Summary Statistics

What is statistics

What is statistics?

The field of statistics - the practice and study of collecting and analyzing data
A summary statistic - a fact about or summary of some data

What can statistics do?

How likely is someone to purchase a product? Are people more likely to purchase it if they can use a different payment system?
How many occupants will your hotel have? How can you optimize occupancy?
How many sizes of jeans need to be manufactured so they can fit 95% of the population? Should the same number of each size be produced?
A/B tests: Which ad is more effective in getting people to purchase a product?
Why is Game of Thrones so popular?

Instead...

Are series with more violent scenes viewed by more people?

But...

Even so, this can't tell us if more violent scenes lead to more views.

Types of statistics

Descriptive statistics 描述统计学

Describe and summarize data
50% of friends drive to work
25% take the bus
25% bike

Inferential statistics 推论统计

Use a sample of data to make inferences about a larger population

What percentage of people drive to work?

Types of data

Numeric(Quantitative)

Continuous(Measured)
- Airplane speed
- Time spent waiting in line
Discrete(Counted)
- Number of pets
- Number of packages shipped

Categorical(Qualitative)

Nominal(Unordered)
- Married/unmarried
- Country of residence
Orinal(Ordered)

Categorical data can be represented as numbers

Nominal(Unordered)

Married/unmarried(1/0)
Country of resience(1,2,...)

Orinal(Ordered)

Strongly disagree(1)
Somewhat disagree(2)
Neither agree nor disagree(3)
Somewhat agree(4)
Strongly agree

Why does data type matter?

Measures of center

Mammal sleep data

msleep

Histograms

How long do mammals in this dataset typically sleep?

What's a typical value?

Where is the center of the data?

Mean
Median
Mode

Measures of center: mean

mean(msleep$sleep_total)

Measures of center: median

sort(msleep$sleep_total)
sort(msleep$sleep_total)[42]
median(msleep$sleep_total)

Measures of center: mode

Most frequent value

msleep %>% count(sleep_total, sort=TRUE)
msleep %>% count(vore, sort=TRUE)

Adding an outlier

msleep %>%
filter(vore == "insecti") %>%
summarize(mean_sleep = mean(sleep_data),
         median_sleep = median(sleep_total))

Mean: 16.5 -> 13.2

Median: 18.9 -> 18.1

Measures of spread

What is spread?

Variance

Average distance from each data point to the data's mean

dists <- msleep$sleep_total - mean(msleep$sleep_total)
dists

# 1.66626506 6.56626506 ... -4.13373494 2.06626506 -0.63373494

squared_dists <- (dists)^2

sum_sq_dists <- sum(squared_dists)
sum_sq_dists

# 1624.066

sum_sq_dists/82 #19.80568

var(msleep$sleep_total) #19.80568

Standard deviation

sqrt(var(msleep$sleep_total))
#4.450357

sd(msleep$sleep_total)
#4.450357

Mean absolute deviation

dists <- msleep$sleep_total - mean(msleep$sleep_total)
mean(abs(dists))
#3.566701

Standard deviation vs. mean absolute deviation

SD squares distances, penalizing longer distances more than shorter ones
MAD penalizes each distance equality
One isn't better than the other, but SD is more common than MAD

Quartiles

quantile(msleep$sleep_total)
#  0%   25%   50%   75%   100%  
#1.90  7.85 10.10 13.75  19.90

Second quartile/50th percentile = median

Boxplots use quartiles

ggplot(msleep, aes(y=sleep_total)) + geom_boxplot()

Quantiles

Interquartile range(IQR)

Height of the box in a boxplot

quantile(msleep$sleep_total, 0.75) - quantile(msleep$sleep_total, 0.25)
# 75%
# 5.9

Outliers

Outlier: data point that is substantially different from the others

How do we know what a substantial difference is? A data point is an outlier if:

data < Q1 - 1.5 x IQR or
data > Q3 + 1.5 x IQR

Finding outliers

iqr <- quantile(msleep$bodywt, 0.75) - quantile(msleep$bodywt, 0.25)
lower_threshold <- quantile(msleep$bodywt, 0.25) - 1.5 * iqr
upper_threshold <- quantile(msleep$bodywt, 0.75) + 1.5 * iqr

msleep %>% filter(bodywt < lower_threshold | bodywt > upper_threshold) %>%
select(name, vore, sleep_total, bodywt)

521R

关注

53
点赞
踩
53

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录