Introduction to Statistics in R:01-Summary Statistics

文章内容来自DataCamp(纯英文版)

Summary Statistics

What is statistics

What is statistics?

  • The field of statistics - the practice and study of collecting and analyzing data

  • A summary statistic - a fact about or summary of some data

What can statistics do?

  • How likely is someone to purchase a product? Are people more likely to purchase it if they can use a different payment system?

  • How many occupants will your hotel have? How can you optimize occupancy?

  • How many sizes of jeans need to be manufactured so they can fit 95% of the population? Should the same number of each size be produced?

  • A/B tests: Which ad is more effective in getting people to purchase a product?

  • Why is Game of Thrones so popular?

Instead...

  • Are series with more violent scenes viewed by more people?

But...

  • Even so, this can't tell us if more violent scenes lead to more views.

Types of statistics

Descriptive statistics 描述统计学
  • Describe and summarize data

  • 50% of friends drive to work

  • 25% take the bus

  • 25% bike

Inferential statistics 推论统计
  • Use a sample of data to make inferences about a larger population

What percentage of people drive to work?

Types of data

Numeric(Quantitative)
  • Continuous(Measured)

    • Airplane speed

    • Time spent waiting in line

  • Discrete(Counted)

    • Number of pets

    • Number of packages shipped

Categorical(Qualitative)
  • Nominal(Unordered)

    • Married/unmarried

    • Country of residence

  • Orinal(Ordered)

Categorical data can be represented as numbers

Nominal(Unordered)
  • Married/unmarried(1/0)

  • Country of resience(1,2,...)

Orinal(Ordered)
  • Strongly disagree(1)

  • Somewhat disagree(2)

  • Neither agree nor disagree(3)

  • Somewhat agree(4)

  • Strongly agree

Why does data type matter?

Measures of center

Mammal sleep data

msleep

Histograms

How long do mammals in this dataset typically sleep?

What's a typical value?

Where is the center of the data?

  • Mean

  • Median

  • Mode

Measures of center: mean

mean(msleep$sleep_total)

Measures of center: median

sort(msleep$sleep_total)
sort(msleep$sleep_total)[42]
median(msleep$sleep_total)

Measures of center: mode

Most frequent value

msleep %>% count(sleep_total, sort=TRUE)
msleep %>% count(vore, sort=TRUE)

Adding an outlier

msleep %>%
filter(vore == "insecti") %>%
summarize(mean_sleep = mean(sleep_data),
         median_sleep = median(sleep_total))

Mean: 16.5 -> 13.2

Median: 18.9 -> 18.1

Measures of spread

What is spread?

Variance

Average distance from each data point to the data's mean

dists <- msleep$sleep_total - mean(msleep$sleep_total)
dists
# 1.66626506 6.56626506 ... -4.13373494 2.06626506 -0.63373494
squared_dists <- (dists)^2
sum_sq_dists <- sum(squared_dists)
sum_sq_dists
# 1624.066
sum_sq_dists/82 #19.80568
var(msleep$sleep_total) #19.80568

Standard deviation

sqrt(var(msleep$sleep_total))
#4.450357
sd(msleep$sleep_total)
#4.450357

Mean absolute deviation

dists <- msleep$sleep_total - mean(msleep$sleep_total)
mean(abs(dists))
#3.566701

Standard deviation vs. mean absolute deviation

  • SD squares distances, penalizing longer distances more than shorter ones

  • MAD penalizes each distance equality

  • One isn't better than the other, but SD is more common than MAD

Quartiles

quantile(msleep$sleep_total)
#  0%   25%   50%   75%   100%  
#1.90  7.85 10.10 13.75  19.90

Second quartile/50th percentile = median

Boxplots use quartiles

ggplot(msleep, aes(y=sleep_total)) + geom_boxplot()

Quantiles

Interquartile range(IQR)

Height of the box in a boxplot

quantile(msleep$sleep_total, 0.75) - quantile(msleep$sleep_total, 0.25)
# 75%
# 5.9

Outliers

Outlier: data point that is substantially different from the others

How do we know what a substantial difference is? A data point is an outlier if:

  • data < Q1 - 1.5 x IQR or

  • data > Q3 + 1.5 x IQR

Finding outliers

iqr <- quantile(msleep$bodywt, 0.75) - quantile(msleep$bodywt, 0.25)
lower_threshold <- quantile(msleep$bodywt, 0.25) - 1.5 * iqr
upper_threshold <- quantile(msleep$bodywt, 0.75) + 1.5 * iqr
msleep %>% filter(bodywt < lower_threshold | bodywt > upper_threshold) %>%
select(name, vore, sleep_total, bodywt)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值