Introduction to Statistics in R: 04-Correlation and Experimental Design

Correalation

Relationships between two variables

  • x = explanatory/independent variable

  • y = response/dependent variable

Correlation coefficient

  • Quantifies the linear relationship between two variables

  • Number between -1 and 1

  • Magnitude corresponds to strngth of relationship

  • Sign (+ or -) corresponds to direction of relationship

Magnitude = strength of relationship

0.99 (very strong relatioship)

0.75 (strong relatioship)

0.56 (moderate relationship)

0.21 (weak relationship)

0.04 (no relationship)

  • Knowing the value of x doesn't tell us anything about y

Sign = direction

Visualizing relationships

ggplot(df, aes(x, y)) + 
    geom_point()

Adding a trendline

ggplot(df, aes(x, y)) + 
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

Computing correlation

cor(df$x, df$y)
# -0.7472765
cor(df$y, df$x)
# -0.7472765

Correlation with missing values

df$x
# -3.2508382 -9.1599807 3.4515013 4.1505899    NA 11.9806140
cor(df$x, df$y)
# NA
cor(df$x, df$y, use = "pairwise.complete.obs")
# -0.7471757

Many ways to calculate correlation

  • Used in this course: Pearson product-moment correlation (r)

    • Most common

    • x(bar) = mean of x

  • Variations on this formula:

    • Kendall's tau

    • Spearman's rho

Correlation caveats

Non-linear relationships

Correlation only accounts for linear relatonships

Correlation shouldn't be used blindly

cor(df$x, df$y)
# 0.1786163

Always visualize your data

Mammal sleep data

msleep

Body weight vs. awake time

cor(msleep$bodywt, msleep$awake)
# 0.3119801

Distribution of body weight

Log transformation

msleep %>%
    mutate(log_bodywt = log(bodywt)) %>%
    ggplot(aes(log_bodywt, awake)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE)
cor(msleep$log_bodywt, msleep$awake)
# 0.5687943

Other transformations

  • Log transformation(log(x))

  • Square root transformation(sqrt(x))

  • Reciprocal transformation(1/x)

  • Combinations of these, e.g.:

    • log(x) and log(y)

    • sqrt(x) and 1/y

Why use a transformation?

  • Certain statistical methods rely on variables having a linear relationship

    • Correlation coefficient

    • Linear regression

  • Introduction to Regression in R

Correlation does not imply causation

x is correlated with y does not mean x causes y

Confounding

Design of experiments

Vocabulary

Experiment aims to answer: What is the effect of the treatment on the response?

  • Treatment: explanatory/independent variable

  • Response: response/ dependent variable

What is the effect of an advertisement on the number of products purchased?

  • Treatment: advertisement

  • Response: number of products purchased

Controlled experiments

  • Participants are assigned by researchers to either treatment group or control group

    • Treatment group sees advertisement

    • Control group does not

  • Group should be comparable so that causation can be inferred

  • If groups are not comparable, this could lead to confounding(bias)

    • Treatment group average age: 25

    • Control group avarage age: 50

    • Age is a potential confounder

The gold standard of experiments will use...

  • Randomized controlled trial

    • Participants are assigned to treatment/control randomly, not based on any other characteristics

    • Choosing randomly helps ensure that groups are comparable

  • Placebo 安慰剂

    • Resembles treatment, but has no effect

    • Participants will not know which group they're in

    • In clinical trials, a sugar pill ensures that the effect of the drug is actually due to the drug itself and not the idea of receiving the drug

The gold standard of experiments will use...

  • Double-blind trial

    • Person administering the treatment/running the study doesn't know whether the treatment is real or a placebo

    • Prevent bias in the response and/or analysis of results

Fewer opportunities for bias = more reliable conclusion about causation

Observational studies

  • Participants are not assigned randomly to groups

    • Participants assign themselves, usually based on pre-existing characteristics

  • Many research questions are not conductive to a controlled experiment

    • You can't force someone to smoke or have a disease

    • You can't make someone have certain past behavior

  • Establish association, not causation

    • Effects can be confounded by factors that got certain people into the control or treatment group

    • There are ways to control for confounders to get more reliable conclusions about association

Longitudinal vs. cross-sectional studies 纵向与横断面研究

Longitudinal study 纵向研究

  • Participants are followed over a period of time to examine effect of treatment on response

  • Effect of age on height is not confounded by generation

  • More expensive, results take longer

Cross-sectional study 横断面研究

  • Data on participants is collected from a single snapshot in time

  • Effect of age on height is confounded by generation

  • Cheaper, faster, more convenient

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值