Correalation
Relationships between two variables
-
x = explanatory/independent variable
-
y = response/dependent variable
Correlation coefficient
-
Quantifies the linear relationship between two variables
-
Number between -1 and 1
-
Magnitude corresponds to strngth of relationship
-
Sign (+ or -) corresponds to direction of relationship
Magnitude = strength of relationship
0.99 (very strong relatioship)
0.75 (strong relatioship)
0.56 (moderate relationship)
0.21 (weak relationship)
0.04 (no relationship)
-
Knowing the value of x doesn't tell us anything about y
Sign = direction
Visualizing relationships
ggplot(df, aes(x, y)) +
geom_point()
Adding a trendline
ggplot(df, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Computing correlation
cor(df$x, df$y)
# -0.7472765
cor(df$y, df$x)
# -0.7472765
Correlation with missing values
df$x
# -3.2508382 -9.1599807 3.4515013 4.1505899 NA 11.9806140
cor(df$x, df$y)
# NA
cor(df$x, df$y, use = "pairwise.complete.obs")
# -0.7471757
Many ways to calculate correlation
-
Used in this course: Pearson product-moment correlation (r)
-
Most common
-
x(bar) = mean of x
-
-
Variations on this formula:
-
Kendall's tau
-
Spearman's rho
-
Correlation caveats
Non-linear relationships
Correlation only accounts for linear relatonships
Correlation shouldn't be used blindly
cor(df$x, df$y)
# 0.1786163
Always visualize your data
Mammal sleep data
msleep
Body weight vs. awake time
cor(msleep$bodywt, msleep$awake)
# 0.3119801
Distribution of body weight
Log transformation
msleep %>%
mutate(log_bodywt = log(bodywt)) %>%
ggplot(aes(log_bodywt, awake)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
cor(msleep$log_bodywt, msleep$awake)
# 0.5687943
Other transformations
-
Log transformation(log(x))
-
Square root transformation(sqrt(x))
-
Reciprocal transformation(1/x)
-
Combinations of these, e.g.:
-
log(x) and log(y)
-
sqrt(x) and 1/y
-
Why use a transformation?
-
Certain statistical methods rely on variables having a linear relationship
-
Correlation coefficient
-
Linear regression
-
-
Introduction to Regression in R
Correlation does not imply causation
x is correlated with y does not mean x causes y
Confounding
Design of experiments
Vocabulary
Experiment aims to answer: What is the effect of the treatment on the response?
-
Treatment: explanatory/independent variable
-
Response: response/ dependent variable
What is the effect of an advertisement on the number of products purchased?
-
Treatment: advertisement
-
Response: number of products purchased
Controlled experiments
-
Participants are assigned by researchers to either treatment group or control group
-
Treatment group sees advertisement
-
Control group does not
-
-
Group should be comparable so that causation can be inferred
-
If groups are not comparable, this could lead to confounding(bias)
-
Treatment group average age: 25
-
Control group avarage age: 50
-
Age is a potential confounder
-
The gold standard of experiments will use...
-
Randomized controlled trial
-
Participants are assigned to treatment/control randomly, not based on any other characteristics
-
Choosing randomly helps ensure that groups are comparable
-
-
Placebo 安慰剂
-
Resembles treatment, but has no effect
-
Participants will not know which group they're in
-
In clinical trials, a sugar pill ensures that the effect of the drug is actually due to the drug itself and not the idea of receiving the drug
-
The gold standard of experiments will use...
-
Double-blind trial
-
Person administering the treatment/running the study doesn't know whether the treatment is real or a placebo
-
Prevent bias in the response and/or analysis of results
-
Fewer opportunities for bias = more reliable conclusion about causation
Observational studies
-
Participants are not assigned randomly to groups
-
Participants assign themselves, usually based on pre-existing characteristics
-
-
Many research questions are not conductive to a controlled experiment
-
You can't force someone to smoke or have a disease
-
You can't make someone have certain past behavior
-
-
Establish association, not causation
-
Effects can be confounded by factors that got certain people into the control or treatment group
-
There are ways to control for confounders to get more reliable conclusions about association
-
Longitudinal vs. cross-sectional studies 纵向与横断面研究
Longitudinal study 纵向研究
-
Participants are followed over a period of time to examine effect of treatment on response
-
Effect of age on height is not confounded by generation
-
More expensive, results take longer
Cross-sectional study 横断面研究
-
Data on participants is collected from a single snapshot in time
-
Effect of age on height is confounded by generation
-
Cheaper, faster, more convenient