Part 1
For each of the four datasets...
- Compute the mean and variance of both x and y
- Compute the correlation coefficient between x and y
- Compute the linear regression line: y=β0+β1x+ϵy=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
anascombe = pd.read_csv('anscombe.csv')
print('the mean of x is', anascombe['x'].mean())
print('the variance of x is', anascombe['x'].std())
print('the mean of y is', anascombe['y'].mean())
print('the variance of y is', anascombe['y'].std())
print('the correlation coefficient between x and y is\n', anascombe.corr())
model = ols('x ~ y', anascombe).fit()
print(model.summary())
Part 2
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
import pandas as pd
import matplotlib.pyplot as plt
anascombe = pd.read_csv('anscombe.csv')
f, ax = plt.subplots()
ax.scatter(anascombe['x'], anascombe['y'])
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()