# Anscombe's quartet

Anscombe's quartet comprises of four datasets, and is rather famous. Why? You'll find out in this exercise.

In [4]:
anascombe = pd.read_csv('data/anscombe.csv')
anascombe.head()
Out[4]:
datasetxy
0I108.04
1I86.95
2I137.58
3I98.81
4I118.33

## Part 1

For each of the four datasets...

• Compute the mean and variance of both x and y
• Compute the correlation coefficient between x and y
• Compute the linear regression line: y=β0+β1x+ϵy=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)
print(anascombe.groupby('dataset')['x'].mean())
print(anascombe.groupby('dataset')['y'].mean())
​
print(anascombe.groupby('dataset')['x'].var())
print(anascombe.groupby('dataset')['y'].var())
​
print(anascombe.groupby('dataset')['x'].corr(anascombe['y']))
​
for data_set in anascombe.dataset.unique():
    a = anascombe.query("dataset == '{}'".format(data_set))
    n = len(a)
    x = sm.add_constant(a.x)
    model = sm.OLS(a.y, x).fit()
    print(model.summary())

## Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

g = sns.FacetGrid(anascombe, col="dataset", hue = "dataset")
g.map(plt.scatter, "x", "y")