Part 1
(1) Compute the mean and variance of both x and y
Group=anascombe.groupby('dataset')
print ( Group['x'].mean() )
print ( Group['y'].mean() )
print ( Group['x'].var() )
print ( Group['y'].var() )
output
dataset
I 9.0
II 9.0
III 9.0
IV 9.0
Name: x, dtype: float64
dataset
I 7.500909
II 7.500909
III 7.500000
IV 7.500909
Name: y, dtype: float64
dataset
I 11.0
II 11.0
III 11.0
IV 11.0
Name: x, dtype: float64
dataset
I 4.127269
II 4.127629
III 4.122620
IV 4.123249
Name: y, dtype: float64
(2) Compute the correlation coefficient between x and y
print ( Group.corr() )
output
x y
dataset
I x 1.000000 0.816421
y 0.816421 1.000000
II x 1.000000 0.816237
y 0.816237 1.000000
III x 1.000000 0.816287
y 0.816287 1.000000
IV x 1.000000 0.816521
y 0.816521 1.000000
(3)Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)
dat1 = anascombe[ (anascombe['dataset']=='I') ].reset_index(drop=True)
dat2 = anascombe[ (anascombe['dataset']=='II') ].reset_index(drop=True)
dat3 = anascombe[ (anascombe['dataset']=='III') ].reset_index(drop=True)
dat4 = anascombe[ (anascombe['dataset']=='IV') ].reset_index(drop=True)
lin_model1 = smf.ols('y ~ x', dat1).fit()
print("For data set I:")
print( lin_model1.summary() )
lin_model2 = smf.ols('y ~ x', dat2).fit()
print("\nFor data set II:")
print( lin_model2.summary() )
lin_model3 = smf.ols('y ~ x', dat3).fit()
print("\nFor data set III:")
print( lin_model3.summary() )
lin_model4 = smf.ols('y ~ x', dat4).fit()
print("\nFor data set I