本周主要讲了pandas库以及捎带讲了一下jupyter这个工具。jupyter差不多可以看作一个live script editor,能够在完成排版设计的同时记录代码运行结果。
以下是本次作业的导出结果。
%matplotlib inline
import random
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set_context("talk")
# Anscombe’s quartet Anscombe’s quartet comprises of four datasets, and is rather famous. Why? You’ll find out in this exercise.
anascombe = pd.read_csv('data/anscombe.csv')
anascombe.head()
dataset | x | y | |
---|---|---|---|
0 | I | 10.0 | 8.04 |
1 | I | 8.0 | 6.95 |
2 | I | 13.0 | 7.58 |
3 | I | 9.0 | 8.81 |
4 | I | 11.0 | 8.33 |
Part 1
For each of the four datasets…
- Compute the mean and variance of both x and y
- Compute the correlation coefficient between x and y
- Compute the linear regression line:
y=β0+β1x+ϵ
y
=
β
0
+
β
1
x
+
ϵ
(hint: use statsmodels and look at the Statsmodels notebook)
print('the mean of x of each dataset')
print(anascombe.groupby(['dataset'])[['x','y']].mean())
print('the mean of y of each dataset')
print(anascombe.groupby(['dataset'])[['x','y']].var())
print('the correlation coeffcient between x and y of each dataset')
print(anascombe.groupby(['dataset'])[['x','y']].corr())
ds = ['I','II','III','IV']
for i in range(0,4):
print('the linear regression of dataset {}'.format(ds[i]))
print(sm.OLS(anascombe['y'][anascombe.dataset==ds[i]], anascombe['x'][anascombe.dataset==ds[i]]).fit().summary())
the mean of x of each dataset
x y
dataset
I 9.0 7.500909
II 9.0 7.500909
III 9.0 7.500000
IV 9.0 7.500909
the mean of y of each dataset
x y
dataset
I 11.0 4.127269
II 11.0 4.127629
III 11.0 4.122620
IV 11.0 4.123249
the correlation coeffcient between x and y of each dataset
x y
dataset
I x 1.000000 0.816421
y 0.816421 1.000000
II x 1.000000 0.816237
y 0.816237 1.000000
III x 1.000000 0.816287
y 0.816287 1.000000
IV x 1.000000 0.816521
y 0.816521 1.000000
the linear regression of dataset I
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.963
Model: OLS Adj. R-squared: 0.959
Method: Least Squares F-statistic: 257.9
Date: Tue, 12 Jun 2018 Prob (F-statistic): 1.81e-08
Time: 18:15:41 Log-Likelihood: -20.044
No. Observations: 11 AIC: 42.09
Df Residuals: 10 BIC: 42.49
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x 0.7968 0.050 16.059 0.000 0.686 0.907
==============================================================================
Omnibus: 1.171 Durbin-Watson: 2.491
Prob(Omnibus): 0.557 Jarque-Bera (JB): 0.684
Skew: -0.572 Prob(JB): 0.710
Kurtosis: 2.573 Cond. No. 1.00
==============================================================================
the linear regression of dataset II
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.963
Model: OLS Adj. R-squared: 0.959
Method: Least Squares F-statistic: 257.7
Date: Tue, 12 Jun 2018 Prob (F-statistic): 1.82e-08
Time: 18:15:41 Log-Likelihood: -20.049
No. Observations: 11 AIC: 42.10
Df Residuals: 10 BIC: 42.50
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x 0.7968 0.050 16.053 0.000 0.686 0.907
==============================================================================
Omnibus: 4.616 Durbin-Watson: 2.550
Prob(Omnibus): 0.099 Jarque-Bera (JB): 2.202
Skew: -1.093 Prob(JB): 0.333
Kurtosis: 3.153 Cond. No. 1.00
==============================================================================
the linear regression of dataset III
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.963
Model: OLS Adj. R-squared: 0.959
Method: Least Squares F-statistic: 257.7
Date: Tue, 12 Jun 2018 Prob (F-statistic): 1.82e-08
Time: 18:15:41 Log-Likelihood: -20.047
No. Observations: 11 AIC: 42.09
Df Residuals: 10 BIC: 42.49
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x 0.7967 0.050 16.053 0.000 0.686 0.907
==============================================================================
Omnibus: 0.727 Durbin-Watson: 1.874
Prob(Omnibus): 0.695 Jarque-Bera (JB): 0.614
Skew: -0.215 Prob(JB): 0.735
Kurtosis: 1.925 Cond. No. 1.00
==============================================================================
the linear regression of dataset IV
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.963
Model: OLS Adj. R-squared: 0.959
Method: Least Squares F-statistic: 258.0
Date: Tue, 12 Jun 2018 Prob (F-statistic): 1.81e-08
Time: 18:15:41 Log-Likelihood: -20.043
No. Observations: 11 AIC: 42.09
Df Residuals: 10 BIC: 42.48
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x 0.7968 0.050 16.062 0.000 0.686 0.907
==============================================================================
Omnibus: 0.522 Durbin-Watson: 0.947
Prob(Omnibus): 0.770 Jarque-Bera (JB): 0.468
Skew: -0.395 Prob(JB): 0.791
Kurtosis: 2.370 Cond. No. 1.00
==============================================================================
Part 2
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
g = sns.FacetGrid(anascombe, hue='dataset', size=5)
g.map(plt.scatter, 'x', 'y')
<seaborn.axisgrid.FacetGrid at 0x1861cb86160>