题目
地址
摘要
Anscombe’s quartet
Anscombe’s quartet comprises of four datasets, and is rather famous. Why? You’ll find out in this exercise.
Part 1
For each of the four datasets…
Compute the mean and variance of both x and y
Compute the correlation coefficient between x and y
Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)
Part 2
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
解答
# -*- coding: utf-8 -*-
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
sns.set_context("talk")
anscombe = pd.read_csv('data/anscombe.csv')
print(anscombe.head())
#均值
print('The mean of x and y: ')
print(anscombe.groupby(['dataset'])[['x', 'y']].mean())
#方差
print('\nThe varience of x and y: ')
print(anscombe.groupby(['dataset'])[['x', 'y']].var())
#分别输出x和y的关联系数
print('\nThe correlation coefficient between x and y: ')
print(anscombe.groupby(['dataset'])[['x', 'y']].corr());
datasets = ['I', 'II', 'III', 'IV']
for dataset in datasets:
lin_model = smf.ols('y ~ x', anscombe[anscombe['dataset'] == dataset]).fit()
print('\nThe linear model for dataset', dataset)
print(lin_model.summary())
print('\n')
g = sns.FacetGrid(anscombe, col='dataset', hue="y")
g.map(plt.scatter, 'x', 'y')
结果
runfile('E:/OLD F/Onedrive/Python Assignments/pandas_assignment.py', wdir='E:/OLD F/Onedrive/Python Assignments')
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
The mean of x and y:
x y
dataset
I 9.0 7.500909
II 9.0 7.500909
III 9.0 7.500000
IV 9.0 7.500909
The varience of x and y:
x y
dataset
I 11.0 4.127269
II 11.0 4.127629
III 11.0 4.122620
IV 11.0 4.123249
The correlation coefficient between x and y:
x y
dataset
I x 1.000000 0.816421
y 0.816421 1.000000
II x 1.000000 0.816237
y 0.816237 1.000000
III x 1.000000 0.816287
y 0.816287 1.000000
IV x 1.000000 0.816521
y 0.816521 1.000000
The linear model for dataset I
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.667
Model: OLS Adj. R-squared: 0.629
Method: Least Squares F-statistic: 17.99
Date: Wed, 13 Jun 2018 Prob (F-statistic): 0.00217
Time: 16:46:40 Log-Likelihood: -16.841
No. Observations: 11 AIC: 37.68
Df Residuals: 9 BIC: 38.48
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0001 1.125 2.667 0.026 0.456 5.544
x 0.5001 0.118 4.241 0.002 0.233 0.767
==============================================================================
Omnibus: 0.082 Durbin-Watson: 3.212
Prob(Omnibus): 0.960 Jarque-Bera (JB): 0.289
Skew: -0.122 Prob(JB): 0.865
Kurtosis: 2.244 Cond. No. 29.1
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The linear model for dataset II
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.666
Model: OLS Adj. R-squared: 0.629
Method: Least Squares F-statistic: 17.97
Date: Wed, 13 Jun 2018 Prob (F-statistic): 0.00218
Time: 16:46:40 Log-Likelihood: -16.846
No. Observations: 11 AIC: 37.69
Df Residuals: 9 BIC: 38.49
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0009 1.125 2.667 0.026 0.455 5.547
x 0.5000 0.118 4.239 0.002 0.233 0.767
==============================================================================
Omnibus: 1.594 Durbin-Watson: 2.188
Prob(Omnibus): 0.451 Jarque-Bera (JB): 1.108
Skew: -0.567 Prob(JB): 0.575
Kurtosis: 1.936 Cond. No. 29.1
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The linear model for dataset III
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.666
Model: OLS Adj. R-squared: 0.629
Method: Least Squares F-statistic: 17.97
Date: Wed, 13 Jun 2018 Prob (F-statistic): 0.00218
Time: 16:46:40 Log-Likelihood: -16.838
No. Observations: 11 AIC: 37.68
Df Residuals: 9 BIC: 38.47
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0025 1.124 2.670 0.026 0.459 5.546
x 0.4997 0.118 4.239 0.002 0.233 0.766
==============================================================================
Omnibus: 19.540 Durbin-Watson: 2.144
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13.478
Skew: 2.041 Prob(JB): 0.00118
Kurtosis: 6.571 Cond. No. 29.1
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The linear model for dataset IV
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.667
Model: OLS Adj. R-squared: 0.630
Method: Least Squares F-statistic: 18.00
Date: Wed, 13 Jun 2018 Prob (F-statistic): 0.00216
Time: 16:46:40 Log-Likelihood: -16.833
No. Observations: 11 AIC: 37.67
Df Residuals: 9 BIC: 38.46
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0017 1.124 2.671 0.026 0.459 5.544
x 0.4999 0.118 4.243 0.002 0.233 0.766
==============================================================================
Omnibus: 0.555 Durbin-Watson: 1.662
Prob(Omnibus): 0.758 Jarque-Bera (JB): 0.524
Skew: 0.010 Prob(JB): 0.769
Kurtosis: 1.931 Cond. No. 29.1
==============================================================================
分析
四个数据集的均值方差,相关系数几乎完全一致,但它们的分布完全不同。
因此从单个维度(均值,方差,相关系数等)分析,往往会出现一些偏颇。有时也很难发现数据之间的规律。因此加上数据可视化是很有必要的,也是非常有效的一种操作。
另外数据集过小也会导致一些问题(在这里,输出经常有warning,部分是数据集过小带来的,如结果中的\stats.py:1334: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
)。
"anyway, n=%i" % int(n))