本次作业涉及到jupyter notebook的使用
使用前需先安装jupyter notebook,我在cmd下使用命令行安装:
pip install jupyter notebook
安装成功后输入 jupyter notebook 即可进入默认主页。
将文件保存至主页包含的文件夹下即可打开该文件进行编辑。
本次作业主要使用到pandas,statsmodels及seaborn三个库。
pandas是基于NumPy 的一种工具,也是Python下最强大的数据分析和探索工具之一。它包含高级的数据结构和精巧的工具,使得在 Python中处理数据非常快速和简单。本次我们主要使用它来处理数据表。(在本次作业中为.csv)
StatsModels更加注重数据的统计建模分析,它支持与 Pandas进行数据交互,与 Pandas结合成为了 Python下强大的数据挖掘组合。
Seaborn是基于matplotlib的python数据可视化库,提供更高层次的API封装,使用起来更加方便快捷。
下面我们来看具体实现:
头部文件的定义(已给出):
%matplotlib inline
import random
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set_context("talk")
注意到这里我们使用了"%matplotlib inline"这一语句
%matplotlib inline 可以在Ipython编译器里直接使用,功能是可以内嵌绘图,并且可以省略掉plt.show()这一步。
Anscombe's quartet
Anscombe's quartet comprises of four datasets, and is rather famous. Why? You'll find out in this exercise.
读取文件(已给出):
anascombe = pd.read_csv('data/anscombe.csv')
anascombe.head()
Part 1
For each of the four datasets...
- Compute the mean and variance of both x and y
- Compute the correlation coefficient between x and y
- Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)
先完成前两题的代码:
按照题目要求,我们分组在每组内进行这些运算,故需要使用该函数完成分组操作:
print ("The mean of x and y: ")
print (anascombe.groupby('dataset').mean(),end = '\n\n')
print ("The variance of x and y: ")
print (anascombe.groupby('dataset').var(),end = '\n\n')
print ("The correlation coefficient between x and y: ")
print (anascombe.groupby('dataset').corr(),end = '\n\n')
The mean of x and y: x y dataset I 9.0 7.500909 II 9.0 7.500909 III 9.0 7.500000 IV 9.0 7.500909 The variance of x and y: x y dataset I 11.0 4.127269 II 11.0 4.127629 III 11.0 4.122620 IV 11.0 4.123249 The correlation coefficient between x and y: x y dataset I x 1.000000 0.816421 y 0.816421 1.000000 II x 1.000000 0.816237 y 0.816237 1.000000 III x 1.000000 0.816287 y 0.816287 1.000000 IV x 1.000000 0.816521 y 0.816521 1.000000
在计算线性回归时(最小二乘法),我们可以使用两种方法,分别为:
1) 使用R-style formulas --smf.ols
2) 使用numpy array -- sm.OLS
详细可见关于最小二乘法的官方文档
我在程序中使用 pandas.Series.unique()统计dataset的类型
sets = anascombe.dataset.unique()
for set in sets:
print("\nFor data set " + set + ": ")
x = anascombe[(anascombe['dataset'] == set)].x
X = sm.add_constant(x)
y = anascombe[(anascombe['dataset'] == set)].y
linear = sm.OLS(y,X).fit()
print( linear.summary())
For data set I: OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.667 Model: OLS Adj. R-squared: 0.629 Method: Least Squares F-statistic: 17.99 Date: Tue, 12 Jun 2018 Prob (F-statistic): 0.00217 Time: 14:04:36 Log-Likelihood: -16.841 No. Observations: 11 AIC: 37.68 Df Residuals: 9 BIC: 38.48 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.0001 1.125 2.667 0.026 0.456 5.544 x 0.5001 0.118 4.241 0.002 0.233 0.767 ============================================================================== Omnibus: 0.082 Durbin-Watson: 3.212 Prob(Omnibus): 0.960 Jarque-Bera (JB): 0.289 Skew: -0.122 Prob(JB): 0.865 Kurtosis: 2.244 Cond. No. 29.1 ============================================================================== For data set II: OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.666 Model: OLS Adj. R-squared: 0.629 Method: Least Squares F-statistic: 17.97 Date: Tue, 12 Jun 2018 Prob (F-statistic): 0.00218 Time: 14:04:36 Log-Likelihood: -16.846 No. Observations: 11 AIC: 37.69 Df Residuals: 9 BIC: 38.49 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.0009 1.125 2.667 0.026 0.455 5.547 x 0.5000 0.118 4.239 0.002 0.233 0.767 ============================================================================== Omnibus: 1.594 Durbin-Watson: 2.188 Prob(Omnibus): 0.451 Jarque-Bera (JB): 1.108 Skew: -0.567 Prob(JB): 0.575 Kurtosis: 1.936 Cond. No. 29.1 ============================================================================== For data set III: OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.666 Model: OLS Adj. R-squared: 0.629 Method: Least Squares F-statistic: 17.97 Date: Tue, 12 Jun 2018 Prob (F-statistic): 0.00218 Time: 14:04:36 Log-Likelihood: -16.838 No. Observations: 11 AIC: 37.68 Df Residuals: 9 BIC: 38.47 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.0025 1.124 2.670 0.026 0.459 5.546 x 0.4997 0.118 4.239 0.002 0.233 0.766 ============================================================================== Omnibus: 19.540 Durbin-Watson: 2.144 Prob(Omnibus): 0.000 Jarque-Bera (JB): 13.478 Skew: 2.041 Prob(JB): 0.00118 Kurtosis: 6.571 Cond. No. 29.1 ============================================================================== For data set IV: OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.667 Model: OLS Adj. R-squared: 0.630 Method: Least Squares F-statistic: 18.00 Date: Tue, 12 Jun 2018 Prob (F-statistic): 0.00216 Time: 14:04:36 Log-Likelihood: -16.833 No. Observations: 11 AIC: 37.67 Df Residuals: 9 BIC: 38.46 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.0017 1.124 2.671 0.026 0.459 5.544 x 0.4999 0.118 4.243 0.002 0.233 0.766 ============================================================================== Omnibus: 0.555 Durbin-Watson: 1.662 Prob(Omnibus): 0.758 Jarque-Bera (JB): 0.524 Skew: 0.010 Prob(JB): 0.769 Kurtosis: 1.931 Cond. No. 29.1 ==============================================================================
Part 2
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
我使用到seaborn.FacetGrid函数进行绘图,点击查看其详情
g = sns.FacetGrid(anascombe, col="dataset", hue="dataset")
g.map(plt.scatter, "x", "y")
...