【CME 193】 ipynb

Exercises.ipynb

本次作业涉及到jupyter notebook的使用

使用前需先安装jupyter notebook,我在cmd下使用命令行安装:

pip install jupyter notebook

安装成功后输入 jupyter notebook 即可进入默认主页。

将文件保存至主页包含的文件夹下即可打开该文件进行编辑。


本次作业主要使用到pandas,statsmodels及seaborn三个库。

pandas是基于NumPy 的一种工具,也是Python下最强大的数据分析和探索工具之一。它包含高级的数据结构和精巧的工具,使得在 Python中处理数据非常快速和简单。本次我们主要使用它来处理数据表。(在本次作业中为.csv)

StatsModels更加注重数据的统计建模分析,它支持与 Pandas进行数据交互,与 Pandas结合成为了 Python下强大的数据挖掘组合。

Seaborn是基于matplotlib的python数据可视化库,提供更高层次的API封装,使用起来更加方便快捷。


下面我们来看具体实现:

头部文件的定义(已给出):

%matplotlib inline

import random

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set_context("talk")

注意到这里我们使用了"%matplotlib inline"这一语句

%matplotlib inline 可以在Ipython编译器里直接使用,功能是可以内嵌绘图,并且可以省略掉plt.show()这一步。



Anscombe's quartet

Anscombe's quartet comprises of four datasets, and is rather famous. Why? You'll find out in this exercise.

读取文件(已给出):

anascombe = pd.read_csv('data/anscombe.csv')
anascombe.head()


Part 1

For each of the four datasets...

  • Compute the mean and variance of both x and y
  • Compute the correlation coefficient between x and y
  • Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)

先完成前两题的代码:

按照题目要求,我们分组在每组内进行这些运算,故需要使用该函数完成分组操作:


print ("The mean of x and y: ")
print (anascombe.groupby('dataset').mean(),end = '\n\n')

print ("The variance of x and y: ")
print (anascombe.groupby('dataset').var(),end = '\n\n')

print ("The correlation coefficient between x and y: ")
print (anascombe.groupby('dataset').corr(),end = '\n\n')
The mean of x and y: 
           x         y
dataset               
I        9.0  7.500909
II       9.0  7.500909
III      9.0  7.500000
IV       9.0  7.500909

The variance of x and y: 
            x         y
dataset                
I        11.0  4.127269
II       11.0  4.127629
III      11.0  4.122620
IV       11.0  4.123249

The correlation coefficient between x and y: 
                  x         y
dataset                      
I       x  1.000000  0.816421
        y  0.816421  1.000000
II      x  1.000000  0.816237
        y  0.816237  1.000000
III     x  1.000000  0.816287
        y  0.816287  1.000000
IV      x  1.000000  0.816521
        y  0.816521  1.000000

在计算线性回归时(最小二乘法),我们可以使用两种方法,分别为:

1) 使用R-style formulas --smf.ols

2) 使用numpy array -- sm.OLS

详细可见关于最小二乘法的官方文档

我在程序中使用 pandas.Series.unique()统计dataset的类型

sets = anascombe.dataset.unique()
for set in sets:
    print("\nFor data set " + set + ": ")
    x = anascombe[(anascombe['dataset'] == set)].x
    X = sm.add_constant(x)
    y = anascombe[(anascombe['dataset'] == set)].y
    linear = sm.OLS(y,X).fit()
    print( linear.summary())
For data set I: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.667
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     17.99
Date:                Tue, 12 Jun 2018   Prob (F-statistic):            0.00217
Time:                        14:04:36   Log-Likelihood:                -16.841
No. Observations:                  11   AIC:                             37.68
Df Residuals:                       9   BIC:                             38.48
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0001      1.125      2.667      0.026       0.456       5.544
x              0.5001      0.118      4.241      0.002       0.233       0.767
==============================================================================
Omnibus:                        0.082   Durbin-Watson:                   3.212
Prob(Omnibus):                  0.960   Jarque-Bera (JB):                0.289
Skew:                          -0.122   Prob(JB):                        0.865
Kurtosis:                       2.244   Cond. No.                         29.1
==============================================================================

For data set II: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.666
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     17.97
Date:                Tue, 12 Jun 2018   Prob (F-statistic):            0.00218
Time:                        14:04:36   Log-Likelihood:                -16.846
No. Observations:                  11   AIC:                             37.69
Df Residuals:                       9   BIC:                             38.49
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0009      1.125      2.667      0.026       0.455       5.547
x              0.5000      0.118      4.239      0.002       0.233       0.767
==============================================================================
Omnibus:                        1.594   Durbin-Watson:                   2.188
Prob(Omnibus):                  0.451   Jarque-Bera (JB):                1.108
Skew:                          -0.567   Prob(JB):                        0.575
Kurtosis:                       1.936   Cond. No.                         29.1
==============================================================================

For data set III: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.666
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     17.97
Date:                Tue, 12 Jun 2018   Prob (F-statistic):            0.00218
Time:                        14:04:36   Log-Likelihood:                -16.838
No. Observations:                  11   AIC:                             37.68
Df Residuals:                       9   BIC:                             38.47
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0025      1.124      2.670      0.026       0.459       5.546
x              0.4997      0.118      4.239      0.002       0.233       0.766
==============================================================================
Omnibus:                       19.540   Durbin-Watson:                   2.144
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               13.478
Skew:                           2.041   Prob(JB):                      0.00118
Kurtosis:                       6.571   Cond. No.                         29.1
==============================================================================

For data set IV: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.667
Model:                            OLS   Adj. R-squared:                  0.630
Method:                 Least Squares   F-statistic:                     18.00
Date:                Tue, 12 Jun 2018   Prob (F-statistic):            0.00216
Time:                        14:04:36   Log-Likelihood:                -16.833
No. Observations:                  11   AIC:                             37.67
Df Residuals:                       9   BIC:                             38.46
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0017      1.124      2.671      0.026       0.459       5.544
x              0.4999      0.118      4.243      0.002       0.233       0.766
==============================================================================
Omnibus:                        0.555   Durbin-Watson:                   1.662
Prob(Omnibus):                  0.758   Jarque-Bera (JB):                0.524
Skew:                           0.010   Prob(JB):                        0.769
Kurtosis:                       1.931   Cond. No.                         29.1
==============================================================================



Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter


我使用到seaborn.FacetGrid函数进行绘图,点击查看其详情

g = sns.FacetGrid(anascombe, col="dataset", hue="dataset")
g.map(plt.scatter, "x", "y")  
   

2018.06.12

...

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值