(Week 14)Python-pandas_exercises

cme193-ipython-notebooks-lecture

本次习题为【CME 193】中的练习,原题目及对应库的介绍见 cme193-ipython-notebooks-lecture

Part 0

获取数据集

import pandas as pd
anascombe = pd.read_csv('anscombe.csv')
anascombe.head()
datasetxy
0I10.0
1I8.0
2I13.0
3I9.0
4I11.0

Part 1

For each of the four datasets…

  • Compute the mean and variance of both x and y
  • Compute the correlation coefficient between x and y
  • Compute the linear regression line: y=β0+β1x+ϵ y = β 0 + β 1 x + ϵ (hint: use statsmodels and look at the Statsmodels notebook)

解答:

  • 调用mean()和var()计算各数据集的期望与方差

  • 调用corr()计算各数据集间x与y的相关系数

  • 调用statsmodels中ols对象的fit()函数进行线性回归模型的拟合,获得其参数得到回归方程

代码如下:

import pandas as pd
import statsmodels.formula.api as smf

anascombe = pd.read_csv('anscombe.csv')
print('mean of both x and y:')
print(anascombe.groupby('dataset')['x', 'y'].mean(), '\n')
print('variance of both x and y:')
print(anascombe.groupby('dataset')['x', 'y'].var(), '\n')
print('correlation coefficient between x and y:')
print(anascombe.groupby('dataset')['x', 'y'].corr(), '\n')
for i in range(0, 4):
    dataset = anascombe[i * 11: (i + 1) * 11]
    lin_model = smf.ols('y ~ x', dataset).fit()
    params = lin_model.params   #获得回归模型的参数,包括截距Intercept和斜率x
    print('the linear regression line of dataset ' + str(i) + ':')
    print('y = ' + str(params.x) + 'x + ' + str(params.Intercept))

输出如下:(可以看到期望、方差、相关系数、回归方程都很相近)

mean of both x and y:
           x         y
dataset               
I        9.0  7.500909
II       9.0  7.500909
III      9.0  7.500000
IV       9.0  7.500909 

variance of both x and y:
            x         y
dataset                
I        11.0  4.127269
II       11.0  4.127629
III      11.0  4.122620
IV       11.0  4.123249 

correlation coefficient between x and y:
                  x         y
dataset                      
I       x  1.000000  0.816421
        y  0.816421  1.000000
II      x  1.000000  0.816237
        y  0.816237  1.000000
III     x  1.000000  0.816287
        y  0.816287  1.000000
IV      x  1.000000  0.816521
        y  0.816521  1.000000 

the linear regression line of dataset 0:
y = 0.500090909091x + 3.00009090909
the linear regression line of dataset 1:
y = 0.5x + 3.00090909091
the linear regression line of dataset 2:
y = 0.499727272727x + 3.00245454545
the linear regression line of dataset 3:
y = 0.499909090909x + 3.00172727273

Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

import matplotlib.pyplot as plt
import seaborn as sns
g = sns.FacetGrid(anascombe, col="dataset",  hue="y") #确定列和因子y(不同因子颜色不同)
g = g.map(plt.scatter, "x", "y", edgecolor="w")
g.savefig('a.png')

得到图像如下:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值