(Week 14)Python-pandas_exercises

最新推荐文章于 2024-03-20 09:31:30 发布

茵茵的聪聪

最新推荐文章于 2024-03-20 09:31:30 发布

阅读量381

点赞数

本文链接：https://blog.csdn.net/qq_36153312/article/details/80668319

版权

cme193-ipython-notebooks-lecture

本次习题为【CME 193】中的练习，原题目及对应库的介绍见 cme193-ipython-notebooks-lecture

Part 0

获取数据集

import pandas as pd
anascombe = pd.read_csv('anscombe.csv')
anascombe.head()

dataset	x	y
0	I	10.0
1	I	8.0
2	I	13.0
3	I	9.0
4	I	11.0

Part 1

For each of the four datasets…

Compute the mean and variance of both x and y
Compute the correlation coefficient between x and y
Compute the linear regression line: $y=β_0+β_1x+ϵ$ (hint: use statsmodels and look at the Statsmodels notebook)

解答：

调用mean()和var()计算各数据集的期望与方差
调用corr()计算各数据集间x与y的相关系数
调用statsmodels中ols对象的fit()函数进行线性回归模型的拟合，获得其参数得到回归方程

代码如下：

import pandas as pd
import statsmodels.formula.api as smf

anascombe = pd.read_csv('anscombe.csv')
print('mean of both x and y:')
print(anascombe.groupby('dataset')['x', 'y'].mean(), '\n')
print('variance of both x and y:')
print(anascombe.groupby('dataset')['x', 'y'].var(), '\n')
print('correlation coefficient between x and y:')
print(anascombe.groupby('dataset')['x', 'y'].corr(), '\n')
for i in range(0, 4):
    dataset = anascombe[i * 11: (i + 1) * 11]
    lin_model = smf.ols('y ~ x', dataset).fit()
    params = lin_model.params   #获得回归模型的参数，包括截距Intercept和斜率x
    print('the linear regression line of dataset ' + str(i) + ':')
    print('y = ' + str(params.x) + 'x + ' + str(params.Intercept))

输出如下：(可以看到期望、方差、相关系数、回归方程都很相近)

mean of both x and y:
           x         y
dataset               
I        9.0  7.500909
II       9.0  7.500909
III      9.0  7.500000
IV       9.0  7.500909 

variance of both x and y:
            x         y
dataset                
I        11.0  4.127269
II       11.0  4.127629
III      11.0  4.122620
IV       11.0  4.123249 

correlation coefficient between x and y:
                  x         y
dataset                      
I       x  1.000000  0.816421
        y  0.816421  1.000000
II      x  1.000000  0.816237
        y  0.816237  1.000000
III     x  1.000000  0.816287
        y  0.816287  1.000000
IV      x  1.000000  0.816521
        y  0.816521  1.000000 

the linear regression line of dataset 0:
y = 0.500090909091x + 3.00009090909
the linear regression line of dataset 1:
y = 0.5x + 3.00090909091
the linear regression line of dataset 2:
y = 0.499727272727x + 3.00245454545
the linear regression line of dataset 3:
y = 0.499909090909x + 3.00172727273

Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

import matplotlib.pyplot as plt
import seaborn as sns
g = sns.FacetGrid(anascombe, col="dataset",  hue="y") #确定列和因子y（不同因子颜色不同）
g = g.map(plt.scatter, "x", "y", edgecolor="w")
g.savefig('a.png')