pandas 习题

最新推荐文章于 2024-07-28 15:01:02 发布

odls

最新推荐文章于 2024-07-28 15:01:02 发布

阅读量1.4k

点赞数

文章标签：高级编程技术课程作业

本文链接：https://blog.csdn.net/cuidiji6092/article/details/80671447

版权

题目来源：

https://nbviewer.jupyter.org/github/schmit/cme193-ipython-notebooks-lecture/blob/master/Exercises.ipynb

Anscombe's quartet

Anscombe's quartet comprises of four datasets, and is rather famous. Why? You'll find out in this exercise.

初始数据：

Part 1

For each of the four datasets...

Compute the mean and variance of both x and y
Compute the correlation coefficient between x and y
Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook

结果：

means:

variance:

correlation coefficient:

model:

Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

具体代码：

import random

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set_context("talk")

#读取并显示初始数据
anascombe = pd.read_csv('Anscombe.csv')
data = anascombe.head()
print(data)
#计算并显示平均数
means = anascombe.groupby('dataset')['x','y'].mean()    
print("the mean of x and y:")
print(means)
#计算并显示方差
std = anascombe.groupby('dataset')['x','y'].std()
print("the variance of x and y:")
print(std)
#计算并显示相关系数
corr = anascombe.groupby('dataset')['x','y'].corr()
print("the correlation coefficient of x and y:")
print(corr)

print()
#拟合并输出结果
l = ['I','II','III','IV']
for i in l:
    x = anascombe[anascombe['dataset'] == i]['x']
    y = anascombe[anascombe['dataset'] == i]['y']
    #增加常数项
    x = sm.add_constant(x)
    model = sm.OLS(y,x).fit()
    print('the model of data '+i+' :')
    print(model.params)
    print(model.summary())
    
g = sns.FacetGrid(anascombe, col="dataset")
g.map(plt.scatter, "x", "y")
plt.show()