cme193-ipython-notebooks-lecture
本次习题为【CME 193】中的练习,原题目及对应库的介绍见 cme193-ipython-notebooks-lecture
Part 0
获取数据集
import pandas as pd
anascombe = pd.read_csv('anscombe.csv')
anascombe.head()
dataset | x | y |
---|---|---|
0 | I | 10.0 |
1 | I | 8.0 |
2 | I | 13.0 |
3 | I | 9.0 |
4 | I | 11.0 |
Part 1
For each of the four datasets…
- Compute the mean and variance of both x and y
- Compute the correlation coefficient between x and y
- Compute the linear regression line: y=β0+β1x+ϵ y = β 0 + β 1 x + ϵ (hint: use statsmodels and look at the Statsmodels notebook)
解答:
调用mean()和var()计算各数据集的期望与方差
调用corr()计算各数据集间x与y的相关系数
调用statsmodels中ols对象的fit()函数进行线性回归模型的拟合,获得其参数得到回归方程
代码如下:
import pandas as pd
import statsmodels.formula.api as smf
anascombe = pd.read_csv('anscombe.csv')
print('mean of both x and y:')
print(anascombe.groupby('dataset')['x', 'y'].mean(), '\n')
print('variance of both x and y:')
print(anascombe.groupby('dataset')['x', 'y'].var(), '\n')
print('correlation coefficient between x and y:')
print(anascombe.groupby('dataset')['x', 'y'].corr(), '\n')
for i in range(0, 4):
dataset = anascombe[i * 11: (i + 1) * 11]
lin_model = smf.ols('y ~ x', dataset).fit()
params = lin_model.params #获得回归模型的参数,包括截距Intercept和斜率x
print('the linear regression line of dataset ' + str(i) + ':')
print('y = ' + str(params.x) + 'x + ' + str(params.Intercept))
输出如下:(可以看到期望、方差、相关系数、回归方程都很相近)
mean of both x and y:
x y
dataset
I 9.0 7.500909
II 9.0 7.500909
III 9.0 7.500000
IV 9.0 7.500909
variance of both x and y:
x y
dataset
I 11.0 4.127269
II 11.0 4.127629
III 11.0 4.122620
IV 11.0 4.123249
correlation coefficient between x and y:
x y
dataset
I x 1.000000 0.816421
y 0.816421 1.000000
II x 1.000000 0.816237
y 0.816237 1.000000
III x 1.000000 0.816287
y 0.816287 1.000000
IV x 1.000000 0.816521
y 0.816521 1.000000
the linear regression line of dataset 0:
y = 0.500090909091x + 3.00009090909
the linear regression line of dataset 1:
y = 0.5x + 3.00090909091
the linear regression line of dataset 2:
y = 0.499727272727x + 3.00245454545
the linear regression line of dataset 3:
y = 0.499909090909x + 3.00172727273
Part 2
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
import matplotlib.pyplot as plt
import seaborn as sns
g = sns.FacetGrid(anascombe, col="dataset", hue="y") #确定列和因子y(不同因子颜色不同)
g = g.map(plt.scatter, "x", "y", edgecolor="w")
g.savefig('a.png')
得到图像如下: