本次练习的数据来源和教学
https://github.com/schmit/cme193-ipython-notebooks-lecture
Part 1
For each of the four datasets...
- Compute the mean and variance of both x and y
- Compute the correlation coefficient between x and y
- Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)
Compute the mean and variance of both x and y.
print(" mean")
print(anascombe.groupby("dataset").mean())
print("\n variance")
print(anascombe.groupby("dataset").var())
输出结果:
mean
x y
dataset
I 9.0 7.500909
II 9.0 7.500909
III 9.0 7.500000
IV 9.0 7.500909
variance
x y
dataset
I 11.0 4.127269
II 11.0 4.127629
III 11.0 4.122620
IV 11.0 4.123249
Compute the correlation coefficient between x and y
anascombe.groupby("dataset").corr()
输出结果:
x y
dataset
I x 1.000000 0.816421
y 0.816421 1.000000
II x 1.000000 0.816237
y 0.816237 1.000000
III x 1.000000 0.816287
y 0.816287 1.000000
IV x 1.000000 0.816521
y 0.816521 1.000000
Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)
def rmse(y, yhat):
return np.sum((y - yhat)**2)**0.5
def ols_by_dataset(anascombe, dataset):
print("For dataset {}:".format(dataset))
is_dataset = anascombe["dataset"] == dataset
dataset = anascombe[is_dataset].reset_index(drop = True)
# print(dataset)
lin_model = smf.ols("y ~ x", dataset).fit()
print("y = {}x".format(lin_model.params[0]) + " + {}".format(lin_model.params[1]))
# print(lin_model.summary())
preds = lin_model.predict(dataset['x'])
print('The RMSE is {}\n'.format(rmse(dataset['y'], preds)))
ols_by_dataset(anascombe, 'I')
ols_by_dataset(anascombe, 'II')
ols_by_dataset(anascombe, 'III')
ols_by_dataset(anascombe, 'IV')
输出结果:
For dataset I:
y = 3.000090909090909x + 0.500090909090909
The RMSE is 3.7098099681789622
For dataset II:
y = 3.0009090909090905x + 0.5
The RMSE is 3.711642616024731
For dataset III:
y = 3.002454545454545x + 0.4997272727272728
The RMSE is 3.708934054169988
For dataset IV:
y = 3.0017272727272735x + 0.49990909090909075
The RMSE is 3.7070864570441304
Part 2
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
g = sns.FacetGrid(anascombe, col="dataset")
g.map(plt.scatter, "x", "y")