Part1
For each of the four datasets…
Compute the mean and variance of both x and y
Compute the correlation coefficient between x and y
Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)
import numpy as np
import seaborn
import statsmodels.api as sma
import pandas
seaborn.set_context("talk")
anascombe = pandas.read_csv('C:/Users/sysusdcsgjh/Desktop/anscombe.csv')
print(anascombe)
print()
l=0
r=11
for i in range(4):
xi = anascombe.x[l:r].values
yi = anascombe.y[l:r].values
meanxi = np.mean(xi)
meanyi = np.mean(yi)
varxi = np.var(xi)
varyi = np.var(yi)
coefxy = np.corrcoef(xi,yi)[0][1]
tmp = sma.add_constant(xi)
model = sma.OLS(yi,tmp)
rst = model.fit()
params = rst.params
print('mean_x'+str(i+1),': ',meanxi)
print('mean_y'+str(i+1),': ',meanyi)
print('varx'+str(i+1),': ',varxi)
print('vary'+str(i+1),': ',varyi)
print('coef_xy: ',coefxy)
print('线性回归为 y=',params[0],'+',params[1],'*x')
print()
l += 11
r += 11
结果:
Part2
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
import seaborn
import pandas
import matplotlib.pyplot as plt
anascombe = pandas.read_csv('C:/Users/sysusdcsgjh/Desktop/anscombe.csv')
seaborn.set(style='whitegrid') #数据可视化
graph = seaborn.FacetGrid(anascombe,col='dataset',hue='dataset',size=3)
graph.map(plt.scatter,'x','y')
plt.show()