Part 1
用Dataframe结构的groupby函数可以将整表依照某index分组为多个,这样对之进行mean(),var()就方便很多,但不方便作corrwith()操作,因为需要self为Dataframe,other为Serial。所以又建立一个列表,专门求相关系数和最小二乘。
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
anascombe = pd.read_csv('anscombe.csv')
grouped=anascombe.groupby('dataset')
dgroup = [0]*5
for i in range(0,4):
dgroup[i]=anascombe.ix[i*11:i*11+10,['x','y']]
print(" Mean:")
print(grouped.mean()) #平均
print("\n Variance:")
print(grouped.var()) #方差
print('\n Correlation coefficient:')
for i in range(0,4):
print(i,'\t',dgroup[i][['x']].corrwith(dgroup[i]['y']),'\n') #相关系数
for i in range(0,4):
X = sm.add_constant(dgroup[i]['x'])
Y = dgroup[i]['y']
lin_model = sm.OLS(Y,X)
result = lin_model.fit()
print(result.summary())
#最小二乘
结果表明,从统计数据上观测,这4组样本的特征几乎完全一致。
若用lmplot绘制,则各组数据貌似相近;但若通过FacetGrid绘制plt.scatter子图,比对后会发现,四组数据本质差别极大:图一为正常的散点图,图二为抛物线,图三为带一异常值的直线,图四同理。通过这次作业我更加熟悉了python图表的绘制。
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set()
anascombe = pd.read_csv('anscombe.csv')
g=sns.FacetGrid(anascombe, col="dataset",size=4) #以dataset为分割作子图
g.map(plt.scatter,"x","y") #用plt.scatter函数绘制
plt.show()
#g = sns.lmplot(x ='x',y ='y',data = anascombe,hue ='dataset')
#plt.show()