问题详情可见于:
【前置条件】
关于安斯库姆四重奏(Anscombe's quartet)的通俗解释可参看:https://www.zhihu.com/question/67493742
不过该知乎中相关系数的计算公式与我们一般常见的相关系数计算公式(如下)是等价的,只是描述的形式不同而已。
【调用模块】
import os
import random
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
【问题解决】
statsmodels的OLS的通俗讲解可参看:https://zhuanlan.zhihu.com/p/22692029
path = os.path.dirname(os.getcwd()) + str('\\data\\anscombe.csv')
anascombe = pd.read_csv(path) # 读取文件
for i in range(4):
x = anascombe.x[0+11*i:11+11*i].values
y = anascombe.y[0+11*i:11+11*i].values # 11个样本
print("For dataset %d: "%(i+1))
# 均值
XMean = np.mean(x)
YMean = np.mean(y)
print("mean of x: " + str(XMean))
print("mean of y: " + str(YMean))
# 方差
XVar = np.var(x)
YVar = np.var(y)
print("variance of x: " + str(XVar))
print("Variance of y: " + str(YVar))
# 标准差
XSD = np.sqrt(XVar) # XSD = np.std(x)
YSD = np.sqrt(YVar) # YSD = np.std(y)
# z分数
ZX = (x - XMean) / XSD
ZY = (y - YMean) / YSD
# 相关系数
r = np.sum(ZX * ZY) / len(x)
print("correlation coefficient between x and y: " + str(r))
X = sm.add_constant(x) # 线性回归增加常数项
model = sm.OLS(y, X) # 对反应变量和回归变量使用OLS
results = model.fit() # 回归拟合
print(results.params) # 打印参数
print(results.summary()) # 打印全部摘要
seaborn模块的绘图函数可参考https://blog.csdn.net/yutao03081/article/details/79064669
g = sns.FacetGrid(anascombe, col="dataset")
g.map(plt.scatter, "x","y") # 散点图